├── DESCRIPTION
├── NAMESPACE
├── R
    ├── bclust.R
    ├── bdist.R
    ├── binomix.R
    ├── blasting.R
    ├── domainclust.R
    ├── entrez.R
    ├── extern.R
    ├── extractPanGenes.R
    ├── genomedistances.R
    ├── hmmer3.R
    ├── micropan.R
    ├── panmat.R
    ├── panpca.R
    ├── panprep.R
    ├── powerlaw.R
    ├── rarefaction.R
    ├── xmpl.R
    └── xz.R
├── Readme.Rmd
├── Readme.md
├── Readme_files
    ├── figure-gfm
    │   ├── unnamed-chunk-10-1.png
    │   ├── unnamed-chunk-15-1.png
    │   ├── unnamed-chunk-16-1.png
    │   ├── unnamed-chunk-20-1.png
    │   ├── unnamed-chunk-21-1.png
    │   ├── unnamed-chunk-26-1.png
    │   ├── unnamed-chunk-27-1.png
    │   ├── unnamed-chunk-28-1.png
    │   ├── unnamed-chunk-29-1.png
    │   ├── unnamed-chunk-30-1.png
    │   ├── unnamed-chunk-31-1.png
    │   ├── unnamed-chunk-32-1.png
    │   ├── unnamed-chunk-8-1.png
    │   └── unnamed-chunk-9-1.png
    └── figure-markdown_github
    │   ├── unnamed-chunk-10-1.png
    │   ├── unnamed-chunk-14-1.png
    │   ├── unnamed-chunk-16-1.png
    │   ├── unnamed-chunk-18-1.png
    │   ├── unnamed-chunk-20-1.png
    │   ├── unnamed-chunk-21-1.png
    │   ├── unnamed-chunk-25-1.png
    │   ├── unnamed-chunk-26-1.png
    │   ├── unnamed-chunk-27-1.png
    │   ├── unnamed-chunk-28-1.png
    │   ├── unnamed-chunk-29-1.png
    │   ├── unnamed-chunk-30-1.png
    │   ├── unnamed-chunk-31-1.png
    │   ├── unnamed-chunk-32-1.png
    │   ├── unnamed-chunk-7-1.png
    │   ├── unnamed-chunk-8-1.png
    │   └── unnamed-chunk-9-1.png
├── data
    ├── xmpl.bclst.rda
    ├── xmpl.bdist.rda
    └── xmpl.panmat.rda
├── inst
    └── extdata
    │   ├── GID1_vs_GID1_.txt.xz
    │   ├── GID1_vs_microfam.hmm.txt.xz
    │   ├── GID2_vs_GID1_.txt.xz
    │   ├── GID2_vs_GID2_.txt.xz
    │   ├── GID2_vs_microfam.hmm.txt.xz
    │   ├── GID3_vs_GID1_.txt.xz
    │   ├── GID3_vs_GID2_.txt.xz
    │   ├── GID3_vs_GID3_.txt.xz
    │   ├── GID3_vs_microfam.hmm.txt.xz
    │   ├── microfam.hmm.h3f.xz
    │   ├── microfam.hmm.h3i.xz
    │   ├── microfam.hmm.h3m.xz
    │   ├── microfam.hmm.h3p.xz
    │   ├── xmpl.faa.xz
    │   ├── xmpl_GID1.faa.xz
    │   ├── xmpl_GID2.faa.xz
    │   └── xmpl_GID3.faa.xz
├── man
    ├── bClust.Rd
    ├── bDist.Rd
    ├── binomixEstimate.Rd
    ├── blastpAllAll.Rd
    ├── chao.Rd
    ├── dClust.Rd
    ├── distJaccard.Rd
    ├── distManhattan.Rd
    ├── entrezDownload.Rd
    ├── extractPanGenes.Rd
    ├── fluidity.Rd
    ├── geneFamilies2fasta.Rd
    ├── geneWeights.Rd
    ├── getAccessions.Rd
    ├── heaps.Rd
    ├── hmmerCleanOverlap.Rd
    ├── hmmerScan.Rd
    ├── isOrtholog.Rd
    ├── micropan.Rd
    ├── panMatrix.Rd
    ├── panPca.Rd
    ├── panPrep.Rd
    ├── rarefaction.Rd
    ├── readBlastSelf.Rd
    ├── readHmmer.Rd
    ├── xmpl.Rd
    └── xz.Rd
└── vignettes
    ├── vignette-concordance.tex
    ├── vignette.Rnw
    ├── vignette.aux
    ├── vignette.log
    ├── vignette.pdf
    └── vignette.tex


/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: micropan
 2 | Type: Package
 3 | Title: Microbial Pan-Genome Analysis
 4 | Version: 2.2.1
 5 | Date: 2022-04-12
 6 | Author: Lars Snipen and Kristian Hovde Liland
 7 | Maintainer: Lars Snipen <lars.snipen@nmbu.no>
 8 | Description: A collection of functions for computations and visualizations of microbial pan-genomes.
 9 | License: GPL-2
10 | Depends:
11 |     R (>= 4.0.0),
12 |     microseq,
13 |     dplyr,
14 |     stringr,
15 |     igraph
16 | Imports:
17 |     tibble,
18 |     rlang
19 | LazyData: FALSE
20 | ZipData: TRUE
21 | RoxygenNote: 7.1.1
22 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
 1 | # Generated by roxygen2: do not edit by hand
 2 | 
 3 | export(bClust)
 4 | export(bDist)
 5 | export(binomixEstimate)
 6 | export(blastpAllAll)
 7 | export(chao)
 8 | export(dClust)
 9 | export(distJaccard)
10 | export(distManhattan)
11 | export(entrezDownload)
12 | export(extractPanGenes)
13 | export(fluidity)
14 | export(geneFamilies2fasta)
15 | export(geneWeights)
16 | export(getAccessions)
17 | export(heaps)
18 | export(hmmerCleanOverlap)
19 | export(hmmerScan)
20 | export(isOrtholog)
21 | export(panMatrix)
22 | export(panPca)
23 | export(panPrep)
24 | export(rarefaction)
25 | export(readBlastPair)
26 | export(readBlastSelf)
27 | export(readHmmer)
28 | export(xzcompress)
29 | export(xzuncompress)
30 | importFrom(dplyr,"%>%")
31 | importFrom(dplyr,arrange)
32 | importFrom(dplyr,bind_cols)
33 | importFrom(dplyr,bind_rows)
34 | importFrom(dplyr,desc)
35 | importFrom(dplyr,distinct)
36 | importFrom(dplyr,filter)
37 | importFrom(dplyr,group_by)
38 | importFrom(dplyr,mutate)
39 | importFrom(dplyr,n)
40 | importFrom(dplyr,rename)
41 | importFrom(dplyr,right_join)
42 | importFrom(dplyr,select)
43 | importFrom(dplyr,slice)
44 | importFrom(dplyr,summarize)
45 | importFrom(igraph,components)
46 | importFrom(igraph,degree)
47 | importFrom(igraph,graph_from_edgelist)
48 | importFrom(microseq,readFasta)
49 | importFrom(microseq,writeFasta)
50 | importFrom(rlang,.data)
51 | importFrom(stats,as.dendrogram)
52 | importFrom(stats,as.dist)
53 | importFrom(stats,constrOptim)
54 | importFrom(stats,cutree)
55 | importFrom(stats,dendrapply)
56 | importFrom(stats,dist)
57 | importFrom(stats,hclust)
58 | importFrom(stats,is.leaf)
59 | importFrom(stats,optim)
60 | importFrom(stats,prcomp)
61 | importFrom(stats,sd)
62 | importFrom(stringr,str_c)
63 | importFrom(stringr,str_detect)
64 | importFrom(stringr,str_extract)
65 | importFrom(stringr,str_extract_all)
66 | importFrom(stringr,str_length)
67 | importFrom(stringr,str_remove)
68 | importFrom(stringr,str_remove_all)
69 | importFrom(stringr,str_replace)
70 | importFrom(stringr,str_replace_all)
71 | importFrom(stringr,str_split)
72 | importFrom(stringr,str_sub)
73 | importFrom(tibble,as_tibble)
74 | importFrom(tibble,tibble)
75 | importFrom(utils,read.table)
76 | 


--------------------------------------------------------------------------------
/R/bclust.R:
--------------------------------------------------------------------------------
  1 | #' @name bClust
  2 | #' @title Clustering sequences based on pairwise distances
  3 | #' 
  4 | #' @description Sequences are clustered by hierarchical clustering based on a set of pariwise distances.
  5 | #' The distances must take values between 0.0 and 1.0, and all pairs \emph{not} listed are assumed to
  6 | #' have distance 1.0.
  7 | #' 
  8 | #' @param dist.tbl A \code{tibble} with pairwise distances. 
  9 | #' @param linkage A text indicating what type of clustering to perform, either \samp{complete} (default),
 10 | #' \samp{average} or \samp{single}.
 11 | #' @param threshold Specifies the maximum size of a cluster. Must be a distance, i.e. a number between
 12 | #' 0.0 and 1.0.
 13 | #' @param verbose Logical, turns on/off text output during computations.
 14 | #' 
 15 | #' @details  Computing clusters (gene families) is an essential step in many comparative studies.
 16 | #' \code{\link{bClust}} will assign sequences into gene families by a hierarchical clustering approach.
 17 | #' Since the number of sequences may be huge, a full all-against-all distance matrix will be impossible
 18 | #' to handle in memory. However, most sequence pairs will have an \sQuote{infinite} distance between them,
 19 | #' and only the pairs with a finite (smallish) distance need to be considered.
 20 | #'
 21 | #' This function takes as input the distances in \code{dist.tbl} where only the relevant distances are
 22 | #' listed. The columns \samp{Query} and \samp{Hit} contain tags identifying pairs of sequences. The column
 23 | #' \samp{Distance} contains the distances, always a number from 0.0 to 1.0. Typically, this is the output
 24 | #' from \code{\link{bDist}}. All pairs of sequences \emph{not} listed are assumed to have distance 1.0,
 25 | #' which is considered the \sQuote{infinite} distance.
 26 | #' All sequences must be listed at least once in ceither column \samp{Query} or \samp{Hit} of the \code{dist.tbl}.
 27 | #' This should pose no problem, since all sequences must have distance 0.0 to themselves, and should be listed
 28 | #' with this distance once (\samp{Query} and \samp{Hit} containing the same tag). 
 29 | #' 
 30 | #' The \samp{linkage} defines the type of clusters produced. The \samp{threshold} indicates the size of
 31 | #' the clusters. A \samp{single} linkage clustering means all members of a cluster have at least one other
 32 | #' member of the same cluster within distance \samp{threshold} from itself. An \samp{average} linkage means
 33 | #' all members of a cluster are within the distance \samp{threshold} from the center of the cluster. A
 34 | #' \samp{complete} linkage means all members of a cluster are no more than the distance \samp{threshold}
 35 | #' away from any other member of the same cluster. 
 36 | #' 
 37 | #' Typically, \samp{single} linkage produces big clusters where members may differ a lot, since they are
 38 | #' only required to be close to something, which is close to something,...,which is close to some other
 39 | #' member. On the other extreme, \samp{complete} linkage will produce small and tight clusters, since all
 40 | #' must be similar to all. The \samp{average} linkage is between, but closer to \samp{complete} linkage. If
 41 | #' you want the \samp{threshold} to specify directly the maximum distance tolerated between two members of
 42 | #' the same gene family, you must use \samp{complete} linkage. The \samp{single} linkage is the fastest
 43 | #' alternative to compute. Using the default setting of \samp{single} linkage and maximum \samp{threshold}
 44 | #' (1.0) will produce the largest and fewest clusters possible.
 45 | #' 
 46 | #' @return The function returns a vector of integers, indicating the cluster membership of every unique
 47 | #' sequence from the \samp{Query} or \samp{Hit} columns of the input \samp{dist.tbl}. The name
 48 | #' of each element indicates the sequence. The numerical values have no meaning as such, they are simply
 49 | #' categorical indicators of cluster membership.
 50 | #' 
 51 | #' @author Lars Snipen and Kristian Hovde Liland.
 52 | #' 
 53 | #' @seealso \code{\link{bDist}}, \code{\link{hclust}}, \code{\link{dClust}}, \code{\link{isOrtholog}}.
 54 | #' 
 55 | #' @examples
 56 | #' # Loading example BLAST distances
 57 | #' data(xmpl.bdist)
 58 | #' 
 59 | #' # Clustering with default settings
 60 | #' clst <- bClust(xmpl.bdist)
 61 | #' # Other settings, and verbose
 62 | #' clst <- bClust(xmpl.bdist, linkage = "average", threshold = 0.5, verbose = TRUE)
 63 | #' 
 64 | #' @importFrom igraph graph_from_edgelist components degree
 65 | #' @importFrom stats hclust as.dist cutree
 66 | #' @importFrom dplyr filter %>% 
 67 | #' @importFrom rlang .data
 68 | #' 
 69 | #' @export bClust
 70 | #' 
 71 | bClust <- function(dist.tbl, linkage = "complete", threshold = 0.75, verbose = TRUE){
 72 |   if(verbose) cat("bClust:\n")
 73 |   linknum <- grep(linkage, c("single", "average", "complete"))
 74 |   dist.tbl %>% 
 75 |     filter(.data$Distance < threshold) -> dist.tbl
 76 |   utag <- sort(unique(c(dist.tbl$Dbase, dist.tbl$Query))) # Important to sort here!
 77 |     
 78 |   if(verbose) cat("...constructing graph with", length(utag), "sequences (nodes) and", nrow(dist.tbl), "distances (edges)\n")
 79 |   M <- matrix(as.numeric(factor(c(dist.tbl$Dbase, dist.tbl$Query), levels = utag)), ncol = 2, byrow = F)
 80 |   g <- graph_from_edgelist(M, directed = F)
 81 |   cls <- components(g)
 82 |   if(verbose) cat("...found", cls$no, "single linkage clusters\n")
 83 |   tibble(cluster = cls$membership,
 84 |          tag = utag) -> cls.tbl
 85 |   
 86 |   if(linknum > 1){
 87 |     ucls <- sort(unique(cls.tbl$cluster))
 88 |     incomplete <- which(sapply(ucls, function(j){
 89 |       v <- which(cls$membership == j)
 90 |       degg <- degree(g, v)
 91 |       return(min(degg) < (length(degg) + 1))
 92 |     }))
 93 |     if(verbose) cat("...found", length(incomplete), "incomplete clusters\n")
 94 |     if(length(incomplete) > 0){
 95 |       cls.tbl %>% 
 96 |         filter(.data$cluster %in% incomplete) %>% 
 97 |         mutate(cluster = .data$cluster * 1000) -> inc.tbl
 98 |       cls.tbl %>% 
 99 |         filter(!(.data$cluster %in% incomplete)) -> cls.tbl
100 |       ucls.c <- unique(inc.tbl$cluster)
101 |       for(i in 1:length(ucls.c)){   # for each incomplete cluster
102 |         inc.tbl %>% 
103 |           filter(.data$cluster == ucls.c[i]) -> tbl
104 |         D <- matrix(1, nrow = nrow(tbl), ncol = nrow(tbl))
105 |         rownames(D) <- colnames(D) <- tbl$tag
106 |         dist.tbl %>% 
107 |           filter(.data$Dbase %in% tbl$tag | .data$Query %in% tbl$tag) %>% 
108 |           mutate(Dbase = factor(.data$Dbase, levels = tbl$tag)) %>% 
109 |           mutate(Query = factor(.data$Query, levels = tbl$tag)) -> d.tbl
110 |         M <- matrix(c(as.integer(d.tbl$Dbase), as.integer(d.tbl$Query)), ncol = 2, byrow = F)
111 |         D[M] <- d.tbl$Distance
112 |         D[M[,c(2,1)]] <- d.tbl$Distance
113 |         if(linknum == 2){
114 |           clst <- hclust(as.dist(D), method = "average")
115 |         } else {
116 |           clst <- hclust(as.dist(D), method = "complete")
117 |         }
118 |         tbl %>% 
119 |           mutate(cluster = .data$cluster + cutree(clst, h = threshold)) %>% 
120 |           bind_rows(cls.tbl) -> cls.tbl
121 |         if(verbose) cat(i, "/", length(ucls.c), "\r")
122 |       }
123 |     }
124 |   }
125 |   clustering <- as.integer(factor(cls.tbl$cluster))  # to get values 1,2,3,...
126 |   names(clustering) <- cls.tbl$tag
127 |   if(verbose) cat("\n...ended with", length(unique(clustering)),
128 |                   "clusters, largest cluster has",
129 |                   max(table(clustering)), "members\n")
130 |   return(sort(clustering))
131 | }
132 | 
133 | 
134 | #' @name isOrtholog
135 | #' @title Identifies orthologs in gene clusters
136 | #' 
137 | #' @description Finds the ortholog sequences in every cluster based on pairwise distances.
138 | #' 
139 | #' @param clustering A vector of integers indicating the cluster for every sequence. Sequences with
140 | #' the same number belong to the same cluster. The name of each element is the tag identifying the sequence.
141 | #' @param dist.tbl A \code{tibble} with pairwise distances. The columns \samp{Query} and
142 | #' \samp{Hit} contain tags identifying pairs of sequences. The column \samp{Distance} contains the
143 | #' distances, always a number from 0.0 to 1.0.
144 | #' 
145 | #' @details The input \code{clustering} is typically produced by \code{\link{bClust}}. The input
146 | #' \code{dist.tbl} is typically produced by \code{\link{bDist}}.
147 | #' 
148 | #' The concept of orthologs is difficult for prokaryotes, and this function finds orthologs in a
149 | #' simplistic way. For a given cluster, with members from many genomes, there is one ortholog from every
150 | #' genome. In cases where a genome has two or more members in the same cluster, only one of these is an
151 | #' ortholog, the rest are paralogs.
152 | #' 
153 | #' Consider all sequences from the same genome belonging to the same cluster. The ortholog is defined as
154 | #' the one having the smallest sum of distances to all other members of the same cluster, i.e. the one
155 | #' closest to the \sQuote{center} of the cluster.
156 | #' 
157 | #' Note that the status as ortholog or paralog depends greatly on how clusters are defined in the first
158 | #' place. If you allow large and diverse (and few) clusters, many sequences will be paralogs. If you define
159 | #' tight and homogenous (and many) clusters, almost all sequences will be orthologs.
160 | #' 
161 | #' @return A vector of logicals with the same number of elements as the input \samp{clustering}, indicating
162 | #' if the corresponding sequence is an ortholog (\code{TRUE}) or not (\code{FALSE}). The name of each
163 | #' element is copied from \samp{clustering}.
164 | #' 
165 | #' @author Lars Snipen and Kristian Hovde Liland.
166 | #' 
167 | #' @seealso \code{\link{bDist}}, \code{\link{bClust}}.
168 | #' 
169 | #' @examples
170 | #' \dontrun{
171 | #' # Loading distance data and their clustering results
172 | #' data(list = c("xmpl.bdist","xmpl.bclst"))
173 | #' 
174 | #' # Finding orthologs
175 | #' is.ortholog <- isOrtholog(xmpl.bclst, xmpl.bdist)
176 | #' # The orthologs are
177 | #' which(is.ortholog)
178 | #' }
179 | #' 
180 | #' @export isOrtholog
181 | #' 
182 | isOrtholog <- function(clustering, dist.tbl){
183 |   uclst <- unique(clustering)
184 |   tags <- names(clustering)
185 |   is.ortholog <- rep(F, length(clustering))
186 |   names(is.ortholog) <- tags
187 |   for(i in 1:length(uclst)){
188 |     idx <- which(clustering == uclst[i])
189 |     idd <- which((dist.tbl$Query %in% tags[idx]) & (dist.tbl$Hit %in% tags[idx]))
190 |     gidz <- str_extract(tags[idx], "GID[0-9]+")
191 |     if(max(table(gidz)) > 1){
192 |       D <- matrix(1, nrow = length(idx), ncol = length(idx))
193 |       a <- as.numeric(factor(dist.tbl$Query[idd], levels = tags[idx]))
194 |       b <- as.numeric(factor(dist.tbl$Hit[idd], levels = tags[idx]))
195 |       D[matrix(c(a,b), ncol = 2, byrow = F)] <- dist.tbl$Distance[idd]
196 |       D[matrix(c(b,a), ncol = 2, byrow = F)] <- dist.tbl$Distance[idd]
197 |       ixx <- order(rowSums(D))
198 |       ixd <- which(!duplicated(gidz[ixx]))
199 |       is.ortholog[idx[ixx[ixd]]] <- T
200 |     } else {
201 |       is.ortholog[idx] <- T
202 |     }
203 |     cat(i, "/", length(uclst), "\r")
204 |   }
205 |   return(is.ortholog)
206 | }
207 |    
208 |     
209 |     
210 | 


--------------------------------------------------------------------------------
/R/bdist.R:
--------------------------------------------------------------------------------
  1 | #' @name bDist
  2 | #' @title Computes distances between sequences
  3 | #' 
  4 | #' @description Computes distance between all sequences based on the BLAST bit-scores.
  5 | #' 
  6 | #' @param blast.files A text vector of BLAST result filenames.
  7 | #' @param blast.tbl A table with BLAST results.
  8 | #' @param e.value A threshold E-value to immediately discard (very) poor BLAST alignments.
  9 | #' @param verbose Logical, indicating if textual output should be given to monitor the progress.
 10 | #' 
 11 | #' @details The essential input is either a vector of BLAST result filenames (\code{blast.files}) or a
 12 | #' table of the BLAST results (\code{blast.tbl}). It is no point in providing both, then \code{blast.tbl} is ignored.
 13 | #' 
 14 | #' For normal sized data sets (e.g. less than 100 genomes), you would provide the BLAST filenames as the argument
 15 | #' \code{blast.files} to this function.
 16 | #' Then results are read, and distances are computed. Only if you have huge data sets, you may find it more efficient to 
 17 | #' read the files using \code{\link{readBlastSelf}} and \code{\link{readBlastPair}} separately, and then provide as the
 18 | #' argument \code{blast.tbl]} the table you get from binding these results. In all cases, the BLAST result files must
 19 | #' have been produced by \code{\link{blastpAllAll}}.
 20 | #' 
 21 | #' Setting a small \samp{e.value} threshold can speed up the computation and reduce the size of the
 22 | #' output, but you may loose some alignments that could produce smallish distances for short sequences.
 23 | #' 
 24 | #' The distance computed is based on alignment bitscores. Assume the alignment of query A against hit B
 25 | #' has a bitscore of S(A,B). The distance is D(A,B)=1-2*S(A,B)/(S(A,A)+S(B,B)) where S(A,A) and S(B,B) are
 26 | #' the self-alignment bitscores, i.e. the scores of aligning against itself. A distance of
 27 | #' 0.0 means A and B are identical. The maximum possible distance is 1.0, meaning there is no BLAST between A and B.
 28 | #'
 29 | #' This distance should not be interpreted as lack of identity! A distance of 0.0 means 100\% identity,
 30 | #' but a distance of 0.25 does \emph{not} mean 75\% identity. It has some resemblance to an evolutinary
 31 | #' (raw) distance, but since it is based on protein alignments, the type of mutations plays a significant
 32 | #' role, not only the number of mutations.
 33 | #' 
 34 | #' @return The function returns a table with columns \samp{Dbase}, \samp{Query}, \samp{Bitscore}
 35 | #' and \samp{Distance}. Each row corresponds to a pair of sequences (Dbase and Query sequences) having at least
 36 | #' one BLAST hit between
 37 | #' them. All pairs \emph{not} listed in the output have distance 1.0 between them.
 38 | #' 
 39 | #' @author Lars Snipen and Kristian Hovde Liland.
 40 | #' 
 41 | #' @seealso \code{\link{blastpAllAll}}, \code{\link{readBlastSelf}}, \code{\link{readBlastPair}},
 42 | #' \code{\link{bClust}}, \code{\link{isOrtholog}}.
 43 | #' 
 44 | #' @examples
 45 | #' # Using BLAST result files in this package...
 46 | #' prefix <- c("GID1_vs_GID1_",
 47 | #'             "GID2_vs_GID1_",
 48 | #'             "GID3_vs_GID1_",
 49 | #'             "GID2_vs_GID2_",
 50 | #'             "GID3_vs_GID2_",
 51 | #'             "GID3_vs_GID3_")
 52 | #' bf <- file.path(path.package("micropan"), "extdata", str_c(prefix, ".txt.xz"))
 53 | #' 
 54 | #' # We need to uncompress them first...
 55 | #' blast.files <- tempfile(pattern = prefix, fileext = ".txt.xz")
 56 | #' ok <- file.copy(from = bf, to = blast.files)
 57 | #' blast.files <- unlist(lapply(blast.files, xzuncompress))
 58 | #' 
 59 | #' # Computing pairwise distances
 60 | #' blast.dist <- bDist(blast.files)
 61 | #' 
 62 | #' # Read files separately, then use bDist
 63 | #' self.tbl <- readBlastSelf(blast.files)
 64 | #' pair.tbl <- readBlastPair(blast.files)
 65 | #' blast.dist <- bDist(blast.tbl = bind_rows(self.tbl, pair.tbl))
 66 | #' 
 67 | #' # ...and cleaning...
 68 | #' ok <- file.remove(blast.files)
 69 | #' 
 70 | #' # See also example for blastpAl
 71 | #' 
 72 | #' @importFrom tibble tibble
 73 | #' @importFrom stringr str_extract_all
 74 | #' @importFrom dplyr %>% rename filter arrange mutate distinct select bind_rows desc
 75 | #' @importFrom utils read.table
 76 | #' @importFrom rlang .data
 77 | #' 
 78 | #' @export bDist
 79 | #' 
 80 | bDist <- function(blast.files = NULL, blast.tbl = NULL, e.value = 1, verbose = TRUE){
 81 |   if(!is.null(blast.files)){
 82 |     readBlastSelf(blast.files, e.value = e.value, verbose = verbose) %>% 
 83 |       filter(.data$Evalue <= e.value) %>% 
 84 |       arrange(desc(.data$Bitscore)) %>% 
 85 |       mutate(Pair = sortPaste(.data$Dbase, .data$Query)) %>% 
 86 |       distinct(.data$Pair, .keep_all = TRUE) %>% 
 87 |       select(.data$Dbase, .data$Query, .data$Bitscore) -> self.tbl
 88 |     readBlastPair(blast.files, e.value = e.value, verbose = verbose) %>% 
 89 |       filter(.data$Evalue <= e.value) %>% 
 90 |       select(.data$Dbase, .data$Query, .data$Bitscore) %>% 
 91 |       arrange(desc(.data$Bitscore)) %>% 
 92 |       distinct(.data$Dbase, .data$Query, .keep_all = TRUE) %>% 
 93 |       bind_rows(self.tbl) -> blast.tbl
 94 |   } else if(is.null(blast.tbl)){
 95 |     stop("Needs either blast.files or blast.tbl as input")
 96 |   } else {
 97 |     blast.tbl %>% 
 98 |       filter(.data$Evalue <= e.value) %>% 
 99 |       select(-.data$Evalue) %>% 
100 |       arrange(desc(.data$Bitscore)) %>% 
101 |       mutate(GIDd = str_extract(.data$Dbase, "GID[0-9]+")) %>% 
102 |       mutate(GIDq = str_extract(.data$Query, "GID[0-9]+")) -> blast.tbl
103 |     blast.tbl %>% 
104 |       filter(.data$GIDd == .data$GIDq) %>% 
105 |       mutate(Pair = sortPaste(.data$Dbase, .data$Query)) %>% 
106 |       distinct(.data$Pair, .keep_all = TRUE) %>% 
107 |       select(-.data$Pair) -> self.tbl
108 |     blast.tbl %>% 
109 |       filter(.data$GIDd != .data$GIDq) %>% 
110 |       bind_rows(self.tbl) %>% 
111 |       select(-.data$GIDd, -.data$GIDq) %>% 
112 |       distinct(.data$Dbase, .data$Query, .keep_all = TRUE) -> blast.tbl
113 |   }
114 |   if(verbose) cat("bDist:\n   ...found", nrow(blast.tbl), "alignments...\n")
115 |   blast.tbl %>% 
116 |     filter(.data$Dbase == .data$Query) -> self.tbl
117 |   if(verbose) cat("   ...where", nrow(self.tbl), "are self-alignments...\n")
118 |   idx.d <- match(blast.tbl$Dbase, self.tbl$Dbase)
119 |   idd <- which(is.na(idx.d))
120 |   if(length(idd) > 0) stop("No self-alignment for sequences: ", str_c(unique(blast.tbl$Dbase[idd]), collapse = ","))
121 |   idx.q <- match(blast.tbl$Query, self.tbl$Query)
122 |   idd <- which(is.na(idx.q))
123 |   if(length(idd) > 0) stop("No self-alignment for sequences: ", str_c(unique(blast.tbl$Query[idd]), collapse = ","))
124 |   
125 |   blast.tbl %>%
126 |     mutate(Distance = 1 - (2 * .data$Bitscore) / (self.tbl$Bitscore[idx.d] + self.tbl$Bitscore[idx.q])) %>% 
127 |     arrange(.data$Dbase, .data$Query) -> dist.tbl
128 |   return(dist.tbl)
129 | }
130 | 
131 | 
132 | # Local function
133 | sortPaste <- function(q, h){
134 |   M <- matrix(c(q, h), ncol = 2, byrow = F)
135 |   pp <- apply(M, 1, function(x){paste(sort(x), collapse = ":")})
136 |   return(pp)
137 | }
138 | 
139 | 
140 | 
141 | #' @name readBlastSelf
142 | #' @aliases readBlastSelf readBlastPair
143 | #' @title Reads BLAST result files
144 | #' 
145 | #' @description Reads files from a search with blastpAllAll
146 | #' 
147 | #' @param blast.files A text vector of filenames.
148 | #' @param e.value A threshold E-value to immediately discard (very) poor BLAST alignments.
149 | #' @param verbose Logical, indicating if textual output should be given to monitor the progress.
150 | #' 
151 | #' @details The filenames given as input must refer to BLAST result files produced by \code{\link{blastpAllAll}}.
152 | #' 
153 | #' With \code{readBlastSelf} you only read the self-alignment results, i.e. blasting a genome against itself. With
154 | #' \code{readBlastPair} you read all the other files, i.e. different genomes compared. You may use all blast file
155 | #' names as input to both, they will select the proper files based on their names, e.g. GID1_vs_GID1.txt is read
156 | #' by \code{readBlastSelf} while GID2_vs_GID1.txt is read by \code{readBlastPair}.
157 | #' 
158 | #' Setting a small \samp{e.value} threshold will filter the alignment, and may speed up this and later processing,
159 | #' but you may also loose some important alignments for short sequences.
160 | #' 
161 | #' Both these functions are used by \code{\link{bDist}}. The reason we provide them separately is to allow the user
162 | #' to complete this file reading before calling \code{\link{bDist}}. If you have a huge number of files, a
163 | #' skilled user may utilize parallell processing to speed up the reading. For normal size data sets (e.g. less than 100 genomes)
164 | #' you should probably use \code{\link{bDist}} directly.
165 | #' 
166 | #' @return The functions returns a table with columns \samp{Dbase}, \samp{Query}, \samp{Bitscore}
167 | #' and \samp{Distance}. Each row corresponds to a pair of sequences (a Dbase and a Query sequence) having at least
168 | #' one BLAST hit between
169 | #' them. All pairs \emph{not} listed have distance 1.0 between them. You should normally bind the output from 
170 | #' \code{readBlastSelf} to the ouptut from \code{readBlastPair} and use the result as input to \code{\link{bDist}}.
171 | #' 
172 | #' @author Lars Snipen.
173 | #' 
174 | #' @seealso \code{\link{bDist}}, \code{\link{blastpAllAll}}.
175 | #' 
176 | #' @examples
177 | #' # Using BLAST result files in this package...
178 | #' prefix <- c("GID1_vs_GID1_",
179 | #'             "GID2_vs_GID1_",
180 | #'             "GID3_vs_GID1_",
181 | #'             "GID2_vs_GID2_",
182 | #'             "GID3_vs_GID2_",
183 | #'             "GID3_vs_GID3_")
184 | #' bf <- file.path(path.package("micropan"), "extdata", str_c(prefix, ".txt.xz"))
185 | #' 
186 | #' # We need to uncompress them first...
187 | #' blast.files <- tempfile(pattern = prefix, fileext = ".txt.xz")
188 | #' ok <- file.copy(from = bf, to = blast.files)
189 | #' blast.files <- unlist(lapply(blast.files, xzuncompress))
190 | #' 
191 | #' # Reading self-alignment files, then the other files
192 | #' self.tbl <- readBlastSelf(blast.files)
193 | #' pair.tbl <- readBlastPair(blast.files)
194 | #' 
195 | #' # ...and cleaning...
196 | #' ok <- file.remove(blast.files)
197 | #' 
198 | #' # See also examples for bDist
199 | #' 
200 | #' @importFrom stringr str_extract_all
201 | #' @importFrom dplyr %>% rename filter bind_rows
202 | #' @importFrom utils read.table
203 | #' @importFrom rlang .data
204 | #' 
205 | #' @export readBlastSelf readBlastPair
206 | #' 
207 | readBlastSelf <- function(blast.files, e.value = 1, verbose = TRUE){
208 |   blast.files <- normalizePath(blast.files)
209 |   if(verbose) cat("readBlastSelf:\n   ...received", length(blast.files), "blast-files...\n")
210 |   gids <- str_extract_all(blast.files, "GID[0-9]+", simplify = TRUE)
211 |   self.idx <- which(gids[,1] == gids[,2])
212 |   if(verbose) cat("   ...found", length(self.idx), "self-alignment files...\n")
213 |   lapply(blast.files[self.idx], read.table, header = FALSE, sep = "\t", strip.white = TRUE, stringsAsFactors = FALSE) %>% 
214 |     bind_rows() %>% 
215 |     rename(Dbase = .data$V1, Query = .data$V2, Evalue = .data$V3, Bitscore = .data$V4) %>% 
216 |     filter(.data$Evalue <= e.value) -> self.tbl
217 |   if(verbose) cat("   ...returns", nrow(self.tbl), "alignment results\n")
218 |   return(self.tbl)
219 | }
220 | readBlastPair <- function(blast.files, e.value = 1, verbose = TRUE){
221 |   blast.files <- normalizePath(blast.files)
222 |   if(verbose) cat("readBlastPairs:\n   ...received", length(blast.files), "blast-files...\n")
223 |   gids <- str_extract_all(blast.files, "GID[0-9]+", simplify = TRUE)
224 |   pair.idx <- which(gids[,1] != gids[,2])
225 |   if(verbose) cat("   ...found", length(pair.idx), "alignment files who are NOT self-alignments...\n")
226 |   lapply(blast.files[pair.idx], read.table, header = FALSE, sep = "\t", strip.white = TRUE, stringsAsFactors = FALSE) %>% 
227 |     bind_rows() %>% 
228 |     rename(Dbase = .data$V1, Query = .data$V2, Evalue = .data$V3, Bitscore = .data$V4) %>% 
229 |     filter(.data$Evalue <= e.value) -> pair.tbl
230 |   if(verbose) cat("   ...returns", nrow(pair.tbl), "alignment results\n")
231 |   return(pair.tbl)
232 | }
233 | 
234 | 


--------------------------------------------------------------------------------
/R/binomix.R:
--------------------------------------------------------------------------------
  1 | #' @name binomixEstimate
  2 | #' @aliases binomixEstimate
  3 | #' 
  4 | #' @title Binomial mixture model estimates
  5 | #' 
  6 | #' @description Fits binomial mixture models to the data given as a pan-matrix. From the fitted models
  7 | #' both estimates of pan-genome size and core-genome size are available.
  8 | #' 
  9 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details.
 10 | #' @param K.range The range of model complexities to explore. The vector of integers specify the number
 11 | #' of binomial densities to combine in the mixture models.
 12 | #' @param core.detect.prob The detection probability of core genes. This should almost always be 1.0,
 13 | #' since a core gene is by definition always present in all genomes, but can be set fractionally smaller.
 14 | #' @param verbose Logical indicating if textual output should be given to monitor the progress of the
 15 | #' computations.
 16 | #' 
 17 | #' @details  A binomial mixture model can be used to describe the distribution of gene clusters across
 18 | #' genomes in a pan-genome. The idea and the details of the computations are given in Hogg et al (2007),
 19 | #' Snipen et al (2009) and Snipen & Ussery (2012).
 20 | #' 
 21 | #' Central to the concept is the idea that every gene has a detection probability, i.e. a probability of
 22 | #' being present in a genome. Genes who are always present in all genomes are called core genes, and these
 23 | #' should have a detection probability of 1.0. Other genes are only present in a subset of the genomes, and
 24 | #' these have smaller detection probabilities. Some genes are only present in one single genome, denoted
 25 | #' ORFan genes, and an unknown number of genes have yet to be observed. If the number of genomes investigated
 26 | #' is large these latter must have a very small detection probability. 
 27 | #' 
 28 | #' A binomial mixture model with \samp{K} components estimates \samp{K} detection probabilities from the
 29 | #' data. The more components you choose, the better you can fit the (present) data, at the cost of less
 30 | #' precision in the estimates due to less degrees of freedom. \code{\link{binomixEstimate}} allows you to
 31 | #' fit several models, and the input \samp{K.range} specifies which values of \samp{K} to try out. There no
 32 | #' real point using \samp{K} less than 3, and the default is \samp{K.range=3:5}. In general, the more genomes
 33 | #' you have the larger you can choose \samp{K} without overfitting.  Computations will be slower for larger
 34 | #' values of \samp{K}. In order to choose the optimal value for \samp{K}, \code{\link{binomixEstimate}}
 35 | #' computes the BIC-criterion, see below.
 36 | #' 
 37 | #' As the number of genomes grow, we tend to observe an increasing number of gene clusters. Once a
 38 | #' \samp{K}-component binomial mixture has been fitted, we can estimate the number of gene clusters not yet
 39 | #' observed, and thereby the pan-genome size. Also, as the number of genomes grows we tend to observe fewer
 40 | #' core genes. The fitted binomial mixture model also gives an estimate of the final number of core gene
 41 | #' clusters, i.e. those still left after having observed \sQuote{infinite} many genomes.
 42 | #' 
 43 | #' The detection probability of core genes should be 1.0, but can at times be set fractionally smaller.
 44 | #' This means you accept that even core genes are not always detected in every genome, e.g. they may be
 45 | #' there, but your gene prediction has missed them. Notice that setting the \samp{core.detect.prob} to less
 46 | #' than 1.0 may affect the core gene size estimate dramatically.
 47 | #' 
 48 | #' @return \code{\link{binomixEstimate}} returns a \code{list} with two components, the \samp{BIC.tbl}
 49 | #' and \samp{Mix.tbl}.
 50 | #' 
 51 | #' The \samp{BIC.tbl} is a \code{tibble} listing, in each row, the results for each number of components
 52 | #' used, given by the input \samp{K.range}. The column \samp{Core.size} is the estimated number of
 53 | #' core gene families, the column \samp{Pan.size} is the estimated pan-genome size. The column
 54 | #' \samp{BIC} is the Bayesian Information Criterion (Schwarz, 1978) that should be used to choose the
 55 | #' optimal component number (\samp{K}). The number of components where \samp{BIC} is minimized is the
 56 | #' optimum. If minimum \samp{BIC} is reached for the largest \samp{K} value you should extend the
 57 | #' \samp{K.range} to larger values and re-fit. The function will issue
 58 | #' a \code{warning} to remind you of this.
 59 | #' 
 60 | #' The \samp{Mix.tbl} is a \code{tibble} with estimates from the mixture models. The column \samp{Component}
 61 | #' indicates the model, i.e. all rows where \samp{Component} has the same value are from the same model.
 62 | #' There will be 3 rows for 3-component model, 4 rows for 4-component, etc. The column \samp{Detection.prob}
 63 | #' contain the estimated detection probabilities for each component of the mixture models. A 
 64 | #' \samp{Mixing.proportion} is the proportion of the gene clusters having the corresponding \samp{Detection.prob},
 65 | #' i.e. if core genes have \samp{Detection.prob} 1.0, the corresponding \samp{Mixing.proportion} (same row)
 66 | #' indicates how large fraction of the gene families are core genes.
 67 | #' 
 68 | #' @references
 69 | #' Hogg, J.S., Hu, F.Z, Janto, B., Boissy, R., Hayes, J., Keefe, R., Post, J.C., Ehrlich, G.D. (2007).
 70 | #' Characterization and modeling of the Haemophilus influenzae core- and supra-genomes based on the
 71 | #' complete genomic sequences of Rd and 12 clinical nontypeable strains. Genome Biology, 8:R103.
 72 | #' 
 73 | #' Snipen, L., Almoy, T., Ussery, D.W. (2009). Microbial comparative pan-genomics using binomial
 74 | #' mixture models. BMC Genomics, 10:385.
 75 | #' 
 76 | #' Snipen, L., Ussery, D.W. (2012). A domain sequence approach to pangenomics: Applications to
 77 | #' Escherichia coli. F1000 Research, 1:19.
 78 | #' 
 79 | #' Schwarz, G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2):461-464.
 80 | #' 
 81 | #' @author Lars Snipen and Kristian Hovde Liland.
 82 | #' 
 83 | #' @seealso \code{\link{panMatrix}}, \code{\link{chao}}.
 84 | #' 
 85 | #' @examples
 86 | #' # Loading an example pan-matrix
 87 | #' data(xmpl.panmat)
 88 | #' 
 89 | #' # Estimating binomial mixture models
 90 | #' binmix.lst <- binomixEstimate(xmpl.panmat, K.range = 3:8)
 91 | #' print(binmix.lst$BIC.tbl) # minimum BIC at 3 components
 92 | #' 
 93 | #' \dontrun{
 94 | #' # The pan-genome gene distribution as a pie-chart
 95 | #' library(ggplot2)
 96 | #' ncomp <- 3
 97 | #' binmix.lst$Mix.tbl %>% 
 98 | #'   filter(Components == ncomp) %>% 
 99 | #'   ggplot() +
100 | #'   geom_col(aes(x = "", y = Mixing.proportion, fill = Detection.prob)) +
101 | #'   coord_polar(theta = "y") +
102 | #'   labs(x = "", y = "", title = "Pan-genome gene distribution") +
103 | #'   scale_fill_gradientn(colors = c("pink", "orange", "green", "cyan", "blue"))
104 | #'   
105 | #' # The distribution in an average genome
106 | #' binmix.lst$Mix.tbl %>% 
107 | #'   filter(Components == ncomp) %>% 
108 | #'   mutate(Single = Mixing.proportion * Detection.prob) %>%
109 | #'   ggplot() +
110 | #'   geom_col(aes(x = "", y = Single, fill = Detection.prob)) +
111 | #'   coord_polar(theta = "y") +
112 | #'   labs(x = "", y = "", title = "Average genome gene distribution") +
113 | #'   scale_fill_gradientn(colors = c("pink", "orange", "green", "cyan", "blue"))
114 | #' }
115 | #' 
116 | #' @importFrom stringr str_c
117 | #' @importFrom tibble tibble
118 | #' 
119 | #' @export binomixEstimate
120 | #' 
121 | binomixEstimate <- function(pan.matrix, K.range = 3:5, core.detect.prob = 1.0, verbose = TRUE){
122 |   pan.matrix[which(pan.matrix > 0, arr.ind = T)] <- 1
123 |   y <- table(factor(colSums(pan.matrix), levels = 1:nrow(pan.matrix)))
124 |   bic.mat <- matrix(c(K.range, rep(0, 3*length(K.range))), ncol = 4)
125 |   colnames(bic.mat) <- c("K.range", "Core.size", "Pan.size", "BIC")
126 |   mix.tbl <- NULL
127 |   for(i in 1:length(K.range)){
128 |     if(verbose) cat("binomixEstimate: Fitting", K.range[i], "component model...\n")
129 |     lst <- binomixMachine(y, K.range[i], core.detect.prob)
130 |     bic.mat[i,-1] <- lst[[1]]
131 |     mix.tbl <- bind_rows(mix.tbl, lst[[2]])
132 |   }
133 |   bic.tbl <- as_tibble(bic.mat)
134 |   if(bic.tbl[length(K.range),3] == min(bic.tbl[,3])) warning("Minimum BIC at maximum K, increase upper limit of K.range")
135 |   return(list(BIC.tbl = bic.tbl, Mix.tbl = mix.tbl))
136 | }
137 | 
138 | 
139 | #' @importFrom stats constrOptim as.dendrogram dendrapply dist is.leaf optim prcomp sd
140 | #' @importFrom tibble tibble
141 | binomixMachine <- function(y, K, core.detect.prob = 1.0){
142 |   n <- sum(y)
143 |   G <- length(y)
144 |   ctr <- list(maxit = 300, reltol = 1e-6)
145 |   np <- K - 1
146 |     
147 |   pmix0 <- rep(1, np)/K             # flat mixture proportions
148 |   pdet0 <- (1:np)/(np+1)            # "all" possible detection probabilities
149 |   p.initial <- c(pmix0, pdet0)      # initial values for parameters
150 |   # the inequality constraints...
151 |   A <- rbind(c( rep(1, np), rep(0, np)), c(rep(-1, np), rep(0, np)), diag(np+np), -1*diag(np+np))
152 |   b <- c(0, -1, rep(0, np+np), rep(-1, np+np))
153 |   
154 |   # The estimation, maximizing the negative truncated log-likelihood function
155 |   est <- constrOptim(theta = p.initial, f = negTruncLogLike, grad = NULL, method = "Nelder-Mead", control = ctr, ui = A, ci = b,
156 |                      y = y, core.p = core.detect.prob)
157 |   
158 |   estimates <- numeric(3)
159 |   names(estimates) <- c("Core.size", "Pan.size", "BIC")
160 |   estimates[3] <- 2*est$value + log(n)*(np+K)                       # the BIC-criterion
161 |   p.mix <- c(1 - sum(est$par[1:np]), est$par[1:np])                 # the mixing proportions
162 |   p.det <- c(core.detect.prob, est$par[(np+1):length( est$par )])   # the detection probabilities
163 |   ixx <- order(p.det)
164 |   p.det <- p.det[ixx]
165 |   p.mix <- p.mix[ixx]
166 |     
167 |   theta_0 <- choose(G, 0) * sum(p.mix * (1-p.det)^G)
168 |   y_0 <- n * theta_0/(1-theta_0)
169 |   estimates[2] <- n + round(y_0)
170 |   ixx <- which(p.det >= core.detect.prob)
171 |   estimates[1] <- round(estimates[2] * sum(p.mix[ixx]))
172 |   mix.tbl <- tibble(Components = rep(K, length(p.det)),
173 |                     Detection.prob = p.det,
174 |                     Mixing.proportion = p.mix)
175 |   return(list(estimates, mix.tbl))
176 | }
177 | 
178 | negTruncLogLike <- function(p, y, core.p){
179 |   np <- length(p)/2
180 |   p.det <- c(core.p, p[(np+1):length(p)])
181 |   p.mix <- c(1-sum(p[1:np]), p[1:np])
182 |   G <- length(y)
183 |   K <- length(p.mix)
184 |   n <- sum(y)
185 |     
186 |   theta_0 <- choose(G, 0) * sum(p.mix * (1-p.det)^G)
187 |   L <- -n * log(1 - theta_0)
188 |   for(g in 1:G){
189 |     theta_g <- choose(G, g) * sum(p.mix * p.det^g * (1-p.det)^(G-g))
190 |     L <- L + y[g] * log(theta_g)
191 |   }
192 |   return(-L)
193 | }
194 | 


--------------------------------------------------------------------------------
/R/blasting.R:
--------------------------------------------------------------------------------
  1 | #' @name blastpAllAll
  2 | #' @title Making BLAST search all against all genomes
  3 | #' 
  4 | #' @description Runs a reciprocal all-against-all BLAST search to look for similarity of proteins
  5 | #' within and across genomes.
  6 | #' 
  7 | #' @param prot.files A vector with FASTA filenames.
  8 | #' @param out.folder The folder where the result files should end up.
  9 | #' @param e.value The chosen E-value threshold in BLAST.
 10 | #' @param job An integer to separate multiple jobs.
 11 | #' @param start.at An integer to specify where in the file-list to start BLASTing.
 12 | #' @param threads The number of CPU's to use.
 13 | #' @param verbose Logical, if \code{TRUE} some text output is produced to monitor the progress.
 14 | #' 
 15 | #' @details A basic step in pangenomics and many other comparative studies is to cluster proteins into
 16 | #' groups or families. One commonly used approach is based on BLASTing. This function uses the
 17 | #' \samp{blast+} software available for free from NCBI (Camacho et al, 2009). More precisely, the blastp
 18 | #' algorithm with the BLOSUM45 scoring matrix and all composition based statistics turned off.
 19 | #' 
 20 | #' A vector listing FASTA files of protein sequences is given as input in \samp{prot.files}. These files
 21 | #' must have the genome_id in the first token of every header, and in their filenames as well, i.e. all input
 22 | #' files should first be prepared by \code{\link{panPrep}} to ensure this. Note that only protein sequences
 23 | #' are considered here. If your coding genes are stored as DNA, please translate them to protein prior to
 24 | #' using this function, see \code{\link[microseq]{translate}}.
 25 | #' 
 26 | #' In the first version of this package we used reciprocal BLASTing, i.e. we computed both genome A against
 27 | #' B and B against A. This may sometimes produce slightly different results, but in reality this is too
 28 | #' costly compared to its gain, and we now only make one of the above searches. This basically halves the
 29 | #' number of searches. This step is still very time consuming for larger number of genomes. Note that the 
 30 | #' protein files are sorted by the genome_id (part of filename) inside this function. This is to ensure a 
 31 | #' consistent ordering irrespective of how they are enterred.
 32 | #' 
 33 | #' For every pair of genomes a result file is produced. If two genomes have genome_id's \samp{GID111},
 34 | #' and \samp{GID222} then the result file \samp{GID222_vs_GID111.txt} will
 35 | #' be found in \samp{out.folder} after the completion of this search. The last of the two genome_id is always
 36 | #' the first in alphabetical order of the two.
 37 | #' 
 38 | #' The \samp{out.folder} is scanned for already existing result files, and \code{\link{blastpAllAll}} never
 39 | #' overwrites an existing result file. If a file with the name \samp{GID111_vs_GID222.txt} already exists in
 40 | #' the \samp{out.folder}, this particular search is skipped. This makes it possible to run multiple jobs in
 41 | #' parallell, writing to the same \samp{out.folder}. It also makes it possible to add new genomes, and only
 42 | #' BLAST the new combinations without repeating previous comparisons. 
 43 | #' 
 44 | #' This search can be slow if the genomes contain many proteins and it scales quadratically in the number of
 45 | #' input files. It is best suited for the study of a smaller number of genomes. By
 46 | #' starting multiple R sessions, you can speed up the search by running \code{\link{blastpAllAll}} from each R
 47 | #' session, using the same \samp{out.folder} but different integers for the \code{job} option. At the same
 48 | #' time you may also want to start the BLASTing at different places in the file-list, by giving larger values
 49 | #' to the argument \code{start.at}. This is 1 by default, i.e. the BLASTing starts at the first protein file.
 50 | #' If you are using a multicore computer you can also increase the number of CPUs by increasing \code{threads}.
 51 | #' 
 52 | #' The result files are tab-separated text files, and can be read into R, but more
 53 | #' commonly they are used as input to \code{\link{bDist}} to compute distances between sequences for subsequent
 54 | #' clustering.
 55 | #' 
 56 | #' @return The function produces a result file for each pair of files listed in \samp{prot.files}.
 57 | #' These result files are located in \code{out.folder}. Existing files are never overwritten by
 58 | #' \code{\link{blastpAllAll}}, if you want to re-compute something, delete the corresponding result files first.
 59 | #' 
 60 | #' @references Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., Madden, T.L.
 61 | #' (2009). BLAST+: architecture and applications. BMC Bioinformatics, 10:421.
 62 | #' 
 63 | #' @note The \samp{blast+} software must be installed on the system for this function to work, i.e. the command
 64 | #' \samp{system("makeblastdb -help")} must be recognized as valid commands if you
 65 | #' run them in the Console window.
 66 | #' 
 67 | #' @author Lars Snipen and Kristian Hovde Liland.
 68 | #' 
 69 | #' @seealso \code{\link{panPrep}}, \code{\link{bDist}}.
 70 | #' 
 71 | #' @examples 
 72 | #' \dontrun{
 73 | #' # This example requires the external BLAST+ software
 74 | #' # Using protein files in this package
 75 | #' pf <- file.path(path.package("micropan"), "extdata",
 76 | #'                 str_c("xmpl_GID", 1:3, ".faa.xz"))
 77 | #' 
 78 | #' # We need to uncompress them first...
 79 | #' prot.files <- tempfile(fileext = c("_GID1.faa.xz","_GID2.faa.xz","_GID3.faa.xz"))
 80 | #' ok <- file.copy(from = pf, to = prot.files)
 81 | #' prot.files <- unlist(lapply(prot.files, xzuncompress))
 82 | #' 
 83 | #' # Blasting all versus all
 84 | #' out.dir <- "."
 85 | #' blastpAllAll(prot.files, out.folder = out.dir)
 86 | #' 
 87 | #' # Reading results, and computing blast.distances
 88 | #' blast.files <- list.files(out.dir, pattern = "GID[0-9]+_vs_GID[0-9]+.txt")
 89 | #' blast.distances <- bDist(file.path(out.dir, blast.files))
 90 | #' 
 91 | #' # ...and cleaning...
 92 | #' ok <- file.remove(prot.files)
 93 | #' ok <- file.remove(file.path(out.dir, blast.files))
 94 | #' }
 95 | #' 
 96 | #' @importFrom stringr str_extract str_c
 97 | #' 
 98 | #' @export blastpAllAll
 99 | blastpAllAll <- function(prot.files, out.folder, e.value = 1, job = 1,
100 |                          threads = 1, start.at = 1, verbose = TRUE){
101 |   if(available.external("blast+")){
102 |     N <- length(prot.files)
103 |     genome_id <- str_extract(prot.files, "GID[0-9]+")
104 |     prot.files <- prot.files[order(genome_id)]
105 |     genome_id <- genome_id[order(genome_id)]
106 |     prot.files <- prot.files[start.at:N]
107 |     genome_id <- genome_id[start.at:N]
108 |     N <- length(prot.files)
109 |     if(.Platform$OS.type == "windows"){
110 |       outfmt <- "\"6 qseqid sseqid evalue bitscore\""
111 |     } else {
112 |       outfmt <- "'6 qseqid sseqid evalue bitscore'"
113 |     }
114 |     #out.folder <- normalizePath(out.folder)  # do NOT normalize this path!
115 |     file.tbl <- data.frame(Dbase = rep("", (N^2 + N)/2),
116 |                            Query = rep("", (N^2 + N)/2),
117 |                            Res.file = rep("", (N^2 + N)/2),
118 |                            stringsAsFactors = FALSE)
119 |     cc <- 1
120 |     for(i in 1:N){
121 |       for(j in i:N){
122 |         file.tbl$Dbase[cc] <- prot.files[i]
123 |         file.tbl$Query[cc] <- prot.files[j]
124 |         file.tbl$Res.file[cc] <- str_c(genome_id[j], "_vs_", genome_id[i], ".txt")
125 |         cc <- cc + 1
126 |       }
127 |     }
128 |     existing.files <- list.files(out.folder, pattern = "txt$")
129 |     file.tbl %>% 
130 |       filter(!(.data$Res.file %in% existing.files)) -> file.tbl
131 |     if(nrow(file.tbl) > 0){
132 |       dbases <- unique(file.tbl$Dbase)
133 |       for(i in 1:length(dbases)){
134 |         log.fil <- file.path(out.folder, str_c("log", job, ".txt"))
135 |         db.fil <- file.path(out.folder, str_c("blastDB", job))
136 |         if(verbose) cat("blastpAllAll: Making BLAST database of", dbases[i], "\n")
137 |         system(paste("makeblastdb -logfile", log.fil, "-dbtype prot -out", db.fil, "-in", dbases[i]))
138 |         file.tbl %>% 
139 |           filter(.data$Dbase == dbases[i]) -> tbl
140 |         for(j in 1:nrow(tbl)){
141 |           out.file <- file.path(out.folder, tbl$Res.file[j])
142 |           if(!file.exists(out.file)){
143 |             if(verbose) cat("   ", tbl$Res.file[j], "\n")
144 |             cmd <- paste("blastp",
145 |                          "-matrix BLOSUM45",
146 |                          "-evalue", e.value,
147 |                          "-num_threads", threads,
148 |                          "-comp_based_stats", "F",
149 |                          "-num_alignments", 1000,
150 |                          "-outfmt", outfmt,
151 |                          "-query", tbl$Query[j],
152 |                          "-db", db.fil,
153 |                          "-out", out.file)
154 |             system(cmd)
155 |           }
156 |         }
157 |       }
158 |       ok <- file.remove(list.files(out.folder, pattern = str_c("blastDB", job), full.names = T))
159 |       ok <- file.remove(log.fil)
160 |       if(file.exists(str_c(log.fil, ".perf"))) ok <- file.remove(str_c(log.fil, ".perf"))
161 |     }
162 |     invisible(TRUE)
163 |   }
164 | }
165 | 


--------------------------------------------------------------------------------
/R/domainclust.R:
--------------------------------------------------------------------------------
  1 | #' @name dClust
  2 | #' @title Clustering sequences based on domain sequence
  3 | #' 
  4 | #' @description Proteins are clustered by their sequence of protein domains. A domain sequence is the
  5 | #' ordered sequence of domains in the protein. All proteins having identical domain sequence are assigned
  6 | #' to the same cluster.
  7 | #' 
  8 | #' @param hmmer.tbl A \code{tibble} of results from a \code{\link{hmmerScan}} against a domain database.
  9 | #' 
 10 | #' @details A domain sequence is simply the ordered list of domains occurring in a protein. Not all proteins
 11 | #' contain known domains, but those who do will have from one to several domains, and these can be ordered
 12 | #' forming a sequence. Since domains can be more or less conserved, two proteins can be quite different in
 13 | #' their amino acid sequence, and still share the same domains. Describing, and grouping, proteins by their
 14 | #' domain sequence was proposed by Snipen & Ussery (2012) as an alternative to clusters based on pairwise
 15 | #' alignments, see \code{\link{bClust}}. Domain sequence clusters are less influenced by gene prediction errors.
 16 | #' 
 17 | #' The input is a \code{tibble} of the type produced by \code{\link{readHmmer}}. Typically, it is the
 18 | #' result of scanning proteins (using \code{\link{hmmerScan}}) against Pfam-A or any other HMMER3 database
 19 | #' of protein domains. It is highly reccomended that you remove overlapping hits in \samp{hmmer.tbl} before
 20 | #' you pass it as input to \code{\link{dClust}}. Use the function \code{\link{hmmerCleanOverlap}} for this.
 21 | #' Overlapping hits are in some cases real hits, but often the poorest of them are artifacts.
 22 | #' 
 23 | #' @return The output is a numeric vector with one element for each unique sequence in the \samp{Query}
 24 | #' column of the input \samp{hmmer.tbl}. Sequences with identical number belong to the same cluster. The
 25 | #' name of each element identifies the sequence.
 26 | #' 
 27 | #' This vector also has an attribute called \samp{cluster.info} which is a character vector containing the
 28 | #' domain sequences. The first element is the domain sequence for cluster 1, the second for cluster 2, etc.
 29 | #' In this way you can, in addition to clustering the sequences, also see which domains the sequences of a
 30 | #' particular cluster share.
 31 | #' 
 32 | #' @references Snipen, L. Ussery, D.W. (2012). A domain sequence approach to pangenomics: Applications
 33 | #' to Escherichia coli. F1000 Research, 1:19.
 34 | #' 
 35 | #' @author Lars Snipen and Kristian Hovde Liland.
 36 | #' 
 37 | #' @seealso \code{\link{panPrep}}, \code{\link{hmmerScan}}, \code{\link{readHmmer}},
 38 | #' \code{\link{hmmerCleanOverlap}}, \code{\link{bClust}}.
 39 | #' 
 40 | #' @examples 
 41 | #' # HMMER3 result files in this package
 42 | #' hf <- file.path(path.package("micropan"), "extdata", 
 43 | #'                 str_c("GID", 1:3, "_vs_microfam.hmm.txt.xz"))
 44 | #' 
 45 | #' # We need to uncompress them first...
 46 | #' hmm.files <- tempfile(fileext = rep(".xz", length(hf)))
 47 | #' ok <- file.copy(from = hf, to = hmm.files)
 48 | #' hmm.files <- unlist(lapply(hmm.files, xzuncompress))
 49 | #' 
 50 | #' # Reading the HMMER3 results, cleaning overlaps...
 51 | #' hmmer.tbl <- NULL
 52 | #' for(i in 1:3){
 53 | #'   readHmmer(hmm.files[i]) %>% 
 54 | #'     hmmerCleanOverlap() %>% 
 55 | #'     bind_rows(hmmer.tbl) -> hmmer.tbl
 56 | #' }
 57 | #' 
 58 | #' # The clustering
 59 | #' clst <- dClust(hmmer.tbl)
 60 | #' 
 61 | #' # ...and cleaning...
 62 | #' ok <- file.remove(hmm.files)
 63 | #' 
 64 | #' @importFrom dplyr arrange group_by summarize mutate
 65 | #' @importFrom rlang .data
 66 | #' 
 67 | #' @export dClust
 68 | #' 
 69 | dClust <- function(hmmer.tbl){
 70 |   hmmer.tbl %>% 
 71 |     arrange(.data$Start) %>% 
 72 |     group_by(.data$Query) %>% 
 73 |     summarize(Dom.seq = str_c(.data$Hit, collapse = ",")) %>% 
 74 |     mutate(Cluster = as.integer(factor(.data$Dom.seq, levels = unique(.data$Dom.seq)))) -> tbl
 75 | 
 76 |   dsc <- tbl$Cluster
 77 |   names(dsc) <- tbl$Query
 78 |   attr(dsc, "cluster.info") <- unique(tbl$Dom.seq)
 79 |   return(dsc)
 80 | }
 81 | 
 82 | 
 83 | #' @name hmmerCleanOverlap
 84 | #' @title Removing overlapping hits from HMMER3 scans
 85 | #' 
 86 | #' @description Removing hits to avoid overlapping HMMs on the same protein sequence.
 87 | #' 
 88 | #' @param hmmer.tbl A table (\code{tibble}) with \code{\link{hmmerScan}} results, see \code{\link{readHmmer}}.
 89 | #' 
 90 | #' @details  When scanning sequences against a profile HMM database using \code{\link{hmmerScan}}, we
 91 | #' often find that several patterns (HMMs) match in the same region of the query sequence, i.e. we have
 92 | #' overlapping hits. The function \code{\link{hmmerCleanOverlap}} will remove the poorest overlapping hit
 93 | #' in a recursive way such that all overlaps are eliminated.
 94 | #' 
 95 | #' The input is a \code{tibble} of the type produced by \code{\link{readHmmer}}.
 96 | #' 
 97 | #' @return A \code{tibble} which is a subset of the input, where some rows may have been deleted to
 98 | #' avoid overlapping hits.
 99 | #' 
100 | #' @author Lars Snipen and Kristian Hovde Liland.
101 | #' 
102 | #' @seealso \code{\link{hmmerScan}}, \code{\link{readHmmer}}, \code{\link{dClust}}.
103 | #' 
104 | #' @examples # See the example in the Help-file for dClust.
105 | #' 
106 | #' @importFrom dplyr filter select %>% slice
107 | #' @importFrom rlang .data
108 | #' 
109 | #' @export hmmerCleanOverlap
110 | #' 
111 | hmmerCleanOverlap <- function(hmmer.tbl){
112 |   qt <- table(hmmer.tbl$Query)
113 |   if(max(qt) > 1){
114 |     multi <- names(qt[qt > 1])
115 |     hmmer.tbl$Keep <- TRUE
116 |     for(i in 1:length(multi)){
117 |       idx <- which(hmmer.tbl$Query == multi[i])
118 |       hmmer.tbl$Keep[idx] <- keeper(hmmer.tbl[idx,])
119 |     }
120 |     hmmer.tbl %>% 
121 |       filter(.data$Keep) %>% 
122 |       select(-.data$Keep) -> hmmer.tbl
123 |   }
124 |   return(hmmer.tbl)
125 | } 
126 | 
127 | 
128 | 
129 | # Local functions
130 | keeper<- function(hmmer.tbl){
131 |   hmmer.tbl$Overlaps <- overlapper(hmmer.tbl)
132 |   while((sum(hmmer.tbl$Keep) > 1) & (sum(hmmer.tbl$Overlaps) > 0)){
133 |     idx <- which(hmmer.tbl$Overlaps)
134 |     idd <- which(hmmer.tbl$Evalue[idx] == max(hmmer.tbl$Evalue[idx]))
135 |     hmmer.tbl$Keep[idx[idd[1]]] <- F
136 |     hmmer.tbl$Overlaps <- overlapper(hmmer.tbl)
137 |   }
138 |   return(hmmer.tbl$Keep)
139 | }
140 | overlapper <- function(hmmer.tbl){
141 |   olaps <- rep(FALSE, nrow(hmmer.tbl))
142 |   idx <- which(hmmer.tbl$Keep)
143 |   if(length(idx) > 1){
144 |     ht <- slice(hmmer.tbl, idx)
145 |     for(i in 1:nrow(ht)){
146 |       ovr <- ((ht$Start[i] <= ht$Stop[-i]) & (ht$Start[i] >= ht$Start[-i])) |
147 |         ((ht$Stop[i] <= ht$Stop[-i]) & (ht$Stop[i] >= ht$Start[-i])) |
148 |         ((ht$Start[i] <= ht$Start[-i]) & (ht$Stop[i] >= ht$Stop[-i]))
149 |       olaps[idx[i]] <- (length(which(ovr)) > 0)
150 |     }
151 |   }
152 |   return(olaps)
153 | }
154 | 


--------------------------------------------------------------------------------
/R/entrez.R:
--------------------------------------------------------------------------------
  1 | #' @name entrezDownload
  2 | #' @title Downloading genome data
  3 | #' 
  4 | #' @description Retrieving genomes from NCBI using the Entrez programming utilities.
  5 | #' 
  6 | #' @param accession A character vector containing a set of valid accession numbers at the NCBI
  7 | #' Nucleotide database.
  8 | #' @param out.file Name of the file where downloaded sequences should be written in FASTA format.
  9 | #' @param verbose Logical indicating if textual output should be given during execution, to monitor
 10 | #' the download progress.
 11 | #' 
 12 | #' @details The Entrez programming utilities is a toolset for automatic download of data from the
 13 | #' NCBI databases, see \href{https://www.ncbi.nlm.nih.gov/books/NBK25500/}{E-utilities Quick Start}
 14 | #' for details. \code{\link{entrezDownload}} can be used to download genomes from the NCBI Nucleotide
 15 | #' database through these utilities.
 16 | #' 
 17 | #' The argument \samp{accession} must be a set of valid accession numbers at NCBI Nucleotide, typically
 18 | #' all accession numbers related to a genome (chromosomes, plasmids, contigs, etc). For completed genomes,
 19 | #' where the number of sequences is low, \samp{accession} is typically a single text listing all accession
 20 | #' numbers separated by commas. In the case of some draft genomes having a large number of contigs, the
 21 | #' accession numbers must be split into several comma-separated texts. The reason for this is that Entrez
 22 | #' will not accept too many queries in one chunk. 
 23 | #' 
 24 | #' The downloaded sequences are saved in \samp{out.file} on your system. This will be a FASTA formatted file.
 25 | #' Note that all downloaded sequences end up in this file. If you want to download multiple genomes,
 26 | #' you call \code{\link{entrezDownload}} multiple times and store in multiple files.
 27 | #' 
 28 | #' @return The name of the resulting FASTA file is returned (same as \code{out.file}), but the real result of
 29 | #' this function is the creation of the file itself.
 30 | #' 
 31 | #' @author Lars Snipen and Kristian Liland.
 32 | #' 
 33 | #' @seealso \code{\link{getAccessions}}, \code{\link[microseq]{readFasta}}.
 34 | #' 
 35 | #' @examples 
 36 | #' \dontrun{
 37 | #' # Accession numbers for the chromosome and plasmid of Buchnera aphidicola, strain APS
 38 | #' acc <- "BA000003.2,AP001071.1"
 39 | #' genome.file <- tempfile(pattern = "Buchnera_aphidicola", fileext = ".fna")
 40 | #' txt <- entrezDownload(acc, out.file = genome.file)
 41 | #' 
 42 | #' # ...cleaning...
 43 | #' ok <- file.remove(genome.file)
 44 | #' }
 45 | #' 
 46 | #' @importFrom stringr str_c
 47 | #' 
 48 | #' @export entrezDownload
 49 | #' 
 50 | entrezDownload <- function(accession, out.file, verbose = TRUE){
 51 |   if(verbose) cat("Downloading genome...")
 52 |   connect <- file(out.file, open = "w")
 53 |   for(j in 1:length(accession)){
 54 |     adr <- str_c("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide",
 55 |                  "&id=", accession[j],
 56 |                  "&retmode=text",
 57 |                  "&rettype=fasta")
 58 |     entrez <- url(adr, open = "rt")
 59 |     if(isOpen(entrez)){
 60 |       lines <- readLines(entrez)
 61 |       writeLines(lines, con = connect)
 62 |       close(entrez)
 63 |     } else {
 64 |       cat("Download failed: Could not open connection\n")
 65 |     }
 66 |   }
 67 |   close(connect)
 68 |   if(verbose) cat("...sequences saved in", out.file, "\n")
 69 |   return(out.file)
 70 | }
 71 | 
 72 | 
 73 | #' @name getAccessions
 74 | #' @title Collecting contig accession numbers
 75 | #' 
 76 | #' @description Retrieving the accession numbers for all contigs from a master record GenBank file.
 77 | #' 
 78 | #' @param master.record.accession The accession number (single text) to a master record GenBank file having
 79 | #' the WGS entry specifying the accession numbers to all contigs of the WGS genome.
 80 | #' @param chunk.size The maximum number of accession numbers returned in one text.
 81 | #' 
 82 | #' @details In order to download a WGS genome (draft genome) using \code{\link{entrezDownload}} you will
 83 | #' need the accession number of every contig. This is found in the master record GenBank file, which is
 84 | #' available for every WGS genome. \code{\link{getAccessions}} will extract these from the GenBank file and
 85 | #' return them in the apropriate way to be used by \code{\link{entrezDownload}}.
 86 | #' 
 87 | #' The download API at NCBI will not tolerate too many accessions per query, and for this reason you need
 88 | #' to split the accessions for many contigs into several texts using \code{chunk.size}.
 89 | #' 
 90 | #' @return A character vector where each element is a text listing the accession numbers separated by comma.
 91 | #' Each vector element will contain no more than \code{chunk.size} accession numbers, see 
 92 | #' \code{\link{entrezDownload}} for details on this. The vector returned by \code{\link{getAccessions}}
 93 | #' is typically used as input to \code{\link{entrezDownload}}.
 94 | #' 
 95 | #' @author Lars Snipen and Kristian Liland.
 96 | #' 
 97 | #' @seealso \code{\link{entrezDownload}}.
 98 | #' 
 99 | #' @examples
100 | #' \dontrun{
101 | #' # The master record accession for the WGS genome Mycoplasma genitalium, strain G37
102 | #' acc <- getAccessions("AAGX00000000")
103 | #' # Then we use this to download all contigs and save them
104 | #' genome.file <- tempfile(fileext = ".fna")
105 | #' txt <- entrezDownload(acc, out.file = genome.file)
106 | #' 
107 | #' # ...cleaning...
108 | #' ok <- file.remove(genome.file)
109 | #' }
110 | #' 
111 | #' @importFrom stringr str_c str_detect str_extract str_sub str_remove str_split
112 | #' 
113 | #' @export getAccessions
114 | #' 
115 | getAccessions <- function(master.record.accession, chunk.size = 99){
116 |   adrId <- str_c("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore",
117 |                   "&term=", master.record.accession)
118 |   idSearch <- url(adrId, open = "rt")
119 |   if(isOpen(idSearch)){
120 |     idDoc <- readLines(idSearch)
121 |     idLine <- which(str_detect(idDoc, "<Id>+"))[1]
122 |     id <- str_sub(idDoc[idLine], 5, -6)
123 |     close(idSearch)
124 |   } else {
125 |     cat("Download failed: Could not open connection\n")
126 |     close(idSearch)
127 |     return(NULL)
128 |   }
129 | 
130 |   adr <- str_c("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore",
131 |                "&id=", id,
132 |                "&retmode=text",
133 |                "&rettype=gb")
134 |   entrez <- url(adr, open = "rt")
135 |   accessions <- ""
136 |   if(isOpen(entrez)){
137 |     lines <- readLines(entrez)
138 |     close(entrez)
139 |     wgs.line <- str_remove(lines[str_detect(lines, pattern = "WGS   ")], "WGS[ ]+")
140 |     ss <- str_split(wgs.line, pattern = "-", simplify = T)
141 |     head <- str_extract(ss[1], "[A-Z]+[0]+")
142 |     ss.num <- as.numeric(str_remove(ss, "^[A-Z]+[0]+"))
143 |     if(length(ss.num) > 1){
144 |       range <- ss.num[1]:ss.num[2]
145 |     } else {
146 |       range <- ss.num
147 |     }
148 |     ns <- ceiling(length(range)/chunk.size)
149 |     accessions <- character(ns)
150 |     for(j in 1:ns){
151 |       s1 <- (j-1) * chunk.size + 1
152 |       s2 <- min(j * chunk.size, length(range))
153 |       accessions[j] <- str_c(str_c(head, range[s1]:range[s2]), collapse = ",")
154 |     }
155 |   } else {
156 |     close(entrez)
157 |     cat("Download failed: Could not open connection\n")
158 |   }
159 |   return(accessions)
160 | }
161 | 


--------------------------------------------------------------------------------
/R/extern.R:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | ## Non-exported function to gracefully fail when external dependencies are missing.
 4 | available.external <- function(what){
 5 |   if(what == "hmmer"){
 6 |     chr <- NULL
 7 |     try(chr <- system('hmmscan -h', intern = TRUE), silent = TRUE)
 8 |     if(is.null(chr)){
 9 |       stop(paste('hmmer was not found by R.',
10 |                  'Please install hmmer from: http://hmmer.org/download.html',
11 |                  'After installation, re-start R and make sure the hmmer softwares can be run from R by',
12 |                  'the command \'system("hmmscan -h")\'.', sep = '\n'))
13 |       return(FALSE)
14 |     } else {
15 |       return(TRUE)
16 |     }
17 |   } else if(what == "blast+"){
18 |     chr <- NULL
19 |     try(chr <- system('makeblastdb -help', intern = TRUE), silent = TRUE)
20 |     if(is.null(chr)){
21 |       stop(paste('blast+ was not found by R.',
22 |                  'Please install blast+ from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/',
23 |                  'After installation, re-start R and make sure the blast+ softwares can be run from R by',
24 |                  'the command \'system("makeblastdb -help")\'.', sep = '\n'))
25 |       return(FALSE)
26 |     } else {
27 |       return(TRUE)
28 |     }
29 |   }
30 | }
31 | 


--------------------------------------------------------------------------------
/R/extractPanGenes.R:
--------------------------------------------------------------------------------
  1 | #' @name extractPanGenes
  2 | #' @title Extracting genes of same prevalence
  3 | #' 
  4 | #' @description Based on a clustering of genes, this function extracts the genes
  5 | #' occurring in the same number of genomes.
  6 | #' 
  7 | #' @param clustering Named vector of clustering 
  8 | #' @param N.genomes Vector specifying the number of genomes the genes should be in
  9 | #' 
 10 | #' @details Pan-genome studies focus on the gene families obtained by some clustering,
 11 | #' see \code{\link{bClust}} or \code{\link{dClust}}. This function will extract the individual genes from
 12 | #' each genome belonging to gene families found in \code{N.genomes} genomes specified by the user.
 13 | #' Only the sequence tag for each gene is extracted, but the sequences can be added easily, see examples
 14 | #' below.
 15 | #' 
 16 | #' @return A table with columns
 17 | #' \itemize{
 18 | #'   \item cluster. The gene family (integer)
 19 | #'   \item seq_tag. The sequence tag identifying each sequence (text)
 20 | #'   \item N_genomes. The number of genomes in which it is found (integer)
 21 | #' }
 22 | #' 
 23 | #' @author Lars Snipen.
 24 | #' 
 25 | #' @seealso \code{\link{bClust}}, \code{\link{dClust}}, \code{\link{geneFamilies2fasta}}.
 26 | #' 
 27 | #' @examples
 28 | #' # Loading clustering data in this package
 29 | #' data(xmpl.bclst)
 30 | #' 
 31 | #' # Finding genes in 5 genomes
 32 | #' core.tbl <- extractPanGenes(xmpl.bclst, N.genomes = 5)
 33 | #' #...or in a single genome
 34 | #' orfan.tbl <- extractPanGenes(xmpl.bclst, N.genomes = 1)
 35 | #' 
 36 | #' \dontrun{
 37 | #' # To add the sequences, assume all protein fasta files are in a folder named faa:
 38 | #' lapply(list.files("faa", full.names = T), readFasta) %>% 
 39 | #'   bind_rows() %>% 
 40 | #'   mutate(seq_tag = word(Header, 1, 1)) %>% 
 41 | #'   right_join(orfan.tbl, by = "seq_tag") -> orfan.tbl
 42 | #'   # The resulting table can be written to fasta file directly using writeFasta()
 43 | #'   # See also geneFamilies2fasta()
 44 | #' }
 45 | #' 
 46 | #' @importFrom dplyr distinct group_by summarize filter %>% mutate select arrange
 47 | #' @importFrom tibble tibble
 48 | #' @importFrom stringr str_extract
 49 | #' @importFrom rlang .data
 50 | #' 
 51 | #' @export extractPanGenes
 52 | #' 
 53 | extractPanGenes <- function(clustering, N.genomes = 1:2){
 54 |   tibble(cluster = clustering,
 55 |          seq_tag = names(clustering),
 56 |          genome_id = str_extract(names(clustering), "GID[0-9]+")) -> tbl
 57 |   names(tbl$cluster) <- NULL
 58 |   tbl %>% 
 59 |     distinct(cluster, genome_id) %>% 
 60 |     group_by(cluster) %>% 
 61 |     summarize(n.genomes = n()) %>% 
 62 |     filter(n.genomes %in% N.genomes) -> trg.tbl
 63 |   
 64 |   out.tbl <- NULL
 65 |   for(i in 1:length(N.genomes)){
 66 |     idx <- which(trg.tbl$n.genomes == N.genomes[i])
 67 |     tbl %>% 
 68 |       filter(cluster %in% trg.tbl$cluster[idx]) %>% 
 69 |       mutate(N_genomes = N.genomes[i]) %>% 
 70 |       select(cluster, seq_tag, N_genomes) %>% 
 71 |       arrange(cluster, seq_tag) %>% 
 72 |       bind_rows(out.tbl) -> out.tbl
 73 |   }
 74 |   return(out.tbl)
 75 | }
 76 | 
 77 | 
 78 | 
 79 | 
 80 | 
 81 | 
 82 | 
 83 | 
 84 | #' @name geneFamilies2fasta
 85 | #' @title Write gene families to files
 86 | #' 
 87 | #' @description Writes specified gene families to separate fasta files.
 88 | #' 
 89 | #' @param pangene.tbl A table listing gene families (clusters). 
 90 | #' @param fasta.folder The folder containing the fasta files with all sequences.
 91 | #' @param out.folder The folder to write to.
 92 | #' @param file.ext The file extension to recognize the fasta files in \code{fasta.folder}.
 93 | #' @param verbose Logical to allow text ouput during processing
 94 | #' 
 95 | #' @details The argument \code{pangene.tbl} should be produced by \code{\link{extractPanGenes}} in order to
 96 | #' contain the columns \code{cluster}, \code{seq_tag} and \code{N_genomes} required by this function. The
 97 | #' files in \code{fasta.folder} must have been prepared by \code{\link{panPrep}} in order to have the proper
 98 | #' sequence tag information. They may contain protein sequences or DNA sequences.
 99 | #' 
100 | #' If you already added the \code{Header} and \code{Sequence} information to \code{pangene.tbl} these will be
101 | #' used instead of reading the files in \code{fasta.folder}, but a warning is issued.
102 | #' 
103 | #' @author Lars Snipen.
104 | #' 
105 | #' @seealso \code{\link{extractPanGenes}}, \code{\link{writeFasta}}.
106 | #' 
107 | #' @examples
108 | #' # Loading clustering data in this package
109 | #' data(xmpl.bclst)
110 | #' 
111 | #' # Finding genes in 1,..,5 genomes (all genes)
112 | #' all.tbl <- extractPanGenes(xmpl.bclst, N.genomes = 1:5)
113 | #' 
114 | #' \dontrun{
115 | #' # All protein fasta files are in a folder named faa, and we write to the current folder:
116 | #' clusters2fasta(all.tbl, fasta.folder = "faa", out.folder = ".")
117 | #' 
118 | #' # use pipe, write to folder "orfans"
119 | #' extractPanGenes(xmpl.bclst, N.genomes = 1)) %>% 
120 | #'   geneFamilies2fasta(fasta.folder = "faa", out.folder = "orfans")
121 | #' }
122 | #' 
123 | #' @importFrom dplyr bind_rows distinct filter %>% mutate select right_join
124 | #' @importFrom microseq writeFasta readFasta
125 | #' @importFrom stringr str_extract
126 | #' @importFrom rlang .data
127 | #' 
128 | #' @export geneFamilies2fasta
129 | #' 
130 | geneFamilies2fasta <- function(pangene.tbl, fasta.folder, out.folder, file.ext = "fasta$|faa$|fna$|fa$",
131 |                            verbose = TRUE){
132 |   has.Header <- exists("Header", pangene.tbl)
133 |   has.Sequence <- exists("Sequence", pangene.tbl)
134 |   if(has.Header & has.Sequence){
135 |     warning("pangene.tbl already has sequences, ignoring fasta.folder")
136 |   } else {
137 |     if(has.Header) pangene.tbl %>% select(-Header) -> pangene.tbl
138 |     if(has.Sequence) pangene.tbl %>% select(-Sequence) -> pangene.tbl
139 |     fasta.files <- list.files(fasta.folder, pattern = file.ext, full.names = T)
140 |     if(length(fasta.files) == 0) stop("Found no fasta files in fasta.folder")
141 |     if(verbose) cat("clusters2fasta:\n   found", length(fasta.files), "fasta files...\n")
142 |     lapply(fasta.files, readFasta) %>% 
143 |       bind_rows() %>% 
144 |       mutate(seq_tag = word(Header, 1, 1)) %>% 
145 |       right_join(pangene.tbl, by = "seq_tag") -> pangene.tbl
146 |     if(sum(is.na(pangene.tbl$Sequence))) stop("Cannot bind these sequences to the supplied genes, mismatching sequence tags")
147 |   }
148 |   pangene.tbl %>% 
149 |     distinct(cluster, N_genomes) %>% 
150 |     by(1:nrow(.), function(x){str_c("Genome=", x[2], "_Cluster=", x[1], ".fasta")}) %>% 
151 |     as.character() -> filenames
152 |   ucls <- unique(pangene.tbl$cluster)
153 |   if(verbose) cat("   writing", length(ucls), "clusters to files...\n")
154 |   for(i in 1:length(ucls)){
155 |     pangene.tbl %>% 
156 |       filter(cluster == ucls[i]) %>% 
157 |       writeFasta(out.file = file.path(out.folder, filenames[i]))
158 |   }
159 | }


--------------------------------------------------------------------------------
/R/genomedistances.R:
--------------------------------------------------------------------------------
  1 | #' @name fluidity
  2 | #' @title Computing genomic fluidity for a pan-genome
  3 | #' 
  4 | #' @description Computes the genomic fluidity, which is a measure of population diversity.
  5 | #' 
  6 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details.
  7 | #' @param n.sim An integer specifying the number of random samples to use in the computations.
  8 | #' 
  9 | #' @details  The genomic fluidity between two genomes is defined as the number of unique gene
 10 | #' families divided by the total number of gene families (Kislyuk et al, 2011). This is averaged
 11 | #' over \samp{n.sim} random pairs of genomes to obtain a population estimate.
 12 | #' 
 13 | #' The genomic fluidity between two genomes describes their degree of overlap with respect to gene
 14 | #' cluster content. If the fluidity is 0.0, the two genomes contain identical gene clusters. If it
 15 | #' is 1.0 the two genomes are non-overlapping. The difference between a Jaccard distance (see
 16 | #' \code{\link{distJaccard}}) and genomic fluidity is small, they both measure overlap between
 17 | #' genomes, but fluidity is computed for the population by averaging over many pairs, while Jaccard
 18 | #' distances are computed for every pair. Note that only presence/absence of gene clusters are
 19 | #' considered, not multiple occurrences.
 20 | #' 
 21 | #' The input \samp{pan.matrix} is typically constructed by \code{\link{panMatrix}}.
 22 | #' 
 23 | #' @return A vector with two elements, the mean fluidity and its sample standard deviation over
 24 | #' the \samp{n.sim} computed values.
 25 | #' 
 26 | #' @references Kislyuk, A.O., Haegeman, B., Bergman, N.H., Weitz, J.S. (2011). Genomic fluidity:
 27 | #' an integrative view of gene diversity within microbial populations. BMC Genomics, 12:32.
 28 | #' 
 29 | #' @author Lars Snipen and Kristian Hovde Liland.
 30 | #' 
 31 | #' @seealso \code{\link{panMatrix}}, \code{\link{distJaccard}}.
 32 | #' 
 33 | #' @examples 
 34 | #' # Loading a pan-matrix in this package
 35 | #' data(xmpl.panmat)
 36 | #' 
 37 | #' # Fluidity based on this pan-matrix
 38 | #' fluid <- fluidity(xmpl.panmat)
 39 | #' 
 40 | #' @importFrom stats sd
 41 | #' 
 42 | #' @export fluidity
 43 | #' 
 44 | fluidity <- function(pan.matrix, n.sim = 10){
 45 |   pan.matrix[which(pan.matrix > 0, arr.ind=T)] <- 1
 46 |   flu <- rep(0, n.sim)
 47 |   for(i in 1:n.sim){
 48 |     ii <- sample(nrow(pan.matrix), 2)
 49 |     flu[i] <- (sum(pan.matrix[ii[1],] > 0 & pan.matrix[ii[2],] == 0)
 50 |                + sum(pan.matrix[ii[1],] == 0 & pan.matrix[ii[2],] > 0)) / (sum(pan.matrix[ii[1],]) + sum(pan.matrix[ii[2],]))
 51 |   }
 52 |   flu.vec <- c(Mean = mean(flu), Std = sd(flu))
 53 |   return(flu.vec)
 54 | }
 55 | 
 56 | 
 57 | #' @name distJaccard
 58 | #' @title Computing Jaccard distances between genomes
 59 | #' 
 60 | #' @description Computes the Jaccard distances between all pairs of genomes.
 61 | #' 
 62 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details.
 63 | #' 
 64 | #' @details The Jaccard index between two sets is defined as the size of the intersection of
 65 | #' the sets divided by the size of the union. The Jaccard distance is simply 1 minus the Jaccard index.
 66 | #' 
 67 | #' The Jaccard distance between two genomes describes their degree of overlap with respect to gene
 68 | #' cluster content. If the Jaccard distance is 0.0, the two genomes contain identical gene clusters.
 69 | #' If it is 1.0 the two genomes are non-overlapping. The difference between a genomic fluidity (see
 70 | #' \code{\link{fluidity}}) and a Jaccard distance is small, they both measure overlap between genomes,
 71 | #' but fluidity is computed for the population by averaging over many pairs, while Jaccard distances are
 72 | #' computed for every pair. Note that only presence/absence of gene clusters are considered, not multiple
 73 | #' occurrences.
 74 | #' 
 75 | #' The input \samp{pan.matrix} is typically constructed by \code{\link{panMatrix}}.
 76 | #' 
 77 | #' @return A \code{dist} object (see \code{\link{dist}}) containing all pairwise Jaccard distances
 78 | #' between genomes.
 79 | #' 
 80 | #' @author Lars Snipen and Kristian Hovde Liland.
 81 | #' 
 82 | #' @seealso \code{\link{panMatrix}}, \code{\link{fluidity}}, \code{\link{dist}}.
 83 | #' 
 84 | #' @examples
 85 | #' # Loading a pan-matrix in this package
 86 | #' data(xmpl.panmat)
 87 | #' 
 88 | #' # Jaccard distances
 89 | #' Jdist <- distJaccard(xmpl.panmat)
 90 | #' 
 91 | #' # Making a dendrogram based on the distances,
 92 | #' # see example for distManhattan
 93 | #' 
 94 | #' @export distJaccard
 95 | #' 
 96 | distJaccard <- function(pan.matrix){
 97 |   pan.matrix[which(pan.matrix > 0, arr.ind = T)] <- 1
 98 |   D <- matrix(0, nrow = nrow(pan.matrix), ncol = nrow(pan.matrix))
 99 |   rownames(D) <- colnames(D) <- rownames(pan.matrix)
100 |   for(i in 1:(nrow(pan.matrix) - 1)){
101 |     for(j in (i+1):nrow(pan.matrix)){
102 |       cs <- pan.matrix[i,] + pan.matrix[j,]
103 |       D[j,i] <- D[i,j] <- 1 - sum(cs > 1)/sum(cs > 0)
104 |     }
105 |   }
106 |   return(as.dist(D))
107 | }
108 | 
109 | 
110 | #' @name distManhattan
111 | #' @title Computing Manhattan distances between genomes
112 | #' 
113 | #' @description Computes the (weighted) Manhattan distances beween all pairs of genomes.
114 | #' 
115 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details.
116 | #' @param scale An optional scale to control how copy numbers should affect the distances.
117 | #' @param weights Vector of optional weights of gene clusters.
118 | #' 
119 | #' @details The Manhattan distance is defined as the sum of absolute elementwise differences between
120 | #' two vectors. Each genome is represented as a vector (row) of integers in \samp{pan.matrix}. The
121 | #' Manhattan distance between two genomes is the sum of absolute difference between these rows. If
122 | #' two rows (genomes) of the \samp{pan.matrix} are identical, the corresponding Manhattan distance
123 | #' is \samp{0.0}.
124 | #' 
125 | #' The \samp{scale} can be used to control how copy number differences play a role in the distances
126 | #' computed. Usually we assume that going from 0 to 1 copy of a gene is the big change of the genome,
127 | #' and going from 1 to 2 (or more) copies is less. Prior to computing the Manhattan distance, the
128 | #' \samp{pan.matrix} is transformed according to the following affine mapping: If the original value in
129 | #' \samp{pan.matrix} is \samp{x}, and \samp{x} is not 0, then the transformed value is \samp{1 + (x-1)*scale}.
130 | #' Note that with \samp{scale=0.0} (default) this will result in 1 regardless of how large \samp{x} was.
131 | #' In this case the Manhattan distance only distinguish between presence and absence of gene clusters.
132 | #' If \samp{scale=1.0} the value \samp{x} is left untransformed. In this case the difference between 1
133 | #' copy and 2 copies is just as big as between 1 copy and 0 copies. For any \samp{scale} between 0.0 and
134 | #' 1.0 the transformed value is shrunk towards 1, but a certain effect of larger copy numbers is still
135 | #' present. In this way you can decide if the distances between genomes should be affected, and to what
136 | #' degree, by differences in copy numbers beyond 1. Notice that as long as \samp{scale=0.0} (and no
137 | #' weighting) the Manhattan distance has a nice interpretation, namely the number of gene clusters that
138 | #' differ in present/absent status between two genomes.
139 | #' 
140 | #' When summing the difference across gene clusters we can also up- or downweight some clusters compared
141 | #' to others. The vector \samp{weights} must contain one value for each column in \samp{pan.matrix}. The
142 | #' default is to use flat weights, i.e. all clusters count equal. See \code{\link{geneWeights}} for
143 | #' alternative weighting strategies.
144 | #' 
145 | #' @return A \code{dist} object (see \code{\link{dist}}) containing all pairwise Manhattan distances
146 | #' between genomes.
147 | #' 
148 | #' @author Lars Snipen and Kristian Hovde Liland.
149 | #' 
150 | #' @seealso \code{\link{panMatrix}}, \code{\link{distJaccard}}, \code{\link{geneWeights}}.
151 | #' 
152 | #' @examples 
153 | #' # Loading a pan-matrix in this package
154 | #' data(xmpl.panmat)
155 | #' 
156 | #' # Manhattan distances between genomes
157 | #' Mdist <- distManhattan(xmpl.panmat)
158 | #' 
159 | #' \dontrun{
160 | #' # Making a dendrogram based on shell-weighted distances
161 | #' library(ggdendro)
162 | #' weights <- geneWeights(xmpl.panmat, type = "shell")
163 | #' Mdist <- distManhattan(xmpl.panmat, weights = weights)
164 | #' ggdendrogram(dendro_data(hclust(Mdist, method = "average")),
165 | #'   rotate = TRUE, theme_dendro = FALSE) +
166 | #'   labs(x = "Genomes", y = "Shell-weighted Manhattan distance", title = "Pan-genome dendrogram")
167 | #' }
168 | #' 
169 | #' @importFrom stats dist
170 | #' 
171 | #' @export distManhattan
172 | #' 
173 | distManhattan <- function(pan.matrix, scale = 0.0, weights = rep(1, ncol(pan.matrix))){
174 |   if((scale > 1) | (scale < 0)){
175 |     warning( "scale should be between 0.0 and 1.0, using scale = 0.0" )
176 |     scale <- 0.0
177 |   }
178 |   idx <- which(pan.matrix > 0, arr.ind = T)
179 |   pan.matrix[idx] <- 1 + (pan.matrix[idx] - 1) * scale
180 |   pan.matrix <- t(t(pan.matrix) * weights)
181 |   return(dist(pan.matrix, method = "manhattan"))
182 | }
183 | 
184 | 
185 | #' @name geneWeights
186 | #' @title Gene cluster weighting
187 | #' 
188 | #' @description This function computes weights for gene cluster according to their distribution in a pan-genome.
189 | #' 
190 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details.
191 | #' @param type A text indicating the weighting strategy.
192 | #' 
193 | #' @details When computing distances between genomes or a PCA, it is possible to give weights to the
194 | #' different gene clusters, emphasizing certain aspects.
195 | #' 
196 | #' As proposed by Snipen & Ussery (2010), we have implemented two types of weighting: The default
197 | #' \samp{"shell"} type means gene families occuring frequently in the genomes, denoted shell-genes, are
198 | #' given large weight (close to 1) while those occurring rarely are given small weight (close to 0).
199 | #' The opposite is the \samp{"cloud"} type of weighting. Genes observed in a minority of the genomes are
200 | #' referred to as cloud-genes. Presumeably, the \samp{"shell"} weighting will give distances/PCA reflecting
201 | #' a more long-term evolution, since emphasis is put on genes who have just barely diverged away from the
202 | #' core. The \samp{"cloud"} weighting emphasizes those gene clusters seen rarely. Genomes with similar
203 | #' patterns among these genes may have common recent history. A \samp{"cloud"} weighting typically gives
204 | #' a more erratic or \sQuote{noisy} picture than the \samp{"shell"} weighting.
205 | #' 
206 | #' @return A vector of weights, one for each column in \code{pan.matrix}.
207 | #' 
208 | #' @references Snipen, L., Ussery, D.W. (2010). Standard operating procedure for computing pangenome
209 | #' trees. Standards in Genomic Sciences, 2:135-141.
210 | #' 
211 | #' @author Lars Snipen and Kristian Hovde Liland.
212 | #' 
213 | #' @seealso \code{\link{panMatrix}}, \code{\link{distManhattan}}.
214 | #' 
215 | #' @examples 
216 | #' # See examples for distManhattan
217 | #' 
218 | #' @export geneWeights
219 | #' 
220 | geneWeights <- function(pan.matrix, type = c("shell", "cloud")){
221 |   ng <- dim( pan.matrix )[1]
222 |   nf <- dim( pan.matrix )[2]
223 |   pan.matrix[which(pan.matrix > 0, arr.ind = T)] <- 1
224 |   cs <- colSums(pan.matrix)
225 |   
226 |   midx <- grep(type[1], c("shell", "cloud"))
227 |   if(length(midx) == 0){
228 |     warning("Unknown weighting:", type, ", using shell weights")
229 |     midx <- 1
230 |   }
231 |   W <- rep(1, ncol(pan.matrix))
232 |   x <- 1:nrow(pan.matrix)
233 |   ww <- 1 / (1 + exp(((x - 1) - (max(x) - 1)/2) / ((max(x) - 1) / 10)))
234 |   if(midx == 1) ww <- 1 - ww
235 |   for(i in x) W[cs == i] <- ww[i]
236 |   return(W)
237 | }
238 | 


--------------------------------------------------------------------------------
/R/hmmer3.R:
--------------------------------------------------------------------------------
  1 | #' @name hmmerScan
  2 | #' @title Scanning a profile Hidden Markov Model database
  3 | #' 
  4 | #' @description Scanning FASTA formatted protein files against a database of pHMMs using the HMMER3
  5 | #' software.
  6 | #' 
  7 | #' @param in.files A character vector of file names.
  8 | #' @param dbase The full path-name of the database to scan (text).
  9 | #' @param out.folder The name of the folder to put the result files.
 10 | #' @param threads Number of CPU's to use.
 11 | #' @param verbose Logical indicating if textual output should be given to monitor the progress.
 12 | #' 
 13 | #' @details The HMMER3 software is purpose-made for handling profile Hidden Markov Models (pHMM)
 14 | #' describing patterns in biological sequences (Eddy, 2008). This function will make calls to the
 15 | #' HMMER3 software to scan FASTA files of proteins against a pHMM database. 
 16 | #' 
 17 | #' The files named in \samp{in.files} must contain FASTA formatted protein sequences. These files
 18 | #' should be prepared by \code{\link{panPrep}} to make certain each sequence, as well as the file name,
 19 | #' has a GID-tag identifying their genome. The database named in \samp{db} must be a HMMER3 formatted
 20 | #' database. It is typically the Pfam-A database, but you can also make your own HMMER3 databases, see
 21 | #' the HMMER3 documentation for help.
 22 | #' 
 23 | #' \code{\link{hmmerScan}} will query every input file against the named database. The database contains
 24 | #' profile Hidden Markov Models describing position specific sequence patterns. Each sequence in every
 25 | #' input file is scanned to see if some of the patterns can be matched to some degree. Each input file
 26 | #' results in an output file with the same GID-tag in the name. The result files give tabular output, and
 27 | #' are plain text files. See \code{\link{readHmmer}} for how to read the results into R.
 28 | #' 
 29 | #' Scanning large databases like Pfam-A takes time, usually several minutes per genome. The scan is set
 30 | #' up to use only 1 cpu per scan by default. By increasing \code{threads} you can utilize multiple CPUs, typically
 31 | #' on a computing cluster.
 32 | #' Our experience is that from a multi-core laptop it is better to start this function in default mode
 33 | #' from mutliple R-sessions. This function will not overwrite an existing result file, and multiple parallel
 34 | #' sessions can write results to the same folder.
 35 | #' 
 36 | #' @return This function produces files in the folder specified by \samp{out.folder}. Existing files are
 37 | #' never overwritten by \code{\link{hmmerScan}}, if you want to re-compute something, delete the
 38 | #' corresponding result files first.
 39 | #' 
 40 | #' @references Eddy, S.R. (2008). A Probabilistic Model of Local Sequence Alignment That Simplifies
 41 | #' Statistical Significance Estimation. PLoS Computational Biology, 4(5).
 42 | #' 
 43 | #' @note The HMMER3 software must be installed on the system for this function to work, i.e. the command
 44 | #' \samp{system("hmmscan -h")} must be recognized as a valid command if you run it in the Console window.
 45 | #' 
 46 | #' @author Lars Snipen and Kristian Hovde Liland.
 47 | #' 
 48 | #' @seealso \code{\link{panPrep}}, \code{\link{readHmmer}}.
 49 | #' 
 50 | #' @examples
 51 | #' \dontrun{
 52 | #' # This example require the external software HMMER
 53 | #' # Using example files in this package
 54 | #' pf <- file.path(path.package("micropan"), "extdata", "xmpl_GID1.faa.xz")
 55 | #' dbf <- file.path(path.package("micropan"), "extdata",
 56 | #'                  str_c("microfam.hmm", c(".h3f.xz",".h3i.xz",".h3m.xz",".h3p.xz")))
 57 | #' 
 58 | #' # We need to uncompress them first...
 59 | #' prot.file <- tempfile(pattern = "GID1.faa", fileext=".xz")
 60 | #' ok <- file.copy(from = pf, to = prot.file)
 61 | #' prot.file <- xzuncompress(prot.file)
 62 | #' db.files <- str_c(tempfile(), c(".h3f.xz",".h3i.xz",".h3m.xz",".h3p.xz"))
 63 | #' ok <- file.copy(from = dbf, to = db.files)
 64 | #' db.files <- unlist(lapply(db.files, xzuncompress))
 65 | #' db.name <- str_remove(db.files[1], "\\.[a-z0-9]+$")
 66 | #' 
 67 | #' # Scanning the FASTA file against microfam.hmm...
 68 | #' hmmerScan(in.files = prot.file, dbase = db.name, out.folder = ".")
 69 | #'
 70 | #' # Reading results
 71 | #' hmm.file <- file.path(".", str_c("GID1_vs_", basename(db.name), ".txt"))
 72 | #' hmm.tbl <- readHmmer(hmm.file)
 73 | #' 
 74 | #' # ...and cleaning...
 75 | #' ok <- file.remove(prot.file)
 76 | #' ok <- file.remove(str_remove(db.files, ".xz"))
 77 | #' }
 78 | #' 
 79 | #' @importFrom stringr str_c str_extract
 80 | #' 
 81 | #' @export hmmerScan
 82 | #' 
 83 | hmmerScan <- function(in.files, dbase, out.folder, threads = 0, verbose = TRUE){
 84 |   if(length(dbase) > 1){
 85 |     stop("Argument dbase must be a single text")
 86 |   }
 87 |   if(available.external("hmmer")){
 88 |     log.fil <- file.path(out.folder, "log.txt")
 89 |     basic <- paste("hmmscan -o", log.fil,"--cut_ga --noali --cpu", threads)
 90 |     rbase <- str_c("_vs_", basename(dbase), ".txt")
 91 |     gids <- str_extract(in.files, "GID[0-9]+")
 92 |     in.files <- normalizePath(in.files)
 93 |     for(i in 1:length(in.files)){
 94 |       rname <- str_c(gids[i], rbase)
 95 |       res.files <- list.files(out.folder)
 96 |       if(!(rname %in% res.files)){
 97 |         if(verbose) cat("hmmerScan: Scanning file", i, "out of", length(in.files), "...\r")
 98 |         cmd <- paste(basic,
 99 |                      "--domtblout", file.path(out.folder, rname),
100 |                      dbase,
101 |                      in.files[i])
102 |         system(cmd)
103 |         ok <- file.remove(log.fil)
104 |       }
105 |     }
106 |   }
107 | }
108 | 
109 | 
110 | 
111 | #' @name readHmmer
112 | #' @title Reading results from a HMMER3 scan
113 | #' 
114 | #' @description Reading a text file produced by \code{\link{hmmerScan}}.
115 | #' 
116 | #' @param hmmer.file The name of a \code{\link{hmmerScan}} result file.
117 | #' @param e.value Numeric threshold, hits with E-value above this are ignored (default is 1.0).
118 | #' @param use.acc Logical indicating if accession numbers should be used to identify the hits.
119 | #' 
120 | #' @details The function reads a text file produced by \code{\link{hmmerScan}}. By specifying a smaller
121 | #' \samp{e.value} you filter out poorer hits, and fewer results are returned. The option \samp{use.acc}
122 | #' should be turned off (FALSE) if you scan against your own database where accession numbers are lacking.
123 | #' 
124 | #' @return The results are returned in a \samp{tibble} with columns \samp{Query}, \samp{Hit},
125 | #' \samp{Evalue}, \samp{Score}, \samp{Start}, \samp{Stop} and \samp{Description}. \samp{Query} is the tag
126 | #' identifying each query sequence. \samp{Hit} is the name or accession number for a pHMM in the database
127 | #' describing patterns. The \samp{Evalue} is the \samp{ievalue} in the HMMER3 terminology. The \samp{Score}
128 | #' is the HMMER3 score for the match between \samp{Query} and \samp{Hit}. The \samp{Start} and \samp{Stop}
129 | #' are the positions within the \samp{Query} where the \samp{Hit} (pattern) starts and stops.
130 | #' \samp{Description} is the description of the \samp{Hit}. There is one line for each hit. 
131 | #' 
132 | #' @author Lars Snipen and Kristian Hovde Liland.
133 | #' 
134 | #' @seealso \code{\link{hmmerScan}}, \code{\link{hmmerCleanOverlap}}, \code{\link{dClust}}.
135 | #' 
136 | #' @examples
137 | #' # See the examples in the Help-files for dClust and hmmerScan.
138 | #' 
139 | #' @importFrom stringr str_detect str_split str_c str_replace_all
140 | #' @importFrom tibble tibble
141 | #' @importFrom dplyr %>% 
142 | #' @importFrom rlang .data
143 | #' 
144 | #' @export readHmmer
145 | #' 
146 | readHmmer <- function(hmmer.file, e.value = 1, use.acc = TRUE){
147 |   hmmer.file <- normalizePath(hmmer.file)
148 |   lines <- readLines(hmmer.file)
149 |   subset(lines, !str_detect(lines, "^\\#")) %>%
150 |     str_replace_all("[ ]+", " ") %>% 
151 |     str_split(pattern = " ") -> lst
152 |   if(use.acc){
153 |     hit <- sapply(lst, function(x){x[2]})
154 |   } else {
155 |     hit <- sapply(lst, function(x){x[1]})
156 |   }
157 |   tibble(Query  = sapply(lst, function(x){x[4]}),
158 |          Hit    = hit,
159 |          Evalue = as.numeric(sapply(lst, function(x){x[13]})),
160 |          Score  = as.numeric(sapply(lst, function(x){x[14]})),
161 |          Start  = as.numeric(sapply(lst, function(x){x[18]})),
162 |          Stop   = as.numeric(sapply(lst, function(x){x[19]})),
163 |          Description = sapply(lst, function(x){str_c(x[23:length(x)], collapse = " ")})) %>% 
164 |     filter(.data$Evalue <= e.value) -> hmmer.tbl
165 |   return(hmmer.tbl)
166 | }
167 | 
168 | 


--------------------------------------------------------------------------------
/R/micropan.R:
--------------------------------------------------------------------------------
 1 | #' @name micropan
 2 | #' @aliases micropan-package
 3 | #' @title Microbial Pan-Genome Analysis
 4 | #' 
 5 | #' @description A collection of functions for computations and visualizations of microbial pan-genomes.
 6 | #' Some of the functions make use of external software that needs to be installed on the system, see the
 7 | #' package vignette for more details on this.
 8 | #' 
 9 | #' 
10 | #' @author Lars Snipen and Kristian Hovde Liland.
11 | #' 
12 | #' Maintainer: Lars Snipen <lars.snipen@nmbu.no>
13 | #' 
14 | #' @references
15 | #' Snipen, L., Liland, KH. (2015). micropan: an R-package for microbial pan-genomics. BMC Bioinformatics, 16:79.
16 | #' 
17 | NULL


--------------------------------------------------------------------------------
/R/panmat.R:
--------------------------------------------------------------------------------
 1 | #' @name panMatrix
 2 | #' @title Computing the pan-matrix for a set of gene clusters
 3 | #' 
 4 | #' @description A pan-matrix has one row for each genome and one column for each gene cluster, and
 5 | #' cell \samp{[i,j]} indicates how many members genome \samp{i} has in gene family \samp{j}.
 6 | #' 
 7 | #' @param clustering A named vector of integers.
 8 | #' 
 9 | #' @details The pan-matrix is a central data structure for pan-genomic analysis. It is a matrix with
10 | #' one row for each genome in the study, and one column for each gene cluster. Cell \samp{[i,j]}
11 | #' contains an integer indicating how many members genome \samp{i} has in cluster \samp{j}.
12 | #' 
13 | #' The input \code{clustering} must be a named integer vector with one element for each sequence in the study,
14 | #' typically produced by either \code{\link{bClust}} or \code{\link{dClust}}. The name of each element
15 | #' is a text identifying every sequence. The value of each element indicates the cluster, i.e. those
16 | #' sequences with identical values are in the same cluster. IMPORTANT: The name of each sequence must
17 | #' contain the \samp{genome_id} for each genome, i.e. they must of the form \samp{GID111_seq1}, \samp{GID111_seq2},...
18 | #' where the \samp{GIDxxx} part indicates which genome the sequence belongs to. See \code{\link{panPrep}}
19 | #' for details.
20 | #' 
21 | #' The rows of the pan-matrix is named by the \samp{genome_id} for every genome. The columns are just named
22 | #' \samp{Cluster_x} where \samp{x} is an integer copied from \samp{clustering}.
23 | #' 
24 | #' @return An integer matrix with a row for each genome and a column for each sequence cluster.
25 | #' The input vector \samp{clustering} is attached as the attribute \samp{clustering}.
26 | #' 
27 | #' @author Lars Snipen and Kristian Hovde Liland.
28 | #' 
29 | #' @seealso \code{\link{bClust}}, \code{\link{dClust}}, \code{\link{distManhattan}},
30 | #' \code{\link{distJaccard}}, \code{\link{fluidity}}, \code{\link{chao}},
31 | #' \code{\link{binomixEstimate}}, \code{\link{heaps}}, \code{\link{rarefaction}}.
32 | #' 
33 | #' @examples 
34 | #' # Loading clustering data in this package
35 | #' data(xmpl.bclst)
36 | #' 
37 | #' # Pan-matrix based on the clustering
38 | #' panmat <- panMatrix(xmpl.bclst)
39 | #' 
40 | #' \dontrun{
41 | #' # Plotting cluster distribution
42 | #' library(ggplot2)
43 | #' tibble(Clusters = as.integer(table(factor(colSums(panmat > 0), levels = 1:nrow(panmat)))),
44 | #'        Genomes = 1:nrow(panmat)) %>% 
45 | #' ggplot(aes(x = Genomes, y = Clusters)) +
46 | #' geom_col()
47 | #' }
48 | #' 
49 | #' @importFrom stringr str_extract str_c
50 | #' 
51 | #' @export panMatrix
52 | #' 
53 | panMatrix <- function(clustering){
54 |   gids <- str_extract(names(clustering), "GID[0-9]+")
55 |   ugids <- sort(unique(gids))
56 |   uclst <- sort(unique(clustering))
57 |   pan.matrix <- matrix(0, nrow = length(ugids), ncol = length(uclst))
58 |   rownames(pan.matrix) <- ugids
59 |   colnames(pan.matrix) <- str_c("Cluster", uclst)
60 |   for(i in 1:length(ugids)){
61 |     tb <- table(clustering[gids == ugids[i]])
62 |     idd <- as.numeric(names(tb))
63 |     idx <- which(uclst %in% idd)
64 |     pan.matrix[i,idx] <- tb
65 |   }
66 |   attr(pan.matrix, "clustering") <- clustering
67 |   return(pan.matrix)
68 | }
69 | 
70 | 


--------------------------------------------------------------------------------
/R/panpca.R:
--------------------------------------------------------------------------------
 1 | #' @name panPca
 2 | #' @title Principal component analysis of a pan-matrix
 3 | #' 
 4 | #' @description Computes a principal component decomposition of a pan-matrix, with possible
 5 | #' scaling and weightings.
 6 | #' 
 7 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details.
 8 | #' @param scale An optional scale to control how copy numbers should affect the distances.
 9 | #' @param weights Vector of optional weights of gene clusters.
10 | #' 
11 | #' @details A principal component analysis (PCA) can be computed for any matrix, also a pan-matrix.
12 | #' The principal components will in this case be linear combinations of the gene clusters. One major
13 | #' idea behind PCA is to truncate the space, e.g. instead of considering the genomes as points in a
14 | #' high-dimensional space spanned by all gene clusters, we look for a few \sQuote{smart} combinations
15 | #' of the gene clusters, and visualize the genomes in a low-dimensional space spanned by these directions.
16 | #' 
17 | #' The \samp{scale} can be used to control how copy number differences play a role in the PCA. Usually
18 | #' we assume that going from 0 to 1 copy of a gene is the big change of the genome, and going from 1 to
19 | #' 2 (or more) copies is less. Prior to computing the PCA, the \samp{pan.matrix} is transformed according
20 | #' to the following affine mapping: If the original value in \samp{pan.matrix} is \samp{x}, and \samp{x}
21 | #' is not 0, then the transformed value is \samp{1 + (x-1)*scale}. Note that with \samp{scale=0.0}
22 | #' (default) this will result in 1 regardless of how large \samp{x} was. In this case the PCA only
23 | #' distinguish between presence and absence of gene clusters. If \samp{scale=1.0} the value \samp{x} is
24 | #' left untransformed. In this case the difference between 1 copy and 2 copies is just as big as between
25 | #' 1 copy and 0 copies. For any \samp{scale} between 0.0 and 1.0 the transformed value is shrunk towards
26 | #' 1, but a certain effect of larger copy numbers is still present. In this way you can decide if the PCA
27 | #' should be affected, and to what degree, by differences in copy numbers beyond 1.
28 | #' 
29 | #' The PCA may also up- or downweight some clusters compared to others. The vector \samp{weights} must
30 | #' contain one value for each column in \samp{pan.matrix}. The default is to use flat weights, i.e. all
31 | #' clusters count equal. See \code{\link{geneWeights}} for alternative weighting strategies.
32 | #' 
33 | #' @return A \code{list} with three tables:
34 | #' 
35 | #' \samp{Evar.tbl} has two columns, one listing the component number and one listing the relative 
36 | #' explained variance for each component. The relative explained variance always sums to 1.0 over
37 | #' all components. This value indicates the importance of each component, and it is always in
38 | #' descending order, the first component being the most important.
39 | #' This is typically the first result you look at after a PCA has been computed, as it indicates
40 | #' how many components (directions) you need to capture the bulk of the total variation in the data.
41 | #' 
42 | #' \samp{Scores.tbl} has a column listing the \samp{GID.tag} for each genome, and then one column for each
43 | #' principal component. The columns are ordered corresponding to the elements in \samp{Evar}. The
44 | #' scores are the coordinates of each genome in the principal component space.
45 | #' 
46 | #' \samp{Loadings.tbl} is similar to \samp{Scores.tbl} but contain values for each gene cluster
47 | #' instead of each genome. The columns are ordered corresponding to the elements in \samp{Evar}.
48 | #' The loadings are the contributions from each gene cluster to the principal component directions.
49 | #' NOTE: Only gene clusters having a non-zero variance is used in a PCA. Gene clusters with the
50 | #' same value for every genome have no impact and are discarded from the \samp{Loadings}.
51 | #' 
52 | #' @author Lars Snipen and Kristian Hovde Liland.
53 | #' 
54 | #' @seealso \code{\link{distManhattan}}, \code{\link{geneWeights}}.
55 | #' 
56 | #' @examples 
57 | #' # Loading a pan-matrix in this package
58 | #' data(xmpl.panmat)
59 | #' 
60 | #' # Computing panPca
61 | #' ppca <- panPca(xmpl.panmat)
62 | #' 
63 | #' \dontrun{
64 | #' # Plotting explained variance
65 | #' library(ggplot2)
66 | #' ggplot(ppca$Evar.tbl) +
67 | #'   geom_col(aes(x = Component, y = Explained.variance))
68 | #' # Plotting scores
69 | #' ggplot(ppca$Scores.tbl) +
70 | #'   geom_text(aes(x = PC1, y = PC2, label = GID.tag))
71 | #' # Plotting loadings
72 | #' ggplot(ppca$Loadings.tbl) +
73 | #'   geom_text(aes(x = PC1, y = PC2, label = Cluster))
74 | #' }
75 | #' 
76 | #' @importFrom tibble as_tibble tibble
77 | #' 
78 | #' @export panPca
79 | #' 
80 | panPca <- function(pan.matrix, scale = 0.0, weights = rep(1, ncol(pan.matrix))){
81 |   if((scale > 1) | (scale < 0)){
82 |     warning("scale should be between 0.0 and 1.0, using scale=0.0")
83 |     scale <- 0.0
84 |   }
85 |   idx <- which(pan.matrix > 0, arr.ind = T)
86 |   pan.matrix[idx] <- 1 + (pan.matrix[idx] - 1) * scale
87 |   pan.matrix <- t(t(pan.matrix) * weights)
88 |   X <- pan.matrix[,which(apply(pan.matrix, 2, sd) > 0)]
89 |   pca <- prcomp(X)
90 |   pca.lst <- list(Evar.tbl     = tibble(Component = 1:length(pca$sdev),
91 |                                         Explained.variance = pca$sdev^2/sum(pca$sdev^2)),
92 |                   Scores.tbl   = as_tibble(pca$x, rownames = "GID.tag"),
93 |                   Loadings.tbl = as_tibble(pca$rotation, rownames = "Cluster"))
94 |   return(pca.lst)
95 | }
96 | 
97 | 


--------------------------------------------------------------------------------
/R/panprep.R:
--------------------------------------------------------------------------------
 1 | #' @name panPrep
 2 | #' @title Preparing FASTA files for pan-genomics
 3 | #' 
 4 | #' @description Preparing a FASTA file before starting comparisons of sequences.
 5 | #' 
 6 | #' @usage panPrep(in.file, genome_id, out.file, protein = TRUE, min.length = 10, discard = "")
 7 | #' 
 8 | #' @param in.file The name of a FASTA formatted file with protein or nucleotide sequences for coding
 9 | #' genes in a genome.
10 | #' @param genome_id The Genome Identifier, see below.
11 | #' @param out.file Name of file where the prepared sequences will be written.
12 | #' @param protein Logical, indicating if the \samp{in.file} contains protein (\code{TRUE}) or
13 | #' nucleotide (\code{FALSE}) sequences.
14 | #' @param min.length Minimum sequence length
15 | #' @param discard A text, a regular expression, and sequences having a match against this in their
16 | #' headerline will be discarded.
17 | #' 
18 | #' @details This function will read the \code{in.file} and produce another, slightly modified, FASTA file
19 | #' which is prepared for the comparisons using \code{\link{blastpAllAll}}, \code{\link{hmmerScan}}
20 | #' or any other method.
21 | #' 
22 | #' The main purpose of \code{\link{panPrep}} is to make certain every sequence is labeled with a tag
23 | #' called a \samp{genome_id} identifying the genome from which it comes. This text contains the text
24 | #' \dQuote{GID} followed by an integer. This integer can be any integer as long as it is unique to every
25 | #' genome in the study. If a genome has the text \dQuote{GID12345} as identifier, then the
26 | #' sequences in the file produced by \code{\link{panPrep}} will have headerlines starting with
27 | #' \dQuote{GID12345_seq1}, \dQuote{GID12345_seq2}, \dQuote{GID12345_seq3}...etc. This makes it possible
28 | #' to quickly identify which genome every sequence belongs to.
29 | #' 
30 | #' The \samp{genome_id} is also added to the file name specified in \samp{out.file}. For this reason the
31 | #' \samp{out.file} must have a file extension containing letters only. By convention, we expect FASTA
32 | #' files to have one of the extensions \samp{.fsa}, \samp{.faa}, \samp{.fa} or \samp{.fasta}.
33 | #' 
34 | #' \code{\link{panPrep}} will also remove sequences shorter than \code{min.length}, removing stop codon
35 | #' symbols (\samp{*}), replacing alien characters with \samp{X} and converting all sequences to upper-case.
36 | #' If the input \samp{discard} contains a regular expression, any sequences having a match to this in their
37 | #' headerline are also removed. Example: If we use the \code{prodigal} software (see \code{\link[microseq]{findGenes}})
38 | #' to find proteins in a genome, partially predicted genes will have the text \samp{partial=10} or
39 | #' \samp{partial=01} in their headerline. Using \samp{discard= "partial=01|partial=10"} will remove
40 | #' these from the data set.
41 | #' 
42 | #' @return This function produces a FASTA formatted sequence file, and returns the name of this file.
43 | #' 
44 | #' @author Lars Snipen and Kristian Liland.
45 | #' 
46 | #' @seealso \code{\link{hmmerScan}}, \code{\link{blastpAllAll}}.
47 | #' 
48 | #' @examples
49 | #' # Using a protein file in this package
50 | #' # We need to uncompress it first...
51 | #' pf <- file.path(path.package("micropan"),"extdata","xmpl.faa.xz")
52 | #' prot.file <- tempfile(fileext = ".xz")
53 | #' ok <- file.copy(from = pf, to = prot.file)
54 | #' prot.file <- xzuncompress(prot.file)
55 | #' 
56 | #' # Prepping it, using the genome_id "GID123"
57 | #' prepped.file <- panPrep(prot.file, genome_id = "GID123", out.file = tempfile(fileext = ".faa"))
58 | #' 
59 | #' # Reading the prepped file
60 | #' prepped <- readFasta(prepped.file)
61 | #' head(prepped)
62 | #' 
63 | #' # ...and cleaning...
64 | #' ok <- file.remove(prot.file, prepped.file)
65 | #' 
66 | #' @importFrom microseq readFasta writeFasta
67 | #' @importFrom dplyr mutate filter %>% n
68 | #' @importFrom stringr str_remove_all str_length str_c str_extract str_replace
69 | #' @importFrom rlang .data
70 | #' 
71 | #' @export panPrep
72 | #' 
73 | panPrep <- function(in.file, genome_id, out.file, protein = TRUE, min.length = 10, discard = ""){
74 |   if(protein){
75 |     alien <- "[^ARNDCQEGHILKMFPSTWYV]"
76 |   } else {
77 |     alien <- "[^ACGT]"
78 |   }
79 |   readFasta(normalizePath(in.file)) %>% 
80 |     mutate(Sequence = toupper(.data$Sequence)) %>% 
81 |     mutate(Sequence = str_remove_all(.data$Sequence, "\\*")) %>% 
82 |     mutate(Length = str_length(.data$Sequence)) %>% 
83 |     filter(.data$Length >= min.length) %>% 
84 |     mutate(Sequence = str_replace_all(.data$Sequence, pattern = alien, "X")) %>% 
85 |     mutate(Header = str_c(genome_id, "_seq", 1:n(), " ", .data$Header)) -> fdta
86 |   if(str_length(discard) > 0){
87 |     fdta %>%
88 |       filter(!str_detect(.data$Header, pattern = discard)) -> fdta
89 |   }
90 |   out.file <- file.path(normalizePath(dirname(out.file)),
91 |                         basename(out.file))
92 |   fext <- str_extract(out.file, "\\.[a-zA-Z]+$")
93 |   out.file <- str_replace(out.file, str_c(fext, "$"), str_c("_", genome_id, fext))
94 |   writeFasta(fdta, out.file = out.file)
95 |   return(out.file)
96 | }


--------------------------------------------------------------------------------
/R/powerlaw.R:
--------------------------------------------------------------------------------
  1 | #' @name chao
  2 | #' @title The Chao lower bound estimate of pan-genome size
  3 | #' 
  4 | #' @description Computes the Chao lower bound estimated number of gene clusters in a pan-genome.
  5 | #' 
  6 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details.
  7 | #' 
  8 | #' @details The size of a pan-genome is the number of gene clusters in it, both those observed and those
  9 | #' not yet observed.
 10 | #' 
 11 | #' The input \samp{pan.matrix} is a a matrix with one row for each
 12 | #' genome and one column for each observed gene cluster in the pan-genome. See \code{\link{panMatrix}}
 13 | #' for how to construct this.
 14 | #' 
 15 | #' The number of observed gene clusters is simply the number of columns in \samp{pan.matrix}. The
 16 | #' number of gene clusters not yet observed is estimated by the Chao lower bound estimator (Chao, 1987).
 17 | #' This is based solely on the number of clusters observed in 1 and 2 genomes. It is a very simple and
 18 | #' conservative estimator, i.e. it is more likely to be too small than too large. 
 19 | #' 
 20 | #' @return The function returns an integer, the estimated pan-genome size. This includes both the number
 21 | #' of gene clusters observed so far, as well as the estimated number not yet seen.
 22 | #' 
 23 | #' @references Chao, A. (1987). Estimating the population size for capture-recapture data with unequal
 24 | #' catchability. Biometrics, 43:783-791.
 25 | #' 
 26 | #' @author Lars Snipen and Kristian Hovde Liland.
 27 | #' 
 28 | #' @seealso \code{\link{panMatrix}}, \code{\link{binomixEstimate}}.
 29 | #' 
 30 | #' @examples 
 31 | #' # Loading a pan-matrix in this package
 32 | #' data(xmpl.panmat)
 33 | #' 
 34 | #' # Estimating the pan-genome size using the Chao estimator
 35 | #' chao.pansize <- chao(xmpl.panmat)
 36 | #' 
 37 | #' @export chao
 38 | #' 
 39 | chao <- function(pan.matrix){  
 40 |   y <- table(factor(colSums(pan.matrix > 0), levels = 1:nrow(pan.matrix)))
 41 |   if(y[2] == 0){
 42 |     stop( "Cannot compute Chao estimate since there are 0 gene clusters observed in 2 genomes!\n" )
 43 |   } else {
 44 |     pan.size <- round(sum(y) + y[1]^2/(2*y[2]))
 45 |     names(pan.size) <- NULL
 46 |     return(pan.size)
 47 |   }
 48 | }
 49 | 
 50 | 
 51 | #' @name heaps
 52 | #' @title Heaps law estimate
 53 | #' 
 54 | #' @description Estimating if a pan-genome is open or closed based on a Heaps law model.
 55 | #' 
 56 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details.
 57 | #' @param n.perm The number of random permutations of genome ordering.
 58 | #' 
 59 | #' @details An open pan-genome means there will always be new gene clusters observed as long as new genomes
 60 | #' are being sequenced. This may sound controversial, but in a pragmatic view, an open pan-genome indicates
 61 | #' that the number of new gene clusters to be observed in future genomes is \sQuote{large} (but not literally
 62 | #' infinite). Opposite, a closed pan-genome indicates we are approaching the end of new gene clusters. 
 63 | #' 
 64 | #' This function is based on a Heaps law approach suggested by Tettelin et al (2008). The Heaps law model
 65 | #' is fitted to the number of new gene clusters observed when genomes are ordered in a random way. The model
 66 | #' has two parameters, an intercept and a decay parameter called \samp{alpha}. If \samp{alpha>1.0} the
 67 | #' pan-genome is closed, if \samp{alpha<1.0} it is open.
 68 | #' 
 69 | #' The number of permutations, \samp{n.perm}, should be as large as possible, limited by computation time.
 70 | #' The default value of 100 is certainly a minimum.
 71 | #' 
 72 | #' Word of caution: The Heaps law assumes independent sampling. If some of the genomes in the data set
 73 | #' form distinct sub-groups in the population, this may affect the results of this analysis severely.
 74 | #' 
 75 | #' @return A vector of two estimated parameters: The \samp{Intercept} and the decay parameter \samp{alpha}.
 76 | #' If \samp{alpha<1.0} the pan-genome is open, if \samp{alpha>1.0} it is closed.
 77 | #' 
 78 | #' @references Tettelin, H., Riley, D., Cattuto, C., Medini, D. (2008). Comparative genomics: the
 79 | #' bacterial pan-genome. Current Opinions in Microbiology, 12:472-477.
 80 | #' 
 81 | #' @author Lars Snipen and Kristian Hovde Liland.
 82 | #' 
 83 | #' @seealso \code{\link{binomixEstimate}}, \code{\link{chao}}, \code{\link{rarefaction}}.
 84 | #' 
 85 | #' @examples 
 86 | #' # Loading a pan-matrix in this package 
 87 | #' data(xmpl.panmat)
 88 | #' 
 89 | #' # Estimating population openness
 90 | #' h.est <- heaps(xmpl.panmat, n.perm = 500)
 91 | #' print(h.est)
 92 | #' # If alpha < 1 it indicates an open pan-genome
 93 | #' 
 94 | #' @importFrom stats optim
 95 | #' 
 96 | #' @export
 97 | heaps <- function(pan.matrix, n.perm = 100){
 98 |   pan.matrix[which(pan.matrix > 0, arr.ind = T)] <- 1
 99 |   ng <- dim(pan.matrix)[1]
100 |   nmat <- matrix(0, nrow = nrow(pan.matrix) - 1, ncol = n.perm)
101 |   for(i in 1:n.perm){
102 |     cm <- apply(pan.matrix[sample(nrow(pan.matrix)),], 2, cumsum)
103 |     nmat[,i] <- rowSums((cm == 1)[2:ng,] & (cm == 0)[1:(ng-1),])
104 |     cat(i, "/", n.perm, "\r")
105 |   }
106 |   x <- rep((2:nrow(pan.matrix)), times = n.perm)
107 |   y <- as.numeric(nmat)
108 |   p0 <- c(mean(y[which(x == 2)] ), 1)
109 |   fit <- optim(p0, objectFun, gr = NULL, x, y, method = "L-BFGS-B", lower = c(0, 0), upper = c(10000, 2))
110 |   p.hat <- fit$par
111 |   names(p.hat) <- c("Intercept", "alpha")
112 |   return(p.hat)
113 | }
114 | 
115 | objectFun <- function(p, x, y){
116 |   y.hat <- p[1] * x^(-p[2])
117 |   J <- sqrt(sum((y - y.hat)^2))/length(x)
118 |   return(J)
119 | }
120 | 


--------------------------------------------------------------------------------
/R/rarefaction.R:
--------------------------------------------------------------------------------
 1 | #' @name rarefaction
 2 | #' @title Rarefaction curves for a pan-genome
 3 | #' 
 4 | #' @description Computes rarefaction curves for a number of random permutations of genomes.
 5 | #' 
 6 | #' @param pan.matrix A pan-matrix, see \code{\link{panMatrix}} for details.
 7 | #' @param n.perm The number of random genome orderings to use. If \samp{n.perm=1} the fixed order of
 8 | #' the genomes in \samp{pan.matrix} is used.
 9 | #' 
10 | #' @details A rarefaction curve is simply the cumulative number of unique gene clusters we observe as
11 | #' more and more genomes are being considered. The shape of this curve will depend on the order of the
12 | #' genomes. This function will typically compute rarefaction curves for a number of (\samp{n.perm})
13 | #' orderings. By using a large number of permutations, and then averaging over the results, the effect
14 | #' of any particular ordering is smoothed.
15 | #' 
16 | #' The averaged curve illustrates how many new gene clusters we observe for each new genome. If this
17 | #' levels out and becomes flat, it means we expect few, if any, new gene clusters by sequencing more
18 | #' genomes. The function \code{\link{heaps}} can be used to estimate population openness based on this
19 | #' principle.
20 | #' 
21 | #' @return A table with the curves in the columns. The first column is the number of genomes, while 
22 | #' all other columns are the cumulative number of clusters, one column for each permutation.
23 | #' 
24 | #' @author Lars Snipen and Kristian Hovde Liland.
25 | #' 
26 | #' @seealso \code{\link{heaps}}, \code{\link{panMatrix}}.
27 | #' 
28 | #' @examples 
29 | #' # Loading a pan-matrix in this package 
30 | #' data(xmpl.panmat)
31 | #' 
32 | #' # Rarefaction
33 | #' rar.tbl <- rarefaction(xmpl.panmat, n.perm = 1000)
34 | #' 
35 | #' \dontrun{
36 | #' # Plotting
37 | #' library(ggplot2)
38 | #' library(tidyr)
39 | #' rar.tbl %>% 
40 | #'   gather(key = "Permutation", value = "Clusters", -Genome) %>% 
41 | #'   ggplot(aes(x = Genome, y = Clusters, group = Permutation)) +
42 | #'     geom_line()
43 | #' }
44 | #' 
45 | #' @importFrom dplyr %>% bind_cols
46 | #' @importFrom tibble tibble as_tibble
47 | #' 
48 | #' @export rarefaction
49 | #' 
50 | rarefaction <- function(pan.matrix, n.perm = 1){
51 |   pan.matrix[which(pan.matrix > 0, arr.ind = T)] <- 1
52 |   nmat <- matrix(0, nrow = nrow(pan.matrix), ncol = n.perm)
53 |   cm <- apply(pan.matrix, 2, cumsum)
54 |   nmat[,1] <- rowSums(cm > 0)
55 |   if(n.perm > 1){
56 |     for(i in 2:n.perm){
57 |       cm <- apply(pan.matrix[sample(nrow(pan.matrix)),], 2, cumsum)
58 |       nmat[,i] <- rowSums(cm > 0)
59 |       cat(i, "/", n.perm, "\r")
60 |     }
61 |   }
62 |   nmat <- rbind(rep(0, ncol(nmat)), nmat)
63 |   tibble(Genomes = 0:nrow(pan.matrix)) %>% 
64 |     bind_cols(as_tibble(nmat, .name_repair = "minimal")) -> rtbl
65 |   colnames(rtbl) <- c("Genome", str_c("Perm", 1:n.perm))
66 |   return(rtbl)
67 | }
68 | 
69 | 


--------------------------------------------------------------------------------
/R/xmpl.R:
--------------------------------------------------------------------------------
 1 | #' @name xmpl
 2 | #' @aliases xmpl xmpl.bdist xmpl.bclst xmpl.panmat
 3 | #' @docType data
 4 | #' @title Data sets for use in examples
 5 | #' 
 6 | #' @description This data set contains several files with various objects used in examples
 7 | #' in some of the functions in the \code{micropan} package.
 8 | #' 
 9 | #' @usage 
10 | #' data(xmpl.bdist)
11 | #' data(xmpl.bclst)
12 | #' data(xmpl.panmat)
13 | #' 
14 | #' @details 
15 | #' \samp{xmpl.bdist} is a \code{tibble} with 4 columns holding all
16 | #' BLAST distances between pairs of proteins in an example with 10 small genomes.
17 | #' 
18 | #' \samp{xmpl.bclst} is a clustering vector of all proteins in the
19 | #' genomes from \samp{xmpl.bdist}.
20 | #' 
21 | #' \samp{xmpl.panmat} is a pan-matrix with 10 rows and 1210 columns
22 | #' computed from \samp{xmpl.bclst}.
23 | #' 
24 | #' @author Lars Snipen and Kristian Hovde Liland.
25 | #' 
26 | #' @examples 
27 | #' 
28 | #' # BLAST distances, only the first 20 are displayed
29 | #' data(xmpl.bdist)
30 | #' head(xmpl.bdist)
31 | #' 
32 | #' # Clustering vector
33 | #' data(xmpl.bclst)
34 | #' print(xmpl.bclst[1:30])
35 | #' 
36 | #' # Pan-matrix
37 | #' data(xmpl.panmat)
38 | #' head(xmpl.panmat)
39 | #' 
40 | NULL


--------------------------------------------------------------------------------
/R/xz.R:
--------------------------------------------------------------------------------
  1 | #' @rdname xz
  2 | #' @name xzcompress
  3 | #' @title Compressing and uncompressing text files
  4 | #' 
  5 | #' @description These functions are adapted from the \code{R.utils} package from gzip to xz. Internally
  6 | #' \code{xzfile()} (see connections) is used to read (write) chunks to (from) the xz file. If the
  7 | #' process is interrupted before completed, the partially written output file is automatically removed.
  8 | #' 
  9 | #' @param filename Path name of input file.
 10 | #' @param destname Pathname of output file.
 11 | #' @param temporary If TRUE, the output file is created in a temporary directory.
 12 | #' @param skip If TRUE and the output file already exists, the output file is returned as is.
 13 | #' @param overwrite If TRUE and the output file already exists, the file is silently overwritting,
 14 | #' otherwise an exception is thrown (unless skip is TRUE).
 15 | #' @param remove If TRUE, the input file is removed afterward, otherwise not.
 16 | #' @param BFR.SIZE The number of bytes read in each chunk.
 17 | #' @param compression The compression level used (1-9).
 18 | #' @param ... Not used.
 19 | #' 
 20 | #' @return Returns the pathname of the output file. The number of bytes processed is returned as an attribute. 
 21 | #' 
 22 | #' @author Kristian Hovde Liland.
 23 | #' 
 24 | #' @examples
 25 | #' # Creating small file
 26 | #' tf <- tempfile()
 27 | #' cat(file=tf, "Hello world!")
 28 | #' 
 29 | #' # Compressing
 30 | #' tf.xz <- xzcompress(tf)
 31 | #' print(file.info(tf.xz))
 32 | #' 
 33 | #' # Uncompressing
 34 | #' tf <- xzuncompress(tf.xz)
 35 | #' print(file.info(tf))
 36 | #' file.remove(tf)
 37 | #' 
 38 | #' @export xzcompress
 39 | #' 
 40 | xzcompress <- function(filename, destname = sprintf("%s.xz", filename), temporary = FALSE, 
 41 |           skip = FALSE, overwrite = FALSE, remove = TRUE, BFR.SIZE = 1e+07, compression = 6,
 42 |           ...){
 43 |   if(!file.exists(filename)){
 44 |     stop("No such file: ", filename)
 45 |   }
 46 |   if(temporary){
 47 |     destname <- file.path(tempdir(), basename(destname))
 48 |   }
 49 |   attr(destname, "temporary") <- temporary
 50 |   if(filename == destname) 
 51 |     stop(sprintf("Argument 'filename' and 'destname' are identical: %s", filename))
 52 |   if(file.exists(destname)){
 53 |     if(skip){
 54 |       return(destname)
 55 |     } else if(!overwrite){
 56 |       stop(sprintf("File already exists: %s", destname))
 57 |     }
 58 |   }
 59 |   destpath <- dirname(destname)
 60 |   if(!file.info(destpath)$isdir) 
 61 |     dir.create(destpath)
 62 |   inn <- file(filename, open = "rb")
 63 |   on.exit(if(!is.null(inn)) close(inn))
 64 |   outComplete <- FALSE
 65 |   out <- xzfile(destname, open = "wb", compression = compression, ...)
 66 |   on.exit({
 67 |     close(out)
 68 |     if(!outComplete){
 69 |       file.remove(destname)
 70 |     }
 71 |   }, add = TRUE)
 72 |   nbytes <- 0L
 73 |   repeat{
 74 |     bfr <- readBin(inn, what = raw(0L), size = 1L, n = BFR.SIZE)
 75 |     n <- length(bfr)
 76 |     if(n == 0L) 
 77 |       break
 78 |     nbytes <- nbytes + n
 79 |     writeBin(bfr, con = out, size = 1L)
 80 |     bfr <- NULL
 81 |   }
 82 |   outComplete <- TRUE
 83 |   if(remove){
 84 |     close(inn)
 85 |     inn <- NULL
 86 |     file.remove(filename)
 87 |   }
 88 |   attr(destname, "nbrOfBytes") <- nbytes
 89 |   invisible(destname)
 90 | }
 91 | #' @rdname xz
 92 | #' @export xzuncompress
 93 | xzuncompress <- function(filename, destname = gsub("[.]xz$", "", filename, ignore.case = TRUE), 
 94 |           temporary = FALSE, skip = FALSE, overwrite = FALSE, remove = TRUE, 
 95 |           BFR.SIZE = 1e+07, ...){
 96 |   if(!file.exists(filename)){
 97 |     stop("No such file: ", filename)
 98 |   }
 99 |   if(temporary){
100 |     destname <- file.path(tempdir(), basename(destname))
101 |   }
102 |   attr(destname, "temporary") <- temporary
103 |   if(filename == destname){
104 |     stop(sprintf("Argument 'filename' and 'destname' are identical: %s", filename))
105 |   }
106 |   if(file.exists(destname)){
107 |     if(skip){
108 |       return(destname)
109 |     } else if(!overwrite){
110 |       stop(sprintf("File already exists: %s", destname))
111 |     }
112 |   }
113 |   destpath <- dirname(destname)
114 |   if(!file.info(destpath)$isdir) 
115 |     dir.create(destpath)
116 |   inn <- xzfile(filename, open = "rb")
117 |   on.exit(if(!is.null(inn)) close(inn))
118 |   outComplete <- FALSE
119 |   out <- file(destname, open = "wb")
120 |   on.exit({
121 |     close(out)
122 |     if (!outComplete) {
123 |       file.remove(destname)
124 |     }
125 |   }, add = TRUE)
126 |   nbytes <- 0L
127 |   repeat{
128 |     bfr <- readBin(inn, what = raw(0L), size = 1L, n = BFR.SIZE)
129 |     n <- length(bfr)
130 |     if(n == 0L) 
131 |       break
132 |     nbytes <- nbytes + n
133 |     writeBin(bfr, con = out, size = 1L)
134 |     bfr <- NULL
135 |   }
136 |   outComplete <- TRUE
137 |   if(remove){
138 |     close(inn)
139 |     inn <- NULL
140 |     file.remove(filename)
141 |   }
142 |   attr(destname, "nbrOfBytes") <- nbytes
143 |   invisible(destname)
144 | }
145 | 


--------------------------------------------------------------------------------
/Readme_files/figure-gfm/unnamed-chunk-10-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-10-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-gfm/unnamed-chunk-15-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-15-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-gfm/unnamed-chunk-16-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-16-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-gfm/unnamed-chunk-20-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-20-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-gfm/unnamed-chunk-21-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-21-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-gfm/unnamed-chunk-26-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-26-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-gfm/unnamed-chunk-27-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-27-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-gfm/unnamed-chunk-28-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-28-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-gfm/unnamed-chunk-29-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-29-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-gfm/unnamed-chunk-30-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-30-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-gfm/unnamed-chunk-31-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-31-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-gfm/unnamed-chunk-32-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-32-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-gfm/unnamed-chunk-8-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-8-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-gfm/unnamed-chunk-9-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-gfm/unnamed-chunk-9-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-markdown_github/unnamed-chunk-10-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-10-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-markdown_github/unnamed-chunk-14-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-14-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-markdown_github/unnamed-chunk-16-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-16-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-markdown_github/unnamed-chunk-18-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-18-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-markdown_github/unnamed-chunk-20-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-20-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-markdown_github/unnamed-chunk-21-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-21-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-markdown_github/unnamed-chunk-25-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-25-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-markdown_github/unnamed-chunk-26-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-26-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-markdown_github/unnamed-chunk-27-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-27-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-markdown_github/unnamed-chunk-28-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-28-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-markdown_github/unnamed-chunk-29-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-29-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-markdown_github/unnamed-chunk-30-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-30-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-markdown_github/unnamed-chunk-31-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-31-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-markdown_github/unnamed-chunk-32-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-32-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-markdown_github/unnamed-chunk-7-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-7-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-markdown_github/unnamed-chunk-8-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-8-1.png


--------------------------------------------------------------------------------
/Readme_files/figure-markdown_github/unnamed-chunk-9-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/Readme_files/figure-markdown_github/unnamed-chunk-9-1.png


--------------------------------------------------------------------------------
/data/xmpl.bclst.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/data/xmpl.bclst.rda


--------------------------------------------------------------------------------
/data/xmpl.bdist.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/data/xmpl.bdist.rda


--------------------------------------------------------------------------------
/data/xmpl.panmat.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/data/xmpl.panmat.rda


--------------------------------------------------------------------------------
/inst/extdata/GID1_vs_GID1_.txt.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID1_vs_GID1_.txt.xz


--------------------------------------------------------------------------------
/inst/extdata/GID1_vs_microfam.hmm.txt.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID1_vs_microfam.hmm.txt.xz


--------------------------------------------------------------------------------
/inst/extdata/GID2_vs_GID1_.txt.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID2_vs_GID1_.txt.xz


--------------------------------------------------------------------------------
/inst/extdata/GID2_vs_GID2_.txt.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID2_vs_GID2_.txt.xz


--------------------------------------------------------------------------------
/inst/extdata/GID2_vs_microfam.hmm.txt.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID2_vs_microfam.hmm.txt.xz


--------------------------------------------------------------------------------
/inst/extdata/GID3_vs_GID1_.txt.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID3_vs_GID1_.txt.xz


--------------------------------------------------------------------------------
/inst/extdata/GID3_vs_GID2_.txt.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID3_vs_GID2_.txt.xz


--------------------------------------------------------------------------------
/inst/extdata/GID3_vs_GID3_.txt.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID3_vs_GID3_.txt.xz


--------------------------------------------------------------------------------
/inst/extdata/GID3_vs_microfam.hmm.txt.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/GID3_vs_microfam.hmm.txt.xz


--------------------------------------------------------------------------------
/inst/extdata/microfam.hmm.h3f.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/microfam.hmm.h3f.xz


--------------------------------------------------------------------------------
/inst/extdata/microfam.hmm.h3i.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/microfam.hmm.h3i.xz


--------------------------------------------------------------------------------
/inst/extdata/microfam.hmm.h3m.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/microfam.hmm.h3m.xz


--------------------------------------------------------------------------------
/inst/extdata/microfam.hmm.h3p.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/microfam.hmm.h3p.xz


--------------------------------------------------------------------------------
/inst/extdata/xmpl.faa.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/xmpl.faa.xz


--------------------------------------------------------------------------------
/inst/extdata/xmpl_GID1.faa.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/xmpl_GID1.faa.xz


--------------------------------------------------------------------------------
/inst/extdata/xmpl_GID2.faa.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/xmpl_GID2.faa.xz


--------------------------------------------------------------------------------
/inst/extdata/xmpl_GID3.faa.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/inst/extdata/xmpl_GID3.faa.xz


--------------------------------------------------------------------------------
/man/bClust.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/bclust.R
 3 | \name{bClust}
 4 | \alias{bClust}
 5 | \title{Clustering sequences based on pairwise distances}
 6 | \usage{
 7 | bClust(dist.tbl, linkage = "complete", threshold = 0.75, verbose = TRUE)
 8 | }
 9 | \arguments{
10 | \item{dist.tbl}{A \code{tibble} with pairwise distances.}
11 | 
12 | \item{linkage}{A text indicating what type of clustering to perform, either \samp{complete} (default),
13 | \samp{average} or \samp{single}.}
14 | 
15 | \item{threshold}{Specifies the maximum size of a cluster. Must be a distance, i.e. a number between
16 | 0.0 and 1.0.}
17 | 
18 | \item{verbose}{Logical, turns on/off text output during computations.}
19 | }
20 | \value{
21 | The function returns a vector of integers, indicating the cluster membership of every unique
22 | sequence from the \samp{Query} or \samp{Hit} columns of the input \samp{dist.tbl}. The name
23 | of each element indicates the sequence. The numerical values have no meaning as such, they are simply
24 | categorical indicators of cluster membership.
25 | }
26 | \description{
27 | Sequences are clustered by hierarchical clustering based on a set of pariwise distances.
28 | The distances must take values between 0.0 and 1.0, and all pairs \emph{not} listed are assumed to
29 | have distance 1.0.
30 | }
31 | \details{
32 | Computing clusters (gene families) is an essential step in many comparative studies.
33 | \code{\link{bClust}} will assign sequences into gene families by a hierarchical clustering approach.
34 | Since the number of sequences may be huge, a full all-against-all distance matrix will be impossible
35 | to handle in memory. However, most sequence pairs will have an \sQuote{infinite} distance between them,
36 | and only the pairs with a finite (smallish) distance need to be considered.
37 | 
38 | This function takes as input the distances in \code{dist.tbl} where only the relevant distances are
39 | listed. The columns \samp{Query} and \samp{Hit} contain tags identifying pairs of sequences. The column
40 | \samp{Distance} contains the distances, always a number from 0.0 to 1.0. Typically, this is the output
41 | from \code{\link{bDist}}. All pairs of sequences \emph{not} listed are assumed to have distance 1.0,
42 | which is considered the \sQuote{infinite} distance.
43 | All sequences must be listed at least once in ceither column \samp{Query} or \samp{Hit} of the \code{dist.tbl}.
44 | This should pose no problem, since all sequences must have distance 0.0 to themselves, and should be listed
45 | with this distance once (\samp{Query} and \samp{Hit} containing the same tag). 
46 | 
47 | The \samp{linkage} defines the type of clusters produced. The \samp{threshold} indicates the size of
48 | the clusters. A \samp{single} linkage clustering means all members of a cluster have at least one other
49 | member of the same cluster within distance \samp{threshold} from itself. An \samp{average} linkage means
50 | all members of a cluster are within the distance \samp{threshold} from the center of the cluster. A
51 | \samp{complete} linkage means all members of a cluster are no more than the distance \samp{threshold}
52 | away from any other member of the same cluster. 
53 | 
54 | Typically, \samp{single} linkage produces big clusters where members may differ a lot, since they are
55 | only required to be close to something, which is close to something,...,which is close to some other
56 | member. On the other extreme, \samp{complete} linkage will produce small and tight clusters, since all
57 | must be similar to all. The \samp{average} linkage is between, but closer to \samp{complete} linkage. If
58 | you want the \samp{threshold} to specify directly the maximum distance tolerated between two members of
59 | the same gene family, you must use \samp{complete} linkage. The \samp{single} linkage is the fastest
60 | alternative to compute. Using the default setting of \samp{single} linkage and maximum \samp{threshold}
61 | (1.0) will produce the largest and fewest clusters possible.
62 | }
63 | \examples{
64 | # Loading example BLAST distances
65 | data(xmpl.bdist)
66 | 
67 | # Clustering with default settings
68 | clst <- bClust(xmpl.bdist)
69 | # Other settings, and verbose
70 | clst <- bClust(xmpl.bdist, linkage = "average", threshold = 0.5, verbose = TRUE)
71 | 
72 | }
73 | \seealso{
74 | \code{\link{bDist}}, \code{\link{hclust}}, \code{\link{dClust}}, \code{\link{isOrtholog}}.
75 | }
76 | \author{
77 | Lars Snipen and Kristian Hovde Liland.
78 | }
79 | 


--------------------------------------------------------------------------------
/man/bDist.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/bdist.R
 3 | \name{bDist}
 4 | \alias{bDist}
 5 | \title{Computes distances between sequences}
 6 | \usage{
 7 | bDist(blast.files = NULL, blast.tbl = NULL, e.value = 1, verbose = TRUE)
 8 | }
 9 | \arguments{
10 | \item{blast.files}{A text vector of BLAST result filenames.}
11 | 
12 | \item{blast.tbl}{A table with BLAST results.}
13 | 
14 | \item{e.value}{A threshold E-value to immediately discard (very) poor BLAST alignments.}
15 | 
16 | \item{verbose}{Logical, indicating if textual output should be given to monitor the progress.}
17 | }
18 | \value{
19 | The function returns a table with columns \samp{Dbase}, \samp{Query}, \samp{Bitscore}
20 | and \samp{Distance}. Each row corresponds to a pair of sequences (Dbase and Query sequences) having at least
21 | one BLAST hit between
22 | them. All pairs \emph{not} listed in the output have distance 1.0 between them.
23 | }
24 | \description{
25 | Computes distance between all sequences based on the BLAST bit-scores.
26 | }
27 | \details{
28 | The essential input is either a vector of BLAST result filenames (\code{blast.files}) or a
29 | table of the BLAST results (\code{blast.tbl}). It is no point in providing both, then \code{blast.tbl} is ignored.
30 | 
31 | For normal sized data sets (e.g. less than 100 genomes), you would provide the BLAST filenames as the argument
32 | \code{blast.files} to this function.
33 | Then results are read, and distances are computed. Only if you have huge data sets, you may find it more efficient to 
34 | read the files using \code{\link{readBlastSelf}} and \code{\link{readBlastPair}} separately, and then provide as the
35 | argument \code{blast.tbl]} the table you get from binding these results. In all cases, the BLAST result files must
36 | have been produced by \code{\link{blastpAllAll}}.
37 | 
38 | Setting a small \samp{e.value} threshold can speed up the computation and reduce the size of the
39 | output, but you may loose some alignments that could produce smallish distances for short sequences.
40 | 
41 | The distance computed is based on alignment bitscores. Assume the alignment of query A against hit B
42 | has a bitscore of S(A,B). The distance is D(A,B)=1-2*S(A,B)/(S(A,A)+S(B,B)) where S(A,A) and S(B,B) are
43 | the self-alignment bitscores, i.e. the scores of aligning against itself. A distance of
44 | 0.0 means A and B are identical. The maximum possible distance is 1.0, meaning there is no BLAST between A and B.
45 | 
46 | This distance should not be interpreted as lack of identity! A distance of 0.0 means 100\% identity,
47 | but a distance of 0.25 does \emph{not} mean 75\% identity. It has some resemblance to an evolutinary
48 | (raw) distance, but since it is based on protein alignments, the type of mutations plays a significant
49 | role, not only the number of mutations.
50 | }
51 | \examples{
52 | # Using BLAST result files in this package...
53 | prefix <- c("GID1_vs_GID1_",
54 |             "GID2_vs_GID1_",
55 |             "GID3_vs_GID1_",
56 |             "GID2_vs_GID2_",
57 |             "GID3_vs_GID2_",
58 |             "GID3_vs_GID3_")
59 | bf <- file.path(path.package("micropan"), "extdata", str_c(prefix, ".txt.xz"))
60 | 
61 | # We need to uncompress them first...
62 | blast.files <- tempfile(pattern = prefix, fileext = ".txt.xz")
63 | ok <- file.copy(from = bf, to = blast.files)
64 | blast.files <- unlist(lapply(blast.files, xzuncompress))
65 | 
66 | # Computing pairwise distances
67 | blast.dist <- bDist(blast.files)
68 | 
69 | # Read files separately, then use bDist
70 | self.tbl <- readBlastSelf(blast.files)
71 | pair.tbl <- readBlastPair(blast.files)
72 | blast.dist <- bDist(blast.tbl = bind_rows(self.tbl, pair.tbl))
73 | 
74 | # ...and cleaning...
75 | ok <- file.remove(blast.files)
76 | 
77 | # See also example for blastpAl
78 | 
79 | }
80 | \seealso{
81 | \code{\link{blastpAllAll}}, \code{\link{readBlastSelf}}, \code{\link{readBlastPair}},
82 | \code{\link{bClust}}, \code{\link{isOrtholog}}.
83 | }
84 | \author{
85 | Lars Snipen and Kristian Hovde Liland.
86 | }
87 | 


--------------------------------------------------------------------------------
/man/binomixEstimate.Rd:
--------------------------------------------------------------------------------
  1 | % Generated by roxygen2: do not edit by hand
  2 | % Please edit documentation in R/binomix.R
  3 | \name{binomixEstimate}
  4 | \alias{binomixEstimate}
  5 | \title{Binomial mixture model estimates}
  6 | \usage{
  7 | binomixEstimate(
  8 |   pan.matrix,
  9 |   K.range = 3:5,
 10 |   core.detect.prob = 1,
 11 |   verbose = TRUE
 12 | )
 13 | }
 14 | \arguments{
 15 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.}
 16 | 
 17 | \item{K.range}{The range of model complexities to explore. The vector of integers specify the number
 18 | of binomial densities to combine in the mixture models.}
 19 | 
 20 | \item{core.detect.prob}{The detection probability of core genes. This should almost always be 1.0,
 21 | since a core gene is by definition always present in all genomes, but can be set fractionally smaller.}
 22 | 
 23 | \item{verbose}{Logical indicating if textual output should be given to monitor the progress of the
 24 | computations.}
 25 | }
 26 | \value{
 27 | \code{\link{binomixEstimate}} returns a \code{list} with two components, the \samp{BIC.tbl}
 28 | and \samp{Mix.tbl}.
 29 | 
 30 | The \samp{BIC.tbl} is a \code{tibble} listing, in each row, the results for each number of components
 31 | used, given by the input \samp{K.range}. The column \samp{Core.size} is the estimated number of
 32 | core gene families, the column \samp{Pan.size} is the estimated pan-genome size. The column
 33 | \samp{BIC} is the Bayesian Information Criterion (Schwarz, 1978) that should be used to choose the
 34 | optimal component number (\samp{K}). The number of components where \samp{BIC} is minimized is the
 35 | optimum. If minimum \samp{BIC} is reached for the largest \samp{K} value you should extend the
 36 | \samp{K.range} to larger values and re-fit. The function will issue
 37 | a \code{warning} to remind you of this.
 38 | 
 39 | The \samp{Mix.tbl} is a \code{tibble} with estimates from the mixture models. The column \samp{Component}
 40 | indicates the model, i.e. all rows where \samp{Component} has the same value are from the same model.
 41 | There will be 3 rows for 3-component model, 4 rows for 4-component, etc. The column \samp{Detection.prob}
 42 | contain the estimated detection probabilities for each component of the mixture models. A 
 43 | \samp{Mixing.proportion} is the proportion of the gene clusters having the corresponding \samp{Detection.prob},
 44 | i.e. if core genes have \samp{Detection.prob} 1.0, the corresponding \samp{Mixing.proportion} (same row)
 45 | indicates how large fraction of the gene families are core genes.
 46 | }
 47 | \description{
 48 | Fits binomial mixture models to the data given as a pan-matrix. From the fitted models
 49 | both estimates of pan-genome size and core-genome size are available.
 50 | }
 51 | \details{
 52 | A binomial mixture model can be used to describe the distribution of gene clusters across
 53 | genomes in a pan-genome. The idea and the details of the computations are given in Hogg et al (2007),
 54 | Snipen et al (2009) and Snipen & Ussery (2012).
 55 | 
 56 | Central to the concept is the idea that every gene has a detection probability, i.e. a probability of
 57 | being present in a genome. Genes who are always present in all genomes are called core genes, and these
 58 | should have a detection probability of 1.0. Other genes are only present in a subset of the genomes, and
 59 | these have smaller detection probabilities. Some genes are only present in one single genome, denoted
 60 | ORFan genes, and an unknown number of genes have yet to be observed. If the number of genomes investigated
 61 | is large these latter must have a very small detection probability. 
 62 | 
 63 | A binomial mixture model with \samp{K} components estimates \samp{K} detection probabilities from the
 64 | data. The more components you choose, the better you can fit the (present) data, at the cost of less
 65 | precision in the estimates due to less degrees of freedom. \code{\link{binomixEstimate}} allows you to
 66 | fit several models, and the input \samp{K.range} specifies which values of \samp{K} to try out. There no
 67 | real point using \samp{K} less than 3, and the default is \samp{K.range=3:5}. In general, the more genomes
 68 | you have the larger you can choose \samp{K} without overfitting.  Computations will be slower for larger
 69 | values of \samp{K}. In order to choose the optimal value for \samp{K}, \code{\link{binomixEstimate}}
 70 | computes the BIC-criterion, see below.
 71 | 
 72 | As the number of genomes grow, we tend to observe an increasing number of gene clusters. Once a
 73 | \samp{K}-component binomial mixture has been fitted, we can estimate the number of gene clusters not yet
 74 | observed, and thereby the pan-genome size. Also, as the number of genomes grows we tend to observe fewer
 75 | core genes. The fitted binomial mixture model also gives an estimate of the final number of core gene
 76 | clusters, i.e. those still left after having observed \sQuote{infinite} many genomes.
 77 | 
 78 | The detection probability of core genes should be 1.0, but can at times be set fractionally smaller.
 79 | This means you accept that even core genes are not always detected in every genome, e.g. they may be
 80 | there, but your gene prediction has missed them. Notice that setting the \samp{core.detect.prob} to less
 81 | than 1.0 may affect the core gene size estimate dramatically.
 82 | }
 83 | \examples{
 84 | # Loading an example pan-matrix
 85 | data(xmpl.panmat)
 86 | 
 87 | # Estimating binomial mixture models
 88 | binmix.lst <- binomixEstimate(xmpl.panmat, K.range = 3:8)
 89 | print(binmix.lst$BIC.tbl) # minimum BIC at 3 components
 90 | 
 91 | \dontrun{
 92 | # The pan-genome gene distribution as a pie-chart
 93 | library(ggplot2)
 94 | ncomp <- 3
 95 | binmix.lst$Mix.tbl \%>\% 
 96 |   filter(Components == ncomp) \%>\% 
 97 |   ggplot() +
 98 |   geom_col(aes(x = "", y = Mixing.proportion, fill = Detection.prob)) +
 99 |   coord_polar(theta = "y") +
100 |   labs(x = "", y = "", title = "Pan-genome gene distribution") +
101 |   scale_fill_gradientn(colors = c("pink", "orange", "green", "cyan", "blue"))
102 |   
103 | # The distribution in an average genome
104 | binmix.lst$Mix.tbl \%>\% 
105 |   filter(Components == ncomp) \%>\% 
106 |   mutate(Single = Mixing.proportion * Detection.prob) \%>\%
107 |   ggplot() +
108 |   geom_col(aes(x = "", y = Single, fill = Detection.prob)) +
109 |   coord_polar(theta = "y") +
110 |   labs(x = "", y = "", title = "Average genome gene distribution") +
111 |   scale_fill_gradientn(colors = c("pink", "orange", "green", "cyan", "blue"))
112 | }
113 | 
114 | }
115 | \references{
116 | Hogg, J.S., Hu, F.Z, Janto, B., Boissy, R., Hayes, J., Keefe, R., Post, J.C., Ehrlich, G.D. (2007).
117 | Characterization and modeling of the Haemophilus influenzae core- and supra-genomes based on the
118 | complete genomic sequences of Rd and 12 clinical nontypeable strains. Genome Biology, 8:R103.
119 | 
120 | Snipen, L., Almoy, T., Ussery, D.W. (2009). Microbial comparative pan-genomics using binomial
121 | mixture models. BMC Genomics, 10:385.
122 | 
123 | Snipen, L., Ussery, D.W. (2012). A domain sequence approach to pangenomics: Applications to
124 | Escherichia coli. F1000 Research, 1:19.
125 | 
126 | Schwarz, G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2):461-464.
127 | }
128 | \seealso{
129 | \code{\link{panMatrix}}, \code{\link{chao}}.
130 | }
131 | \author{
132 | Lars Snipen and Kristian Hovde Liland.
133 | }
134 | 


--------------------------------------------------------------------------------
/man/blastpAllAll.Rd:
--------------------------------------------------------------------------------
  1 | % Generated by roxygen2: do not edit by hand
  2 | % Please edit documentation in R/blasting.R
  3 | \name{blastpAllAll}
  4 | \alias{blastpAllAll}
  5 | \title{Making BLAST search all against all genomes}
  6 | \usage{
  7 | blastpAllAll(
  8 |   prot.files,
  9 |   out.folder,
 10 |   e.value = 1,
 11 |   job = 1,
 12 |   threads = 1,
 13 |   start.at = 1,
 14 |   verbose = TRUE
 15 | )
 16 | }
 17 | \arguments{
 18 | \item{prot.files}{A vector with FASTA filenames.}
 19 | 
 20 | \item{out.folder}{The folder where the result files should end up.}
 21 | 
 22 | \item{e.value}{The chosen E-value threshold in BLAST.}
 23 | 
 24 | \item{job}{An integer to separate multiple jobs.}
 25 | 
 26 | \item{threads}{The number of CPU's to use.}
 27 | 
 28 | \item{start.at}{An integer to specify where in the file-list to start BLASTing.}
 29 | 
 30 | \item{verbose}{Logical, if \code{TRUE} some text output is produced to monitor the progress.}
 31 | }
 32 | \value{
 33 | The function produces a result file for each pair of files listed in \samp{prot.files}.
 34 | These result files are located in \code{out.folder}. Existing files are never overwritten by
 35 | \code{\link{blastpAllAll}}, if you want to re-compute something, delete the corresponding result files first.
 36 | }
 37 | \description{
 38 | Runs a reciprocal all-against-all BLAST search to look for similarity of proteins
 39 | within and across genomes.
 40 | }
 41 | \details{
 42 | A basic step in pangenomics and many other comparative studies is to cluster proteins into
 43 | groups or families. One commonly used approach is based on BLASTing. This function uses the
 44 | \samp{blast+} software available for free from NCBI (Camacho et al, 2009). More precisely, the blastp
 45 | algorithm with the BLOSUM45 scoring matrix and all composition based statistics turned off.
 46 | 
 47 | A vector listing FASTA files of protein sequences is given as input in \samp{prot.files}. These files
 48 | must have the genome_id in the first token of every header, and in their filenames as well, i.e. all input
 49 | files should first be prepared by \code{\link{panPrep}} to ensure this. Note that only protein sequences
 50 | are considered here. If your coding genes are stored as DNA, please translate them to protein prior to
 51 | using this function, see \code{\link[microseq]{translate}}.
 52 | 
 53 | In the first version of this package we used reciprocal BLASTing, i.e. we computed both genome A against
 54 | B and B against A. This may sometimes produce slightly different results, but in reality this is too
 55 | costly compared to its gain, and we now only make one of the above searches. This basically halves the
 56 | number of searches. This step is still very time consuming for larger number of genomes. Note that the 
 57 | protein files are sorted by the genome_id (part of filename) inside this function. This is to ensure a 
 58 | consistent ordering irrespective of how they are enterred.
 59 | 
 60 | For every pair of genomes a result file is produced. If two genomes have genome_id's \samp{GID111},
 61 | and \samp{GID222} then the result file \samp{GID222_vs_GID111.txt} will
 62 | be found in \samp{out.folder} after the completion of this search. The last of the two genome_id is always
 63 | the first in alphabetical order of the two.
 64 | 
 65 | The \samp{out.folder} is scanned for already existing result files, and \code{\link{blastpAllAll}} never
 66 | overwrites an existing result file. If a file with the name \samp{GID111_vs_GID222.txt} already exists in
 67 | the \samp{out.folder}, this particular search is skipped. This makes it possible to run multiple jobs in
 68 | parallell, writing to the same \samp{out.folder}. It also makes it possible to add new genomes, and only
 69 | BLAST the new combinations without repeating previous comparisons. 
 70 | 
 71 | This search can be slow if the genomes contain many proteins and it scales quadratically in the number of
 72 | input files. It is best suited for the study of a smaller number of genomes. By
 73 | starting multiple R sessions, you can speed up the search by running \code{\link{blastpAllAll}} from each R
 74 | session, using the same \samp{out.folder} but different integers for the \code{job} option. At the same
 75 | time you may also want to start the BLASTing at different places in the file-list, by giving larger values
 76 | to the argument \code{start.at}. This is 1 by default, i.e. the BLASTing starts at the first protein file.
 77 | If you are using a multicore computer you can also increase the number of CPUs by increasing \code{threads}.
 78 | 
 79 | The result files are tab-separated text files, and can be read into R, but more
 80 | commonly they are used as input to \code{\link{bDist}} to compute distances between sequences for subsequent
 81 | clustering.
 82 | }
 83 | \note{
 84 | The \samp{blast+} software must be installed on the system for this function to work, i.e. the command
 85 | \samp{system("makeblastdb -help")} must be recognized as valid commands if you
 86 | run them in the Console window.
 87 | }
 88 | \examples{
 89 | \dontrun{
 90 | # This example requires the external BLAST+ software
 91 | # Using protein files in this package
 92 | pf <- file.path(path.package("micropan"), "extdata",
 93 |                 str_c("xmpl_GID", 1:3, ".faa.xz"))
 94 | 
 95 | # We need to uncompress them first...
 96 | prot.files <- tempfile(fileext = c("_GID1.faa.xz","_GID2.faa.xz","_GID3.faa.xz"))
 97 | ok <- file.copy(from = pf, to = prot.files)
 98 | prot.files <- unlist(lapply(prot.files, xzuncompress))
 99 | 
100 | # Blasting all versus all
101 | out.dir <- "."
102 | blastpAllAll(prot.files, out.folder = out.dir)
103 | 
104 | # Reading results, and computing blast.distances
105 | blast.files <- list.files(out.dir, pattern = "GID[0-9]+_vs_GID[0-9]+.txt")
106 | blast.distances <- bDist(file.path(out.dir, blast.files))
107 | 
108 | # ...and cleaning...
109 | ok <- file.remove(prot.files)
110 | ok <- file.remove(file.path(out.dir, blast.files))
111 | }
112 | 
113 | }
114 | \references{
115 | Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., Madden, T.L.
116 | (2009). BLAST+: architecture and applications. BMC Bioinformatics, 10:421.
117 | }
118 | \seealso{
119 | \code{\link{panPrep}}, \code{\link{bDist}}.
120 | }
121 | \author{
122 | Lars Snipen and Kristian Hovde Liland.
123 | }
124 | 


--------------------------------------------------------------------------------
/man/chao.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/powerlaw.R
 3 | \name{chao}
 4 | \alias{chao}
 5 | \title{The Chao lower bound estimate of pan-genome size}
 6 | \usage{
 7 | chao(pan.matrix)
 8 | }
 9 | \arguments{
10 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.}
11 | }
12 | \value{
13 | The function returns an integer, the estimated pan-genome size. This includes both the number
14 | of gene clusters observed so far, as well as the estimated number not yet seen.
15 | }
16 | \description{
17 | Computes the Chao lower bound estimated number of gene clusters in a pan-genome.
18 | }
19 | \details{
20 | The size of a pan-genome is the number of gene clusters in it, both those observed and those
21 | not yet observed.
22 | 
23 | The input \samp{pan.matrix} is a a matrix with one row for each
24 | genome and one column for each observed gene cluster in the pan-genome. See \code{\link{panMatrix}}
25 | for how to construct this.
26 | 
27 | The number of observed gene clusters is simply the number of columns in \samp{pan.matrix}. The
28 | number of gene clusters not yet observed is estimated by the Chao lower bound estimator (Chao, 1987).
29 | This is based solely on the number of clusters observed in 1 and 2 genomes. It is a very simple and
30 | conservative estimator, i.e. it is more likely to be too small than too large.
31 | }
32 | \examples{
33 | # Loading a pan-matrix in this package
34 | data(xmpl.panmat)
35 | 
36 | # Estimating the pan-genome size using the Chao estimator
37 | chao.pansize <- chao(xmpl.panmat)
38 | 
39 | }
40 | \references{
41 | Chao, A. (1987). Estimating the population size for capture-recapture data with unequal
42 | catchability. Biometrics, 43:783-791.
43 | }
44 | \seealso{
45 | \code{\link{panMatrix}}, \code{\link{binomixEstimate}}.
46 | }
47 | \author{
48 | Lars Snipen and Kristian Hovde Liland.
49 | }
50 | 


--------------------------------------------------------------------------------
/man/dClust.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/domainclust.R
 3 | \name{dClust}
 4 | \alias{dClust}
 5 | \title{Clustering sequences based on domain sequence}
 6 | \usage{
 7 | dClust(hmmer.tbl)
 8 | }
 9 | \arguments{
10 | \item{hmmer.tbl}{A \code{tibble} of results from a \code{\link{hmmerScan}} against a domain database.}
11 | }
12 | \value{
13 | The output is a numeric vector with one element for each unique sequence in the \samp{Query}
14 | column of the input \samp{hmmer.tbl}. Sequences with identical number belong to the same cluster. The
15 | name of each element identifies the sequence.
16 | 
17 | This vector also has an attribute called \samp{cluster.info} which is a character vector containing the
18 | domain sequences. The first element is the domain sequence for cluster 1, the second for cluster 2, etc.
19 | In this way you can, in addition to clustering the sequences, also see which domains the sequences of a
20 | particular cluster share.
21 | }
22 | \description{
23 | Proteins are clustered by their sequence of protein domains. A domain sequence is the
24 | ordered sequence of domains in the protein. All proteins having identical domain sequence are assigned
25 | to the same cluster.
26 | }
27 | \details{
28 | A domain sequence is simply the ordered list of domains occurring in a protein. Not all proteins
29 | contain known domains, but those who do will have from one to several domains, and these can be ordered
30 | forming a sequence. Since domains can be more or less conserved, two proteins can be quite different in
31 | their amino acid sequence, and still share the same domains. Describing, and grouping, proteins by their
32 | domain sequence was proposed by Snipen & Ussery (2012) as an alternative to clusters based on pairwise
33 | alignments, see \code{\link{bClust}}. Domain sequence clusters are less influenced by gene prediction errors.
34 | 
35 | The input is a \code{tibble} of the type produced by \code{\link{readHmmer}}. Typically, it is the
36 | result of scanning proteins (using \code{\link{hmmerScan}}) against Pfam-A or any other HMMER3 database
37 | of protein domains. It is highly reccomended that you remove overlapping hits in \samp{hmmer.tbl} before
38 | you pass it as input to \code{\link{dClust}}. Use the function \code{\link{hmmerCleanOverlap}} for this.
39 | Overlapping hits are in some cases real hits, but often the poorest of them are artifacts.
40 | }
41 | \examples{
42 | # HMMER3 result files in this package
43 | hf <- file.path(path.package("micropan"), "extdata", 
44 |                 str_c("GID", 1:3, "_vs_microfam.hmm.txt.xz"))
45 | 
46 | # We need to uncompress them first...
47 | hmm.files <- tempfile(fileext = rep(".xz", length(hf)))
48 | ok <- file.copy(from = hf, to = hmm.files)
49 | hmm.files <- unlist(lapply(hmm.files, xzuncompress))
50 | 
51 | # Reading the HMMER3 results, cleaning overlaps...
52 | hmmer.tbl <- NULL
53 | for(i in 1:3){
54 |   readHmmer(hmm.files[i]) \%>\% 
55 |     hmmerCleanOverlap() \%>\% 
56 |     bind_rows(hmmer.tbl) -> hmmer.tbl
57 | }
58 | 
59 | # The clustering
60 | clst <- dClust(hmmer.tbl)
61 | 
62 | # ...and cleaning...
63 | ok <- file.remove(hmm.files)
64 | 
65 | }
66 | \references{
67 | Snipen, L. Ussery, D.W. (2012). A domain sequence approach to pangenomics: Applications
68 | to Escherichia coli. F1000 Research, 1:19.
69 | }
70 | \seealso{
71 | \code{\link{panPrep}}, \code{\link{hmmerScan}}, \code{\link{readHmmer}},
72 | \code{\link{hmmerCleanOverlap}}, \code{\link{bClust}}.
73 | }
74 | \author{
75 | Lars Snipen and Kristian Hovde Liland.
76 | }
77 | 


--------------------------------------------------------------------------------
/man/distJaccard.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/genomedistances.R
 3 | \name{distJaccard}
 4 | \alias{distJaccard}
 5 | \title{Computing Jaccard distances between genomes}
 6 | \usage{
 7 | distJaccard(pan.matrix)
 8 | }
 9 | \arguments{
10 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.}
11 | }
12 | \value{
13 | A \code{dist} object (see \code{\link{dist}}) containing all pairwise Jaccard distances
14 | between genomes.
15 | }
16 | \description{
17 | Computes the Jaccard distances between all pairs of genomes.
18 | }
19 | \details{
20 | The Jaccard index between two sets is defined as the size of the intersection of
21 | the sets divided by the size of the union. The Jaccard distance is simply 1 minus the Jaccard index.
22 | 
23 | The Jaccard distance between two genomes describes their degree of overlap with respect to gene
24 | cluster content. If the Jaccard distance is 0.0, the two genomes contain identical gene clusters.
25 | If it is 1.0 the two genomes are non-overlapping. The difference between a genomic fluidity (see
26 | \code{\link{fluidity}}) and a Jaccard distance is small, they both measure overlap between genomes,
27 | but fluidity is computed for the population by averaging over many pairs, while Jaccard distances are
28 | computed for every pair. Note that only presence/absence of gene clusters are considered, not multiple
29 | occurrences.
30 | 
31 | The input \samp{pan.matrix} is typically constructed by \code{\link{panMatrix}}.
32 | }
33 | \examples{
34 | # Loading a pan-matrix in this package
35 | data(xmpl.panmat)
36 | 
37 | # Jaccard distances
38 | Jdist <- distJaccard(xmpl.panmat)
39 | 
40 | # Making a dendrogram based on the distances,
41 | # see example for distManhattan
42 | 
43 | }
44 | \seealso{
45 | \code{\link{panMatrix}}, \code{\link{fluidity}}, \code{\link{dist}}.
46 | }
47 | \author{
48 | Lars Snipen and Kristian Hovde Liland.
49 | }
50 | 


--------------------------------------------------------------------------------
/man/distManhattan.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/genomedistances.R
 3 | \name{distManhattan}
 4 | \alias{distManhattan}
 5 | \title{Computing Manhattan distances between genomes}
 6 | \usage{
 7 | distManhattan(pan.matrix, scale = 0, weights = rep(1, ncol(pan.matrix)))
 8 | }
 9 | \arguments{
10 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.}
11 | 
12 | \item{scale}{An optional scale to control how copy numbers should affect the distances.}
13 | 
14 | \item{weights}{Vector of optional weights of gene clusters.}
15 | }
16 | \value{
17 | A \code{dist} object (see \code{\link{dist}}) containing all pairwise Manhattan distances
18 | between genomes.
19 | }
20 | \description{
21 | Computes the (weighted) Manhattan distances beween all pairs of genomes.
22 | }
23 | \details{
24 | The Manhattan distance is defined as the sum of absolute elementwise differences between
25 | two vectors. Each genome is represented as a vector (row) of integers in \samp{pan.matrix}. The
26 | Manhattan distance between two genomes is the sum of absolute difference between these rows. If
27 | two rows (genomes) of the \samp{pan.matrix} are identical, the corresponding Manhattan distance
28 | is \samp{0.0}.
29 | 
30 | The \samp{scale} can be used to control how copy number differences play a role in the distances
31 | computed. Usually we assume that going from 0 to 1 copy of a gene is the big change of the genome,
32 | and going from 1 to 2 (or more) copies is less. Prior to computing the Manhattan distance, the
33 | \samp{pan.matrix} is transformed according to the following affine mapping: If the original value in
34 | \samp{pan.matrix} is \samp{x}, and \samp{x} is not 0, then the transformed value is \samp{1 + (x-1)*scale}.
35 | Note that with \samp{scale=0.0} (default) this will result in 1 regardless of how large \samp{x} was.
36 | In this case the Manhattan distance only distinguish between presence and absence of gene clusters.
37 | If \samp{scale=1.0} the value \samp{x} is left untransformed. In this case the difference between 1
38 | copy and 2 copies is just as big as between 1 copy and 0 copies. For any \samp{scale} between 0.0 and
39 | 1.0 the transformed value is shrunk towards 1, but a certain effect of larger copy numbers is still
40 | present. In this way you can decide if the distances between genomes should be affected, and to what
41 | degree, by differences in copy numbers beyond 1. Notice that as long as \samp{scale=0.0} (and no
42 | weighting) the Manhattan distance has a nice interpretation, namely the number of gene clusters that
43 | differ in present/absent status between two genomes.
44 | 
45 | When summing the difference across gene clusters we can also up- or downweight some clusters compared
46 | to others. The vector \samp{weights} must contain one value for each column in \samp{pan.matrix}. The
47 | default is to use flat weights, i.e. all clusters count equal. See \code{\link{geneWeights}} for
48 | alternative weighting strategies.
49 | }
50 | \examples{
51 | # Loading a pan-matrix in this package
52 | data(xmpl.panmat)
53 | 
54 | # Manhattan distances between genomes
55 | Mdist <- distManhattan(xmpl.panmat)
56 | 
57 | \dontrun{
58 | # Making a dendrogram based on shell-weighted distances
59 | library(ggdendro)
60 | weights <- geneWeights(xmpl.panmat, type = "shell")
61 | Mdist <- distManhattan(xmpl.panmat, weights = weights)
62 | ggdendrogram(dendro_data(hclust(Mdist, method = "average")),
63 |   rotate = TRUE, theme_dendro = FALSE) +
64 |   labs(x = "Genomes", y = "Shell-weighted Manhattan distance", title = "Pan-genome dendrogram")
65 | }
66 | 
67 | }
68 | \seealso{
69 | \code{\link{panMatrix}}, \code{\link{distJaccard}}, \code{\link{geneWeights}}.
70 | }
71 | \author{
72 | Lars Snipen and Kristian Hovde Liland.
73 | }
74 | 


--------------------------------------------------------------------------------
/man/entrezDownload.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/entrez.R
 3 | \name{entrezDownload}
 4 | \alias{entrezDownload}
 5 | \title{Downloading genome data}
 6 | \usage{
 7 | entrezDownload(accession, out.file, verbose = TRUE)
 8 | }
 9 | \arguments{
10 | \item{accession}{A character vector containing a set of valid accession numbers at the NCBI
11 | Nucleotide database.}
12 | 
13 | \item{out.file}{Name of the file where downloaded sequences should be written in FASTA format.}
14 | 
15 | \item{verbose}{Logical indicating if textual output should be given during execution, to monitor
16 | the download progress.}
17 | }
18 | \value{
19 | The name of the resulting FASTA file is returned (same as \code{out.file}), but the real result of
20 | this function is the creation of the file itself.
21 | }
22 | \description{
23 | Retrieving genomes from NCBI using the Entrez programming utilities.
24 | }
25 | \details{
26 | The Entrez programming utilities is a toolset for automatic download of data from the
27 | NCBI databases, see \href{https://www.ncbi.nlm.nih.gov/books/NBK25500/}{E-utilities Quick Start}
28 | for details. \code{\link{entrezDownload}} can be used to download genomes from the NCBI Nucleotide
29 | database through these utilities.
30 | 
31 | The argument \samp{accession} must be a set of valid accession numbers at NCBI Nucleotide, typically
32 | all accession numbers related to a genome (chromosomes, plasmids, contigs, etc). For completed genomes,
33 | where the number of sequences is low, \samp{accession} is typically a single text listing all accession
34 | numbers separated by commas. In the case of some draft genomes having a large number of contigs, the
35 | accession numbers must be split into several comma-separated texts. The reason for this is that Entrez
36 | will not accept too many queries in one chunk. 
37 | 
38 | The downloaded sequences are saved in \samp{out.file} on your system. This will be a FASTA formatted file.
39 | Note that all downloaded sequences end up in this file. If you want to download multiple genomes,
40 | you call \code{\link{entrezDownload}} multiple times and store in multiple files.
41 | }
42 | \examples{
43 | \dontrun{
44 | # Accession numbers for the chromosome and plasmid of Buchnera aphidicola, strain APS
45 | acc <- "BA000003.2,AP001071.1"
46 | genome.file <- tempfile(pattern = "Buchnera_aphidicola", fileext = ".fna")
47 | txt <- entrezDownload(acc, out.file = genome.file)
48 | 
49 | # ...cleaning...
50 | ok <- file.remove(genome.file)
51 | }
52 | 
53 | }
54 | \seealso{
55 | \code{\link{getAccessions}}, \code{\link[microseq]{readFasta}}.
56 | }
57 | \author{
58 | Lars Snipen and Kristian Liland.
59 | }
60 | 


--------------------------------------------------------------------------------
/man/extractPanGenes.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/extractPanGenes.R
 3 | \name{extractPanGenes}
 4 | \alias{extractPanGenes}
 5 | \title{Extracting genes of same prevalence}
 6 | \usage{
 7 | extractPanGenes(clustering, N.genomes = 1:2)
 8 | }
 9 | \arguments{
10 | \item{clustering}{Named vector of clustering}
11 | 
12 | \item{N.genomes}{Vector specifying the number of genomes the genes should be in}
13 | }
14 | \value{
15 | A table with columns
16 | \itemize{
17 |   \item cluster. The gene family (integer)
18 |   \item seq_tag. The sequence tag identifying each sequence (text)
19 |   \item N_genomes. The number of genomes in which it is found (integer)
20 | }
21 | }
22 | \description{
23 | Based on a clustering of genes, this function extracts the genes
24 | occurring in the same number of genomes.
25 | }
26 | \details{
27 | Pan-genome studies focus on the gene families obtained by some clustering,
28 | see \code{\link{bClust}} or \code{\link{dClust}}. This function will extract the individual genes from
29 | each genome belonging to gene families found in \code{N.genomes} genomes specified by the user.
30 | Only the sequence tag for each gene is extracted, but the sequences can be added easily, see examples
31 | below.
32 | }
33 | \examples{
34 | # Loading clustering data in this package
35 | data(xmpl.bclst)
36 | 
37 | # Finding genes in 5 genomes
38 | core.tbl <- extractPanGenes(xmpl.bclst, N.genomes = 5)
39 | #...or in a single genome
40 | orfan.tbl <- extractPanGenes(xmpl.bclst, N.genomes = 1)
41 | 
42 | \dontrun{
43 | # To add the sequences, assume all protein fasta files are in a folder named faa:
44 | lapply(list.files("faa", full.names = T), readFasta) \%>\% 
45 |   bind_rows() \%>\% 
46 |   mutate(seq_tag = word(Header, 1, 1)) \%>\% 
47 |   right_join(orfan.tbl, by = "seq_tag") -> orfan.tbl
48 |   # The resulting table can be written to fasta file directly using writeFasta()
49 |   # See also geneFamilies2fasta()
50 | }
51 | 
52 | }
53 | \seealso{
54 | \code{\link{bClust}}, \code{\link{dClust}}, \code{\link{geneFamilies2fasta}}.
55 | }
56 | \author{
57 | Lars Snipen.
58 | }
59 | 


--------------------------------------------------------------------------------
/man/fluidity.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/genomedistances.R
 3 | \name{fluidity}
 4 | \alias{fluidity}
 5 | \title{Computing genomic fluidity for a pan-genome}
 6 | \usage{
 7 | fluidity(pan.matrix, n.sim = 10)
 8 | }
 9 | \arguments{
10 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.}
11 | 
12 | \item{n.sim}{An integer specifying the number of random samples to use in the computations.}
13 | }
14 | \value{
15 | A vector with two elements, the mean fluidity and its sample standard deviation over
16 | the \samp{n.sim} computed values.
17 | }
18 | \description{
19 | Computes the genomic fluidity, which is a measure of population diversity.
20 | }
21 | \details{
22 | The genomic fluidity between two genomes is defined as the number of unique gene
23 | families divided by the total number of gene families (Kislyuk et al, 2011). This is averaged
24 | over \samp{n.sim} random pairs of genomes to obtain a population estimate.
25 | 
26 | The genomic fluidity between two genomes describes their degree of overlap with respect to gene
27 | cluster content. If the fluidity is 0.0, the two genomes contain identical gene clusters. If it
28 | is 1.0 the two genomes are non-overlapping. The difference between a Jaccard distance (see
29 | \code{\link{distJaccard}}) and genomic fluidity is small, they both measure overlap between
30 | genomes, but fluidity is computed for the population by averaging over many pairs, while Jaccard
31 | distances are computed for every pair. Note that only presence/absence of gene clusters are
32 | considered, not multiple occurrences.
33 | 
34 | The input \samp{pan.matrix} is typically constructed by \code{\link{panMatrix}}.
35 | }
36 | \examples{
37 | # Loading a pan-matrix in this package
38 | data(xmpl.panmat)
39 | 
40 | # Fluidity based on this pan-matrix
41 | fluid <- fluidity(xmpl.panmat)
42 | 
43 | }
44 | \references{
45 | Kislyuk, A.O., Haegeman, B., Bergman, N.H., Weitz, J.S. (2011). Genomic fluidity:
46 | an integrative view of gene diversity within microbial populations. BMC Genomics, 12:32.
47 | }
48 | \seealso{
49 | \code{\link{panMatrix}}, \code{\link{distJaccard}}.
50 | }
51 | \author{
52 | Lars Snipen and Kristian Hovde Liland.
53 | }
54 | 


--------------------------------------------------------------------------------
/man/geneFamilies2fasta.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/extractPanGenes.R
 3 | \name{geneFamilies2fasta}
 4 | \alias{geneFamilies2fasta}
 5 | \title{Write gene families to files}
 6 | \usage{
 7 | geneFamilies2fasta(
 8 |   pangene.tbl,
 9 |   fasta.folder,
10 |   out.folder,
11 |   file.ext = "fasta$|faa$|fna$|fa$",
12 |   verbose = TRUE
13 | )
14 | }
15 | \arguments{
16 | \item{pangene.tbl}{A table listing gene families (clusters).}
17 | 
18 | \item{fasta.folder}{The folder containing the fasta files with all sequences.}
19 | 
20 | \item{out.folder}{The folder to write to.}
21 | 
22 | \item{file.ext}{The file extension to recognize the fasta files in \code{fasta.folder}.}
23 | 
24 | \item{verbose}{Logical to allow text ouput during processing}
25 | }
26 | \description{
27 | Writes specified gene families to separate fasta files.
28 | }
29 | \details{
30 | The argument \code{pangene.tbl} should be produced by \code{\link{extractPanGenes}} in order to
31 | contain the columns \code{cluster}, \code{seq_tag} and \code{N_genomes} required by this function. The
32 | files in \code{fasta.folder} must have been prepared by \code{\link{panPrep}} in order to have the proper
33 | sequence tag information. They may contain protein sequences or DNA sequences.
34 | 
35 | If you already added the \code{Header} and \code{Sequence} information to \code{pangene.tbl} these will be
36 | used instead of reading the files in \code{fasta.folder}, but a warning is issued.
37 | }
38 | \examples{
39 | # Loading clustering data in this package
40 | data(xmpl.bclst)
41 | 
42 | # Finding genes in 1,..,5 genomes (all genes)
43 | all.tbl <- extractPanGenes(xmpl.bclst, N.genomes = 1:5)
44 | 
45 | \dontrun{
46 | # All protein fasta files are in a folder named faa, and we write to the current folder:
47 | clusters2fasta(all.tbl, fasta.folder = "faa", out.folder = ".")
48 | 
49 | # use pipe, write to folder "orfans"
50 | extractPanGenes(xmpl.bclst, N.genomes = 1)) \%>\% 
51 |   geneFamilies2fasta(fasta.folder = "faa", out.folder = "orfans")
52 | }
53 | 
54 | }
55 | \seealso{
56 | \code{\link{extractPanGenes}}, \code{\link{writeFasta}}.
57 | }
58 | \author{
59 | Lars Snipen.
60 | }
61 | 


--------------------------------------------------------------------------------
/man/geneWeights.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/genomedistances.R
 3 | \name{geneWeights}
 4 | \alias{geneWeights}
 5 | \title{Gene cluster weighting}
 6 | \usage{
 7 | geneWeights(pan.matrix, type = c("shell", "cloud"))
 8 | }
 9 | \arguments{
10 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.}
11 | 
12 | \item{type}{A text indicating the weighting strategy.}
13 | }
14 | \value{
15 | A vector of weights, one for each column in \code{pan.matrix}.
16 | }
17 | \description{
18 | This function computes weights for gene cluster according to their distribution in a pan-genome.
19 | }
20 | \details{
21 | When computing distances between genomes or a PCA, it is possible to give weights to the
22 | different gene clusters, emphasizing certain aspects.
23 | 
24 | As proposed by Snipen & Ussery (2010), we have implemented two types of weighting: The default
25 | \samp{"shell"} type means gene families occuring frequently in the genomes, denoted shell-genes, are
26 | given large weight (close to 1) while those occurring rarely are given small weight (close to 0).
27 | The opposite is the \samp{"cloud"} type of weighting. Genes observed in a minority of the genomes are
28 | referred to as cloud-genes. Presumeably, the \samp{"shell"} weighting will give distances/PCA reflecting
29 | a more long-term evolution, since emphasis is put on genes who have just barely diverged away from the
30 | core. The \samp{"cloud"} weighting emphasizes those gene clusters seen rarely. Genomes with similar
31 | patterns among these genes may have common recent history. A \samp{"cloud"} weighting typically gives
32 | a more erratic or \sQuote{noisy} picture than the \samp{"shell"} weighting.
33 | }
34 | \examples{
35 | # See examples for distManhattan
36 | 
37 | }
38 | \references{
39 | Snipen, L., Ussery, D.W. (2010). Standard operating procedure for computing pangenome
40 | trees. Standards in Genomic Sciences, 2:135-141.
41 | }
42 | \seealso{
43 | \code{\link{panMatrix}}, \code{\link{distManhattan}}.
44 | }
45 | \author{
46 | Lars Snipen and Kristian Hovde Liland.
47 | }
48 | 


--------------------------------------------------------------------------------
/man/getAccessions.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/entrez.R
 3 | \name{getAccessions}
 4 | \alias{getAccessions}
 5 | \title{Collecting contig accession numbers}
 6 | \usage{
 7 | getAccessions(master.record.accession, chunk.size = 99)
 8 | }
 9 | \arguments{
10 | \item{master.record.accession}{The accession number (single text) to a master record GenBank file having
11 | the WGS entry specifying the accession numbers to all contigs of the WGS genome.}
12 | 
13 | \item{chunk.size}{The maximum number of accession numbers returned in one text.}
14 | }
15 | \value{
16 | A character vector where each element is a text listing the accession numbers separated by comma.
17 | Each vector element will contain no more than \code{chunk.size} accession numbers, see 
18 | \code{\link{entrezDownload}} for details on this. The vector returned by \code{\link{getAccessions}}
19 | is typically used as input to \code{\link{entrezDownload}}.
20 | }
21 | \description{
22 | Retrieving the accession numbers for all contigs from a master record GenBank file.
23 | }
24 | \details{
25 | In order to download a WGS genome (draft genome) using \code{\link{entrezDownload}} you will
26 | need the accession number of every contig. This is found in the master record GenBank file, which is
27 | available for every WGS genome. \code{\link{getAccessions}} will extract these from the GenBank file and
28 | return them in the apropriate way to be used by \code{\link{entrezDownload}}.
29 | 
30 | The download API at NCBI will not tolerate too many accessions per query, and for this reason you need
31 | to split the accessions for many contigs into several texts using \code{chunk.size}.
32 | }
33 | \examples{
34 | \dontrun{
35 | # The master record accession for the WGS genome Mycoplasma genitalium, strain G37
36 | acc <- getAccessions("AAGX00000000")
37 | # Then we use this to download all contigs and save them
38 | genome.file <- tempfile(fileext = ".fna")
39 | txt <- entrezDownload(acc, out.file = genome.file)
40 | 
41 | # ...cleaning...
42 | ok <- file.remove(genome.file)
43 | }
44 | 
45 | }
46 | \seealso{
47 | \code{\link{entrezDownload}}.
48 | }
49 | \author{
50 | Lars Snipen and Kristian Liland.
51 | }
52 | 


--------------------------------------------------------------------------------
/man/heaps.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/powerlaw.R
 3 | \name{heaps}
 4 | \alias{heaps}
 5 | \title{Heaps law estimate}
 6 | \usage{
 7 | heaps(pan.matrix, n.perm = 100)
 8 | }
 9 | \arguments{
10 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.}
11 | 
12 | \item{n.perm}{The number of random permutations of genome ordering.}
13 | }
14 | \value{
15 | A vector of two estimated parameters: The \samp{Intercept} and the decay parameter \samp{alpha}.
16 | If \samp{alpha<1.0} the pan-genome is open, if \samp{alpha>1.0} it is closed.
17 | }
18 | \description{
19 | Estimating if a pan-genome is open or closed based on a Heaps law model.
20 | }
21 | \details{
22 | An open pan-genome means there will always be new gene clusters observed as long as new genomes
23 | are being sequenced. This may sound controversial, but in a pragmatic view, an open pan-genome indicates
24 | that the number of new gene clusters to be observed in future genomes is \sQuote{large} (but not literally
25 | infinite). Opposite, a closed pan-genome indicates we are approaching the end of new gene clusters. 
26 | 
27 | This function is based on a Heaps law approach suggested by Tettelin et al (2008). The Heaps law model
28 | is fitted to the number of new gene clusters observed when genomes are ordered in a random way. The model
29 | has two parameters, an intercept and a decay parameter called \samp{alpha}. If \samp{alpha>1.0} the
30 | pan-genome is closed, if \samp{alpha<1.0} it is open.
31 | 
32 | The number of permutations, \samp{n.perm}, should be as large as possible, limited by computation time.
33 | The default value of 100 is certainly a minimum.
34 | 
35 | Word of caution: The Heaps law assumes independent sampling. If some of the genomes in the data set
36 | form distinct sub-groups in the population, this may affect the results of this analysis severely.
37 | }
38 | \examples{
39 | # Loading a pan-matrix in this package 
40 | data(xmpl.panmat)
41 | 
42 | # Estimating population openness
43 | h.est <- heaps(xmpl.panmat, n.perm = 500)
44 | print(h.est)
45 | # If alpha < 1 it indicates an open pan-genome
46 | 
47 | }
48 | \references{
49 | Tettelin, H., Riley, D., Cattuto, C., Medini, D. (2008). Comparative genomics: the
50 | bacterial pan-genome. Current Opinions in Microbiology, 12:472-477.
51 | }
52 | \seealso{
53 | \code{\link{binomixEstimate}}, \code{\link{chao}}, \code{\link{rarefaction}}.
54 | }
55 | \author{
56 | Lars Snipen and Kristian Hovde Liland.
57 | }
58 | 


--------------------------------------------------------------------------------
/man/hmmerCleanOverlap.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/domainclust.R
 3 | \name{hmmerCleanOverlap}
 4 | \alias{hmmerCleanOverlap}
 5 | \title{Removing overlapping hits from HMMER3 scans}
 6 | \usage{
 7 | hmmerCleanOverlap(hmmer.tbl)
 8 | }
 9 | \arguments{
10 | \item{hmmer.tbl}{A table (\code{tibble}) with \code{\link{hmmerScan}} results, see \code{\link{readHmmer}}.}
11 | }
12 | \value{
13 | A \code{tibble} which is a subset of the input, where some rows may have been deleted to
14 | avoid overlapping hits.
15 | }
16 | \description{
17 | Removing hits to avoid overlapping HMMs on the same protein sequence.
18 | }
19 | \details{
20 | When scanning sequences against a profile HMM database using \code{\link{hmmerScan}}, we
21 | often find that several patterns (HMMs) match in the same region of the query sequence, i.e. we have
22 | overlapping hits. The function \code{\link{hmmerCleanOverlap}} will remove the poorest overlapping hit
23 | in a recursive way such that all overlaps are eliminated.
24 | 
25 | The input is a \code{tibble} of the type produced by \code{\link{readHmmer}}.
26 | }
27 | \examples{
28 | # See the example in the Help-file for dClust.
29 | 
30 | }
31 | \seealso{
32 | \code{\link{hmmerScan}}, \code{\link{readHmmer}}, \code{\link{dClust}}.
33 | }
34 | \author{
35 | Lars Snipen and Kristian Hovde Liland.
36 | }
37 | 


--------------------------------------------------------------------------------
/man/hmmerScan.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/hmmer3.R
 3 | \name{hmmerScan}
 4 | \alias{hmmerScan}
 5 | \title{Scanning a profile Hidden Markov Model database}
 6 | \usage{
 7 | hmmerScan(in.files, dbase, out.folder, threads = 0, verbose = TRUE)
 8 | }
 9 | \arguments{
10 | \item{in.files}{A character vector of file names.}
11 | 
12 | \item{dbase}{The full path-name of the database to scan (text).}
13 | 
14 | \item{out.folder}{The name of the folder to put the result files.}
15 | 
16 | \item{threads}{Number of CPU's to use.}
17 | 
18 | \item{verbose}{Logical indicating if textual output should be given to monitor the progress.}
19 | }
20 | \value{
21 | This function produces files in the folder specified by \samp{out.folder}. Existing files are
22 | never overwritten by \code{\link{hmmerScan}}, if you want to re-compute something, delete the
23 | corresponding result files first.
24 | }
25 | \description{
26 | Scanning FASTA formatted protein files against a database of pHMMs using the HMMER3
27 | software.
28 | }
29 | \details{
30 | The HMMER3 software is purpose-made for handling profile Hidden Markov Models (pHMM)
31 | describing patterns in biological sequences (Eddy, 2008). This function will make calls to the
32 | HMMER3 software to scan FASTA files of proteins against a pHMM database. 
33 | 
34 | The files named in \samp{in.files} must contain FASTA formatted protein sequences. These files
35 | should be prepared by \code{\link{panPrep}} to make certain each sequence, as well as the file name,
36 | has a GID-tag identifying their genome. The database named in \samp{db} must be a HMMER3 formatted
37 | database. It is typically the Pfam-A database, but you can also make your own HMMER3 databases, see
38 | the HMMER3 documentation for help.
39 | 
40 | \code{\link{hmmerScan}} will query every input file against the named database. The database contains
41 | profile Hidden Markov Models describing position specific sequence patterns. Each sequence in every
42 | input file is scanned to see if some of the patterns can be matched to some degree. Each input file
43 | results in an output file with the same GID-tag in the name. The result files give tabular output, and
44 | are plain text files. See \code{\link{readHmmer}} for how to read the results into R.
45 | 
46 | Scanning large databases like Pfam-A takes time, usually several minutes per genome. The scan is set
47 | up to use only 1 cpu per scan by default. By increasing \code{threads} you can utilize multiple CPUs, typically
48 | on a computing cluster.
49 | Our experience is that from a multi-core laptop it is better to start this function in default mode
50 | from mutliple R-sessions. This function will not overwrite an existing result file, and multiple parallel
51 | sessions can write results to the same folder.
52 | }
53 | \note{
54 | The HMMER3 software must be installed on the system for this function to work, i.e. the command
55 | \samp{system("hmmscan -h")} must be recognized as a valid command if you run it in the Console window.
56 | }
57 | \examples{
58 | \dontrun{
59 | # This example require the external software HMMER
60 | # Using example files in this package
61 | pf <- file.path(path.package("micropan"), "extdata", "xmpl_GID1.faa.xz")
62 | dbf <- file.path(path.package("micropan"), "extdata",
63 |                  str_c("microfam.hmm", c(".h3f.xz",".h3i.xz",".h3m.xz",".h3p.xz")))
64 | 
65 | # We need to uncompress them first...
66 | prot.file <- tempfile(pattern = "GID1.faa", fileext=".xz")
67 | ok <- file.copy(from = pf, to = prot.file)
68 | prot.file <- xzuncompress(prot.file)
69 | db.files <- str_c(tempfile(), c(".h3f.xz",".h3i.xz",".h3m.xz",".h3p.xz"))
70 | ok <- file.copy(from = dbf, to = db.files)
71 | db.files <- unlist(lapply(db.files, xzuncompress))
72 | db.name <- str_remove(db.files[1], "\\\\.[a-z0-9]+$")
73 | 
74 | # Scanning the FASTA file against microfam.hmm...
75 | hmmerScan(in.files = prot.file, dbase = db.name, out.folder = ".")
76 | 
77 | # Reading results
78 | hmm.file <- file.path(".", str_c("GID1_vs_", basename(db.name), ".txt"))
79 | hmm.tbl <- readHmmer(hmm.file)
80 | 
81 | # ...and cleaning...
82 | ok <- file.remove(prot.file)
83 | ok <- file.remove(str_remove(db.files, ".xz"))
84 | }
85 | 
86 | }
87 | \references{
88 | Eddy, S.R. (2008). A Probabilistic Model of Local Sequence Alignment That Simplifies
89 | Statistical Significance Estimation. PLoS Computational Biology, 4(5).
90 | }
91 | \seealso{
92 | \code{\link{panPrep}}, \code{\link{readHmmer}}.
93 | }
94 | \author{
95 | Lars Snipen and Kristian Hovde Liland.
96 | }
97 | 


--------------------------------------------------------------------------------
/man/isOrtholog.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/bclust.R
 3 | \name{isOrtholog}
 4 | \alias{isOrtholog}
 5 | \title{Identifies orthologs in gene clusters}
 6 | \usage{
 7 | isOrtholog(clustering, dist.tbl)
 8 | }
 9 | \arguments{
10 | \item{clustering}{A vector of integers indicating the cluster for every sequence. Sequences with
11 | the same number belong to the same cluster. The name of each element is the tag identifying the sequence.}
12 | 
13 | \item{dist.tbl}{A \code{tibble} with pairwise distances. The columns \samp{Query} and
14 | \samp{Hit} contain tags identifying pairs of sequences. The column \samp{Distance} contains the
15 | distances, always a number from 0.0 to 1.0.}
16 | }
17 | \value{
18 | A vector of logicals with the same number of elements as the input \samp{clustering}, indicating
19 | if the corresponding sequence is an ortholog (\code{TRUE}) or not (\code{FALSE}). The name of each
20 | element is copied from \samp{clustering}.
21 | }
22 | \description{
23 | Finds the ortholog sequences in every cluster based on pairwise distances.
24 | }
25 | \details{
26 | The input \code{clustering} is typically produced by \code{\link{bClust}}. The input
27 | \code{dist.tbl} is typically produced by \code{\link{bDist}}.
28 | 
29 | The concept of orthologs is difficult for prokaryotes, and this function finds orthologs in a
30 | simplistic way. For a given cluster, with members from many genomes, there is one ortholog from every
31 | genome. In cases where a genome has two or more members in the same cluster, only one of these is an
32 | ortholog, the rest are paralogs.
33 | 
34 | Consider all sequences from the same genome belonging to the same cluster. The ortholog is defined as
35 | the one having the smallest sum of distances to all other members of the same cluster, i.e. the one
36 | closest to the \sQuote{center} of the cluster.
37 | 
38 | Note that the status as ortholog or paralog depends greatly on how clusters are defined in the first
39 | place. If you allow large and diverse (and few) clusters, many sequences will be paralogs. If you define
40 | tight and homogenous (and many) clusters, almost all sequences will be orthologs.
41 | }
42 | \examples{
43 | \dontrun{
44 | # Loading distance data and their clustering results
45 | data(list = c("xmpl.bdist","xmpl.bclst"))
46 | 
47 | # Finding orthologs
48 | is.ortholog <- isOrtholog(xmpl.bclst, xmpl.bdist)
49 | # The orthologs are
50 | which(is.ortholog)
51 | }
52 | 
53 | }
54 | \seealso{
55 | \code{\link{bDist}}, \code{\link{bClust}}.
56 | }
57 | \author{
58 | Lars Snipen and Kristian Hovde Liland.
59 | }
60 | 


--------------------------------------------------------------------------------
/man/micropan.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/micropan.R
 3 | \name{micropan}
 4 | \alias{micropan}
 5 | \alias{micropan-package}
 6 | \title{Microbial Pan-Genome Analysis}
 7 | \description{
 8 | A collection of functions for computations and visualizations of microbial pan-genomes.
 9 | Some of the functions make use of external software that needs to be installed on the system, see the
10 | package vignette for more details on this.
11 | }
12 | \references{
13 | Snipen, L., Liland, KH. (2015). micropan: an R-package for microbial pan-genomics. BMC Bioinformatics, 16:79.
14 | }
15 | \author{
16 | Lars Snipen and Kristian Hovde Liland.
17 | 
18 | Maintainer: Lars Snipen <lars.snipen@nmbu.no>
19 | }
20 | 


--------------------------------------------------------------------------------
/man/panMatrix.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/panmat.R
 3 | \name{panMatrix}
 4 | \alias{panMatrix}
 5 | \title{Computing the pan-matrix for a set of gene clusters}
 6 | \usage{
 7 | panMatrix(clustering)
 8 | }
 9 | \arguments{
10 | \item{clustering}{A named vector of integers.}
11 | }
12 | \value{
13 | An integer matrix with a row for each genome and a column for each sequence cluster.
14 | The input vector \samp{clustering} is attached as the attribute \samp{clustering}.
15 | }
16 | \description{
17 | A pan-matrix has one row for each genome and one column for each gene cluster, and
18 | cell \samp{[i,j]} indicates how many members genome \samp{i} has in gene family \samp{j}.
19 | }
20 | \details{
21 | The pan-matrix is a central data structure for pan-genomic analysis. It is a matrix with
22 | one row for each genome in the study, and one column for each gene cluster. Cell \samp{[i,j]}
23 | contains an integer indicating how many members genome \samp{i} has in cluster \samp{j}.
24 | 
25 | The input \code{clustering} must be a named integer vector with one element for each sequence in the study,
26 | typically produced by either \code{\link{bClust}} or \code{\link{dClust}}. The name of each element
27 | is a text identifying every sequence. The value of each element indicates the cluster, i.e. those
28 | sequences with identical values are in the same cluster. IMPORTANT: The name of each sequence must
29 | contain the \samp{genome_id} for each genome, i.e. they must of the form \samp{GID111_seq1}, \samp{GID111_seq2},...
30 | where the \samp{GIDxxx} part indicates which genome the sequence belongs to. See \code{\link{panPrep}}
31 | for details.
32 | 
33 | The rows of the pan-matrix is named by the \samp{genome_id} for every genome. The columns are just named
34 | \samp{Cluster_x} where \samp{x} is an integer copied from \samp{clustering}.
35 | }
36 | \examples{
37 | # Loading clustering data in this package
38 | data(xmpl.bclst)
39 | 
40 | # Pan-matrix based on the clustering
41 | panmat <- panMatrix(xmpl.bclst)
42 | 
43 | \dontrun{
44 | # Plotting cluster distribution
45 | library(ggplot2)
46 | tibble(Clusters = as.integer(table(factor(colSums(panmat > 0), levels = 1:nrow(panmat)))),
47 |        Genomes = 1:nrow(panmat)) \%>\% 
48 | ggplot(aes(x = Genomes, y = Clusters)) +
49 | geom_col()
50 | }
51 | 
52 | }
53 | \seealso{
54 | \code{\link{bClust}}, \code{\link{dClust}}, \code{\link{distManhattan}},
55 | \code{\link{distJaccard}}, \code{\link{fluidity}}, \code{\link{chao}},
56 | \code{\link{binomixEstimate}}, \code{\link{heaps}}, \code{\link{rarefaction}}.
57 | }
58 | \author{
59 | Lars Snipen and Kristian Hovde Liland.
60 | }
61 | 


--------------------------------------------------------------------------------
/man/panPca.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/panpca.R
 3 | \name{panPca}
 4 | \alias{panPca}
 5 | \title{Principal component analysis of a pan-matrix}
 6 | \usage{
 7 | panPca(pan.matrix, scale = 0, weights = rep(1, ncol(pan.matrix)))
 8 | }
 9 | \arguments{
10 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.}
11 | 
12 | \item{scale}{An optional scale to control how copy numbers should affect the distances.}
13 | 
14 | \item{weights}{Vector of optional weights of gene clusters.}
15 | }
16 | \value{
17 | A \code{list} with three tables:
18 | 
19 | \samp{Evar.tbl} has two columns, one listing the component number and one listing the relative 
20 | explained variance for each component. The relative explained variance always sums to 1.0 over
21 | all components. This value indicates the importance of each component, and it is always in
22 | descending order, the first component being the most important.
23 | This is typically the first result you look at after a PCA has been computed, as it indicates
24 | how many components (directions) you need to capture the bulk of the total variation in the data.
25 | 
26 | \samp{Scores.tbl} has a column listing the \samp{GID.tag} for each genome, and then one column for each
27 | principal component. The columns are ordered corresponding to the elements in \samp{Evar}. The
28 | scores are the coordinates of each genome in the principal component space.
29 | 
30 | \samp{Loadings.tbl} is similar to \samp{Scores.tbl} but contain values for each gene cluster
31 | instead of each genome. The columns are ordered corresponding to the elements in \samp{Evar}.
32 | The loadings are the contributions from each gene cluster to the principal component directions.
33 | NOTE: Only gene clusters having a non-zero variance is used in a PCA. Gene clusters with the
34 | same value for every genome have no impact and are discarded from the \samp{Loadings}.
35 | }
36 | \description{
37 | Computes a principal component decomposition of a pan-matrix, with possible
38 | scaling and weightings.
39 | }
40 | \details{
41 | A principal component analysis (PCA) can be computed for any matrix, also a pan-matrix.
42 | The principal components will in this case be linear combinations of the gene clusters. One major
43 | idea behind PCA is to truncate the space, e.g. instead of considering the genomes as points in a
44 | high-dimensional space spanned by all gene clusters, we look for a few \sQuote{smart} combinations
45 | of the gene clusters, and visualize the genomes in a low-dimensional space spanned by these directions.
46 | 
47 | The \samp{scale} can be used to control how copy number differences play a role in the PCA. Usually
48 | we assume that going from 0 to 1 copy of a gene is the big change of the genome, and going from 1 to
49 | 2 (or more) copies is less. Prior to computing the PCA, the \samp{pan.matrix} is transformed according
50 | to the following affine mapping: If the original value in \samp{pan.matrix} is \samp{x}, and \samp{x}
51 | is not 0, then the transformed value is \samp{1 + (x-1)*scale}. Note that with \samp{scale=0.0}
52 | (default) this will result in 1 regardless of how large \samp{x} was. In this case the PCA only
53 | distinguish between presence and absence of gene clusters. If \samp{scale=1.0} the value \samp{x} is
54 | left untransformed. In this case the difference between 1 copy and 2 copies is just as big as between
55 | 1 copy and 0 copies. For any \samp{scale} between 0.0 and 1.0 the transformed value is shrunk towards
56 | 1, but a certain effect of larger copy numbers is still present. In this way you can decide if the PCA
57 | should be affected, and to what degree, by differences in copy numbers beyond 1.
58 | 
59 | The PCA may also up- or downweight some clusters compared to others. The vector \samp{weights} must
60 | contain one value for each column in \samp{pan.matrix}. The default is to use flat weights, i.e. all
61 | clusters count equal. See \code{\link{geneWeights}} for alternative weighting strategies.
62 | }
63 | \examples{
64 | # Loading a pan-matrix in this package
65 | data(xmpl.panmat)
66 | 
67 | # Computing panPca
68 | ppca <- panPca(xmpl.panmat)
69 | 
70 | \dontrun{
71 | # Plotting explained variance
72 | library(ggplot2)
73 | ggplot(ppca$Evar.tbl) +
74 |   geom_col(aes(x = Component, y = Explained.variance))
75 | # Plotting scores
76 | ggplot(ppca$Scores.tbl) +
77 |   geom_text(aes(x = PC1, y = PC2, label = GID.tag))
78 | # Plotting loadings
79 | ggplot(ppca$Loadings.tbl) +
80 |   geom_text(aes(x = PC1, y = PC2, label = Cluster))
81 | }
82 | 
83 | }
84 | \seealso{
85 | \code{\link{distManhattan}}, \code{\link{geneWeights}}.
86 | }
87 | \author{
88 | Lars Snipen and Kristian Hovde Liland.
89 | }
90 | 


--------------------------------------------------------------------------------
/man/panPrep.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/panprep.R
 3 | \name{panPrep}
 4 | \alias{panPrep}
 5 | \title{Preparing FASTA files for pan-genomics}
 6 | \usage{
 7 | panPrep(in.file, genome_id, out.file, protein = TRUE, min.length = 10, discard = "")
 8 | }
 9 | \arguments{
10 | \item{in.file}{The name of a FASTA formatted file with protein or nucleotide sequences for coding
11 | genes in a genome.}
12 | 
13 | \item{genome_id}{The Genome Identifier, see below.}
14 | 
15 | \item{out.file}{Name of file where the prepared sequences will be written.}
16 | 
17 | \item{protein}{Logical, indicating if the \samp{in.file} contains protein (\code{TRUE}) or
18 | nucleotide (\code{FALSE}) sequences.}
19 | 
20 | \item{min.length}{Minimum sequence length}
21 | 
22 | \item{discard}{A text, a regular expression, and sequences having a match against this in their
23 | headerline will be discarded.}
24 | }
25 | \value{
26 | This function produces a FASTA formatted sequence file, and returns the name of this file.
27 | }
28 | \description{
29 | Preparing a FASTA file before starting comparisons of sequences.
30 | }
31 | \details{
32 | This function will read the \code{in.file} and produce another, slightly modified, FASTA file
33 | which is prepared for the comparisons using \code{\link{blastpAllAll}}, \code{\link{hmmerScan}}
34 | or any other method.
35 | 
36 | The main purpose of \code{\link{panPrep}} is to make certain every sequence is labeled with a tag
37 | called a \samp{genome_id} identifying the genome from which it comes. This text contains the text
38 | \dQuote{GID} followed by an integer. This integer can be any integer as long as it is unique to every
39 | genome in the study. If a genome has the text \dQuote{GID12345} as identifier, then the
40 | sequences in the file produced by \code{\link{panPrep}} will have headerlines starting with
41 | \dQuote{GID12345_seq1}, \dQuote{GID12345_seq2}, \dQuote{GID12345_seq3}...etc. This makes it possible
42 | to quickly identify which genome every sequence belongs to.
43 | 
44 | The \samp{genome_id} is also added to the file name specified in \samp{out.file}. For this reason the
45 | \samp{out.file} must have a file extension containing letters only. By convention, we expect FASTA
46 | files to have one of the extensions \samp{.fsa}, \samp{.faa}, \samp{.fa} or \samp{.fasta}.
47 | 
48 | \code{\link{panPrep}} will also remove sequences shorter than \code{min.length}, removing stop codon
49 | symbols (\samp{*}), replacing alien characters with \samp{X} and converting all sequences to upper-case.
50 | If the input \samp{discard} contains a regular expression, any sequences having a match to this in their
51 | headerline are also removed. Example: If we use the \code{prodigal} software (see \code{\link[microseq]{findGenes}})
52 | to find proteins in a genome, partially predicted genes will have the text \samp{partial=10} or
53 | \samp{partial=01} in their headerline. Using \samp{discard= "partial=01|partial=10"} will remove
54 | these from the data set.
55 | }
56 | \examples{
57 | # Using a protein file in this package
58 | # We need to uncompress it first...
59 | pf <- file.path(path.package("micropan"),"extdata","xmpl.faa.xz")
60 | prot.file <- tempfile(fileext = ".xz")
61 | ok <- file.copy(from = pf, to = prot.file)
62 | prot.file <- xzuncompress(prot.file)
63 | 
64 | # Prepping it, using the genome_id "GID123"
65 | prepped.file <- panPrep(prot.file, genome_id = "GID123", out.file = tempfile(fileext = ".faa"))
66 | 
67 | # Reading the prepped file
68 | prepped <- readFasta(prepped.file)
69 | head(prepped)
70 | 
71 | # ...and cleaning...
72 | ok <- file.remove(prot.file, prepped.file)
73 | 
74 | }
75 | \seealso{
76 | \code{\link{hmmerScan}}, \code{\link{blastpAllAll}}.
77 | }
78 | \author{
79 | Lars Snipen and Kristian Liland.
80 | }
81 | 


--------------------------------------------------------------------------------
/man/rarefaction.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/rarefaction.R
 3 | \name{rarefaction}
 4 | \alias{rarefaction}
 5 | \title{Rarefaction curves for a pan-genome}
 6 | \usage{
 7 | rarefaction(pan.matrix, n.perm = 1)
 8 | }
 9 | \arguments{
10 | \item{pan.matrix}{A pan-matrix, see \code{\link{panMatrix}} for details.}
11 | 
12 | \item{n.perm}{The number of random genome orderings to use. If \samp{n.perm=1} the fixed order of
13 | the genomes in \samp{pan.matrix} is used.}
14 | }
15 | \value{
16 | A table with the curves in the columns. The first column is the number of genomes, while 
17 | all other columns are the cumulative number of clusters, one column for each permutation.
18 | }
19 | \description{
20 | Computes rarefaction curves for a number of random permutations of genomes.
21 | }
22 | \details{
23 | A rarefaction curve is simply the cumulative number of unique gene clusters we observe as
24 | more and more genomes are being considered. The shape of this curve will depend on the order of the
25 | genomes. This function will typically compute rarefaction curves for a number of (\samp{n.perm})
26 | orderings. By using a large number of permutations, and then averaging over the results, the effect
27 | of any particular ordering is smoothed.
28 | 
29 | The averaged curve illustrates how many new gene clusters we observe for each new genome. If this
30 | levels out and becomes flat, it means we expect few, if any, new gene clusters by sequencing more
31 | genomes. The function \code{\link{heaps}} can be used to estimate population openness based on this
32 | principle.
33 | }
34 | \examples{
35 | # Loading a pan-matrix in this package 
36 | data(xmpl.panmat)
37 | 
38 | # Rarefaction
39 | rar.tbl <- rarefaction(xmpl.panmat, n.perm = 1000)
40 | 
41 | \dontrun{
42 | # Plotting
43 | library(ggplot2)
44 | library(tidyr)
45 | rar.tbl \%>\% 
46 |   gather(key = "Permutation", value = "Clusters", -Genome) \%>\% 
47 |   ggplot(aes(x = Genome, y = Clusters, group = Permutation)) +
48 |     geom_line()
49 | }
50 | 
51 | }
52 | \seealso{
53 | \code{\link{heaps}}, \code{\link{panMatrix}}.
54 | }
55 | \author{
56 | Lars Snipen and Kristian Hovde Liland.
57 | }
58 | 


--------------------------------------------------------------------------------
/man/readBlastSelf.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/bdist.R
 3 | \name{readBlastSelf}
 4 | \alias{readBlastSelf}
 5 | \alias{readBlastPair}
 6 | \title{Reads BLAST result files}
 7 | \usage{
 8 | readBlastSelf(blast.files, e.value = 1, verbose = TRUE)
 9 | }
10 | \arguments{
11 | \item{blast.files}{A text vector of filenames.}
12 | 
13 | \item{e.value}{A threshold E-value to immediately discard (very) poor BLAST alignments.}
14 | 
15 | \item{verbose}{Logical, indicating if textual output should be given to monitor the progress.}
16 | }
17 | \value{
18 | The functions returns a table with columns \samp{Dbase}, \samp{Query}, \samp{Bitscore}
19 | and \samp{Distance}. Each row corresponds to a pair of sequences (a Dbase and a Query sequence) having at least
20 | one BLAST hit between
21 | them. All pairs \emph{not} listed have distance 1.0 between them. You should normally bind the output from 
22 | \code{readBlastSelf} to the ouptut from \code{readBlastPair} and use the result as input to \code{\link{bDist}}.
23 | }
24 | \description{
25 | Reads files from a search with blastpAllAll
26 | }
27 | \details{
28 | The filenames given as input must refer to BLAST result files produced by \code{\link{blastpAllAll}}.
29 | 
30 | With \code{readBlastSelf} you only read the self-alignment results, i.e. blasting a genome against itself. With
31 | \code{readBlastPair} you read all the other files, i.e. different genomes compared. You may use all blast file
32 | names as input to both, they will select the proper files based on their names, e.g. GID1_vs_GID1.txt is read
33 | by \code{readBlastSelf} while GID2_vs_GID1.txt is read by \code{readBlastPair}.
34 | 
35 | Setting a small \samp{e.value} threshold will filter the alignment, and may speed up this and later processing,
36 | but you may also loose some important alignments for short sequences.
37 | 
38 | Both these functions are used by \code{\link{bDist}}. The reason we provide them separately is to allow the user
39 | to complete this file reading before calling \code{\link{bDist}}. If you have a huge number of files, a
40 | skilled user may utilize parallell processing to speed up the reading. For normal size data sets (e.g. less than 100 genomes)
41 | you should probably use \code{\link{bDist}} directly.
42 | }
43 | \examples{
44 | # Using BLAST result files in this package...
45 | prefix <- c("GID1_vs_GID1_",
46 |             "GID2_vs_GID1_",
47 |             "GID3_vs_GID1_",
48 |             "GID2_vs_GID2_",
49 |             "GID3_vs_GID2_",
50 |             "GID3_vs_GID3_")
51 | bf <- file.path(path.package("micropan"), "extdata", str_c(prefix, ".txt.xz"))
52 | 
53 | # We need to uncompress them first...
54 | blast.files <- tempfile(pattern = prefix, fileext = ".txt.xz")
55 | ok <- file.copy(from = bf, to = blast.files)
56 | blast.files <- unlist(lapply(blast.files, xzuncompress))
57 | 
58 | # Reading self-alignment files, then the other files
59 | self.tbl <- readBlastSelf(blast.files)
60 | pair.tbl <- readBlastPair(blast.files)
61 | 
62 | # ...and cleaning...
63 | ok <- file.remove(blast.files)
64 | 
65 | # See also examples for bDist
66 | 
67 | }
68 | \seealso{
69 | \code{\link{bDist}}, \code{\link{blastpAllAll}}.
70 | }
71 | \author{
72 | Lars Snipen.
73 | }
74 | 


--------------------------------------------------------------------------------
/man/readHmmer.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/hmmer3.R
 3 | \name{readHmmer}
 4 | \alias{readHmmer}
 5 | \title{Reading results from a HMMER3 scan}
 6 | \usage{
 7 | readHmmer(hmmer.file, e.value = 1, use.acc = TRUE)
 8 | }
 9 | \arguments{
10 | \item{hmmer.file}{The name of a \code{\link{hmmerScan}} result file.}
11 | 
12 | \item{e.value}{Numeric threshold, hits with E-value above this are ignored (default is 1.0).}
13 | 
14 | \item{use.acc}{Logical indicating if accession numbers should be used to identify the hits.}
15 | }
16 | \value{
17 | The results are returned in a \samp{tibble} with columns \samp{Query}, \samp{Hit},
18 | \samp{Evalue}, \samp{Score}, \samp{Start}, \samp{Stop} and \samp{Description}. \samp{Query} is the tag
19 | identifying each query sequence. \samp{Hit} is the name or accession number for a pHMM in the database
20 | describing patterns. The \samp{Evalue} is the \samp{ievalue} in the HMMER3 terminology. The \samp{Score}
21 | is the HMMER3 score for the match between \samp{Query} and \samp{Hit}. The \samp{Start} and \samp{Stop}
22 | are the positions within the \samp{Query} where the \samp{Hit} (pattern) starts and stops.
23 | \samp{Description} is the description of the \samp{Hit}. There is one line for each hit.
24 | }
25 | \description{
26 | Reading a text file produced by \code{\link{hmmerScan}}.
27 | }
28 | \details{
29 | The function reads a text file produced by \code{\link{hmmerScan}}. By specifying a smaller
30 | \samp{e.value} you filter out poorer hits, and fewer results are returned. The option \samp{use.acc}
31 | should be turned off (FALSE) if you scan against your own database where accession numbers are lacking.
32 | }
33 | \examples{
34 | # See the examples in the Help-files for dClust and hmmerScan.
35 | 
36 | }
37 | \seealso{
38 | \code{\link{hmmerScan}}, \code{\link{hmmerCleanOverlap}}, \code{\link{dClust}}.
39 | }
40 | \author{
41 | Lars Snipen and Kristian Hovde Liland.
42 | }
43 | 


--------------------------------------------------------------------------------
/man/xmpl.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/xmpl.R
 3 | \docType{data}
 4 | \name{xmpl}
 5 | \alias{xmpl}
 6 | \alias{xmpl.bdist}
 7 | \alias{xmpl.bclst}
 8 | \alias{xmpl.panmat}
 9 | \title{Data sets for use in examples}
10 | \usage{
11 | data(xmpl.bdist)
12 | data(xmpl.bclst)
13 | data(xmpl.panmat)
14 | }
15 | \description{
16 | This data set contains several files with various objects used in examples
17 | in some of the functions in the \code{micropan} package.
18 | }
19 | \details{
20 | \samp{xmpl.bdist} is a \code{tibble} with 4 columns holding all
21 | BLAST distances between pairs of proteins in an example with 10 small genomes.
22 | 
23 | \samp{xmpl.bclst} is a clustering vector of all proteins in the
24 | genomes from \samp{xmpl.bdist}.
25 | 
26 | \samp{xmpl.panmat} is a pan-matrix with 10 rows and 1210 columns
27 | computed from \samp{xmpl.bclst}.
28 | }
29 | \examples{
30 | 
31 | # BLAST distances, only the first 20 are displayed
32 | data(xmpl.bdist)
33 | head(xmpl.bdist)
34 | 
35 | # Clustering vector
36 | data(xmpl.bclst)
37 | print(xmpl.bclst[1:30])
38 | 
39 | # Pan-matrix
40 | data(xmpl.panmat)
41 | head(xmpl.panmat)
42 | 
43 | }
44 | \author{
45 | Lars Snipen and Kristian Hovde Liland.
46 | }
47 | 


--------------------------------------------------------------------------------
/man/xz.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/xz.R
 3 | \name{xzcompress}
 4 | \alias{xzcompress}
 5 | \alias{xzuncompress}
 6 | \title{Compressing and uncompressing text files}
 7 | \usage{
 8 | xzcompress(
 9 |   filename,
10 |   destname = sprintf("\%s.xz", filename),
11 |   temporary = FALSE,
12 |   skip = FALSE,
13 |   overwrite = FALSE,
14 |   remove = TRUE,
15 |   BFR.SIZE = 1e+07,
16 |   compression = 6,
17 |   ...
18 | )
19 | 
20 | xzuncompress(
21 |   filename,
22 |   destname = gsub("[.]xz$", "", filename, ignore.case = TRUE),
23 |   temporary = FALSE,
24 |   skip = FALSE,
25 |   overwrite = FALSE,
26 |   remove = TRUE,
27 |   BFR.SIZE = 1e+07,
28 |   ...
29 | )
30 | }
31 | \arguments{
32 | \item{filename}{Path name of input file.}
33 | 
34 | \item{destname}{Pathname of output file.}
35 | 
36 | \item{temporary}{If TRUE, the output file is created in a temporary directory.}
37 | 
38 | \item{skip}{If TRUE and the output file already exists, the output file is returned as is.}
39 | 
40 | \item{overwrite}{If TRUE and the output file already exists, the file is silently overwritting,
41 | otherwise an exception is thrown (unless skip is TRUE).}
42 | 
43 | \item{remove}{If TRUE, the input file is removed afterward, otherwise not.}
44 | 
45 | \item{BFR.SIZE}{The number of bytes read in each chunk.}
46 | 
47 | \item{compression}{The compression level used (1-9).}
48 | 
49 | \item{...}{Not used.}
50 | }
51 | \value{
52 | Returns the pathname of the output file. The number of bytes processed is returned as an attribute.
53 | }
54 | \description{
55 | These functions are adapted from the \code{R.utils} package from gzip to xz. Internally
56 | \code{xzfile()} (see connections) is used to read (write) chunks to (from) the xz file. If the
57 | process is interrupted before completed, the partially written output file is automatically removed.
58 | }
59 | \examples{
60 | # Creating small file
61 | tf <- tempfile()
62 | cat(file=tf, "Hello world!")
63 | 
64 | # Compressing
65 | tf.xz <- xzcompress(tf)
66 | print(file.info(tf.xz))
67 | 
68 | # Uncompressing
69 | tf <- xzuncompress(tf.xz)
70 | print(file.info(tf))
71 | file.remove(tf)
72 | 
73 | }
74 | \author{
75 | Kristian Hovde Liland.
76 | }
77 | 


--------------------------------------------------------------------------------
/vignettes/vignette-concordance.tex:
--------------------------------------------------------------------------------
1 | \Sconcordance{concordance:vignette.tex:C:/projects/git/micropan/vignettes/vignette.Rnw:%
2 | 1 31 1 1 2 4 0 1 2 4 1 1 2 4 0 1 2 4 1}
3 | 


--------------------------------------------------------------------------------
/vignettes/vignette.Rnw:
--------------------------------------------------------------------------------
 1 | \documentclass{article}
 2 | \usepackage{url,Sweave}
 3 | %\VignetteIndexEntry{The micropan package vignette}
 4 | 
 5 | \title{The \texttt{micropan} package vignette}
 6 | \author{Lars Snipen and Kristian Hovde Liland}
 7 | \date{}
 8 | 
 9 | \begin{document}
10 | \SweaveOpts{concordance=TRUE}
11 | %\SweaveOpts{concordance=TRUE}
12 | 
13 | \maketitle
14 | 
15 | 
16 | \section{Using \texttt{dplyr} and \texttt{stringr}}
17 | A major change in the 2.0 version is the use of generic data structures and functions in R instead of creating package specific ones. This makes it possible to use the power of standard data manipulation tools and visualization that R-users are familiar with.
18 | 
19 | Compared to previous versions some functions have been moved to the \texttt{microseq} package.
20 | 
21 | You will also find no casestudy document or plotting functions. However, if you locate the GitHub site for this package, you will find a tutorial with code making similar plots using \texttt{ggplot} or \texttt{ggdendro}. This is an example of using generic R tools instead of making functions for each special case.
22 | 
23 | \subsection{Faster reading of BLAST results}
24 | A major change in the 2.1 version is faster reading of the BLAST result files, see `?bDist` or the tutorial at GitHub mentioned above for more details. 
25 | 
26 | 
27 | \section{External software}
28 | Some functions in this package calls upons external software that must be available on the system. Some of these are 'installed' by simply downloading a binary executable that you put somewhere proper on your computer. To make such programs visible to R, you typically need to update your \texttt{PATH} environment variable, to specify where these executables are located. Try it out, and use google for help!
29 | 
30 | 
31 | \subsection{Software \texttt{blast+}}
32 | The function \emph{blastpAllAll} uses the free software \texttt{blast+} (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/). Source code and installers makes it straightforward to install. In the R console the command
33 | <<eval=FALSE>>=
34 | system("blastp -h")
35 | @
36 | should produce some sensible output.
37 | 
38 | 
39 | \subsection{Software \texttt{hmmer}}
40 | The functions \emph{hmmerScan()} uses the free software \texttt{hmmer} (http://hmmer.org/). This software is developed for UNIX systems (e.g. Mac or Linux), and Windows users may find it a little difficult to install and run from R. In the R console the command
41 | <<eval=FALSE>>=
42 | system("hmmscan -h")
43 | @
44 | should produce some sensible output.
45 | 
46 | 
47 | 
48 | \end{document}


--------------------------------------------------------------------------------
/vignettes/vignette.aux:
--------------------------------------------------------------------------------
1 | \relax 
2 | \@writefile{toc}{\contentsline {section}{\numberline {1}Using \texttt  {dplyr} and \texttt  {stringr}}{1}{}\protected@file@percent }
3 | \@writefile{toc}{\contentsline {subsection}{\numberline {1.1}Faster reading of BLAST results}{1}{}\protected@file@percent }
4 | \@writefile{toc}{\contentsline {section}{\numberline {2}External software}{1}{}\protected@file@percent }
5 | \@writefile{toc}{\contentsline {subsection}{\numberline {2.1}Software \texttt  {blast+}}{1}{}\protected@file@percent }
6 | \@writefile{toc}{\contentsline {subsection}{\numberline {2.2}Software \texttt  {hmmer}}{2}{}\protected@file@percent }
7 | \gdef \@abspage@last{2}
8 | 


--------------------------------------------------------------------------------
/vignettes/vignette.log:
--------------------------------------------------------------------------------
  1 | This is pdfTeX, Version 3.14159265-2.6-1.40.21 (MiKTeX 20.10) (preloaded format=pdflatex 2020.10.12)  9 NOV 2020 16:22
  2 | entering extended mode
  3 | **C:/projects/git/micropan/vignettes/vignette.tex
  4 | (C:/projects/git/micropan/vignettes/vignette.tex
  5 | LaTeX2e <2020-10-01> patch level 1
  6 | L3 programming layer <2020-10-05> xparse <2020-03-03> ("C:\Program Files\MiKTeX
  7 | \tex/latex/base\article.cls"
  8 | Document Class: article 2020/04/10 v1.4m Standard LaTeX document class
  9 | ("C:\Program Files\MiKTeX\tex/latex/base\size10.clo"
 10 | File: size10.clo 2020/04/10 v1.4m Standard LaTeX file (size option)
 11 | )
 12 | \c@part=\count175
 13 | \c@section=\count176
 14 | \c@subsection=\count177
 15 | \c@subsubsection=\count178
 16 | \c@paragraph=\count179
 17 | \c@subparagraph=\count180
 18 | \c@figure=\count181
 19 | \c@table=\count182
 20 | \abovecaptionskip=\skip47
 21 | \belowcaptionskip=\skip48
 22 | \bibindent=\dimen138
 23 | ) ("C:\Program Files\MiKTeX\tex/latex/url\url.sty"
 24 | \Urlmuskip=\muskip16
 25 | Package: url 2013/09/16  ver 3.4  Verb mode for urls, etc.
 26 | ) (C:/PROGRA~1/R/R-40~1.3/share/texmf/tex/latex\Sweave.sty
 27 | Package: Sweave 
 28 | ("C:\Program Files\MiKTeX\tex/latex/base\ifthen.sty"
 29 | Package: ifthen 2014/09/29 v1.1c Standard LaTeX ifthen package (DPC)
 30 | ) ("C:\Program Files\MiKTeX\tex/latex/graphics\graphicx.sty"
 31 | Package: graphicx 2020/09/09 v1.2b Enhanced LaTeX Graphics (DPC,SPQR)
 32 | ("C:\Program Files\MiKTeX\tex/latex/graphics\keyval.sty"
 33 | Package: keyval 2014/10/28 v1.15 key=value parser (DPC)
 34 | \KV@toks@=\toks15
 35 | ) ("C:\Program Files\MiKTeX\tex/latex/graphics\graphics.sty"
 36 | Package: graphics 2020/08/30 v1.4c Standard LaTeX Graphics (DPC,SPQR)
 37 | ("C:\Program Files\MiKTeX\tex/latex/graphics\trig.sty"
 38 | Package: trig 2016/01/03 v1.10 sin cos tan (DPC)
 39 | ) ("C:\Program Files\MiKTeX\tex/latex/graphics-cfg\graphics.cfg"
 40 | File: graphics.cfg 2016/06/04 v1.11 sample graphics configuration
 41 | )
 42 | Package graphics Info: Driver file: pdftex.def on input line 105.
 43 | ("C:\Program Files\MiKTeX\tex/latex/graphics-def\pdftex.def"
 44 | File: pdftex.def 2020/10/05 v1.2a Graphics/color driver for pdftex
 45 | ))
 46 | \Gin@req@height=\dimen139
 47 | \Gin@req@width=\dimen140
 48 | ) (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/latex/fancyvrb\fancyvrb.sty
 49 | Package: fancyvrb 2020/05/03 v3.6 verbatim text (tvz,hv)
 50 | \FV@CodeLineNo=\count183
 51 | \FV@InFile=\read2
 52 | \FV@TabBox=\box47
 53 | \c@FancyVerbLine=\count184
 54 | \FV@StepNumber=\count185
 55 | \FV@OutFile=\write3
 56 | ) ("C:\Program Files\MiKTeX\tex/latex/base\textcomp.sty"
 57 | Package: textcomp 2020/02/02 v2.0n Standard LaTeX package
 58 | ) ("C:\Program Files\MiKTeX\tex/latex/base\fontenc.sty"
 59 | Package: fontenc 2020/08/10 v2.0s Standard LaTeX package
 60 | ) (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/latex/ae\ae.sty
 61 | Package: ae 2001/02/12 1.3 Almost European Computer Modern
 62 | ("C:\Program Files\MiKTeX\tex/latex/base\fontenc.sty"
 63 | Package: fontenc 2020/08/10 v2.0s Standard LaTeX package
 64 | )))
 65 | LaTeX Font Info:    Trying to load font information for T1+aer on input line 9.
 66 | 
 67 | (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/latex/ae\t1aer.fd
 68 | File: t1aer.fd 1997/11/16 Font definitions for T1/aer.
 69 | ) ("C:\Program Files\MiKTeX\tex/latex/l3backend\l3backend-pdftex.def"
 70 | File: l3backend-pdftex.def 2020-09-24 L3 backend support: PDF output (pdfTeX)
 71 | \l__kernel_color_stack_int=\count186
 72 | \l__pdf_internal_box=\box48
 73 | ) (vignette.aux)
 74 | \openout1 = `vignette.aux'.
 75 | 
 76 | LaTeX Font Info:    Checking defaults for OML/cmm/m/it on input line 9.
 77 | LaTeX Font Info:    ... okay on input line 9.
 78 | LaTeX Font Info:    Checking defaults for OMS/cmsy/m/n on input line 9.
 79 | LaTeX Font Info:    ... okay on input line 9.
 80 | LaTeX Font Info:    Checking defaults for OT1/cmr/m/n on input line 9.
 81 | LaTeX Font Info:    ... okay on input line 9.
 82 | LaTeX Font Info:    Checking defaults for T1/cmr/m/n on input line 9.
 83 | LaTeX Font Info:    ... okay on input line 9.
 84 | LaTeX Font Info:    Checking defaults for TS1/cmr/m/n on input line 9.
 85 | LaTeX Font Info:    ... okay on input line 9.
 86 | LaTeX Font Info:    Checking defaults for OMX/cmex/m/n on input line 9.
 87 | LaTeX Font Info:    ... okay on input line 9.
 88 | LaTeX Font Info:    Checking defaults for U/cmr/m/n on input line 9.
 89 | LaTeX Font Info:    ... okay on input line 9.
 90 | ("C:\Program Files\MiKTeX\tex/context/base/mkii\supp-pdf.mkii"
 91 | [Loading MPS to PDF converter (version 2006.09.02).]
 92 | \scratchcounter=\count187
 93 | \scratchdimen=\dimen141
 94 | \scratchbox=\box49
 95 | \nofMPsegments=\count188
 96 | \nofMParguments=\count189
 97 | \everyMPshowfont=\toks16
 98 | \MPscratchCnt=\count190
 99 | \MPscratchDim=\dimen142
100 | \MPnumerator=\count191
101 | \makeMPintoPDFobject=\count192
102 | \everyMPtoPDFconversion=\toks17
103 | ) ("C:\Program Files\MiKTeX\tex/latex/epstopdf-pkg\epstopdf-base.sty"
104 | Package: epstopdf-base 2020-01-24 v2.11 Base part for package epstopdf
105 | (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/generic/infwarerr\infwarerr.sty
106 | Package: infwarerr 2019/12/03 v1.5 Providing info/warning/error messages (HO)
107 | ) (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/latex/grfext\grfext.sty
108 | Package: grfext 2019/12/03 v1.3 Manage graphics extensions (HO)
109 | 
110 | (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/generic/kvdefinekeys\kvdefinekeys.s
111 | ty
112 | Package: kvdefinekeys 2019-12-19 v1.6 Define keys (HO)
113 | )) (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/latex/kvoptions\kvoptions.sty
114 | Package: kvoptions 2019/11/29 v3.13 Key value format for package options (HO)
115 | (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/generic/ltxcmds\ltxcmds.sty
116 | Package: ltxcmds 2020-05-10 v1.25 LaTeX kernel commands for general use (HO)
117 | ) (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/generic/kvsetkeys\kvsetkeys.sty
118 | Package: kvsetkeys 2019/12/15 v1.18 Key value parser (HO)
119 | )) ("C:\Program Files\MiKTeX\tex/generic/pdftexcmds\pdftexcmds.sty"
120 | Package: pdftexcmds 2020-06-27 v0.33 Utility functions of pdfTeX for LuaTeX (HO
121 | )
122 | ("C:\Program Files\MiKTeX\tex/generic/iftex\iftex.sty"
123 | Package: iftex 2020/03/06 v1.0d TeX engine tests
124 | )
125 | Package pdftexcmds Info: \pdf@primitive is available.
126 | Package pdftexcmds Info: \pdf@ifprimitive is available.
127 | Package pdftexcmds Info: \pdfdraftmode found.
128 | )
129 | Package epstopdf-base Info: Redefining graphics rule for `.eps' on input line 4
130 | 85.
131 | Package grfext Info: Graphics extension search list:
132 | (grfext)             [.pdf,.png,.jpg,.mps,.jpeg,.jbig2,.jb2,.PDF,.PNG,.JPG,.JPE
133 | G,.JBIG2,.JB2,.eps]
134 | (grfext)             \AppendGraphicsExtensions on input line 504.
135 | ) (vignette-concordance.tex)
136 | LaTeX Font Info:    Trying to load font information for T1+aett on input line 1
137 | 3.
138 | (C:\Users\larssn\AppData\Roaming\MiKTeX\tex/latex/ae\t1aett.fd
139 | File: t1aett.fd 1997/11/16 Font definitions for T1/aett.
140 | )
141 | LaTeX Font Info:    External font `cmex10' loaded for size
142 | (Font)              <12> on input line 13.
143 | LaTeX Font Info:    External font `cmex10' loaded for size
144 | (Font)              <8> on input line 13.
145 | LaTeX Font Info:    External font `cmex10' loaded for size
146 | (Font)              <6> on input line 13.
147 | 
148 | LaTeX Font Warning: Font shape `T1/aett/b/n' undefined
149 | (Font)              using `T1/aett/m/n' instead on input line 16.
150 | 
151 | 
152 | Overfull \hbox (179.69289pt too wide) in paragraph at lines 32--34
153 | \T1/aer/m/n/10 The func-tion \T1/aer/m/it/10 blast-pAl-lAll \T1/aer/m/n/10 uses
154 |  the free soft-ware \T1/aett/m/n/10 blast+ \T1/aer/m/n/10 (ftp://ftp.ncbi.nlm.n
155 | ih.gov/blast/executables/blast+/LATEST/).
156 |  []
157 | 
158 | [1
159 | 
160 | {C:/Users/larssn/AppData/Local/MiKTeX/pdftex/config/pdftex.map}] [2] (vignette.
161 | aux)
162 | 
163 | LaTeX Font Warning: Some font shapes were not available, defaults substituted.
164 | 
165 |  ) 
166 | Here is how much of TeX's memory you used:
167 |  2555 strings out of 480236
168 |  37894 string characters out of 2890387
169 |  307402 words of memory out of 3000000
170 |  19045 multiletter control sequences out of 15000+200000
171 |  559697 words of font info for 80 fonts, out of 3000000 for 9000
172 |  1141 hyphenation exceptions out of 8191
173 |  72i,6n,77p,487b,225s stack positions out of 5000i,500n,10000p,200000b,50000s
174 | <C:/Program Files/MiKTeX/fonts/type1/public/amsfonts/cm/cmb10.pfb><C:/Program
175 |  Files/MiKTeX/fonts/type1/public/amsfonts/cm/cmr10.pfb><C:/Program Files/MiKTeX
176 | /fonts/type1/public/amsfonts/cm/cmr12.pfb><C:/Program Files/MiKTeX/fonts/type1/
177 | public/amsfonts/cm/cmr17.pfb><C:/Program Files/MiKTeX/fonts/type1/public/amsfon
178 | ts/cm/cmsltt10.pfb><C:/Program Files/MiKTeX/fonts/type1/public/amsfonts/cm/cmti
179 | 10.pfb><C:/Program Files/MiKTeX/fonts/type1/public/amsfonts/cm/cmtt10.pfb><C:/P
180 | rogram Files/MiKTeX/fonts/type1/public/amsfonts/cm/cmtt12.pfb>
181 | Output written on vignette.pdf (2 pages, 100326 bytes).
182 | PDF statistics:
183 |  42 PDF objects out of 1000 (max. 8388607)
184 |  0 named destinations out of 1000 (max. 500000)
185 |  5 words of extra memory for PDF output out of 10000 (max. 10000000)
186 | 
187 | 


--------------------------------------------------------------------------------
/vignettes/vignette.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/larssnip/micropan/4bd8f5028af36d06b7e2df729e4d0b8050928914/vignettes/vignette.pdf


--------------------------------------------------------------------------------
/vignettes/vignette.tex:
--------------------------------------------------------------------------------
 1 | \documentclass{article}
 2 | \usepackage{url,Sweave}
 3 | %\VignetteIndexEntry{The micropan package vignette}
 4 | 
 5 | \title{The \texttt{micropan} package vignette}
 6 | \author{Lars Snipen and Kristian Hovde Liland}
 7 | \date{}
 8 | 
 9 | \begin{document}
10 | \input{vignette-concordance}
11 | %\SweaveOpts{concordance=TRUE}
12 | 
13 | \maketitle
14 | 
15 | 
16 | \section{Using \texttt{dplyr} and \texttt{stringr}}
17 | A major change in the 2.0 version is the use of generic data structures and functions in R instead of creating package specific ones. This makes it possible to use the power of standard data manipulation tools and visualization that R-users are familiar with.
18 | 
19 | Compared to previous versions some functions have been moved to the \texttt{microseq} package.
20 | 
21 | You will also find no casestudy document or plotting functions. However, if you locate the GitHub site for this package, you will find a tutorial with code making similar plots using \texttt{ggplot} or \texttt{ggdendro}. This is an example of using generic R tools instead of making functions for each special case.
22 | 
23 | \subsection{Faster reading of BLAST results}
24 | A major change in the 2.1 version is faster reading of the BLAST result files, see `?bDist` or the tutorial at GitHub mentioned above for more details. 
25 | 
26 | 
27 | \section{External software}
28 | Some functions in this package calls upons external software that must be available on the system. Some of these are 'installed' by simply downloading a binary executable that you put somewhere proper on your computer. To make such programs visible to R, you typically need to update your \texttt{PATH} environment variable, to specify where these executables are located. Try it out, and use google for help!
29 | 
30 | 
31 | \subsection{Software \texttt{blast+}}
32 | The function \emph{blastpAllAll} uses the free software \texttt{blast+} (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/). Source code and installers makes it straightforward to install. In the R console the command
33 | \begin{Schunk}
34 | \begin{Sinput}
35 | > system("blastp -h")
36 | \end{Sinput}
37 | \end{Schunk}
38 | should produce some sensible output.
39 | 
40 | 
41 | \subsection{Software \texttt{hmmer}}
42 | The functions \emph{hmmerScan()} uses the free software \texttt{hmmer} (http://hmmer.org/). This software is developed for UNIX systems (e.g. Mac or Linux), and Windows users may find it a little difficult to install and run from R. In the R console the command
43 | \begin{Schunk}
44 | \begin{Sinput}
45 | > system("hmmscan -h")
46 | \end{Sinput}
47 | \end{Schunk}
48 | should produce some sensible output.
49 | 
50 | 
51 | 
52 | \end{document}
53 | 


--------------------------------------------------------------------------------