├── .github ├── .gitignore └── ISSUE_TEMPLATE │ ├── feature_request.md │ └── bug_report.md ├── src ├── .gitignore ├── lumpyCluster.h ├── treemer.h ├── minEntropy.cpp ├── searchNode.cpp ├── minEntropy.h └── searchTree.cpp ├── data ├── h3n2_tree.rda ├── zikv_tree.rda ├── h3n2_align.rda ├── sars2_align.rda ├── sars2_tree.rda ├── zikv_align.rda ├── h3n2_tree_reduced.rda ├── zikv_tree_reduced.rda ├── h3n2_align_reduced.rda └── zikv_align_reduced.rda ├── tests ├── testthat.R └── testthat │ ├── test-as.data.frame.R │ ├── test-similarityMatrix.R │ ├── test-multiprocessing.R │ ├── test-fixationPath.R │ ├── test-paraFixSites.R │ ├── test-allSitesName.R │ ├── test-addMSA.R │ ├── test-reexports.R │ ├── test-sitesMinEntropy.R │ ├── test-plotFunctions.R │ ├── test-groupTips.R │ ├── test-printFunctions.R │ ├── test-plotSingleSite.R │ ├── test-lineagePath.R │ ├── test-SNPsites.R │ ├── test-parallelSites.R │ ├── test-fixationSites.R │ └── test-siteNumbering.R ├── .Rbuildignore ├── man ├── figures │ ├── example-1.png │ ├── example-2.png │ ├── plot_fixSites-1.png │ ├── plot_fixSites-2.png │ ├── unnamed-chunk-1-1.png │ └── unnamed-chunk-1-2.png ├── sars2_tree.Rd ├── sars2_align.Rd ├── h3n2_tree.Rd ├── sitePath-deprecated.Rd ├── similarityMatrix.Rd ├── reexports.Rd ├── h3n2_align.Rd ├── zikv_tree.Rd ├── zikv_align.Rd ├── fixationIndels.Rd ├── plotParallelSites.Rd ├── SNPsites.Rd ├── extractSite.Rd ├── plotFixationSites.Rd ├── fixationPath.Rd ├── allSitesName.Rd ├── sitesMinEntropy.Rd ├── as.data.frame.Rd ├── plotMutSites.Rd ├── setSiteNumbering.Rd ├── groupTips.Rd ├── extractTips.Rd ├── fixationSites.Rd ├── parallelSites.Rd ├── addMSA.Rd ├── plotSingleSite.Rd ├── plotFunctions.Rd ├── paraFixSites.Rd └── lineagePath.Rd ├── .gitignore ├── LICENSE ├── R ├── sitePath-deprecated.R ├── plotParallelSites.R ├── plotFixationSites.R ├── reexports.R ├── allSitesName.R ├── extractSite.R ├── RcppExports.R ├── SNPsites.R ├── extractTips.R ├── printFunctions.R ├── staticData.R ├── fixationSites.R ├── as.data.frame.R ├── paraFixSites.R ├── fixationIndels.R └── groupTips.R ├── DESCRIPTION ├── README.Rmd ├── README.md ├── NAMESPACE └── vignettes └── sitePath.Rmd /.github/.gitignore: -------------------------------------------------------------------------------- 1 | *.html 2 | -------------------------------------------------------------------------------- /src/.gitignore: -------------------------------------------------------------------------------- 1 | *.o 2 | *.so 3 | *.dll 4 | -------------------------------------------------------------------------------- /data/h3n2_tree.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wuaipinglab/sitePath/HEAD/data/h3n2_tree.rda -------------------------------------------------------------------------------- /data/zikv_tree.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wuaipinglab/sitePath/HEAD/data/zikv_tree.rda -------------------------------------------------------------------------------- /tests/testthat.R: -------------------------------------------------------------------------------- 1 | library(testthat) 2 | library(sitePath) 3 | 4 | test_check("sitePath") 5 | -------------------------------------------------------------------------------- /data/h3n2_align.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wuaipinglab/sitePath/HEAD/data/h3n2_align.rda -------------------------------------------------------------------------------- /data/sars2_align.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wuaipinglab/sitePath/HEAD/data/sars2_align.rda -------------------------------------------------------------------------------- /data/sars2_tree.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wuaipinglab/sitePath/HEAD/data/sars2_tree.rda -------------------------------------------------------------------------------- /data/zikv_align.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wuaipinglab/sitePath/HEAD/data/zikv_align.rda -------------------------------------------------------------------------------- /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^\.travis\.yml$ 2 | ^.*\.Rproj$ 3 | ^\.Rproj\.user$ 4 | ^\.github$ 5 | ^README\.Rmd$ 6 | -------------------------------------------------------------------------------- /data/h3n2_tree_reduced.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wuaipinglab/sitePath/HEAD/data/h3n2_tree_reduced.rda -------------------------------------------------------------------------------- /data/zikv_tree_reduced.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wuaipinglab/sitePath/HEAD/data/zikv_tree_reduced.rda -------------------------------------------------------------------------------- /man/figures/example-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wuaipinglab/sitePath/HEAD/man/figures/example-1.png -------------------------------------------------------------------------------- /man/figures/example-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wuaipinglab/sitePath/HEAD/man/figures/example-2.png -------------------------------------------------------------------------------- /data/h3n2_align_reduced.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wuaipinglab/sitePath/HEAD/data/h3n2_align_reduced.rda -------------------------------------------------------------------------------- /data/zikv_align_reduced.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wuaipinglab/sitePath/HEAD/data/zikv_align_reduced.rda -------------------------------------------------------------------------------- /man/figures/plot_fixSites-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wuaipinglab/sitePath/HEAD/man/figures/plot_fixSites-1.png -------------------------------------------------------------------------------- /man/figures/plot_fixSites-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wuaipinglab/sitePath/HEAD/man/figures/plot_fixSites-2.png -------------------------------------------------------------------------------- /man/figures/unnamed-chunk-1-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wuaipinglab/sitePath/HEAD/man/figures/unnamed-chunk-1-1.png -------------------------------------------------------------------------------- /man/figures/unnamed-chunk-1-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wuaipinglab/sitePath/HEAD/man/figures/unnamed-chunk-1-2.png -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | *.Rproj 6 | inst/doc 7 | vignettes/*.html 8 | vignettes/*.R 9 | .vscode 10 | -------------------------------------------------------------------------------- /tests/testthat/test-as.data.frame.R: -------------------------------------------------------------------------------- 1 | test_that("The function works", { 2 | data(zikv_align_reduced) 3 | data(zikv_tree_reduced) 4 | tr <- addMSA(tree = zikv_tree_reduced, 5 | alignment = zikv_align_reduced) 6 | snp <- SNPsites(tr) 7 | expect_error(as.data.frame(snp), NA) 8 | p <- lineagePath(tr) 9 | fixedSites <- fixationSites(p) 10 | expect_error(as.data.frame(fixedSites), NA) 11 | }) 12 | -------------------------------------------------------------------------------- /tests/testthat/test-similarityMatrix.R: -------------------------------------------------------------------------------- 1 | test_that("Calculate similarity matrix", { 2 | data(zikv_align_reduced) 3 | data(zikv_tree_reduced) 4 | tree <- addMSA(tree = zikv_tree_reduced, 5 | alignment = zikv_align_reduced) 6 | tipNames <- as.phylo(tree)[["tip.label"]] 7 | simMatrix <- similarityMatrix(tree) 8 | expect_identical(colnames(simMatrix), tipNames) 9 | expect_identical(row.names(simMatrix), tipNames) 10 | }) 11 | -------------------------------------------------------------------------------- /man/sars2_tree.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/staticData.R 3 | \docType{data} 4 | \name{sars2_tree} 5 | \alias{sars2_tree} 6 | \title{Phylogenetic tree of SARS-CoV-2 genome CDS} 7 | \format{ 8 | a \code{phylo} object 9 | } 10 | \usage{ 11 | data(sars2_tree) 12 | } 13 | \description{ 14 | Tree was built from \code{\link{sars2_align}} using RAxML 15 | (\url{http://www.exelixis-lab.org/}) with default settings. The tip 16 | \code{EPI_ISL_402125} was used as the outgroup to root the tree. 17 | } 18 | \keyword{datasets} 19 | -------------------------------------------------------------------------------- /tests/testthat/test-multiprocessing.R: -------------------------------------------------------------------------------- 1 | test_that("Multiprocessing works", { 2 | data(h3n2_tree) 3 | data(h3n2_align) 4 | 5 | tr <- addMSA(tree = h3n2_tree, alignment = h3n2_align) 6 | p <- lineagePath(tr) 7 | minEntropy <- sitesMinEntropy(p) 8 | 9 | options(cl.cores = 2) 10 | tr_mp <- addMSA(tree = h3n2_tree, alignment = h3n2_align) 11 | p_mp <- lineagePath(tr) 12 | minEntropy_mp <- sitesMinEntropy(p_mp) 13 | options(cl.cores = NULL) 14 | 15 | expect_identical(tr, tr_mp) 16 | expect_identical(minEntropy, minEntropy_mp) 17 | }) 18 | -------------------------------------------------------------------------------- /man/sars2_align.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/staticData.R 3 | \docType{data} 4 | \name{sars2_align} 5 | \alias{sars2_align} 6 | \title{Multiple sequence alignment of SARS-CoV-2 genome CDS} 7 | \format{ 8 | an \code{alignment} object 9 | } 10 | \usage{ 11 | data(sars2_align) 12 | } 13 | \description{ 14 | The raw sequences were downloaded from GISAID database 15 | (\url{https://www.gisaid.org/}) and aligned using MAFFT 16 | (\url{https://mafft.cbrc.jp/alignment/software/}) with default settings. 17 | } 18 | \keyword{datasets} 19 | -------------------------------------------------------------------------------- /man/h3n2_tree.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/staticData.R 3 | \docType{data} 4 | \name{h3n2_tree} 5 | \alias{h3n2_tree} 6 | \alias{h3n2_tree_reduced} 7 | \title{Phylogenetic tree of H3N2's HA protein} 8 | \format{ 9 | a \code{phylo} object 10 | 11 | a \code{phylo} object 12 | } 13 | \usage{ 14 | data(h3n2_tree) 15 | 16 | data(h3n2_tree_reduced) 17 | } 18 | \description{ 19 | Tree was built from \code{\link{h3n2_align}} using RAxML 20 | (\url{http://www.exelixis-lab.org/}) with default settings. 21 | 22 | \code{h3n2_tree_reduced} is a truncated version of 23 | \code{h3n2_tree} 24 | } 25 | \keyword{datasets} 26 | -------------------------------------------------------------------------------- /man/sitePath-deprecated.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/sitePath-deprecated.R 3 | \name{sitePath-deprecated} 4 | \alias{sitePath-deprecated} 5 | \alias{multiFixationSites} 6 | \title{Deprecated functions in package \sQuote{sitePath}} 7 | \description{ 8 | These functions are provided for compatibility with older 9 | versions of \sQuote{sitePath} only, and will be defunct at the next 10 | release. 11 | } 12 | \details{ 13 | The following functions are deprecated and will be made defunct; use 14 | the replacement indicated below: \itemize{ \item{multiFixationSites: 15 | \code{\link{fixationSites}}} } 16 | } 17 | -------------------------------------------------------------------------------- /man/similarityMatrix.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/groupTips.R 3 | \name{similarityMatrix} 4 | \alias{similarityMatrix} 5 | \title{Similarity between sequences} 6 | \usage{ 7 | similarityMatrix(tree) 8 | } 9 | \arguments{ 10 | \item{tree}{The return from \code{\link{addMSA}} function.} 11 | } 12 | \value{ 13 | A diagonal matrix of similarity between sequences. 14 | } 15 | \description{ 16 | Get similarity between aligned sequences with gap ignored. 17 | } 18 | \examples{ 19 | data(zikv_tree) 20 | data(zikv_align) 21 | tree <- addMSA(zikv_tree, alignment = zikv_align) 22 | simMatrix <- similarityMatrix(tree) 23 | } 24 | -------------------------------------------------------------------------------- /tests/testthat/test-fixationPath.R: -------------------------------------------------------------------------------- 1 | test_that("The output is valid phylo", { 2 | data(h3n2_tree_reduced) 3 | data(h3n2_align_reduced) 4 | tree <- addMSA(tree = h3n2_tree_reduced, 5 | alignment = h3n2_align_reduced) 6 | nTips <- length(as.phylo(tree)$tip.label) 7 | paths <- lineagePath(tree) 8 | mutations <- fixationSites(paths) 9 | expect_false(any(duplicated(unlist( 10 | attr(mutations, "clustersByPath") 11 | )))) 12 | x <- fixationPath(mutations) 13 | tr <- attr(x, "SNPtracing")@phylo 14 | checkOutput <- capture.output(ape::checkValidPhylo(tr)) 15 | expect_false(any(grepl("FATAL", checkOutput))) 16 | }) 17 | -------------------------------------------------------------------------------- /tests/testthat/test-paraFixSites.R: -------------------------------------------------------------------------------- 1 | test_that("The function works", { 2 | data(zikv_tree) 3 | data(zikv_align) 4 | 5 | zikv_paths <- addMSA(zikv_tree, alignment = zikv_align) 6 | zikv_entropy <- sitesMinEntropy(zikv_paths) 7 | 8 | expect_identical(paraFixSites(zikv_tree, alignment = zikv_align), 9 | paraFixSites(zikv_paths)) 10 | expect_identical(paraFixSites(zikv_paths), paraFixSites(zikv_entropy)) 11 | expect_error(paraFixSites(zikv_entropy, category = "union"), NA) 12 | expect_error(paraFixSites(zikv_entropy, category = "parallelOnly"), NA) 13 | expect_error(paraFixSites(zikv_entropy, category = "fixationOnly"), NA) 14 | }) 15 | -------------------------------------------------------------------------------- /man/reexports.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/reexports.R 3 | \docType{import} 4 | \name{reexports} 5 | \alias{reexports} 6 | \alias{read.tree} 7 | \alias{read.alignment} 8 | \alias{as.phylo} 9 | \alias{as.treedata} 10 | \title{Objects exported from other packages} 11 | \keyword{internal} 12 | \description{ 13 | These objects are imported from other packages. Follow the links 14 | below to see their documentation. 15 | 16 | \describe{ 17 | \item{ape}{\code{\link[ape]{as.phylo}}, \code{\link[ape]{read.tree}}} 18 | 19 | \item{seqinr}{\code{\link[seqinr]{read.alignment}}} 20 | 21 | \item{tidytree}{\code{\link[tidytree]{as.treedata}}} 22 | }} 23 | 24 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Is your feature request related to a problem? Please describe.** 11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 12 | 13 | **Describe the solution you'd like** 14 | A clear and concise description of what you want to happen. 15 | 16 | **Describe alternatives you've considered** 17 | A clear and concise description of any alternative solutions or features you've considered. 18 | 19 | **Additional context** 20 | Add any other context or screenshots about the feature request here. 21 | -------------------------------------------------------------------------------- /man/h3n2_align.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/staticData.R 3 | \docType{data} 4 | \name{h3n2_align} 5 | \alias{h3n2_align} 6 | \alias{h3n2_align_reduced} 7 | \title{Multiple sequence alignment of H3N2's HA protein} 8 | \format{ 9 | an \code{alignment} object 10 | 11 | an \code{alignment} object 12 | } 13 | \usage{ 14 | data(h3n2_align) 15 | 16 | data(h3n2_align_reduced) 17 | } 18 | \description{ 19 | The raw protein sequences were downloaded from NCBI database and 20 | aligned using MAFFT (\url{https://mafft.cbrc.jp/alignment/software/}). 21 | 22 | \code{h3n2_align_reduced} is a truncated version of 23 | \code{h3n2_align} 24 | } 25 | \keyword{datasets} 26 | -------------------------------------------------------------------------------- /man/zikv_tree.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/staticData.R 3 | \docType{data} 4 | \name{zikv_tree} 5 | \alias{zikv_tree} 6 | \alias{zikv_tree_reduced} 7 | \title{Phylogenetic tree of Zika virus polyprotein} 8 | \format{ 9 | a \code{phylo} object 10 | 11 | a \code{phylo} object 12 | } 13 | \usage{ 14 | data(zikv_tree) 15 | 16 | data(zikv_tree_reduced) 17 | } 18 | \description{ 19 | Tree was built from \code{\link{zikv_align}} using RAxML 20 | (\url{http://www.exelixis-lab.org/}) with default settings. The tip 21 | ANK57896 was used as outgroup to root the tree. 22 | 23 | \code{zikv_tree_reduced} is a truncated version of 24 | \code{zikv_tree} 25 | } 26 | \keyword{datasets} 27 | -------------------------------------------------------------------------------- /man/zikv_align.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/staticData.R 3 | \docType{data} 4 | \name{zikv_align} 5 | \alias{zikv_align} 6 | \alias{zikv_align_reduced} 7 | \title{Multiple sequence alignment of Zika virus polyprotein} 8 | \format{ 9 | an \code{alignment} object 10 | 11 | an \code{alignment} object 12 | } 13 | \usage{ 14 | data(zikv_align) 15 | 16 | data(zikv_align_reduced) 17 | } 18 | \description{ 19 | The raw protein sequences were downloaded from ViPR database 20 | (\url{https://www.viprbrc.org/}) and aligned using MAFFT 21 | (\url{https://mafft.cbrc.jp/alignment/software/}). with default settings. 22 | 23 | \code{zikv_align_reduced} is a truncated version of 24 | \code{zikv_align} 25 | } 26 | \keyword{datasets} 27 | -------------------------------------------------------------------------------- /man/fixationIndels.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/fixationIndels.R 3 | \name{fixationIndels} 4 | \alias{fixationIndels} 5 | \alias{fixationIndels.sitesMinEntropy} 6 | \title{Fixation indels prediction} 7 | \usage{ 8 | fixationIndels(x, ...) 9 | 10 | \method{fixationIndels}{sitesMinEntropy}(x, ...) 11 | } 12 | \arguments{ 13 | \item{x}{The return from \code{\link{sitesMinEntropy}} function.} 14 | 15 | \item{...}{Other arguments.} 16 | } 17 | \value{ 18 | A \code{fixationIndels} object. 19 | } 20 | \description{ 21 | The fixation of insertions of deletions. 22 | } 23 | \examples{ 24 | data(zikv_tree_reduced) 25 | data(zikv_align_reduced) 26 | tree <- addMSA(zikv_tree_reduced, alignment = zikv_align_reduced) 27 | fixationIndels(sitesMinEntropy(tree)) 28 | } 29 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | 12 | A clear and concise description of what the bug is. 13 | 14 | **To Reproduce** 15 | 16 | The tree and sequence alignment file are needed in most cases to reproduce the bug, you can either: 17 | 18 | 1. Use upload the tree and sequence alignment file. 19 | 2. Or the `saveRDS` function to save the R object causing the bug and upload the RDS file. 20 | 3. If the file(s) exceeds the size limit, email those file(s) to chengyang.ji12@alumni.xjtlu.edu.cn. 21 | 22 | **Screenshots** 23 | 24 | If applicable, add screenshots to help explain your problem. 25 | 26 | **Version info:** 27 | 28 | Run `sessionInfo()` in the R console and paste the output below. 29 | -------------------------------------------------------------------------------- /tests/testthat/test-allSitesName.R: -------------------------------------------------------------------------------- 1 | test_that("The function works", { 2 | data(zikv_tree) 3 | data(zikv_align) 4 | 5 | zikv_tr <- addMSA(zikv_tree, alignment = zikv_align) 6 | 7 | zikv_snp <- SNPsites(zikv_tr) 8 | 9 | sites <- allSitesName(zikv_snp) 10 | expect_type(sites, "character") 11 | expect_equal(length(sites), length(zikv_snp)) 12 | 13 | zikv_p <- lineagePath(zikv_tr) 14 | 15 | zikv_entropy <- sitesMinEntropy(zikv_p) 16 | 17 | zikv_fixed <- fixationSites(zikv_entropy) 18 | sites <- allSitesName(zikv_fixed) 19 | expect_type(sites, "character") 20 | expect_equal(length(sites), length(zikv_fixed)) 21 | 22 | zikv_para <- parallelSites(zikv_entropy, minSNP = 1) 23 | sites <- allSitesName(zikv_para) 24 | expect_type(sites, "character") 25 | expect_equal(length(sites), length(zikv_para)) 26 | }) 27 | -------------------------------------------------------------------------------- /tests/testthat/test-addMSA.R: -------------------------------------------------------------------------------- 1 | test_that("Read from file works", { 2 | tree_file <- system.file("extdata", 3 | "ZIKV.newick", 4 | package = "sitePath") 5 | alignment_file <- system.file("extdata", 6 | "ZIKV.fasta", 7 | package = "sitePath") 8 | tr <- read.tree(tree_file) 9 | expect_error(addMSA(tr, alignment_file), NA) 10 | expect_error(addMSA(tr, msaPath = "")) 11 | 12 | data(zikv_align) 13 | expect_identical(addMSA(tr, alignment_file), 14 | addMSA(tr, alignment_file, alignment = zikv_align)) 15 | }) 16 | 17 | test_that("Read from object works", { 18 | data(zikv_align_reduced) 19 | data(zikv_tree_reduced) 20 | expect_error(addMSA(tree = zikv_tree_reduced, 21 | alignment = zikv_align_reduced), 22 | NA) 23 | 24 | }) 25 | -------------------------------------------------------------------------------- /tests/testthat/test-reexports.R: -------------------------------------------------------------------------------- 1 | data(zikv_align_reduced) 2 | data(zikv_tree_reduced) 3 | tr <- addMSA(tree = zikv_tree_reduced, 4 | alignment = zikv_align_reduced) 5 | p <- lineagePath(tr) 6 | fixedSites <- fixationSites(p) 7 | 8 | test_that("The reexported 'as.phylo' function works", { 9 | if (ape::is.binary(zikv_tree_reduced)) { 10 | expect_equal(zikv_tree_reduced, as.phylo(tr)) 11 | expect_equal(zikv_tree_reduced, as.phylo(p)) 12 | expect_equal(zikv_tree_reduced, as.phylo(fixedSites)) 13 | } else { 14 | resolved <- ape::multi2di(zikv_tree_reduced, random = FALSE) 15 | expect_equal(resolved, as.phylo(tr)) 16 | expect_equal(resolved, as.phylo(p)) 17 | expect_equal(resolved, as.phylo(fixedSites)) 18 | } 19 | }) 20 | 21 | test_that("The reexported 'as.treedata' function works", { 22 | expect_error(as.treedata(fixedSites), NA) 23 | fp <- fixationPath(fixedSites) 24 | expect_error(as.treedata(fp), NA) 25 | }) 26 | -------------------------------------------------------------------------------- /man/plotParallelSites.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/plotParallelSites.R 3 | \name{plotParallelSites} 4 | \alias{plotParallelSites} 5 | \alias{plotParallelSites.parallelSites} 6 | \alias{plotParallelSites.paraFixSites} 7 | \title{Plot the result of fixation sites} 8 | \usage{ 9 | plotParallelSites(x, ...) 10 | 11 | \method{plotParallelSites}{parallelSites}(x, site = NULL, ...) 12 | 13 | \method{plotParallelSites}{paraFixSites}(x, site = NULL, ...) 14 | } 15 | \arguments{ 16 | \item{x}{return from \code{\link{paraFixSites}}} 17 | 18 | \item{...}{further arguments passed to or from other methods.} 19 | 20 | \item{site}{the number of the site according to 21 | \code{\link{setSiteNumbering}}} 22 | } 23 | \value{ 24 | A \code{ggplot} object. 25 | } 26 | \description{ 27 | Visualize the results of \code{\link{paraFixSites}} 28 | } 29 | \examples{ 30 | data(zikv_tree) 31 | data(zikv_align) 32 | paraFix <- paraFixSites(zikv_tree, alignment = zikv_align) 33 | plotParallelSites(paraFix) 34 | } 35 | -------------------------------------------------------------------------------- /man/SNPsites.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/SNPsites.R 3 | \name{SNPsites} 4 | \alias{SNPsites} 5 | \alias{SNPsites.phyMSAmatched} 6 | \title{Finding sites with variation} 7 | \usage{ 8 | SNPsites(tree, ...) 9 | 10 | \method{SNPsites}{phyMSAmatched}(tree, minSNP = NULL, ...) 11 | } 12 | \arguments{ 13 | \item{tree}{A \code{\link{phyMSAmatched}} object.} 14 | 15 | \item{...}{Other arguments} 16 | 17 | \item{minSNP}{Minimum number of a mutation to be a SNP. The default is 10th 18 | of the total tree tips.} 19 | } 20 | \value{ 21 | A \code{SNPsites} object. 22 | } 23 | \description{ 24 | Single nucleotide polymorphism (SNP) in the whole package refers 25 | to variation of amino acid. \code{SNPsite} will try to find SNP in the 26 | multiple sequence alignment. A reference sequence and gap character may be 27 | specified to number the site. 28 | } 29 | \examples{ 30 | data(zikv_tree_reduced) 31 | data(zikv_align_reduced) 32 | tree <- addMSA(zikv_tree_reduced, alignment = zikv_align_reduced) 33 | SNPsites(tree) 34 | } 35 | -------------------------------------------------------------------------------- /tests/testthat/test-sitesMinEntropy.R: -------------------------------------------------------------------------------- 1 | test_that("The return value contains correct extra info", { 2 | data(h3n2_tree_reduced) 3 | data(h3n2_align_reduced) 4 | tr <- addMSA(tree = h3n2_tree_reduced, 5 | alignment = h3n2_align_reduced) 6 | p <- lineagePath(tr) 7 | minEntropy <- sitesMinEntropy(p) 8 | # The result is grouped by path 9 | expect_true(length(p) == length(minEntropy)) 10 | for (segs in minEntropy) { 11 | for (seg in segs) { 12 | for (node in names(seg)) { 13 | tips <- seg[[node]] 14 | # The major amino acid/nucleotide should be same as the fixed 15 | # and the node names should be the same 16 | dominantAA <- 17 | names(which.max(attr(tips, "aaSummary"))) 18 | if (is.null(attr(tips, "toMerge"))) { 19 | expect_true(dominantAA == attr(tips, "AA")) 20 | } 21 | expect_true(node == attr(tips, "node")) 22 | } 23 | } 24 | } 25 | }) 26 | -------------------------------------------------------------------------------- /man/extractSite.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/extractSite.R 3 | \name{extractSite} 4 | \alias{extractSite} 5 | \alias{extractSite.fixationSites} 6 | \title{Extract tips for a single site} 7 | \usage{ 8 | extractSite(x, site, ...) 9 | 10 | \method{extractSite}{fixationSites}(x, site, ...) 11 | } 12 | \arguments{ 13 | \item{x}{A \code{fixationSites} or a \code{parallelSites} object. More type 14 | will be supported in the later version.} 15 | 16 | \item{site}{A site included in the result.} 17 | 18 | \item{...}{Other arguments} 19 | } 20 | \value{ 21 | The predicted result of a single site 22 | } 23 | \description{ 24 | The functions in \code{sitePath} usually include the results on 25 | more than one site. The function \code{extractSite} can be used to extract 26 | the predicted result on a single site. 27 | } 28 | \examples{ 29 | data(zikv_tree_reduced) 30 | data(zikv_align_reduced) 31 | tree <- addMSA(zikv_tree_reduced, alignment = zikv_align_reduced) 32 | mutations <- fixationSites(lineagePath(tree)) 33 | extractSite(mutations, 139) 34 | } 35 | -------------------------------------------------------------------------------- /man/plotFixationSites.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/plotFixationSites.R 3 | \name{plotFixationSites} 4 | \alias{plotFixationSites} 5 | \alias{plotFixationSites.fixationSites} 6 | \alias{plotFixationSites.paraFixSites} 7 | \title{Plot the result of fixation sites} 8 | \usage{ 9 | plotFixationSites(x, ...) 10 | 11 | \method{plotFixationSites}{fixationSites}(x, site = NULL, ...) 12 | 13 | \method{plotFixationSites}{paraFixSites}(x, site = NULL, ...) 14 | } 15 | \arguments{ 16 | \item{x}{return from \code{\link{paraFixSites}}} 17 | 18 | \item{...}{further arguments passed to or from other methods.} 19 | 20 | \item{site}{the number of the site according to 21 | \code{\link{setSiteNumbering}}. If not provided, all sites will be plotted 22 | as labels on the tree} 23 | } 24 | \value{ 25 | A \code{ggplot} object. 26 | } 27 | \description{ 28 | Visualize the results of \code{\link{paraFixSites}} 29 | } 30 | \examples{ 31 | data(zikv_tree_reduced) 32 | data(zikv_align_reduced) 33 | paraFix <- paraFixSites(zikv_tree_reduced, alignment = zikv_align_reduced) 34 | plotFixationSites(paraFix) 35 | } 36 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Chengyang Ji 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /tests/testthat/test-plotFunctions.R: -------------------------------------------------------------------------------- 1 | test_that("The plot functions work", { 2 | data(h3n2_align_reduced) 3 | data(h3n2_tree_reduced) 4 | 5 | paths <- addMSA(h3n2_tree_reduced, 6 | alignment = h3n2_align_reduced) 7 | expect_error(plot(paths), NA) 8 | expect_error(plotMutSites(paths), NA) 9 | 10 | minEntropy <- sitesMinEntropy(paths) 11 | 12 | fixedSites <- fixationSites(minEntropy) 13 | expect_error(plot(fixedSites), NA) 14 | expect_error(plotMutSites(fixedSites), NA) 15 | 16 | # Test the constrain for 'sitePath' 17 | for (site in allSitesName(fixedSites)) { 18 | sp <- extractSite(fixedSites, site) 19 | expect_error(plot(sp, select = length(sp) + 1)) 20 | } 21 | 22 | paraSites <- parallelSites(minEntropy) 23 | expect_error(plot(paraSites), NA) 24 | expect_error(plotMutSites(paraSites), NA) 25 | 26 | categories <- c("intersect", "union", "parallelOnly", "fixationOnly") 27 | for (mutMode in categories) { 28 | pf <- paraFixSites(minEntropy) 29 | expect_error(plotMutSites(pf), NA) 30 | } 31 | 32 | fp <- fixationPath(fixedSites) 33 | expect_error(plot(fp), NA) 34 | }) 35 | -------------------------------------------------------------------------------- /tests/testthat/test-groupTips.R: -------------------------------------------------------------------------------- 1 | test_that("The grouped tips include all tree tips", { 2 | data(zikv_align_reduced) 3 | data(zikv_tree_reduced) 4 | tree <- addMSA(tree = zikv_tree_reduced, 5 | alignment = zikv_align_reduced) 6 | tipNames <- as.phylo(tree)[["tip.label"]] 7 | for (s in seq(0.1, 0.05, length.out = 5)) { 8 | # The grouping from the divergent nodes 9 | paths <- lineagePath(tree, similarity = s) 10 | grouped <- groupTips(paths) 11 | allTips <- unlist(grouped) 12 | names(allTips) <- NULL 13 | expect_identical(sort(allTips), sort(tipNames)) 14 | # The grouping from the fixation mutation 15 | minEntropy <- sitesMinEntropy(paths) 16 | grouped <- groupTips(minEntropy) 17 | allTips <- unlist(grouped) 18 | names(allTips) <- NULL 19 | expect_identical(sort(allTips), sort(tipNames)) 20 | # The grouping from the SNP tracing 21 | snpTracing <- fixationPath(minEntropy, minEffectiveSize = 0) 22 | grouped <- groupTips(snpTracing) 23 | allTips <- unlist(grouped) 24 | names(allTips) <- NULL 25 | expect_identical(sort(allTips), sort(tipNames)) 26 | } 27 | }) 28 | -------------------------------------------------------------------------------- /tests/testthat/test-printFunctions.R: -------------------------------------------------------------------------------- 1 | test_that("The print functions work", { 2 | data(zikv_align_reduced) 3 | data(zikv_tree_reduced) 4 | 5 | expect_error(capture.output(print(zikv_align_reduced)), NA) 6 | 7 | tr <- addMSA(zikv_tree_reduced, 8 | alignment = zikv_align_reduced) 9 | expect_error(capture.output(print(tr)), NA) 10 | 11 | snp <- SNPsites(tr) 12 | expect_error(capture.output(print(snp)), NA) 13 | 14 | p <- lineagePath(tr) 15 | expect_error(capture.output(print(p)), NA) 16 | 17 | snp2 <- SNPsites(p) 18 | expect_error(capture.output(print(snp2)), NA) 19 | 20 | minEntropy <- sitesMinEntropy(p) 21 | expect_error(capture.output(print(minEntropy)), NA) 22 | 23 | fixedSites <- fixationSites(minEntropy) 24 | expect_error(capture.output(print(fixedSites)), NA) 25 | 26 | sp <- extractSite(fixedSites, 139) 27 | expect_error(capture.output(print(sp)), NA) 28 | 29 | fp <- fixationPath(fixedSites) 30 | expect_error(capture.output(print(fp)), NA) 31 | 32 | paraSites <- parallelSites(minEntropy, minSNP = 1) 33 | expect_error(capture.output(print(paraSites)), NA) 34 | 35 | para <- extractSite(paraSites, 106) 36 | expect_error(capture.output(print(para)), NA) 37 | }) 38 | -------------------------------------------------------------------------------- /man/fixationPath.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/fixationPath.R 3 | \name{fixationPath} 4 | \alias{fixationPath} 5 | \alias{fixationPath.sitesMinEntropy} 6 | \alias{fixationPath.fixationSites} 7 | \title{Accumulation of fixed mutation as a tree} 8 | \usage{ 9 | fixationPath(x, ...) 10 | 11 | \method{fixationPath}{sitesMinEntropy}(x, minEffectiveSize = NULL, ...) 12 | 13 | \method{fixationPath}{fixationSites}(x, minEffectiveSize = NULL, ...) 14 | } 15 | \arguments{ 16 | \item{x}{The return from \code{\link{fixationSites}} function.} 17 | 18 | \item{...}{Further arguments passed to or from other methods.} 19 | 20 | \item{minEffectiveSize}{The minimum size for a tip cluster.} 21 | } 22 | \value{ 23 | An \code{fixationPath} object 24 | } 25 | \description{ 26 | The tips are clustered according to the fixation sites. The 27 | transition of fixation sites will be plotted as a phylogenetic tree. The 28 | length of each branch represents the number of fixation mutation between 29 | two clusters. The name of the tree tips indicate the number of sequences in 30 | the cluster. 31 | } 32 | \examples{ 33 | data(zikv_tree_reduced) 34 | data(zikv_align_reduced) 35 | tree <- addMSA(zikv_tree_reduced, alignment = zikv_align_reduced) 36 | paths <- lineagePath(tree) 37 | mutations <- fixationSites(paths) 38 | fixationPath(mutations) 39 | } 40 | -------------------------------------------------------------------------------- /R/sitePath-deprecated.R: -------------------------------------------------------------------------------- 1 | #' @rdname sitePath-deprecated 2 | #' @name sitePath-deprecated 3 | #' @aliases multiFixationSites 4 | #' @title Deprecated functions in package \sQuote{sitePath} 5 | #' @description These functions are provided for compatibility with older 6 | #' versions of \sQuote{sitePath} only, and will be defunct at the next 7 | #' release. 8 | #' @details The following functions are deprecated and will be made defunct; use 9 | #' the replacement indicated below: \itemize{ \item{multiFixationSites: 10 | #' \code{\link{fixationSites}}} } 11 | NULL 12 | 13 | #' @export 14 | multiFixationSites <- function(paths, ...) { 15 | UseMethod("multiFixationSites") 16 | } 17 | 18 | #' @export 19 | multiFixationSites.lineagePath <- function(paths, 20 | samplingSize = NULL, 21 | samplingTimes = 100, 22 | minEffectiveSize = 0, 23 | searchDepth = 1, 24 | method = c("compare", "insert", "delete"), 25 | ...) { 26 | .Deprecated("fixationSites") 27 | res <- fixationSites.lineagePath( 28 | paths = paths, 29 | minEffectiveSize = minEffectiveSize, 30 | searchDepth = searchDepth, 31 | method = method, 32 | ... 33 | ) 34 | return(res) 35 | } 36 | -------------------------------------------------------------------------------- /tests/testthat/test-plotSingleSite.R: -------------------------------------------------------------------------------- 1 | test_that("The function plotSingleSite works for AA", { 2 | data(zikv_align) 3 | data(zikv_tree) 4 | tr <- addMSA(tree = zikv_tree, 5 | alignment = zikv_align, 6 | seqType = "AA") 7 | p <- lineagePath(tr) 8 | expect_error(plotSingleSite(p, 2, showPath = TRUE), NA) 9 | expect_error(plotSingleSite(p, 3, showPath = FALSE), NA) 10 | muts <- fixationSites(p) 11 | expect_error(plotSingleSite(muts, 139), NA) 12 | }) 13 | 14 | test_that("The function plotSingleSite works for DNA", { 15 | data(sars2_align) 16 | data(sars2_tree) 17 | tr <- addMSA(tree = sars2_tree, 18 | alignment = sars2_align, 19 | seqType = "DNA") 20 | p <- lineagePath(tr) 21 | expect_error(plotSingleSite(p, 2, showPath = TRUE), NA) 22 | expect_error(plotSingleSite(p, 3, showPath = FALSE), NA) 23 | muts <- fixationSites(p) 24 | expect_error(plotSingleSite(muts, 8517), NA) 25 | }) 26 | 27 | test_that("Warnings in plotSingleSite works", { 28 | data(zikv_align) 29 | data(zikv_tree) 30 | tr <- addMSA(zikv_tree, alignment = zikv_align) 31 | p <- lineagePath(tr) 32 | expect_error(plotSingleSite(p, 0)) 33 | muts <- fixationSites(p) 34 | expect_error(extractSite(muts, 0)) 35 | expect_error(plotSingleSite(muts, 0)) 36 | sp <- extractSite(muts, 139) 37 | nSP <- length(sp) 38 | expect_error(extractTips(sp, nSP + 1)) 39 | }) 40 | -------------------------------------------------------------------------------- /man/allSitesName.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/allSitesName.R 3 | \name{allSitesName} 4 | \alias{allSitesName} 5 | \alias{allSitesName.SNPsites} 6 | \alias{allSitesName.sitesMinEntropy} 7 | \alias{allSitesName.fixationSites} 8 | \alias{allSitesName.parallelSites} 9 | \alias{allSitesName.paraFixSites} 10 | \title{Retrieve position of all the sites} 11 | \usage{ 12 | allSitesName(x, ...) 13 | 14 | \method{allSitesName}{SNPsites}(x, ...) 15 | 16 | \method{allSitesName}{sitesMinEntropy}(x, ...) 17 | 18 | \method{allSitesName}{fixationSites}(x, ...) 19 | 20 | \method{allSitesName}{parallelSites}(x, ...) 21 | 22 | \method{allSitesName}{paraFixSites}(x, type = c("paraFix", "fixation", "parallel"), ...) 23 | } 24 | \arguments{ 25 | \item{x}{The object containing the sites from analysis} 26 | 27 | \item{...}{Other arguments} 28 | 29 | \item{type}{Return fixation or parallel sites} 30 | } 31 | \value{ 32 | An integer vector for sites position 33 | } 34 | \description{ 35 | The function is a way to get position of the resulting sites 36 | from \code{\link{SNPsites}}, \code{\link{fixationSites}} and 37 | \code{\link{parallelSites}}. The numbering is consistent with what's being 38 | set by \code{\link{setSiteNumbering}} 39 | } 40 | \examples{ 41 | data(zikv_tree) 42 | msaPath <- system.file('extdata', 'ZIKV.fasta', package = 'sitePath') 43 | tree <- addMSA(zikv_tree, msaPath = msaPath, msaFormat = 'fasta') 44 | snp <- SNPsites(tree) 45 | allSitesName(snp) 46 | } 47 | -------------------------------------------------------------------------------- /R/plotParallelSites.R: -------------------------------------------------------------------------------- 1 | #' @rdname plotParallelSites 2 | #' @title Plot the result of fixation sites 3 | #' @description Visualize the results of \code{\link{paraFixSites}} 4 | #' @param x return from \code{\link{paraFixSites}} 5 | #' @param site the number of the site according to 6 | #' \code{\link{setSiteNumbering}} 7 | #' @param ... further arguments passed to or from other methods. 8 | #' @return A \code{ggplot} object. 9 | #' @export 10 | #' @examples 11 | #' data(zikv_tree) 12 | #' data(zikv_align) 13 | #' paraFix <- paraFixSites(zikv_tree, alignment = zikv_align) 14 | #' plotParallelSites(paraFix) 15 | plotParallelSites <- function(x, ...) { 16 | UseMethod("plotParallelSites") 17 | } 18 | 19 | #' @rdname plotParallelSites 20 | #' @export 21 | plotParallelSites.parallelSites <- function(x, site = NULL, ...) { 22 | if (length(x)) { 23 | if (is.null(site)) { 24 | p <- plot.parallelSites(x) 25 | } else { 26 | p <- plotSingleSite.parallelSites(x, site = site) 27 | } 28 | } else { 29 | message("There is no parallel sites detected") 30 | paths <- attr(x, "paths") 31 | p <- plot.lineagePath(paths) 32 | } 33 | return(p) 34 | } 35 | 36 | #' @rdname plotParallelSites 37 | #' @export 38 | plotParallelSites.paraFixSites <- function(x, site = NULL, ...) { 39 | paraSites <- attr(x, "allParaSites") 40 | p <- plotParallelSites.parallelSites(x = paraSites, 41 | site = site, 42 | ...) 43 | return(p) 44 | } 45 | -------------------------------------------------------------------------------- /man/sitesMinEntropy.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/sitesMinEntropy.R 3 | \name{sitesMinEntropy} 4 | \alias{sitesMinEntropy} 5 | \alias{sitesMinEntropy.lineagePath} 6 | \title{Fixation sites prediction} 7 | \usage{ 8 | sitesMinEntropy(x, ...) 9 | 10 | \method{sitesMinEntropy}{lineagePath}( 11 | x, 12 | minEffectiveSize = NULL, 13 | searchDepth = 1, 14 | method = c("compare", "insert", "delete"), 15 | ... 16 | ) 17 | } 18 | \arguments{ 19 | \item{x}{A \code{lineagePath} object returned from \code{\link{lineagePath}} 20 | function.} 21 | 22 | \item{...}{further arguments passed to or from other methods.} 23 | 24 | \item{minEffectiveSize}{The minimum number of tips in a group.} 25 | 26 | \item{searchDepth}{The function uses heuristic search but the termination of 27 | the search cannot be intrinsically decided. \code{searchDepth} is needed to 28 | tell the search when to stop.} 29 | 30 | \item{method}{The strategy for predicting the fixation. The basic approach is 31 | entropy minimization and can be achieved by adding or removing fixation 32 | point, or by comparing the two.} 33 | } 34 | \value{ 35 | A \code{sitesMinEntropy} object. 36 | } 37 | \description{ 38 | After finding the \code{\link{lineagePath}} of a phylogenetic 39 | tree, \code{sitesMinEntropy} perform entropy minimization on every site of 40 | the sequence to group the tips according to amino acid/nucleotide. 41 | } 42 | \examples{ 43 | data(zikv_tree_reduced) 44 | data(zikv_align_reduced) 45 | tree <- addMSA(zikv_tree_reduced, alignment = zikv_align_reduced) 46 | sitesMinEntropy(lineagePath(tree)) 47 | } 48 | -------------------------------------------------------------------------------- /man/as.data.frame.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/as.data.frame.R 3 | \name{as.data.frame.fixationSites} 4 | \alias{as.data.frame.fixationSites} 5 | \alias{as.data.frame.SNPsites} 6 | \alias{as.data.frame.parallelSites} 7 | \title{Convert results to Data Frame} 8 | \usage{ 9 | \method{as.data.frame}{fixationSites}(x, row.names = NULL, optional = FALSE, ...) 10 | 11 | \method{as.data.frame}{SNPsites}(x, row.names = NULL, optional = FALSE, ...) 12 | 13 | \method{as.data.frame}{parallelSites}(x, row.names = NULL, optional = FALSE, ...) 14 | } 15 | \arguments{ 16 | \item{x}{The object to be converted to \code{data.frame}.} 17 | 18 | \item{row.names}{Unimplemented.} 19 | 20 | \item{optional}{Unimplemented.} 21 | 22 | \item{...}{Other arguments.} 23 | } 24 | \value{ 25 | A \code{\link{data.frame}} object. 26 | } 27 | \description{ 28 | Convert return of functions in \code{sitePath} package to a 29 | \code{\link{data.frame}} so can be better worked with. The group name for 30 | each tip is the same as \code{\link{groupTips}}. 31 | 32 | A \code{\link{fixationSites}} object will output the mutation 33 | name of the fixation and the cluster name before and after the mutation. 34 | 35 | An \code{\link{SNPsites}} object will output the tip name with 36 | the SNP and its position. 37 | 38 | An \code{\link{parallelSites}} object will output the tip name 39 | with the group name and mutation info. 40 | } 41 | \examples{ 42 | data(zikv_tree_reduced) 43 | data(zikv_align_reduced) 44 | tree <- addMSA(zikv_tree_reduced, alignment = zikv_align_reduced) 45 | fixations <- fixationSites(lineagePath(tree)) 46 | as.data.frame(fixations) 47 | } 48 | -------------------------------------------------------------------------------- /man/plotMutSites.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/plotMutSites.R 3 | \name{plotMutSites} 4 | \alias{plotMutSites} 5 | \alias{plotMutSites.SNPsites} 6 | \alias{plotMutSites.lineagePath} 7 | \alias{plotMutSites.parallelSites} 8 | \alias{plotMutSites.fixationSites} 9 | \alias{plotMutSites.paraFixSites} 10 | \title{Plot tree and mutation sites} 11 | \usage{ 12 | plotMutSites(x, ...) 13 | 14 | \method{plotMutSites}{SNPsites}(x, showTips = FALSE, ...) 15 | 16 | \method{plotMutSites}{lineagePath}(x, ...) 17 | 18 | \method{plotMutSites}{parallelSites}(x, ...) 19 | 20 | \method{plotMutSites}{fixationSites}(x, ...) 21 | 22 | \method{plotMutSites}{paraFixSites}( 23 | x, 24 | widthRatio = 0.75, 25 | fontSize = 3.88, 26 | dotSize = 1, 27 | lineSize = 0.5, 28 | ... 29 | ) 30 | } 31 | \arguments{ 32 | \item{x}{An \code{\link{SNPsites}} object.} 33 | 34 | \item{...}{Other arguments} 35 | 36 | \item{showTips}{Whether to plot the tip labels. The default is \code{FALSE}.} 37 | 38 | \item{widthRatio}{The width ratio between tree plot and SNP plot} 39 | 40 | \item{fontSize}{The font size of the mutation label in tree plot} 41 | 42 | \item{dotSize}{The dot size of SNP in SNP plot} 43 | 44 | \item{lineSize}{The background line size in SNP plot} 45 | } 46 | \value{ 47 | A tree plot with SNP as dots for each tip. 48 | } 49 | \description{ 50 | The mutated sites for each tip in a phylogenetic tree will be 51 | represented as colored dots positioned by their site number. 52 | } 53 | \examples{ 54 | data(zikv_tree_reduced) 55 | data(zikv_align_reduced) 56 | tree <- addMSA(zikv_tree_reduced, alignment = zikv_align_reduced) 57 | plotMutSites(SNPsites(tree)) 58 | } 59 | -------------------------------------------------------------------------------- /R/plotFixationSites.R: -------------------------------------------------------------------------------- 1 | #' @rdname plotFixationSites 2 | #' @title Plot the result of fixation sites 3 | #' @description Visualize the results of \code{\link{paraFixSites}} 4 | #' @param x return from \code{\link{paraFixSites}} 5 | #' @param site the number of the site according to 6 | #' \code{\link{setSiteNumbering}} 7 | #' @param ... further arguments passed to or from other methods. 8 | #' @return A \code{ggplot} object. 9 | #' @export 10 | #' @examples 11 | #' data(zikv_tree_reduced) 12 | #' data(zikv_align_reduced) 13 | #' paraFix <- paraFixSites(zikv_tree_reduced, alignment = zikv_align_reduced) 14 | #' plotFixationSites(paraFix) 15 | plotFixationSites <- function(x, ...) { 16 | UseMethod("plotFixationSites") 17 | } 18 | 19 | #' @rdname plotFixationSites 20 | #' @param site the number of the site according to 21 | #' \code{\link{setSiteNumbering}}. If not provided, all sites will be plotted 22 | #' as labels on the tree 23 | #' @export 24 | plotFixationSites.fixationSites <- function(x, site = NULL, ...) { 25 | if (length(x)) { 26 | if (is.null(site)) { 27 | p <- plot.fixationSites(x) 28 | } else { 29 | p <- plotSingleSite.fixationSites(x, site = site) 30 | } 31 | } else { 32 | message("There is no fixation sites detected") 33 | paths <- attr(x, "paths") 34 | p <- plot.lineagePath(paths) 35 | } 36 | return(p) 37 | } 38 | 39 | #' @rdname plotFixationSites 40 | #' @export 41 | plotFixationSites.paraFixSites <- function(x, site = NULL, ...) { 42 | fixations <- attr(x, "allFixSites") 43 | p <- plotFixationSites.fixationSites(x = fixations, 44 | site = site, 45 | ...) 46 | return(p) 47 | } 48 | -------------------------------------------------------------------------------- /man/setSiteNumbering.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/siteNumbering.R 3 | \name{setSiteNumbering} 4 | \alias{setSiteNumbering} 5 | \alias{setSiteNumbering.phyMSAmatched} 6 | \title{Set site numbering to the reference sequence} 7 | \usage{ 8 | setSiteNumbering(x, reference, gapChar, ...) 9 | 10 | \method{setSiteNumbering}{phyMSAmatched}(x, reference = NULL, gapChar = "-", minSkipSize = NULL, ...) 11 | } 12 | \arguments{ 13 | \item{x}{The object to set site numbering. It could be a 14 | \code{\link{phyMSAmatched}} or a \code{\link{lineagePath}} object.} 15 | 16 | \item{reference}{Name of reference for site numbering. The name has to be one 17 | of the sequences' name. The default uses the intrinsic alignment numbering} 18 | 19 | \item{gapChar}{The character to indicate gap. The numbering will skip the 20 | \code{gapChar} for the reference sequence.} 21 | 22 | \item{...}{Further arguments passed to or from other methods.} 23 | 24 | \item{minSkipSize}{The minimum number of tips to have gap or ambiguous amino 25 | acid/nucleotide for a site to be ignored in other analysis. This will not 26 | affect the numbering. The default is 0.8.} 27 | } 28 | \value{ 29 | The input \code{x} with numbering mapped to \code{reference}. 30 | } 31 | \description{ 32 | A reference sequence can be used to define a global site 33 | numbering scheme for multiple sequence alignment. The gap in the reference 34 | sequence will be skipped for the numbering. Also, the site that is gap or 35 | amino acid/nucleotide for too many tips will be ignored but won't affect 36 | numbering. 37 | } 38 | \examples{ 39 | data(zikv_tree) 40 | msaPath <- system.file('extdata', 'ZIKV.fasta', package = 'sitePath') 41 | tree <- addMSA(zikv_tree, msaPath = msaPath, msaFormat = 'fasta') 42 | setSiteNumbering(tree) 43 | } 44 | -------------------------------------------------------------------------------- /R/reexports.R: -------------------------------------------------------------------------------- 1 | #' @importFrom ape as.phylo read.tree 2 | #' @importFrom tidytree as.treedata 3 | 4 | #' @export 5 | ape::read.tree 6 | 7 | #' @export 8 | seqinr::read.alignment 9 | 10 | #' @export 11 | ape::as.phylo 12 | 13 | #' @export 14 | as.phylo.phyMSAmatched <- function(x, ...) { 15 | res <- attr(x, "tree") 16 | return(res) 17 | } 18 | 19 | #' @export 20 | as.phylo.sitePath <- function(x, ...) { 21 | res <- attr(x, "tree") 22 | return(res) 23 | } 24 | 25 | #' @export 26 | as.phylo.sitesMinEntropy <- function(x, ...) { 27 | paths <- attr(x, "paths") 28 | res <- attr(paths, "tree") 29 | return(res) 30 | } 31 | 32 | #' @export 33 | as.phylo.fixationSites <- function(x, ...) { 34 | paths <- attr(x, "paths") 35 | res <- attr(paths, "tree") 36 | return(res) 37 | } 38 | 39 | as.phylo.fixationIndels <- function(x, ...) { 40 | paths <- attr(x, "paths") 41 | res <- as.phylo(paths) 42 | return(res) 43 | } 44 | 45 | #' @export 46 | tidytree::as.treedata 47 | 48 | #' @export 49 | as.treedata.fixationSites <- function(tree, ...) { 50 | extraArgs <- list(...) 51 | mutTable <- .mutationTable(tree) 52 | transMut <- lapply(X = split(mutTable, mutTable[, "node"]), 53 | FUN = "[[", 54 | i = "mutation") 55 | .node <- extraArgs[[".node"]] 56 | if (is.null(.node)) { 57 | .node <- groupTips.fixationSites(tree) 58 | } 59 | tree <- groupOTU(as.phylo.fixationSites(tree), .node) 60 | tree <- .annotateSNPonTree(tree, transMut) 61 | return(tree) 62 | } 63 | 64 | as.treedata.fixationIndels <- function(tree, ...) { 65 | for (sites in names(tree)) { 66 | indelPath <- tree[[sites]] 67 | } 68 | } 69 | 70 | #' @export 71 | as.treedata.fixationPath <- function(tree, ...) { 72 | res <- attr(tree, "SNPtracing") 73 | return(res) 74 | } 75 | -------------------------------------------------------------------------------- /R/allSitesName.R: -------------------------------------------------------------------------------- 1 | #' @rdname allSitesName 2 | #' @title Retrieve position of all the sites 3 | #' @description The function is a way to get position of the resulting sites 4 | #' from \code{\link{SNPsites}}, \code{\link{fixationSites}} and 5 | #' \code{\link{parallelSites}}. The numbering is consistent with what's being 6 | #' set by \code{\link{setSiteNumbering}} 7 | #' @param x The object containing the sites from analysis 8 | #' @param ... Other arguments 9 | #' @return An integer vector for sites position 10 | #' @export 11 | #' @examples 12 | #' data(zikv_tree) 13 | #' msaPath <- system.file('extdata', 'ZIKV.fasta', package = 'sitePath') 14 | #' tree <- addMSA(zikv_tree, msaPath = msaPath, msaFormat = 'fasta') 15 | #' snp <- SNPsites(tree) 16 | #' allSitesName(snp) 17 | allSitesName <- function(x, ...) { 18 | UseMethod("allSitesName") 19 | } 20 | 21 | #' @rdname allSitesName 22 | #' @export 23 | allSitesName.SNPsites <- function(x, ...) { 24 | as.character(as.integer(x)) 25 | } 26 | 27 | #' @rdname allSitesName 28 | #' @export 29 | allSitesName.sitesMinEntropy <- function(x, ...) { 30 | names(x[[1]]) 31 | } 32 | 33 | #' @rdname allSitesName 34 | #' @export 35 | allSitesName.fixationSites <- function(x, ...) { 36 | names(x) 37 | } 38 | 39 | #' @rdname allSitesName 40 | #' @export 41 | allSitesName.parallelSites <- function(x, ...) { 42 | names(x) 43 | } 44 | 45 | #' @rdname allSitesName 46 | #' @param type Return fixation or parallel sites 47 | #' @export 48 | allSitesName.paraFixSites <- function(x, 49 | type = c("paraFix", "fixation", "parallel"), 50 | ...) { 51 | res <- switch( 52 | match.arg(type), 53 | "paraFix" = as.character(as.integer(x)), 54 | "fixation" = allSitesName(attr(x, "allFixSites")), 55 | "parallel" = allSitesName(attr(x, "allParaSites")) 56 | ) 57 | return(res) 58 | } 59 | -------------------------------------------------------------------------------- /R/extractSite.R: -------------------------------------------------------------------------------- 1 | #' @rdname extractSite 2 | #' @title Extract tips for a single site 3 | #' @description The functions in \code{sitePath} usually include the results on 4 | #' more than one site. The function \code{extractSite} can be used to extract 5 | #' the predicted result on a single site. 6 | #' @param x A \code{fixationSites} or a \code{parallelSites} object. More type 7 | #' will be supported in the later version. 8 | #' @param site A site included in the result. 9 | #' @param ... Other arguments 10 | #' @return The predicted result of a single site 11 | #' @export 12 | #' @examples 13 | #' data(zikv_tree_reduced) 14 | #' data(zikv_align_reduced) 15 | #' tree <- addMSA(zikv_tree_reduced, alignment = zikv_align_reduced) 16 | #' mutations <- fixationSites(lineagePath(tree)) 17 | #' extractSite(mutations, 139) 18 | extractSite <- function(x, site, ...) { 19 | UseMethod("extractSite") 20 | } 21 | 22 | #' @rdname extractSite 23 | #' @export 24 | extractSite.fixationSites <- function(x, site, ...) { 25 | return(.actualExtractSite(x, site)) 26 | } 27 | 28 | .actualExtractSite <- function(x, site) { 29 | site <- .checkSite(site) 30 | sp <- x[[as.character(site)]] 31 | if (is.null(sp)) { 32 | stop("\"site\": ", site, " is not found in \"x\".") 33 | } 34 | return(sp) 35 | } 36 | 37 | .checkSite <- function(site) { 38 | site <- as.integer(site) 39 | if (!is.numeric(site) || any(site <= 0)) { 40 | stop("Please enter positive integer value for \"site\"") 41 | } 42 | if (length(site) != 1) { 43 | site <- site[1] 44 | warning( 45 | "\"site\" has more than one element, ", 46 | "only the first element (", 47 | site, 48 | ") will be used." 49 | ) 50 | } 51 | return(site) 52 | } 53 | 54 | #' @export 55 | extractSite.parallelSites <- function(x, site, ...) { 56 | return(.actualExtractSite(x, site)) 57 | } 58 | -------------------------------------------------------------------------------- /man/groupTips.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/groupTips.R 3 | \name{groupTips} 4 | \alias{groupTips} 5 | \alias{groupTips.phyMSAmatched} 6 | \alias{groupTips.lineagePath} 7 | \alias{groupTips.sitesMinEntropy} 8 | \alias{groupTips.fixationSites} 9 | \alias{groupTips.fixationPath} 10 | \title{The grouping of tree tips} 11 | \usage{ 12 | groupTips(tree, ...) 13 | 14 | \method{groupTips}{phyMSAmatched}( 15 | tree, 16 | similarity = NULL, 17 | simMatrix = NULL, 18 | forbidTrivial = TRUE, 19 | tipnames = TRUE, 20 | ... 21 | ) 22 | 23 | \method{groupTips}{lineagePath}(tree, tipnames = TRUE, ...) 24 | 25 | \method{groupTips}{sitesMinEntropy}(tree, tipnames = TRUE, ...) 26 | 27 | \method{groupTips}{fixationSites}(tree, tipnames = TRUE, ...) 28 | 29 | \method{groupTips}{fixationPath}(tree, tipnames = TRUE, ...) 30 | } 31 | \arguments{ 32 | \item{tree}{The return from \code{\link{addMSA}}, \code{\link{lineagePath}}, 33 | \code{\link{sitesMinEntropy}} or other functions.} 34 | 35 | \item{...}{Other arguments.} 36 | 37 | \item{similarity}{This decides how minor SNPs are to remove. If provided as 38 | fraction between 0 and 1, then the minimum number of SNP will be total tips 39 | times \code{similariy}. If provided as integer greater than 1, the minimum 40 | number will be \code{similariy}. The default \code{similariy} is 0.05 for 41 | \code{lineagePath}.} 42 | 43 | \item{simMatrix}{Deprecated and will not have effect.} 44 | 45 | \item{forbidTrivial}{Does not allow trivial trimming.} 46 | 47 | \item{tipnames}{If return tips as integer or tip names.} 48 | } 49 | \value{ 50 | \code{groupTips} returns grouping of tips. 51 | } 52 | \description{ 53 | The tips between divergent nodes or fixation mutations on the 54 | lineages are each gathered as group. 55 | } 56 | \examples{ 57 | data(zikv_tree) 58 | data(zikv_align) 59 | tree <- addMSA(zikv_tree, alignment = zikv_align) 60 | groupTips(tree) 61 | } 62 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: sitePath 2 | Type: Package 3 | Title: Phylogeny-based sequence clustering with site polymorphism 4 | Version: 1.25.1 5 | Authors@R: c(person("Chengyang", "Ji", 6 | email = "chengyang.ji12@alumni.xjtlu.edu.cn", 7 | role = c("aut", "cre", "cph"), 8 | comment = c(ORCID = "0000-0001-9258-5453")), 9 | person("Hangyu", "Zhou", 10 | email = "zhy@ism.cams.cn", 11 | role = c("ths")), 12 | person("Aiping", "Wu", 13 | email = "wap@ism.cams.cn", 14 | role = c("ths"))) 15 | Description: Using site polymorphism is one of the ways to cluster DNA/protein 16 | sequences but it is possible for the sequences with the same polymorphism on 17 | a single site to be genetically distant. This package is aimed at clustering 18 | sequences using site polymorphism and their corresponding phylogenetic 19 | trees. By considering their location on the tree, only the structurally 20 | adjacent sequences will be clustered. However, the adjacent sequences may 21 | not necessarily have the same polymorphism. So a branch-and-bound like 22 | algorithm is used to minimize the entropy representing the purity of site 23 | polymorphism of each cluster. 24 | License: MIT + file LICENSE 25 | Depends: 26 | R (>= 4.2) 27 | Imports: 28 | RColorBrewer, 29 | Rcpp, 30 | ape, 31 | aplot, 32 | ggplot2, 33 | ggrepel, 34 | ggtree, 35 | graphics, 36 | grDevices, 37 | gridExtra, 38 | methods, 39 | parallel, 40 | seqinr, 41 | stats, 42 | tidytree, 43 | utils 44 | Suggests: 45 | BiocStyle, 46 | devtools, 47 | knitr, 48 | magick, 49 | rmarkdown, 50 | testthat 51 | LinkingTo: Rcpp 52 | RoxygenNote: 7.2.1 53 | Encoding: UTF-8 54 | VignetteBuilder: knitr 55 | URL: https://wuaipinglab.github.io/sitePath/ 56 | BugReports: https://github.com/wuaipinglab/sitePath/issues 57 | biocViews: Alignment, MultipleSequenceAlignment, Phylogenetics, SNP, Software 58 | -------------------------------------------------------------------------------- /man/extractTips.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/extractTips.R 3 | \name{extractTips} 4 | \alias{extractTips} 5 | \alias{extractTips.lineagePath} 6 | \alias{extractTips.sitesMinEntropy} 7 | \alias{extractTips.fixationSites} 8 | \alias{extractTips.sitePath} 9 | \alias{extractTips.parallelSites} 10 | \alias{extractTips.sitePara} 11 | \title{Extract grouped tips for a single site} 12 | \usage{ 13 | extractTips(x, ...) 14 | 15 | \method{extractTips}{lineagePath}(x, site, ...) 16 | 17 | \method{extractTips}{sitesMinEntropy}(x, site, ...) 18 | 19 | \method{extractTips}{fixationSites}(x, site, select = 1, ...) 20 | 21 | \method{extractTips}{sitePath}(x, select = 1, ...) 22 | 23 | \method{extractTips}{parallelSites}(x, site, ...) 24 | 25 | \method{extractTips}{sitePara}(x, ...) 26 | } 27 | \arguments{ 28 | \item{x}{A \code{fixationSites} or a \code{sitePath} object.} 29 | 30 | \item{...}{Other arguments} 31 | 32 | \item{site}{A site predicted to experience fixation.} 33 | 34 | \item{select}{For a site, there theoretically might be more than one fixation 35 | on different lineages. You may use this argument to extract for a specific 36 | fixation of a site. The default is the first fixation of the site.} 37 | } 38 | \value{ 39 | Tree tips grouped as \code{\link{list}} 40 | } 41 | \description{ 42 | The result of \code{\link{fixationSites}} and \code{sitePath} 43 | contains all the possible sites with fixation mutation. The function 44 | \code{extractTips} retrieves the name of the tips involved in the fixation. 45 | 46 | For \code{\link{lineagePath}}, the function \code{extractTips} 47 | groups all the tree tips according to the amino acid/nucleotide of the 48 | \code{site}. 49 | 50 | For \code{\link{parallelSites}} and \code{sitePara} object, the 51 | function \code{extractTips} retrieve all the tips with parallel mutation. 52 | } 53 | \examples{ 54 | data(zikv_tree_reduced) 55 | data(zikv_align_reduced) 56 | tree <- addMSA(zikv_tree_reduced, alignment = zikv_align_reduced) 57 | mutations <- fixationSites(lineagePath(tree)) 58 | extractTips(mutations, 139) 59 | } 60 | -------------------------------------------------------------------------------- /man/fixationSites.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/fixationSites.R 3 | \name{fixationSites} 4 | \alias{fixationSites} 5 | \alias{fixationSites.lineagePath} 6 | \alias{fixationSites.sitesMinEntropy} 7 | \alias{fixationSites.paraFixSites} 8 | \title{Fixation sites prediction} 9 | \usage{ 10 | fixationSites(paths, ...) 11 | 12 | \method{fixationSites}{lineagePath}( 13 | paths, 14 | minEffectiveSize = NULL, 15 | searchDepth = 1, 16 | method = c("compare", "insert", "delete"), 17 | ... 18 | ) 19 | 20 | \method{fixationSites}{sitesMinEntropy}(paths, ...) 21 | 22 | \method{fixationSites}{paraFixSites}(paths, ...) 23 | } 24 | \arguments{ 25 | \item{paths}{A \code{lineagePath} object returned from 26 | \code{\link{lineagePath}} function.} 27 | 28 | \item{...}{further arguments passed to or from other methods.} 29 | 30 | \item{minEffectiveSize}{The minimum number of tips in a group.} 31 | 32 | \item{searchDepth}{The function uses heuristic search but the termination of 33 | the search cannot be intrinsically decided. \code{searchDepth} is needed to 34 | tell the search when to stop.} 35 | 36 | \item{method}{The strategy for predicting the fixation. The basic approach is 37 | entropy minimization and can be achieved by adding or removing fixation 38 | point, or by comparing the two.} 39 | } 40 | \value{ 41 | A \code{fixationSites} object. 42 | } 43 | \description{ 44 | After finding the \code{\link{lineagePath}} of a phylogenetic 45 | tree, \code{fixationSites} uses the result to find those sites that show 46 | fixation on some, if not all, of the lineages. The number of tips before 47 | and after the fixation mutation is expected to be more than 48 | \code{minEffectiveSize}. Also, the fixation will be skipped if the amino 49 | acid/nucleotide is gap or ambiguous character. A lineage has to have at 50 | least one fixation mutation to be reported. 51 | } 52 | \examples{ 53 | data(zikv_tree_reduced) 54 | data(zikv_align_reduced) 55 | tree <- addMSA(zikv_tree_reduced, alignment = zikv_align_reduced) 56 | fixationSites(lineagePath(tree)) 57 | } 58 | \seealso{ 59 | \code{\link{as.data.frame.fixationSites}} 60 | } 61 | -------------------------------------------------------------------------------- /man/parallelSites.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/parallelSites.R 3 | \name{parallelSites} 4 | \alias{parallelSites} 5 | \alias{parallelSites.lineagePath} 6 | \alias{parallelSites.sitesMinEntropy} 7 | \alias{parallelSites.paraFixSites} 8 | \title{Mutation across multiple phylogenetic lineages} 9 | \usage{ 10 | parallelSites(x, ...) 11 | 12 | \method{parallelSites}{lineagePath}( 13 | x, 14 | minSNP = NULL, 15 | mutMode = c("all", "exact", "pre", "post"), 16 | ... 17 | ) 18 | 19 | \method{parallelSites}{sitesMinEntropy}( 20 | x, 21 | minSNP = NULL, 22 | mutMode = c("all", "exact", "pre", "post"), 23 | ... 24 | ) 25 | 26 | \method{parallelSites}{paraFixSites}(x, ...) 27 | } 28 | \arguments{ 29 | \item{x}{A \code{\link{lineagePath}} or a \code{\link{sitesMinEntropy}} 30 | object.} 31 | 32 | \item{...}{The arguments in \code{\link{sitesMinEntropy}}.} 33 | 34 | \item{minSNP}{The minimum number of mutations to be qualified as parallel on 35 | at least two lineages. The default is 1.} 36 | 37 | \item{mutMode}{The strategy for finding parallel site. The default \code{all} 38 | is to consider any mutation regardless of the amino acid/nucleotide before 39 | and after mutation; Or \code{exact} to force mutation to be the same; Or 40 | \code{pre}/\code{post} to select the site having amino acid/nucleotide 41 | before/after mutation.} 42 | } 43 | \value{ 44 | A \code{parallelSites} object 45 | } 46 | \description{ 47 | A site may have mutated on parallel lineages. Mutation can occur 48 | on the same site across the phylogenetic lineages solved by 49 | \code{\link{lineagePath}}. The site will be considered mutated in parallel 50 | if the mutation occurs on the non-overlap part of more than two lineages. 51 | The amino acid/nucleotide before and after the mutation can be allowed 52 | different on different lineages or only the exact same mutations are 53 | considered. 54 | } 55 | \examples{ 56 | data(zikv_tree_reduced) 57 | data(zikv_align_reduced) 58 | tree <- addMSA(zikv_tree_reduced, alignment = zikv_align_reduced) 59 | paths <- lineagePath(tree) 60 | x <- sitesMinEntropy(paths) 61 | parallelSites(x) 62 | } 63 | -------------------------------------------------------------------------------- /R/RcppExports.R: -------------------------------------------------------------------------------- 1 | # Generated by using Rcpp::compileAttributes() -> do not edit by hand 2 | # Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 3 | 4 | pairSimilarity <- function(seqIndex, alignedSeqs) { 5 | .Call('_sitePath_pairSimilarity', PACKAGE = 'sitePath', seqIndex, alignedSeqs) 6 | } 7 | 8 | getSimilarityMatrix <- function(alignedSeqs) { 9 | .Call('_sitePath_getSimilarityMatrix', PACKAGE = 'sitePath', alignedSeqs) 10 | } 11 | 12 | majorSNPtips <- function(alignedSeqs, siteIndices, minSNPnum) { 13 | .Call('_sitePath_majorSNPtips', PACKAGE = 'sitePath', alignedSeqs, siteIndices, minSNPnum) 14 | } 15 | 16 | terminalTipsBySim <- function(siteIndices, tipPaths, alignedSeqs, metricMatrix) { 17 | .Call('_sitePath_terminalTipsBySim', PACKAGE = 'sitePath', siteIndices, tipPaths, alignedSeqs, metricMatrix) 18 | } 19 | 20 | terminalTipsByDist <- function(siteIndices, tipPaths, alignedSeqs, metricMatrix) { 21 | .Call('_sitePath_terminalTipsByDist', PACKAGE = 'sitePath', siteIndices, tipPaths, alignedSeqs, metricMatrix) 22 | } 23 | 24 | mergePaths <- function(paths) { 25 | .Call('_sitePath_mergePaths', PACKAGE = 'sitePath', paths) 26 | } 27 | 28 | divergentNode <- function(paths) { 29 | .Call('_sitePath_divergentNode', PACKAGE = 'sitePath', paths) 30 | } 31 | 32 | getReference <- function(refSeq, gapChar) { 33 | .Call('_sitePath_getReference', PACKAGE = 'sitePath', refSeq, gapChar) 34 | } 35 | 36 | tableAA <- function(seqs, siteIndex) { 37 | .Call('_sitePath_tableAA', PACKAGE = 'sitePath', seqs, siteIndex) 38 | } 39 | 40 | minEntropyByInserting <- function(nodeSummaries, minEffectiveSize, searchDepth) { 41 | .Call('_sitePath_minEntropyByInserting', PACKAGE = 'sitePath', nodeSummaries, minEffectiveSize, searchDepth) 42 | } 43 | 44 | minEntropyByDeleting <- function(nodeSummaries, minEffectiveSize, searchDepth) { 45 | .Call('_sitePath_minEntropyByDeleting', PACKAGE = 'sitePath', nodeSummaries, minEffectiveSize, searchDepth) 46 | } 47 | 48 | minEntropyByComparing <- function(nodeSummaries, minEffectiveSize, searchDepth) { 49 | .Call('_sitePath_minEntropyByComparing', PACKAGE = 'sitePath', nodeSummaries, minEffectiveSize, searchDepth) 50 | } 51 | 52 | -------------------------------------------------------------------------------- /man/addMSA.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/addMSA.R 3 | \name{phyMSAmatched} 4 | \alias{phyMSAmatched} 5 | \alias{addMSA} 6 | \alias{addMSA.phylo} 7 | \alias{addMSA.treedata} 8 | \title{Add matching sequence alignment to the tree} 9 | \usage{ 10 | addMSA(tree, ...) 11 | 12 | \method{addMSA}{phylo}( 13 | tree, 14 | msaPath = "", 15 | msaFormat = c("fasta", "clustal", "phylip", "mase", "msf"), 16 | alignment = NULL, 17 | seqType = c("AA", "DNA", "RNA"), 18 | ... 19 | ) 20 | 21 | \method{addMSA}{treedata}(tree, ...) 22 | } 23 | \arguments{ 24 | \item{tree}{A \code{\link{phylo}} object. This commonly can be from tree 25 | parsing function in \code{\link{ape}} or \code{\link{ggtree}}. All the 26 | \code{tip.label} should be found in the sequence alignment. The tree is 27 | supposed to be fully resolved (bifurcated) and will be resolved by 28 | \code{\link{multi2di}} if \code{\link{is.binary}} gives \code{FALSE}.} 29 | 30 | \item{...}{Other arguments.} 31 | 32 | \item{msaPath}{The file path to the multiple sequence alignment file.} 33 | 34 | \item{msaFormat}{The format of the multiple sequence alignment file. The 35 | internal uses the \code{\link{read.alignment}} from \code{\link{seqinr}} 36 | package to parse the sequence alignment. The default is "fasta" and it also 37 | accepts "clustal", "phylip", "mase", "msf".} 38 | 39 | \item{alignment}{An \code{alignment} object. This commonly can be from 40 | sequence parsing function in the \code{\link{seqinr}} package. Sequence 41 | names in the alignment should include all \code{tip.label} in the tree} 42 | 43 | \item{seqType}{The type of the sequence in the alignment file. The default is 44 | "AA" for amino acid. The other options are "DNA" and "RNA".} 45 | } 46 | \value{ 47 | Since 1.5.12, the function returns a \code{phyMSAmatched} object to 48 | avoid S3 methods used on \code{phylo} (better encapsulation). 49 | } 50 | \description{ 51 | \code{addMSA} wraps \code{\link{read.alignment}} function in 52 | \code{\link{seqinr}} package and helps match names in tree and sequence 53 | alignment. Either provide the file path to an alignment file and its format 54 | or an alignment object from the return of \code{\link{read.alignment}} 55 | function. If both the file path and alignment object are given, the 56 | function will use the sequence in the alignment file. 57 | } 58 | \examples{ 59 | data(zikv_tree) 60 | msaPath <- system.file('extdata', 'ZIKV.fasta', package = 'sitePath') 61 | addMSA(zikv_tree, msaPath = msaPath, msaFormat = 'fasta') 62 | } 63 | \seealso{ 64 | \code{\link{read.alignment}} 65 | } 66 | -------------------------------------------------------------------------------- /man/plotSingleSite.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/plotSingleSite.R 3 | \name{plotSingleSite} 4 | \alias{plotSingleSite} 5 | \alias{plotSingleSite.lineagePath} 6 | \alias{plotSingleSite.sitesMinEntropy} 7 | \alias{plotSingleSite.parallelSites} 8 | \alias{plotSingleSite.fixationSites} 9 | \title{Color the tree by a single site} 10 | \usage{ 11 | plotSingleSite(x, site, ...) 12 | 13 | \method{plotSingleSite}{lineagePath}(x, site, showPath = TRUE, showTips = FALSE, ...) 14 | 15 | \method{plotSingleSite}{sitesMinEntropy}(x, site, ...) 16 | 17 | \method{plotSingleSite}{parallelSites}(x, site, showPath = TRUE, ...) 18 | 19 | \method{plotSingleSite}{fixationSites}(x, site, select = NULL, ...) 20 | } 21 | \arguments{ 22 | \item{x}{The object to plot.} 23 | 24 | \item{site}{For \code{lineagePath}, it can be any site within sequence 25 | length. For \code{fixationSites} and \code{parallelSites}, it is restrained 26 | to a predicted fixation site. The numbering is consistent with the 27 | reference defined by \code{\link{setSiteNumbering}}.} 28 | 29 | \item{...}{Other arguments. Since 1.5.4, the function uses 30 | \code{\link{ggtree}} as the base function to make plots so the arguments in 31 | \code{plot.phylo} will no longer work.} 32 | 33 | \item{showPath}{If plot the lineage result from \code{\link{lineagePath}}. 34 | The default is \code{TRUE}.} 35 | 36 | \item{showTips}{Whether to plot the tip labels. The default is \code{FALSE}.} 37 | 38 | \item{select}{Select which fixation path in to plot. The default is NULL 39 | which will plot all the fixations.} 40 | } 41 | \value{ 42 | Since 1.5.4, the function returns a ggplot object so on longer 43 | behaviors like the generic \code{\link{plot}} function. 44 | } 45 | \description{ 46 | Plot and color the tree according to amino acid/nucleotide of 47 | the selected site. The color scheme depends on the \code{seqType} set in 48 | \code{\link{addMSA}} function. 49 | 50 | For \code{\link{lineagePath}}, the tree will be colored 51 | according to the amino acid of the site. The color scheme tries to assign 52 | distinguishable color for each amino acid. 53 | 54 | For \code{\link{parallelSites}}, the tree will be colored 55 | according to the amino acid of the site if the mutation is not fixed. 56 | 57 | For \code{\link{fixationSites}}, it will color the ancestral 58 | tips in red, descendant tips in blue and excluded tips in grey. 59 | } 60 | \examples{ 61 | data(zikv_tree) 62 | data(zikv_align) 63 | tree <- addMSA(zikv_tree, alignment = zikv_align) 64 | paths <- lineagePath(tree) 65 | plotSingleSite(paths, 139) 66 | fixations <- fixationSites(paths) 67 | plotSingleSite(fixations, 139) 68 | } 69 | \seealso{ 70 | \code{\link{plot.sitePath}} 71 | } 72 | -------------------------------------------------------------------------------- /tests/testthat/test-lineagePath.R: -------------------------------------------------------------------------------- 1 | test_that("Using SNP to get phylogenetic lineages", { 2 | data(zikv_align) 3 | data(zikv_tree) 4 | 5 | tr <- addMSA(zikv_tree, alignment = zikv_align) 6 | tr2 <- lineagePath(zikv_tree, alignment = zikv_align) 7 | expect_identical(tr, tr2) 8 | expect_true(is(tr, "phyMSAmatched")) 9 | tipNames <- zikv_tree[["tip.label"]] 10 | nTips <- length(tipNames) 11 | # The tree is bifurcated and should be identical to the original tree 12 | expect_identical(as.phylo(tr), zikv_tree) 13 | align <- attr(tr, "align") 14 | reference <- attr(tr, "msaNumbering") 15 | # Test the input of 'similarity' 16 | expect_error(lineagePath(tree = tr, similarity = "0.96")) 17 | expect_error(lineagePath(tree = tr, similarity = -0.1)) 18 | expect_error(lineagePath( 19 | tree = tr, 20 | similarity = 1, 21 | forbidTrivial = FALSE 22 | )) 23 | # Use 0.1 as the input for 'similarity' 24 | similarity <- nTips * 0.1 25 | paths <- lineagePath(tree = tr, 26 | similarity = similarity, 27 | forbidTrivial = FALSE) 28 | # Exclude the invariant sites 29 | loci <- attr(tr, "loci") 30 | majorSNPsites <- list() 31 | for (site in loci) { 32 | for (p in paths) { 33 | # The descendant tips of the terminal node of the path 34 | seqs <- 35 | align[sitePath:::.childrenTips(as.phylo(tr), p[length(p)])] 36 | siteSummary <- sitePath:::tableAA(seqs, site - 1) 37 | # Get the SNP possibly exclusive to the descendant tips 38 | siteChar <- 39 | names(siteSummary)[which(siteSummary == length(seqs))] 40 | # Add the SNP to 'majorSNPsites' 41 | if (length(siteSummary) != 0) { 42 | s <- as.character(site) 43 | if (s %in% names(majorSNPsites)) { 44 | majorSNPsites[[s]] <- c(majorSNPsites[[s]], siteChar) 45 | } else { 46 | majorSNPsites[[s]] <- siteChar 47 | } 48 | } 49 | } 50 | } 51 | # The SNP should meet the 'similarity' threshold 52 | for (site in names(majorSNPsites)) { 53 | siteSummary <- sitePath:::tableAA(align, as.integer(site) - 1) 54 | expect_true(any(siteSummary > similarity)) 55 | } 56 | }) 57 | 58 | test_that("The sneakPeek function works", { 59 | data(zikv_align_reduced) 60 | data(zikv_tree_reduced) 61 | tr <- addMSA(zikv_tree_reduced, 62 | alignment = zikv_align_reduced) 63 | rangeOfResults <- sneakPeek(tr, makePlot = FALSE) 64 | expect_error(sneakPeek(tr, makePlot = FALSE), NA) 65 | for (i in seq_len(nrow(rangeOfResults))) { 66 | similarity <- rangeOfResults[i, "similarity"] 67 | pathNum <- rangeOfResults[i, "pathNum"] 68 | p <- lineagePath(rangeOfResults, similarity = similarity) 69 | expect_true(is(p, "lineagePath")) 70 | expect_true(length(p) == pathNum, 71 | label = paste("Simialrity:", similarity)) 72 | } 73 | }) 74 | -------------------------------------------------------------------------------- /tests/testthat/test-SNPsites.R: -------------------------------------------------------------------------------- 1 | test_minSNPthreshold <- function(x, gapChar, minT, maxT) { 2 | x <- setSiteNumbering(x, gapChar = gapChar) 3 | if (attr(x, "seqType") == "AA") { 4 | unambiguous <- setdiff(sitePath:::AA_UNAMBIGUOUS, gapChar) 5 | } else { 6 | unambiguous <- setdiff(sitePath:::NT_UNAMBIGUOUS, gapChar) 7 | } 8 | align <- attr(x, "align") 9 | for (minSNP in seq(minT, maxT, 20)) { 10 | snp <- SNPsites(x, minSNP = minSNP) 11 | p <- lineagePath(x) 12 | snp2 <- SNPsites(p, minSNP = minSNP) 13 | # Test the plot function 14 | expect_error(plotMutSites(snp), NA) 15 | expect_error(plotMutSites(snp2), NA) 16 | allSNP <- attr(snp, "allSNP") 17 | allSNP <- split(x = allSNP, 18 | f = allSNP[, c("Pos", "SNP")], 19 | drop = TRUE) 20 | for (siteSNP in allSNP) { 21 | siteName <- unique(siteSNP[["Pos"]]) 22 | snpAA <- unique(siteSNP[["SNP"]]) 23 | # The amino acid should not be gap or ambiguous 24 | expect_true(all(snpAA %in% unambiguous)) 25 | # At least two kinds of amino acid over the 'minSNP' threshold 26 | snpNum <- sapply(snpAA, function(AA) { 27 | sum(siteSNP[["SNP"]] == AA) 28 | }) 29 | expect_true(all(snpNum >= minSNP)) 30 | # Summarize the amino acid of the SNP site and the number should be 31 | # consistent with the result 32 | siteCharSummary <- table(sapply( 33 | X = align, 34 | FUN = substring, 35 | first = siteName, 36 | last = siteName 37 | )) 38 | expect_true(all(siteCharSummary[snpAA] == snpNum)) 39 | } 40 | } 41 | } 42 | 43 | test_that("Works for amino acid", { 44 | data(zikv_align_reduced) 45 | data(zikv_tree_reduced) 46 | tr <- addMSA(tree = zikv_tree_reduced, 47 | alignment = zikv_align_reduced) 48 | tipNames <- zikv_tree_reduced[["tip.label"]] 49 | nTips <- length(tipNames) 50 | # Test the input of 'minSNP' 51 | expect_error(SNPsites(tr, minSNP = nTips)) 52 | expect_error(SNPsites(tr, minSNP = "notRight")) 53 | # Set an arbitrary 'gapChar' 54 | gapChar <- "G" 55 | # Set up a range of 'minSNP' values 56 | minT <- floor(nTips / 10) 57 | maxT <- ceiling(nTips / 2) 58 | test_minSNPthreshold(tr, gapChar, minT, maxT) 59 | }) 60 | 61 | test_that("Works for nucleotide", { 62 | data(sars2_align) 63 | data(sars2_tree) 64 | tr <- addMSA(tree = sars2_tree, 65 | alignment = sars2_align) 66 | tipNames <- sars2_tree[["tip.label"]] 67 | nTips <- length(tipNames) 68 | # Test the input of 'minSNP' 69 | expect_error(SNPsites(tr, minSNP = nTips)) 70 | expect_error(SNPsites(tr, minSNP = "notRight")) 71 | # Set an arbitrary 'gapChar' 72 | gapChar <- "G" 73 | # Set up a range of 'minSNP' values 74 | minT <- floor(nTips / 10) 75 | maxT <- ceiling(nTips / 2) 76 | test_minSNPthreshold(tr, gapChar, minT, maxT) 77 | }) 78 | -------------------------------------------------------------------------------- /tests/testthat/test-parallelSites.R: -------------------------------------------------------------------------------- 1 | test_minSNPandSiteSkipping <- function(minEntropy, unambiguous) { 2 | # Use different 'minSNP' constrains and 'mutMode' 3 | mutModes <- c("pre", "all", "post", "exact") 4 | for (minSNP in c(1, 5, 15)) { 5 | for (i in seq_along(mutModes)) { 6 | mutations <- parallelSites(minEntropy, 7 | minSNP = minSNP, 8 | mutMode = mutModes[i]) 9 | for (site in names(mutations)) { 10 | sp <- mutations[[site]] 11 | mutTips <- extractTips(sp) 12 | # Group the tips according to the 'mutMode' 13 | mutModeTips <- list() 14 | for (tips in mutTips) { 15 | mutName <- attr(tips, "mutName")[i] 16 | mutModeTips[[mutName]] <- 17 | c(mutModeTips[[mutName]], list(tips)) 18 | } 19 | for (tips in mutModeTips) { 20 | mutCharsValid <- unlist(lapply(tips, function(i) { 21 | if (attr(i, "fixed")) { 22 | return(TRUE) 23 | } 24 | attr(i, "mutName")[c(1, 3)] %in% unambiguous 25 | }), use.names = FALSE) 26 | # The gap and ambiguous character should have been ignored 27 | expect_true(all(mutCharsValid), 28 | label = paste(site, mutModes[i], minSNP)) 29 | tipNum <- length(unlist(tips)) 30 | isFixed <- vapply( 31 | X = tips, 32 | FUN = attr, 33 | FUN.VALUE = logical(1), 34 | which = "fixed" 35 | ) 36 | # This check could be improved because the 'minSNP' 37 | # constrain applies on the two lineage separately. Here it 38 | # was tested for all the result of a site 39 | expect_true( 40 | tipNum >= minSNP * 2 || 41 | sum(isFixed) >= 2 || 42 | tipNum >= minSNP && 43 | any(isFixed) 44 | ) 45 | } 46 | } 47 | } 48 | } 49 | } 50 | 51 | test_that("The function works for amino acid", { 52 | data(h3n2_tree_reduced) 53 | data(h3n2_align_reduced) 54 | p <- addMSA(tree = h3n2_tree_reduced, 55 | alignment = h3n2_align_reduced) 56 | gapChar <- "S" 57 | p <- setSiteNumbering(p, gapChar = gapChar) 58 | # Test the input of 'minSNP' and 'mutMode' 59 | expect_error(parallelSites(p, minSNP = "a")) 60 | expect_error(parallelSites(p, mutMode = "asdfa")) 61 | minEntropy <- sitesMinEntropy(p) 62 | unambiguous <- setdiff(sitePath:::AA_UNAMBIGUOUS, gapChar) 63 | test_minSNPandSiteSkipping(minEntropy, unambiguous) 64 | }) 65 | 66 | test_that("The function works for nucleotide", { 67 | data(sars2_tree) 68 | data(sars2_align) 69 | tr <- addMSA(tree = sars2_tree, 70 | alignment = sars2_align, 71 | seqType = "DNA") 72 | gapChar <- "G" 73 | p <- lineagePath(tr) 74 | p <- setSiteNumbering(p, gapChar = gapChar) 75 | # Test the input of 'minSNP' and 'mutMode' 76 | expect_error(parallelSites(p, minSNP = "a")) 77 | expect_error(parallelSites(p, mutMode = "asdfa")) 78 | minEntropy <- sitesMinEntropy(p) 79 | unambiguous <- setdiff(sitePath:::NT_UNAMBIGUOUS, gapChar) 80 | test_minSNPandSiteSkipping(minEntropy, unambiguous) 81 | }) 82 | -------------------------------------------------------------------------------- /man/plotFunctions.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/plotFunctions.R 3 | \name{plot.phyMSAmatched} 4 | \alias{plot.phyMSAmatched} 5 | \alias{plot.lineagePath} 6 | \alias{plot.parallelSites} 7 | \alias{plot.fixationSites} 8 | \alias{plot.sitePath} 9 | \alias{plot.fixationIndels} 10 | \alias{plot.fixationPath} 11 | \title{Visualize the results} 12 | \usage{ 13 | \method{plot}{phyMSAmatched}(x, y = TRUE, ...) 14 | 15 | \method{plot}{lineagePath}(x, y = TRUE, showTips = FALSE, ...) 16 | 17 | \method{plot}{parallelSites}(x, y = TRUE, ...) 18 | 19 | \method{plot}{fixationSites}(x, y = TRUE, tipsGrouping = NULL, ...) 20 | 21 | \method{plot}{sitePath}(x, y = NULL, select = NULL, showTips = FALSE, ...) 22 | 23 | \method{plot}{fixationIndels}(x, y = TRUE, ...) 24 | 25 | \method{plot}{fixationPath}(x, y = TRUE, ...) 26 | } 27 | \arguments{ 28 | \item{x}{The object to plot.} 29 | 30 | \item{y}{Whether to show the fixation mutation between clusters. For 31 | \code{lineagePath} object and \code{sitePath} object, it is deprecated and 32 | no longer have effect since 1.5.4.} 33 | 34 | \item{...}{Other arguments. Since 1.5.4, the function uses 35 | \code{\link{ggtree}} as the base function to make plots so the arguments in 36 | \code{plot.phylo} will no longer work.} 37 | 38 | \item{showTips}{Whether to plot the tip labels. The default is \code{FALSE}.} 39 | 40 | \item{tipsGrouping}{A \code{list} to hold the grouping of tips for how the 41 | tree will be colored.} 42 | 43 | \item{select}{For a \code{sitePath} object, it can have result on more than 44 | one evolution pathway. This is to select which path to plot. The default is 45 | \code{NULL} which will plot all the paths. It is the same as \code{select} 46 | in \code{\link{plotSingleSite}}.} 47 | } 48 | \value{ 49 | A ggplot object to make the plot. 50 | } 51 | \description{ 52 | The plot function to visualize the return of functions in the 53 | package. The underlying function applies \code{\link{ggplot2}}. The 54 | function name \code{plot} is used to keep the compatibility with previous 55 | versions, but they do not behave like the generic \code{\link{plot}} 56 | function since 1.5.4. 57 | 58 | A \code{\link{phyMSAmatched}} object will be plotted as a tree 59 | diagram. 60 | 61 | A \code{\link{lineagePath}} object will be plotted as a tree 62 | diagram and paths are black solid line while the trimmed nodes and tips 63 | will use gray dashed line. 64 | 65 | A \code{\link{parallelSites}} object will be plotted as original 66 | phylogenetic tree marked with parallel mutations attached as dot plot. 67 | 68 | A \code{\link{fixationSites}} object will be plotted as original 69 | phylogenetic tree marked with fixation substitutions. 70 | 71 | A \code{sitePath} object can be extracted by using 72 | \code{\link{extractSite}} on the return of \code{\link{fixationSites}}. 73 | 74 | A \code{\link{fixationIndels}} object will be plotted as 75 | original phylogenetic tree marked with indel fixation. 76 | 77 | A \code{\link{fixationPath}} object will be plotted as a 78 | \code{phylo} object. The tips are clustered according to the fixation 79 | sites. The transition of fixation sites will be plotted as a phylogenetic 80 | tree. The length of each branch represents the number of fixation mutation 81 | between two clusters. 82 | } 83 | \examples{ 84 | data(zikv_tree) 85 | data(zikv_align) 86 | tree <- addMSA(zikv_tree, alignment = zikv_align) 87 | plot(tree) 88 | paths <- lineagePath(tree) 89 | plot(paths) 90 | parallel <- parallelSites(paths) 91 | plot(parallel) 92 | fixations <- fixationSites(paths) 93 | plot(fixations) 94 | sp <- extractSite(fixations, 139) 95 | plot(sp) 96 | x <- fixationPath(fixations) 97 | plot(x) 98 | } 99 | -------------------------------------------------------------------------------- /src/lumpyCluster.h: -------------------------------------------------------------------------------- 1 | /* 2 | * When finding the terminal of each phylogenetic lineages, the major SNP is 3 | * thought to be unique to the lineage but sometimes the same SNP could appear 4 | * in other lineages. In this case, the ancestral node of the tips with the SNP 5 | * cannot locate to the desired terminal of the lineages. A clustering method 6 | * similar to CD-HIT http://weizhongli-lab.org/cd-hit/ is used to detect the 7 | * outlier tip with the SNP or identify the multiple groups having the same SNP. 8 | */ 9 | 10 | #ifndef SITEPATH_LUMPCLUSTER_H 11 | #define SITEPATH_LUMPCLUSTER_H 12 | 13 | #include 14 | #include 15 | #include 16 | 17 | #include "treemer.h" 18 | 19 | namespace LumpyCluster { 20 | 21 | typedef std::vector lumpingTips; 22 | typedef std::vector tipGrouping; 23 | 24 | class Base { 25 | public: 26 | Base( 27 | const Rcpp::NumericMatrix &metricMatrix, 28 | const int maxSNPnum 29 | ); 30 | tipGrouping finalClusters() const; 31 | protected: 32 | void mergeClusters( 33 | const Treemer::clusters &clusters, 34 | const int zValue 35 | ); 36 | virtual void thresholdOffset( 37 | const float stdev, 38 | const int zValue 39 | ) = 0; 40 | float clusterCompare( 41 | const Treemer::tips &query, 42 | const Treemer::tips &subject 43 | ); 44 | virtual bool betterMetric( 45 | const float query, 46 | const float subject 47 | ) const = 0; 48 | virtual bool qualifiedMetric(const float metric) const = 0; 49 | protected: 50 | const Rcpp::NumericMatrix m_metricMatrix; 51 | // The merged clusters output 52 | const int m_maxSNPnum; 53 | // The threshold for clustering and the metric standard deviation of all 54 | // tips pairs 55 | lumpingTips m_merged; 56 | // The maximum number of SNP 57 | float m_metricThreshold; 58 | }; 59 | 60 | /* 61 | * The clusters tend to merge when their metric is larger. 62 | */ 63 | 64 | class BySimMatrix: public Base { 65 | public: 66 | BySimMatrix( 67 | const Rcpp::NumericMatrix &simMatrix, 68 | const Treemer::clusters &clusters, 69 | const int maxSNPnum, 70 | const int zValue 71 | ); 72 | protected: 73 | void thresholdOffset( 74 | const float stdev, 75 | const int zValue 76 | ); 77 | bool betterMetric( 78 | const float query, 79 | const float subject 80 | ) const; 81 | bool qualifiedMetric(const float metric) const; 82 | }; 83 | 84 | /* 85 | * The initiation of clustering by distance is the same as clustering by 86 | * similarity. Only the test of better metric and qualification is reversed. The 87 | * clusters tend to merge when their metric is smaller. 88 | */ 89 | class ByDistMatrix: public Base { 90 | public: 91 | ByDistMatrix( 92 | const Rcpp::NumericMatrix &distMatrix, 93 | const Treemer::clusters &clusters, 94 | const int maxSNPnum, 95 | const int zValue 96 | ); 97 | protected: 98 | void thresholdOffset( 99 | const float stdev, 100 | const int zValue 101 | ); 102 | bool betterMetric( 103 | const float query, 104 | const float subject 105 | ) const; 106 | bool qualifiedMetric(const float metric) const; 107 | }; 108 | 109 | template 110 | std::map terminalTips( 111 | const Rcpp::ListOf &tipPaths, 112 | const Rcpp::ListOf &alignedSeqs, 113 | const Rcpp::NumericMatrix &metricMatrix, 114 | const Rcpp::IntegerVector &siteIndices, 115 | const int zValue 116 | ); 117 | 118 | } 119 | 120 | #endif /* SITEPATH_LUMPCLUSTER_H */ 121 | -------------------------------------------------------------------------------- /man/paraFixSites.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/paraFixSites.R 3 | \name{paraFixSites} 4 | \alias{paraFixSites} 5 | \alias{paraFixSites.phylo} 6 | \alias{paraFixSites.treedata} 7 | \alias{paraFixSites.lineagePath} 8 | \alias{paraFixSites.sitesMinEntropy} 9 | \title{The fixation sites with mutation on parallel lineage} 10 | \usage{ 11 | paraFixSites(x, ...) 12 | 13 | \method{paraFixSites}{phylo}( 14 | x, 15 | alignment = NULL, 16 | seqType = c("AA", "DNA", "RNA"), 17 | Nmin = NULL, 18 | reference = NULL, 19 | gapChar = "-", 20 | minSkipSize = NULL, 21 | ... 22 | ) 23 | 24 | \method{paraFixSites}{treedata}(x, ...) 25 | 26 | \method{paraFixSites}{lineagePath}( 27 | x, 28 | minEffectiveSize = NULL, 29 | searchDepth = 1, 30 | method = c("compare", "insert", "delete"), 31 | ... 32 | ) 33 | 34 | \method{paraFixSites}{sitesMinEntropy}( 35 | x, 36 | category = c("intersect", "union", "parallelOnly", "fixationOnly"), 37 | minSNP = NULL, 38 | mutMode = c("all", "exact", "pre", "post"), 39 | ... 40 | ) 41 | } 42 | \arguments{ 43 | \item{x}{A \code{lineagePath} object returned from \code{\link{lineagePath}} 44 | function.} 45 | 46 | \item{...}{further arguments passed to or from other methods.} 47 | 48 | \item{alignment}{An \code{alignment} object. This commonly can be from 49 | sequence parsing function in the \code{\link{seqinr}} package. Sequence 50 | names in the alignment should include all \code{tip.label} in the tree} 51 | 52 | \item{seqType}{The type of the sequence in the alignment file. The default is 53 | "AA" for amino acid. The other options are "DNA" and "RNA".} 54 | 55 | \item{Nmin}{The parameter for identifying phylogenetic pathway using SNP. If 56 | provided as fraction between 0 and 1, then the minimum number of SNP will 57 | be total tips times \code{Nmin}. If provided as integer greater than 1, the 58 | minimum number will be \code{Nmin}.} 59 | 60 | \item{reference}{Name of reference for site numbering. The name has to be one 61 | of the sequences' name. The default uses the intrinsic alignment numbering} 62 | 63 | \item{gapChar}{The character to indicate gap. The numbering will skip the 64 | \code{gapChar} for the reference sequence.} 65 | 66 | \item{minSkipSize}{The minimum number of tips to have gap or ambiguous amino 67 | acid/nucleotide for a site to be ignored in other analysis. This will not 68 | affect the numbering. The default is 0.8.} 69 | 70 | \item{minEffectiveSize}{The minimum number of tips in a group.} 71 | 72 | \item{searchDepth}{The function uses heuristic search but the termination of 73 | the search cannot be intrinsically decided. \code{searchDepth} is needed to 74 | tell the search when to stop.} 75 | 76 | \item{method}{The strategy for predicting the fixation. The basic approach is 77 | entropy minimization and can be achieved by adding or removing fixation 78 | point, or by comparing the two.} 79 | 80 | \item{category}{Could be \code{parallelOnly}, \code{fixationOnly}, 81 | \code{intersect} or \code{union}.} 82 | 83 | \item{minSNP}{The minimum number of mutations to be qualified as parallel on 84 | at least two lineages. The default is 1.} 85 | 86 | \item{mutMode}{The strategy for finding parallel site. The default \code{all} 87 | is to consider any mutation regardless of the amino acid/nucleotide before 88 | and after mutation; Or \code{exact} to force mutation to be the same; Or 89 | \code{pre}/\code{post} to select the site having amino acid/nucleotide 90 | before/after mutation.} 91 | } 92 | \value{ 93 | A \code{paraFixSites} object. 94 | } 95 | \description{ 96 | The operation between the results of \code{\link{fixationSites}} 97 | and \code{\link{parallelSites}}. 98 | } 99 | \examples{ 100 | data(zikv_tree_reduced) 101 | data(zikv_align_reduced) 102 | paraFixSites(zikv_tree_reduced, alignment = zikv_align_reduced) 103 | } 104 | -------------------------------------------------------------------------------- /R/SNPsites.R: -------------------------------------------------------------------------------- 1 | #' @importFrom stats complete.cases 2 | #' @importFrom ape Ntip 3 | 4 | #' @rdname SNPsites 5 | #' @title Finding sites with variation 6 | #' @description Single nucleotide polymorphism (SNP) in the whole package refers 7 | #' to variation of amino acid. \code{SNPsite} will try to find SNP in the 8 | #' multiple sequence alignment. A reference sequence and gap character may be 9 | #' specified to number the site. 10 | #' @param tree A \code{\link{phyMSAmatched}} object. 11 | #' @param minSNP Minimum number of a mutation to be a SNP. The default is 10th 12 | #' of the total tree tips. 13 | #' @param ... Other arguments 14 | #' @return A \code{SNPsites} object. 15 | #' @export 16 | #' @examples 17 | #' data(zikv_tree_reduced) 18 | #' data(zikv_align_reduced) 19 | #' tree <- addMSA(zikv_tree_reduced, alignment = zikv_align_reduced) 20 | #' SNPsites(tree) 21 | SNPsites <- function(tree, ...) { 22 | UseMethod("SNPsites") 23 | } 24 | 25 | #' @rdname SNPsites 26 | #' @export 27 | SNPsites.phyMSAmatched <- function(tree, minSNP = NULL, ...) { 28 | extraArgs <- list(...) 29 | x <- .phyMSAmatch(tree) 30 | nTips <- Ntip(attr(x, "tree")) 31 | # Set default 'minSNP' value 32 | if (is.null(minSNP)) { 33 | minSNP <- attr(x, "minSize") 34 | } else if (!is.numeric(minSNP)) { 35 | stop("\"minSNP\" only accepts numeric") 36 | } else if (minSNP >= nTips / 2) { 37 | stop("\"minSNP\": ", 38 | minSNP, 39 | " is greater than half of the total tips: ", 40 | floor(nTips / 2)) 41 | } 42 | align <- attr(x, "align") 43 | msaNumbering <- attr(x, "msaNumbering") 44 | if (!is.null(extraArgs[["useAllSites"]])) { 45 | msaNumbering <- seq_len(nchar(align[[1]])) 46 | } 47 | refSeqName <- attr(x, "reference") 48 | unambiguous <- .unambiguousChars(x) 49 | # Find SNP for each tree tip by comparing with the consensus sequence or the 50 | # reference sequence if specified 51 | if (is.null(refSeqName)) { 52 | # Find the major SNP of each site as the consensus sequence 53 | refSeq <- vapply( 54 | X = msaNumbering, 55 | FUN = function(s) { 56 | aaSummary <- tableAA(align, s - 1) 57 | # The amino acid/nucleotide having the most appearance 58 | names(aaSummary)[which.max(aaSummary)] 59 | }, 60 | FUN.VALUE = character(1) 61 | ) 62 | align <- strsplit(x = align, split = "") 63 | } else { 64 | align <- strsplit(x = align, split = "") 65 | refSeq <- align[[refSeqName]] 66 | } 67 | # Get the amino acid/nucleotide of each locus 68 | allSNP <- lapply(attr(x, "loci"), function(site) { 69 | snp <- vapply( 70 | X = align, 71 | FUN = "[[", 72 | i = msaNumbering[site], 73 | FUN.VALUE = character(1) 74 | ) 75 | # An SNP has to be different from the reference and not gap or ambiguous 76 | # character 77 | snp <- 78 | snp[which(snp != refSeq[[site]] & snp %in% unambiguous)] 79 | res <- data.frame( 80 | "Accession" = names(snp), 81 | "Pos" = rep(site, length(snp)), 82 | "SNP" = snp 83 | ) 84 | return(res) 85 | }) 86 | allSNP <- do.call(rbind, allSNP) 87 | # Calculate the frequency of each mutation/SNP 88 | snpSummary <- as.data.frame(table(allSNP[["Pos"]], 89 | allSNP[["SNP"]])) 90 | allSNP <- merge( 91 | x = allSNP, 92 | y = snpSummary, 93 | by.x = c("Pos", "SNP"), 94 | by.y = c("Var1", "Var2"), 95 | all.x = TRUE 96 | ) 97 | # Filter out low frequency mutation/SNP 98 | allSNP <- allSNP[which(allSNP[, "Freq"] >= minSNP), 99 | c("Accession", "Pos", "SNP")] 100 | rownames(allSNP) <- NULL 101 | # Extract all the qualified sites as 'res' to be compatible with the return 102 | # of previous version 103 | res <- sort(unique(allSNP[["Pos"]])) 104 | attr(res, "allSNP") <- allSNP 105 | # Transfer attributes 106 | attr(res, "phyMSAmatched") <- x 107 | class(res) <- "SNPsites" 108 | return(res) 109 | } 110 | -------------------------------------------------------------------------------- /src/treemer.h: -------------------------------------------------------------------------------- 1 | /* 2 | * Achieve a tip-to-root trimming behavior for a phylogenetic tree. The basic 3 | * idea is to start at the terminal node for each tree tip and move the node 4 | * towards the root node. Only when more than two nodes meet each other along 5 | * the way do the nodes make a real move otherwise the node stays. 6 | * 7 | * The tree is represented by tipPaths (from the R function "nodepath" in "ape" 8 | * package). The algorithm is performed on the tipPaths along with the 9 | * corresponding aligned sequences for each tip. 10 | */ 11 | 12 | #ifndef SITEPATH_TREEMER_H 13 | #define SITEPATH_TREEMER_H 14 | 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | 21 | namespace Treemer { 22 | 23 | float compare(const std::string &query, const std::string &subject); 24 | 25 | /* 26 | * To hold all necessary data for a tree tip, including sequence, nodePath to 27 | * the tree root and the current node during treemer. Provides way to store 28 | * index of the 'kissed' node on the nodePath. 29 | */ 30 | class TipSeqLinker { 31 | public: 32 | TipSeqLinker( 33 | const Rcpp::CharacterVector &sequence, 34 | const Rcpp::IntegerVector &tipPath 35 | ); 36 | void reset(); 37 | void proceed(); 38 | int nextClade() const; 39 | int currentClade() const; 40 | int getTip() const; 41 | int getRoot() const; 42 | int getSeqLen() const; 43 | Rcpp::IntegerVector getPath() const; 44 | std::string getSeq() const; 45 | char siteChar(const int siteIndex) const; 46 | private: 47 | const std::string m_seq; 48 | const Rcpp::IntegerVector m_path; 49 | const int m_tipIndex; 50 | int m_cIndex; 51 | }; 52 | 53 | typedef std::vector tips; 54 | typedef std::map clusters; 55 | 56 | /* 57 | * The Base class provides the structure for performing the trimming algorithm. 58 | * Depending on the extra constrain for cluster of tips, the class can be 59 | * derived to achieve a variety of behaviors. 60 | * 61 | * The extra constrain could be the largest similarity difference within a 62 | * cluster, the amino acid of the site and the average similarity. 63 | */ 64 | class Base { 65 | public: 66 | Base(const tips &tips, const clusters initClusters); 67 | virtual ~Base(); 68 | // Cluster of tree tips after trimming 69 | std::map< int, std::vector > getTips() const; 70 | protected: 71 | // The actual trimming process happens here 72 | void pruneTree(); 73 | // Whether the cluster of tips satisfy the extra constrain to be valid 74 | virtual bool qualified(const clusters::iterator &clusters_it) const = 0; 75 | protected: 76 | // The container to hold all the tips 77 | const tips m_tips; 78 | // The clustering of tips during trimming process 79 | clusters m_clusters; 80 | }; 81 | 82 | typedef std::map siteClusters; 83 | 84 | /* 85 | * Trim the tree by aa/nt of a site. The trimming stops when the non-dominant 86 | * aa/nt in a group is greater than the SNP percentage threshold 87 | */ 88 | class BySite: public Base { 89 | public: 90 | BySite( 91 | const tips &tips, 92 | const clusters initClusters, 93 | const int siteIndex 94 | ); 95 | // Cluster of TipSeqLinker grouped by aa/nt after trimming 96 | siteClusters getSiteClusters() const; 97 | private: 98 | bool qualified(const clusters::iterator &clusters_it) const; 99 | private: 100 | const int m_siteIndex; 101 | }; 102 | 103 | /* 104 | * Trim the tree and stop the process when the largest similarity difference in 105 | * each tip cluster is about to exceed the constrain 106 | */ 107 | class BySimilarity: public Base { 108 | public: 109 | BySimilarity( 110 | const tips &tips, 111 | const clusters initClusters, 112 | const float simThreshold, 113 | std::map, float> &simMatrix 114 | ); 115 | protected: 116 | // The constrain of the largest similarity difference 117 | const float m_simCut; 118 | // Keep a record of pair-wise similarity to avoid repeating computation 119 | std::map, float> *m_compared; 120 | private: 121 | bool qualified(const clusters::iterator &clusters_it) const; 122 | }; 123 | 124 | } 125 | 126 | #endif /* SITEPATH_TREEMER_H */ 127 | -------------------------------------------------------------------------------- /README.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "sitePath: phylogeny-based sequence clustering using site polymorphism" 3 | output: github_document 4 | --- 5 | 6 | ```{r setup, include=FALSE} 7 | knitr::opts_chunk$set( 8 | fig.path = "man/figures/" 9 | ) 10 | ``` 11 | 12 | The below demonstrates the result of phylogeny-based sequence clustering for a H3N2 virus dataset (included in the package) 13 | 14 | ```{r example} 15 | library(sitePath) 16 | 17 | data(h3n2_align) # load the H3N2 sequences 18 | data(h3n2_tree) # load the corresponding phylogenetic tree 19 | 20 | options(list("cl.cores" = 10)) # Use 10 cores for multiprocessing 21 | 22 | paths <- lineagePath(h3n2_tree, alignment = h3n2_align, Nmin = 0.05) 23 | minEntropy <- sitesMinEntropy(paths) 24 | 25 | p1 <- plotSingleSite(paths, site = 208) # The site polymorphism of site 208 on the tree 26 | p2 <- plotSingleSite(minEntropy, site = 208) # The result of clustering using site 208 27 | gridExtra::grid.arrange(p1, p2, ncol = 2) 28 | ``` 29 | 30 | ```{r extractTips} 31 | grp1 <- extractTips(paths, 208) # Grouping result using site polymorphism only 32 | grp2 <- extractTips(minEntropy, 208) # Phylogeny-based clustering result 33 | ``` 34 | 35 | # Installation 36 | 37 | [R programming language](https://cran.r-project.org/) \>= 4.1.0 is required to use `sitePath`. 38 | 39 | The stable release is available on [Bioconductor](https://bioconductor.org/packages/sitePath/). 40 | 41 | ``` r 42 | if (!requireNamespace("BiocManager", quietly = TRUE)) 43 | install.packages("BiocManager") 44 | 45 | BiocManager::install("sitePath") 46 | ``` 47 | 48 | The installation from [GitHub](https://github.com/wuaipinglab/sitePath/) is in experimental stage but gives the newest feature: 49 | 50 | ``` r 51 | if (!requireNamespace("remotes", quietly = TRUE)) 52 | install.packages("remotes") 53 | 54 | remotes::install_github("wuaipinglab/sitePath") 55 | ``` 56 | 57 | # QuickStart 58 | 59 | The following is a quick tutorial on how to use `sitePath` to find fixation and parallel sites including how to import data, run analysis and visualization of the results. 60 | 61 | ## 1. Data preparation 62 | 63 | You need a *tree* and a *MSA* (multiple sequence alignment) file and the sequence names have to be matched! 64 | 65 | ```{r data_prep} 66 | library(sitePath) # Load the sitePath package 67 | 68 | # The path to your tree and MSA files 69 | tree_file <- system.file("extdata", "ZIKV.newick", package = "sitePath") 70 | alignment_file <- system.file("extdata", "ZIKV.fasta", package = "sitePath") 71 | 72 | 73 | tree <- read.tree(tree_file) # Read the tree file into R 74 | align <- read.alignment(alignment_file, format = "fasta") # Read the MSA file into R 75 | 76 | ``` 77 | 78 | ## 2. Run analysis 79 | 80 | `Nmin` and `minSNP` are the respective parameters for finding fixation and parallel sites (18 and 1 are used as an example for this dataset). The default values will be used if you don't specify them. 81 | 82 | ```{r run_analysis} 83 | options(list("cl.cores" = 1)) # Set this bigger than 1 to use multiprocessing 84 | 85 | paraFix <- paraFixSites(tree, alignment = align, Nmin = 18, minSNP = 1) # Find paraFix sites 86 | paraFix 87 | ``` 88 | 89 | ## 3. Fixation sites 90 | 91 | Use `allSitesName` and set `type` as "fixation" to retrieve fixation sites name 92 | 93 | ```{r fixSites_name} 94 | allSitesName(paraFix, type = "fixation") 95 | ``` 96 | 97 | Use `plotFixationSites` to view fixation sites 98 | 99 | ```{r plot_fixSites} 100 | plotFixationSites(paraFix) # View all fixation sites on the tree 101 | plotFixationSites(paraFix, site = 139) # View a single site 102 | 103 | ``` 104 | 105 | ## 4. Parallel sites 106 | 107 | Use `allSitesName` and set `type` as "parallel" to retrieve parallel sites name 108 | 109 | ```{r paraSites_name} 110 | allSitesName(paraFix, type = "parallel") 111 | ``` 112 | 113 | Use `plotParallelSites` to view parallel sites 114 | 115 | ```{r} 116 | plotParallelSites(paraFix) # View all parallel sites on the tree 117 | plotParallelSites(paraFix, site = 105) # View a single site 118 | ``` 119 | 120 | # Read more 121 | 122 | The above uses wrapper functions but the analysis can be dissembled into step functions (so you can view the result of each step and modify parameters). Click [here](https://wuaipinglab.github.io/sitePath/articles/sitePath.html) for a more detailed tutorial. 123 | 124 | # Getting help 125 | 126 | Post on Bioconductor [support site](https://support.bioconductor.org/) if having trouble using `sitePath`. Or open an [issue](https://github.com/wuaipinglab/sitePath/issues/new?assignees=&labels=&template=bug_report.md&title=) if a bug is found. 127 | -------------------------------------------------------------------------------- /man/lineagePath.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/lineagePath.R 3 | \name{lineagePath} 4 | \alias{lineagePath} 5 | \alias{lineagePath.phylo} 6 | \alias{lineagePath.treedata} 7 | \alias{lineagePath.phyMSAmatched} 8 | \alias{sneakPeek} 9 | \alias{lineagePath.sneakPeekedPaths} 10 | \alias{lineagePath.paraFixSites} 11 | \title{Resolving lineage paths using SNP} 12 | \usage{ 13 | lineagePath(tree, similarity, ...) 14 | 15 | \method{lineagePath}{phylo}( 16 | tree, 17 | similarity = NULL, 18 | alignment = NULL, 19 | seqType = c("AA", "DNA", "RNA"), 20 | reference = NULL, 21 | gapChar = "-", 22 | minSkipSize = NULL, 23 | ... 24 | ) 25 | 26 | \method{lineagePath}{treedata}(tree, ...) 27 | 28 | \method{lineagePath}{phyMSAmatched}( 29 | tree, 30 | similarity = NULL, 31 | simMatrix = NULL, 32 | forbidTrivial = TRUE, 33 | ... 34 | ) 35 | 36 | sneakPeek(tree, step = 9, maxPath = NULL, minPath = 0, makePlot = TRUE) 37 | 38 | \method{lineagePath}{sneakPeekedPaths}(tree, similarity, ...) 39 | 40 | \method{lineagePath}{paraFixSites}(tree, similarity = NULL, ...) 41 | } 42 | \arguments{ 43 | \item{tree}{The return from \code{\link{addMSA}} or \code{sneakPeek} 44 | function.} 45 | 46 | \item{similarity}{The parameter for identifying phylogenetic pathway using SNP. If 47 | provided as fraction between 0 and 1, then the minimum number of SNP will 48 | be total tips times \code{Nmin}. If provided as integer greater than 1, the 49 | minimum number will be \code{Nmin}.} 50 | 51 | \item{...}{Other arguments.} 52 | 53 | \item{alignment}{An \code{alignment} object. This commonly can be from 54 | sequence parsing function in the \code{\link{seqinr}} package. Sequence 55 | names in the alignment should include all \code{tip.label} in the tree} 56 | 57 | \item{seqType}{The type of the sequence in the alignment file. The default is 58 | "AA" for amino acid. The other options are "DNA" and "RNA".} 59 | 60 | \item{reference}{Name of reference for site numbering. The name has to be one 61 | of the sequences' name. The default uses the intrinsic alignment numbering} 62 | 63 | \item{gapChar}{The character to indicate gap. The numbering will skip the 64 | \code{gapChar} for the reference sequence.} 65 | 66 | \item{minSkipSize}{The minimum number of tips to have gap or ambiguous amino 67 | acid/nucleotide for a site to be ignored in other analysis. This will not 68 | affect the numbering. The default is 0.8.} 69 | 70 | \item{simMatrix}{Deprecated and will not have effect.} 71 | 72 | \item{forbidTrivial}{Does not allow trivial trimming.} 73 | 74 | \item{step}{the 'similarity' window for calculating and plotting. To better 75 | see the impact of threshold on path number. The default is 10.} 76 | 77 | \item{maxPath}{maximum number of path to return show in the plot. The number 78 | of path in the raw tree can be far greater than trimmed tree. To better see 79 | the impact of threshold on path number. This is preferably specified. The 80 | default is one 20th of tree tip number.} 81 | 82 | \item{minPath}{minimum number of path to return show in the plot. To better 83 | see the impact of threshold on path number. The default is 1.} 84 | 85 | \item{makePlot}{Whether make a plot when return.} 86 | } 87 | \value{ 88 | Lineage path represent by node number. 89 | 90 | \code{sneakPeek} return the similarity threhold against number of 91 | lineagePath. There will be a simple dot plot between threshold and path 92 | number if \code{makePlot} is TRUE. 93 | } 94 | \description{ 95 | \code{lineagePath} finds the lineages of a phylogenetic tree 96 | providing the corresponding sequence alignment. This is done by finding 97 | 'major SNPs' which usually accumulate along the evolutionary pathways. 98 | 99 | \code{sneakPeek} is intended to plot 'similarity' (actually the 100 | least percentage of 'major SNP') as a threshold against number of output 101 | lineagePath. This plot is intended to give user a rough view about how many 102 | lineages they could expect from the 'similarity' threshold in the function 103 | \code{\link{lineagePath}}. The number of lineagePath is preferably not be 104 | too many or too few. The result excludes where the number of lineagePath is 105 | greater than number of tips divided by 20 or user-defined maxPath. The zero 106 | lineagePath result will also be excluded. 107 | 108 | When used on the return of \code{sneakPeek}, a 109 | \code{lineagePath} with the closest \code{similarity} will be retrieved 110 | from the returned value. 111 | 112 | \code{similarity} has no effect when using on 113 | \code{\link{paraFixSites}} object 114 | } 115 | \examples{ 116 | data('zikv_tree') 117 | data('zikv_align') 118 | tree <- addMSA(zikv_tree, alignment = zikv_align) 119 | lineagePath(tree) 120 | sneakPeek(tree, step = 3) 121 | x <- sneakPeek(tree, step = 3) 122 | lineagePath(x, similarity = 0.05) 123 | } 124 | -------------------------------------------------------------------------------- /src/minEntropy.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | 6 | #include "minEntropy.h" 7 | 8 | float MinEntropy::shannonEntropy( 9 | const aaSummary &values, 10 | const unsigned int tipNum 11 | ) { 12 | float res = 0.0; 13 | for ( 14 | aaSummary::const_iterator it = values.begin(); 15 | it != values.end(); ++it 16 | ) { 17 | // Probability of drawing distinct AA 18 | float p = it->second / static_cast(tipNum); 19 | res -= p * std::log(p); 20 | } 21 | return res; 22 | } 23 | 24 | Rcpp::ListOf MinEntropy::updatedSegmentation( 25 | const Rcpp::ListOf &nodeSummaries, 26 | const segment &final 27 | ) { 28 | std::vector res; 29 | // Keep track of previous AA to see if the segment can be combined 30 | std::string prevFixedAA = ""; 31 | // Summarize the AA if combined 32 | aaSummary combNode; 33 | // Group the tips if combined 34 | std::vector combTips; 35 | // Iterate segment points and grouping tips 36 | segIndex start = 0; 37 | // To get the ancestral nodes of the raw groups 38 | Rcpp::CharacterVector ancestralNodes = nodeSummaries.names(); 39 | // Index of the ancestral node of the group 40 | segIndex aNodeIndex = 0; 41 | // To find the segmentation index of current group, there is a need to look 42 | // back at the last segmentation index that separate two groups with 43 | // different dominant AA 44 | unsigned int indexShift = 1; 45 | // Ancestral node of the group 46 | std::string aNode = Rcpp::as(ancestralNodes.at(0)); 47 | for ( 48 | segment::const_iterator final_itr = final.begin(); 49 | final_itr != final.end(); ++final_itr 50 | ) { 51 | std::vector tips; // To store and grow the group 52 | aaSummary node; // To summarize the amino acids of the group 53 | for (unsigned int i = start; i < *final_itr; ++i) { 54 | Rcpp::IntegerVector nodeTips = nodeSummaries[i]; 55 | // Group the tips with a segment 56 | tips.insert(tips.end(), nodeTips.begin(), nodeTips.end()); 57 | // Summarize the AA within a segment 58 | Rcpp::IntegerVector summary = nodeTips.attr("aaSummary"); 59 | Rcpp::CharacterVector aa = summary.names(); 60 | for (int j = 0; j < aa.size(); ++j) { 61 | node[Rcpp::as(aa.at(j))] += summary.at(j); 62 | } 63 | } 64 | // Find the most dominant AA in the current segment 65 | aaSummary::iterator node_itr = node.begin(); 66 | std::string fixedAA = node_itr->first; 67 | int maxFreq = node_itr->second; 68 | for (++node_itr; node_itr != node.end(); ++node_itr) { 69 | if (node_itr->second > maxFreq) { 70 | fixedAA = node_itr->first; 71 | maxFreq = node_itr->second; 72 | } 73 | } 74 | // If the most dominant AA is the same as the previous segment. 75 | if (fixedAA == prevFixedAA) { 76 | // Combine with the previous segment 77 | for (node_itr = node.begin(); node_itr != node.end(); ++node_itr) { 78 | combNode[node_itr->first] += node_itr->second; 79 | } 80 | combTips.insert(combTips.end(), tips.begin(), tips.end()); 81 | indexShift++; 82 | } else { 83 | // Add the previous combination as a new segment 84 | Rcpp::IntegerVector combined = Rcpp::wrap(combTips); 85 | combined.attr("aaSummary") = Rcpp::wrap(combNode); 86 | combined.attr("AA") = prevFixedAA; 87 | combined.attr("node") = aNode; 88 | res.push_back(combined); 89 | // Initiate a new segment to be combined 90 | combTips = tips; 91 | combNode = node; 92 | // Find the segmentation index for the next group 93 | aNodeIndex = (final_itr == final.begin()) ? 0 : *(final_itr - indexShift); 94 | aNode = Rcpp::as(ancestralNodes.at(aNodeIndex)); 95 | indexShift = 1; 96 | } 97 | // Update the starting segment point for the next segment 98 | start = *final_itr; 99 | // Update the previous fixed AA 100 | prevFixedAA = fixedAA; 101 | } 102 | // Add the last segment to the return list 103 | Rcpp::IntegerVector combined = Rcpp::wrap(combTips); 104 | combined.attr("aaSummary") = Rcpp::wrap(combNode); 105 | combined.attr("AA") = prevFixedAA; 106 | combined.attr("node") = aNode; 107 | res.push_back(combined); 108 | // The first combination of segments is actually empty 109 | res.erase(res.begin()); 110 | Rcpp::ListOf res2 = Rcpp::wrap(res); 111 | res2.attr("final") = Rcpp::wrap(final); 112 | res2.attr("nodeSummaries") = nodeSummaries.names(); 113 | return res2; 114 | } 115 | -------------------------------------------------------------------------------- /tests/testthat/test-fixationSites.R: -------------------------------------------------------------------------------- 1 | test_ambiguousIgnored <- function(fixSites, 2 | tree, 3 | alignment, 4 | gapChar, 5 | minEffectiveSize) { 6 | paths <- attr(fixSites, "paths") 7 | tipNames <- tree[["tip.label"]] 8 | nTips <- length(tipNames) 9 | p <- attr(fixSites, "paths") 10 | if (is.null(minEffectiveSize)) { 11 | minEffectiveSize <- attr(paths, "minSize") 12 | } 13 | # Get the divergent nodes 14 | divNodes <- sitePath:::divergentNode(paths) 15 | # The tips and matching 16 | pathNodeTips <- 17 | sitePath:::.tipSeqsAlongPathNodes(paths, divNodes) 18 | # In case root node does not have any tips 19 | excludedNodes <- divNodes 20 | rootNode <- attr(paths, "rootNode") 21 | if (!rootNode %in% names(pathNodeTips)) { 22 | excludedNodes <- c(rootNode, excludedNodes) 23 | } 24 | pathsWithSeqs <- lapply(paths, function(path) { 25 | path <- as.character(setdiff(path, excludedNodes)) 26 | names(path) <- path 27 | pathNodeAlign <- pathNodeTips[path] 28 | attr(pathNodeAlign, "pathTipNum") <- 29 | sum(lengths(pathNodeAlign)) 30 | return(pathNodeAlign) 31 | }) 32 | pathTipNums <- 33 | vapply(pathsWithSeqs, attr, integer(1), "pathTipNum") 34 | minEffectiveSize <- 35 | ceiling(minEffectiveSize * (min(pathTipNums) / max(pathTipNums))) 36 | # Get all the unambiguous character 37 | if (attr(p, "seqType") == "AA") { 38 | unambiguous <- setdiff(sitePath:::AA_UNAMBIGUOUS, gapChar) 39 | } else { 40 | unambiguous <- setdiff(sitePath:::NT_UNAMBIGUOUS, gapChar) 41 | } 42 | for (sp in fixSites) { 43 | for (mp in sp) { 44 | # The number of tips should be less than total tree tips 45 | expect_lte(sum(lengths(mp)), nTips) 46 | # AA/NT summary of the tips on the 'mutPath' 47 | aa <- lapply(mp, function(tips) { 48 | site <- attr(sp, "site") 49 | matchIndex <- 50 | which(alignment[["nam"]] %in% tipNames[tips]) 51 | sum <- alignment[["seq"]][matchIndex] 52 | sum <- table(sapply(sum, substring, site, site)) 53 | }) 54 | unambiguousNum <- 0 55 | for (i in seq_along(mp)) { 56 | # The tips in each group before or after fixation mutation 57 | nodeTips <- mp[[i]] 58 | # Tips should be unique and more than 'minEffectiveSize' 59 | expect_equal(sort(nodeTips), sort(unique(nodeTips))) 60 | # The dominant AA/NT should be the fixed one 61 | aaD <- toupper(names(which.max(aa[[i]]))) 62 | fixedAA <- attr(nodeTips, "AA") 63 | attributes(fixedAA) <- NULL 64 | if (fixedAA %in% unambiguous) { 65 | unambiguousNum <- unambiguousNum + 1 66 | } 67 | if (is.null(attr(nodeTips, "toMerge"))) { 68 | expect_gte(length(nodeTips), minEffectiveSize) 69 | expect_equal(fixedAA, aaD) 70 | } 71 | } 72 | expect_gte(unambiguousNum, 2) 73 | } 74 | } 75 | } 76 | 77 | test_that("The function works for amino acid", { 78 | data(h3n2_tree_reduced) 79 | data(h3n2_align_reduced) 80 | tr <- addMSA(tree = h3n2_tree_reduced, 81 | alignment = h3n2_align_reduced) 82 | p <- lineagePath(tr) 83 | # Set an arbitrary gap character 84 | gapChar <- "R" 85 | p <- setSiteNumbering(p, gapChar = gapChar) 86 | # Test the input of 'minEffectiveSize' and 'searchDepth' 87 | expect_error(fixationSites(paths = p, minEffectiveSize = -1)) 88 | expect_error(fixationSites(paths = p, minEffectiveSize = "3")) 89 | expect_error(fixationSites(paths = p, searchDepth = -1)) 90 | # Here comes the real deal 91 | fixSites <- fixationSites(p) 92 | minEntropy <- sitesMinEntropy(p) 93 | # Test the two function give the same result 94 | expect_identical(fixSites, fixationSites(minEntropy)) 95 | expect_false(any(duplicated(unlist( 96 | attr(fixSites, "clustersByPath") 97 | )))) 98 | # Test the number of tips before and after each fixation mutation is enough 99 | test_ambiguousIgnored(fixSites, 100 | h3n2_tree_reduced, 101 | h3n2_align_reduced, 102 | gapChar, 103 | NULL) 104 | }) 105 | 106 | test_that("The function works for nucleotide", { 107 | data(sars2_align) 108 | data(sars2_tree) 109 | tr <- addMSA(sars2_tree, 110 | alignment = sars2_align, 111 | seqType = "DNA") 112 | # Set an arbitrary gap character 113 | gapChar <- "G" 114 | tr <- setSiteNumbering(tr, gapChar = gapChar) 115 | p <- lineagePath(tr) 116 | fixSites <- fixationSites(p) 117 | minEntropy <- sitesMinEntropy(p) 118 | # Test the two function give the same result 119 | expect_identical(fixSites, fixationSites(minEntropy)) 120 | # Test the number of tips before and after each fixation mutation is enough 121 | test_ambiguousIgnored(fixSites, 122 | sars2_tree, 123 | sars2_align, 124 | gapChar, 125 | NULL) 126 | }) 127 | -------------------------------------------------------------------------------- /R/extractTips.R: -------------------------------------------------------------------------------- 1 | #' @rdname extractTips 2 | #' @title Extract grouped tips for a single site 3 | #' @description The result of \code{\link{fixationSites}} and \code{sitePath} 4 | #' contains all the possible sites with fixation mutation. The function 5 | #' \code{extractTips} retrieves the name of the tips involved in the fixation. 6 | #' @param x A \code{fixationSites} or a \code{sitePath} object. 7 | #' @param site A site predicted to experience fixation. 8 | #' @param select For a site, there theoretically might be more than one fixation 9 | #' on different lineages. You may use this argument to extract for a specific 10 | #' fixation of a site. The default is the first fixation of the site. 11 | #' @param ... Other arguments 12 | #' @return Tree tips grouped as \code{\link{list}} 13 | #' @export 14 | #' @examples 15 | #' data(zikv_tree_reduced) 16 | #' data(zikv_align_reduced) 17 | #' tree <- addMSA(zikv_tree_reduced, alignment = zikv_align_reduced) 18 | #' mutations <- fixationSites(lineagePath(tree)) 19 | #' extractTips(mutations, 139) 20 | extractTips <- function(x, ...) { 21 | UseMethod("extractTips") 22 | } 23 | 24 | #' @rdname extractTips 25 | #' @description For \code{\link{lineagePath}}, the function \code{extractTips} 26 | #' groups all the tree tips according to the amino acid/nucleotide of the 27 | #' \code{site}. 28 | #' @export 29 | extractTips.lineagePath <- function(x, site, ...) { 30 | site <- .checkSite(site) 31 | align <- attr(x, "align") 32 | align <- strsplit(tolower(align), "") 33 | reference <- attr(x, "msaNumbering") 34 | # Get the site index of the alignment 35 | tryCatch( 36 | expr = site <- reference[[site]], 37 | error = function(e) { 38 | stop("The site: ", 39 | attr(x, "site"), 40 | "is not within the length of sequence alignment.") 41 | } 42 | ) 43 | # Group the tree tips by amino acid/nucleotide of the site 44 | group <- list() 45 | for (tipName in names(align)) { 46 | siteChar <- align[[tipName]][site] 47 | group[[siteChar]] <- c(group[[siteChar]], tipName) 48 | } 49 | return(group) 50 | } 51 | 52 | #' @rdname extractTips 53 | #' @export 54 | extractTips.sitesMinEntropy <- function(x, site, ...) { 55 | tree <- as.phylo.sitesMinEntropy(x) 56 | site <- as.character(site) 57 | sitePaths <- lapply(x, function(segs) { 58 | pathNodeTips <- attr(segs, "pathNodeTips") 59 | res <- segs[site] 60 | attr(res, "pathNodeTips") <- pathNodeTips 61 | return(res) 62 | }) 63 | clustersByPath <- .mergeClusters(lapply(sitePaths, 64 | .clusterByFixation)) 65 | res <- list() 66 | for (gp in clustersByPath) { 67 | for (tips in gp) { 68 | attrs <- attributes(tips)[c("AA", "node")] 69 | tips <- tree[["tip.label"]][tips] 70 | attributes(tips) <- attrs 71 | res <- c(res, list(tips)) 72 | } 73 | } 74 | return(res) 75 | } 76 | 77 | #' @rdname extractTips 78 | #' @export 79 | extractTips.fixationSites <- function(x, 80 | site, 81 | select = 1, 82 | ...) { 83 | sp <- extractSite.fixationSites(x, site) 84 | return(extractTips.sitePath(sp, select)) 85 | } 86 | 87 | #' @rdname extractTips 88 | #' @export 89 | extractTips.sitePath <- function(x, select = 1, ...) { 90 | tree <- attr(x, "tree") 91 | if (select <= 0 || as.integer(select) != select) { 92 | stop("Please enter a single positive integer for \"select\"") 93 | } 94 | tryCatch( 95 | expr = x <- x[[select]], 96 | error = function(e) { 97 | if (length(select)) 98 | stop( 99 | "The site: ", 100 | attr(x, "site"), 101 | " has ", 102 | length(x), 103 | " fixation(s). Please choose a number from 1 to ", 104 | length(x), 105 | " for \"select\"." 106 | ) 107 | } 108 | ) 109 | res <- list() 110 | for (i in x) { 111 | aa <- attr(i, "AA") 112 | attributes(aa) <- NULL 113 | i <- tree[["tip.label"]][i] 114 | attr(i, "AA") <- aa 115 | res <- c(res, list(i)) 116 | } 117 | return(res) 118 | } 119 | 120 | #' @rdname extractTips 121 | #' @description For \code{\link{parallelSites}} and \code{sitePara} object, the 122 | #' function \code{extractTips} retrieve all the tips with parallel mutation. 123 | #' @export 124 | extractTips.parallelSites <- function(x, site, ...) { 125 | parallelMut <- extractSite.parallelSites(x, site) 126 | return(extractTips.sitePara(parallelMut)) 127 | } 128 | 129 | #' @rdname extractTips 130 | #' @export 131 | extractTips.sitePara <- function(x, ...) { 132 | # The duplicated nodes should be removed because they are not truly parallel 133 | toRemoveNodes <- unique(unlist(lapply( 134 | X = x, 135 | FUN = function(mut) { 136 | nodes <- names(mut) 137 | nodes[duplicated(nodes)] 138 | } 139 | ))) 140 | # The parallel nodes in one pair of paths might be duplicated for the other 141 | # pair. And they should be remove too 142 | res <- list() 143 | for (mut in x) { 144 | for (node in names(mut)) { 145 | if (!node %in% toRemoveNodes) { 146 | res[[node]] <- mut[[node]] 147 | } 148 | } 149 | } 150 | return(res) 151 | } 152 | -------------------------------------------------------------------------------- /tests/testthat/test-siteNumbering.R: -------------------------------------------------------------------------------- 1 | test_referenceNumbering <- function(x, tip, refSeq, gapChar) { 2 | # The 'gapChar' can only be a single character 3 | expect_error(setSiteNumbering(x, tip, "--")) 4 | expect_error(setSiteNumbering(x, tip, c("-", "-"))) 5 | # Use an arbitrary character as the gap character 6 | x <- setSiteNumbering(x, tip, gapChar) 7 | reference <- attr(x, "msaNumbering") 8 | refSeqName <- attr(x, "reference") 9 | # The correct reference sequence name can be retrieved 10 | expect_equal(tip, refSeqName) 11 | # Get the numbering of the reference sequence 12 | refNumbering <- seq_along(reference) 13 | # Create an ungapped reference sequence and compared with the gapped 14 | # reference sequence using their own numbering schemes 15 | ungappedRefSeq <- gsub(gapChar, "", refSeq) 16 | # The amino acid/nucleotide are the same for each mapped site 17 | aaEqual <- sapply(refNumbering, function(n) { 18 | # Corresponding site index for the original sequence 19 | m <- reference[[n]] 20 | substr(ungappedRefSeq, n, n) == substr(refSeq, m, m) 21 | }) 22 | expect_true(all(aaEqual)) 23 | } 24 | 25 | test_siteSkipping <- function(x, tip, gapChar, minSkipSize) { 26 | # The skip size cannot be zero or negative 27 | expect_error(setSiteNumbering(x, tip, gapChar, 0)) 28 | expect_error(setSiteNumbering(x, tip, gapChar, -2)) 29 | # Use an arbitrary character as the gap character 30 | x <- setSiteNumbering(x, tip, gapChar, minSkipSize) 31 | # Get all the unambiguous character 32 | if (attr(x, "seqType") == "AA") { 33 | unambiguous <- setdiff(sitePath:::AA_UNAMBIGUOUS, gapChar) 34 | } else { 35 | unambiguous <- setdiff(sitePath:::NT_UNAMBIGUOUS, gapChar) 36 | } 37 | # The min number of tips having unambiguous characters for a site 38 | tipNum <- length(attr(x, "tree")[["tip.label"]]) 39 | if (minSkipSize < 1) { 40 | minUnambiguous <- tipNum - tipNum * minSkipSize 41 | } else { 42 | minUnambiguous <- tipNum - minSkipSize 43 | } 44 | # The aligned sequences, reference numbering and loci 45 | align <- attr(x, "align") 46 | reference <- attr(x, "msaNumbering") 47 | loci <- attr(x, "loci") 48 | # Test the loci meet the skip size threshold 49 | lociOK <- sapply(loci, function(i) { 50 | siteSummary <- sitePath:::tableAA(align, reference[i] - 1) 51 | siteChars <- names(siteSummary) 52 | # All the unambiguous characters in the site summary 53 | unambiguousChars <- 54 | siteChars[which(siteChars %in% unambiguous)] 55 | sum(siteSummary[unambiguousChars]) > minUnambiguous 56 | }) 57 | expect_true(all(lociOK)) 58 | # Test the non-loci doesn't meet the skip size threshold 59 | nonLociOK <- sapply( 60 | X = setdiff(seq_along(reference), loci), 61 | FUN = function(i) { 62 | siteSummary <- sitePath:::tableAA(align, reference[i] - 1) 63 | siteChars <- names(siteSummary) 64 | # All the unambiguous characters in the site summary 65 | unambiguousChars <- 66 | siteChars[which(siteChars %in% unambiguous)] 67 | # The site is skipped also when it's completely conserved 68 | if (length(unambiguousChars) == 1) { 69 | return(TRUE) 70 | } 71 | sum(siteSummary[unambiguousChars]) <= minUnambiguous 72 | } 73 | ) 74 | expect_true(all(nonLociOK)) 75 | } 76 | 77 | test_that("The function works for amino acid", { 78 | data(zikv_align) 79 | data(zikv_tree) 80 | tr <- addMSA(zikv_tree, alignment = zikv_align) 81 | tipNames <- zikv_tree[["tip.label"]] 82 | align <- attr(tr, "align") 83 | # Find the index of sequence which has gap character 84 | for (i in grep('-', align)) { 85 | # Select a tip as the arbitrary reference 86 | tip <- tipNames[[i]] 87 | refSeq <- 88 | toupper(zikv_align[["seq"]][[which(zikv_align[["nam"]] == tip)]]) 89 | # The reference sequence from the raw data and processed data should be 90 | # the same 91 | expect_identical(refSeq, align[[i]]) 92 | # Test the numbering on 'phyMSAmatched' object 93 | test_referenceNumbering(tr, tip, refSeq, '-') 94 | test_siteSkipping(tr, tip, "C", 0.8) 95 | # Test the numbering on the 'lineagePath' object 96 | p <- lineagePath(tr) 97 | test_referenceNumbering(p, tip, refSeq, '-') 98 | test_siteSkipping(p, tip, "G", 0.9) 99 | } 100 | }) 101 | 102 | test_that("The function works for nucleotide", { 103 | data(sars2_align) 104 | data(sars2_tree) 105 | tr <- addMSA(sars2_tree, 106 | alignment = sars2_align, 107 | seqType = "DNA") 108 | tipNames <- sars2_tree[["tip.label"]] 109 | align <- attr(tr, "align") 110 | # Find the index of sequence which has gap character 111 | for (i in grep('-', align)) { 112 | # Select a tip as the arbitrary reference 113 | tip <- tipNames[[i]] 114 | refSeq <- 115 | toupper(sars2_align[["seq"]][[which(sars2_align[["nam"]] == tip)]]) 116 | # The reference sequence from the raw data and processed data should be 117 | # the same 118 | expect_identical(refSeq, align[[i]]) 119 | # Test the numbering on 'phyMSAmatched' object 120 | test_referenceNumbering(tr, tip, refSeq, '-') 121 | test_siteSkipping(tr, tip, "C", 0.8) 122 | # Test the numbering on the 'lineagePath' object 123 | p <- lineagePath(tr) 124 | test_referenceNumbering(p, tip, refSeq, '-') 125 | test_siteSkipping(p, tip, "G", 0.9) 126 | } 127 | }) 128 | -------------------------------------------------------------------------------- /src/searchNode.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | 5 | #include "minEntropy.h" 6 | 7 | // Empty constructor for derived class 8 | MinEntropy::TreeSearchNode::TreeSearchNode(): m_qualified(true) {} 9 | 10 | MinEntropy::TreeSearchNode::~TreeSearchNode() {} 11 | 12 | MinEntropy::TreeSearchNode::TreeSearchNode( 13 | const segment &used, 14 | const std::vector &aaSummaries, 15 | const unsigned int minEffectiveSize 16 | ): 17 | m_used(used), 18 | m_qualified(true) { 19 | // Calculate the total entropy upon instantiating 20 | m_entropy = this->totalEntropy(aaSummaries, minEffectiveSize); 21 | } 22 | 23 | MinEntropy::segment MinEntropy::TreeSearchNode::getUsed() const { 24 | return m_used; 25 | } 26 | 27 | float MinEntropy::TreeSearchNode::getEntropy() const { 28 | return m_entropy; 29 | } 30 | 31 | bool MinEntropy::TreeSearchNode::isQualified() const { 32 | return m_qualified; 33 | } 34 | 35 | float MinEntropy::TreeSearchNode::totalEntropy( 36 | const std::vector &aaSummaries, 37 | const unsigned int minEffectiveSize 38 | ) { 39 | float res = 0.0; 40 | segIndex start = 0; 41 | // Iterate through all the used segment points 42 | for ( 43 | segment::const_iterator m_used_itr = m_used.begin(); 44 | m_used_itr != m_used.end(); ++m_used_itr 45 | ) { 46 | unsigned int tipNum = 0; 47 | aaSummary values; 48 | // Iterate throught the tree nodes between the segment points 49 | for (unsigned int i = start; i < *m_used_itr; ++i) { 50 | // Get the tree node and its summary on amino acid of tips. And 51 | // combine them into a segment 52 | const aaSummary toBeCombined = aaSummaries.at(i); 53 | for ( 54 | aaSummary::const_iterator it = toBeCombined.begin(); 55 | it != toBeCombined.end(); ++it 56 | ) { 57 | values[it->first] += it->second; 58 | tipNum += it->second; 59 | } 60 | } 61 | // The search node will be disqualified if the number of tips in the 62 | // segment is lower than the constrain 63 | if (tipNum < minEffectiveSize) { m_qualified = false; } 64 | // TODO: tipNum can be used as total in calculating entropy 65 | res += shannonEntropy(values, tipNum); 66 | // Update the starting segment point 67 | start = *m_used_itr; 68 | } 69 | return res; 70 | } 71 | 72 | MinEntropy::Segmentor::Segmentor( 73 | const segment &all, 74 | const segment &terminal, 75 | const std::vector &aaSummaries, 76 | const unsigned int minEffectiveSize 77 | ): 78 | MinEntropy::TreeSearchNode(terminal, aaSummaries, minEffectiveSize), 79 | m_open(all) {} 80 | 81 | MinEntropy::Segmentor::Segmentor( 82 | const Segmentor *parent, 83 | const unsigned int i, 84 | const std::vector &aaSummaries, 85 | const unsigned int minEffectiveSize 86 | ): 87 | MinEntropy::TreeSearchNode() { 88 | // List-initialization is only supported since c++11 89 | m_used = this->newUsed(parent, i); 90 | m_open = this->newOpen(parent, i); 91 | m_entropy = this->totalEntropy(aaSummaries, minEffectiveSize); 92 | } 93 | 94 | unsigned int MinEntropy::Segmentor::getOpenSize() const { 95 | // This is just for iterating the open list 96 | return m_open.size(); 97 | } 98 | 99 | bool MinEntropy::Segmentor::isEndNode() const { 100 | // The search should be forced to end when the open list is empty 101 | return (m_open.empty()) ? true : false; 102 | } 103 | 104 | MinEntropy::segment MinEntropy::Segmentor::newUsed( 105 | const Segmentor *parent, 106 | const unsigned int i 107 | ) const { 108 | segment res = parent->m_used; 109 | // Add the "i"th segment point in the parent's open list to the parent's 110 | // used list 111 | res.push_back(parent->m_open.at(i)); 112 | // Sort to make sure the segment points are in order 113 | std::sort(res.begin(), res.end()); 114 | return res; 115 | } 116 | 117 | MinEntropy::segment MinEntropy::Segmentor::newOpen( 118 | const Segmentor *parent, 119 | const unsigned int i 120 | ) const { 121 | segment res = parent->m_open; 122 | // Erase the "i"th segment point in the parent's open list 123 | res.erase(res.begin() + i); 124 | return res; 125 | } 126 | 127 | MinEntropy::Amalgamator::Amalgamator( 128 | const segment &withTerminal, 129 | const std::vector &aaSummaries, 130 | const unsigned int minEffectiveSize 131 | ): 132 | MinEntropy::TreeSearchNode( 133 | withTerminal, 134 | aaSummaries, 135 | minEffectiveSize 136 | ) {} 137 | 138 | MinEntropy::Amalgamator::Amalgamator( 139 | const Amalgamator *parent, 140 | const unsigned int i, 141 | const std::vector &aaSummaries, 142 | const unsigned int minEffectiveSize 143 | ): 144 | MinEntropy::TreeSearchNode() { 145 | // List-initialization is only supported since c++11 146 | m_used = this->newUsed(parent, i); 147 | m_entropy = this->totalEntropy(aaSummaries, minEffectiveSize); 148 | } 149 | 150 | unsigned int MinEntropy::Amalgamator::getOpenSize() const { 151 | // This is just for iterating the open list 152 | return m_used.size() - 1; 153 | } 154 | 155 | bool MinEntropy::Amalgamator::isEndNode() const { 156 | // The search should be forced to end when the open list is empty 157 | return (m_used.size() == 1) ? true : false; 158 | } 159 | 160 | MinEntropy::segment MinEntropy::Amalgamator::newUsed( 161 | const Amalgamator *parent, 162 | const unsigned int i 163 | ) const { 164 | segment res = parent->getUsed(); 165 | res.erase(res.begin() + i); 166 | // There is no need to sort as the order won't change 167 | return res; 168 | } 169 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | sitePath: phylogeny-based sequence clustering using site polymorphism 2 | ================ 3 | 4 | The below demonstrates the result of phylogeny-based sequence clustering 5 | for a H3N2 virus dataset (included in the package) 6 | 7 | ``` r 8 | library(sitePath) 9 | 10 | data(h3n2_align) # load the H3N2 sequences 11 | data(h3n2_tree) # load the corresponding phylogenetic tree 12 | 13 | options(list("cl.cores" = 10)) # Use 10 cores for multiprocessing 14 | 15 | paths <- lineagePath(h3n2_tree, alignment = h3n2_align, Nmin = 0.05) 16 | ``` 17 | 18 | ## The "tree" object is not bifurcated and resolved by "multi2di" function. 19 | 20 | ## Using 10 cores.. 21 | ## Multiprocessing ended. 22 | 23 | ``` r 24 | minEntropy <- sitesMinEntropy(paths) 25 | ``` 26 | 27 | ## Using 10 cores.. 28 | ## Multiprocessing ended. 29 | 30 | ``` r 31 | p1 <- plotSingleSite(paths, site = 208) # The site polymorphism of site 208 on the tree 32 | p2 <- plotSingleSite(minEntropy, site = 208) # The result of clustering using site 208 33 | gridExtra::grid.arrange(p1, p2, ncol = 2) 34 | ``` 35 | 36 | ![](man/figures/example-1.png) 37 | 38 | ``` r 39 | grp1 <- extractTips(paths, 208) # Grouping result using site polymorphism only 40 | grp2 <- extractTips(minEntropy, 208) # Phylogeny-based clustering result 41 | ``` 42 | 43 | # Installation 44 | 45 | [R programming language](https://cran.r-project.org/) \>= 4.1.0 is 46 | required to use `sitePath`. 47 | 48 | The stable release is available on 49 | [Bioconductor](https://bioconductor.org/packages/sitePath/). 50 | 51 | ``` r 52 | if (!requireNamespace("BiocManager", quietly = TRUE)) 53 | install.packages("BiocManager") 54 | 55 | BiocManager::install("sitePath") 56 | ``` 57 | 58 | The installation from [GitHub](https://github.com/wuaipinglab/sitePath/) 59 | is in experimental stage but gives the newest feature: 60 | 61 | ``` r 62 | if (!requireNamespace("remotes", quietly = TRUE)) 63 | install.packages("remotes") 64 | 65 | remotes::install_github("wuaipinglab/sitePath") 66 | ``` 67 | 68 | # QuickStart 69 | 70 | The following is a quick tutorial on how to use `sitePath` to find 71 | fixation and parallel sites including how to import data, run analysis 72 | and visualization of the results. 73 | 74 | ## 1. Data preparation 75 | 76 | You need a *tree* and a *MSA* (multiple sequence alignment) file and the 77 | sequence names have to be matched! 78 | 79 | ``` r 80 | library(sitePath) # Load the sitePath package 81 | 82 | # The path to your tree and MSA files 83 | tree_file <- system.file("extdata", "ZIKV.newick", package = "sitePath") 84 | alignment_file <- system.file("extdata", "ZIKV.fasta", package = "sitePath") 85 | 86 | 87 | tree <- read.tree(tree_file) # Read the tree file into R 88 | align <- read.alignment(alignment_file, format = "fasta") # Read the MSA file into R 89 | ``` 90 | 91 | ## 2. Run analysis 92 | 93 | `Nmin` and `minSNP` are the respective parameters for finding fixation 94 | and parallel sites (18 and 1 are used as an example for this dataset). 95 | The default values will be used if you don’t specify them. 96 | 97 | ``` r 98 | options(list("cl.cores" = 1)) # Set this bigger than 1 to use multiprocessing 99 | 100 | paraFix <- paraFixSites(tree, alignment = align, Nmin = 18, minSNP = 1) # Find paraFix sites 101 | paraFix 102 | ``` 103 | 104 | ## This is a 'paraFixSites' object 105 | ## 106 | ## fixation sites: 107 | ## 139, 894, 2074, 2086, 2634, 3045, 988, 1143, 2842, 3398, 107, 1118, 3353 108 | ## 109 | ## parallel sites: 110 | ## 105, 2292, 1264, 918, 1226, 1717, 988, 2611, 2787, 2749, 3328, 3162, 1857, 2445, 358, 1404, 3046, 791, 1180, 1016, 1171, 1327, 3076, 106, 2357, 916, 1303, 969, 573, 2909, 2122, 940 111 | ## 112 | ## paraFix sites: 113 | ## 988 114 | 115 | ## 3. Fixation sites 116 | 117 | Use `allSitesName` and set `type` as “fixation” to retrieve fixation 118 | sites name 119 | 120 | ``` r 121 | allSitesName(paraFix, type = "fixation") 122 | ``` 123 | 124 | ## [1] "139" "894" "2074" "2086" "2634" "3045" "988" "1143" "2842" "3398" 125 | ## [11] "107" "1118" "3353" 126 | 127 | Use `plotFixationSites` to view fixation sites 128 | 129 | ``` r 130 | plotFixationSites(paraFix) # View all fixation sites on the tree 131 | ``` 132 | 133 | ![](man/figures/plot_fixSites-1.png) 134 | 135 | ``` r 136 | plotFixationSites(paraFix, site = 139) # View a single site 137 | ``` 138 | 139 | ![](man/figures/plot_fixSites-2.png) 140 | 141 | ## 4. Parallel sites 142 | 143 | Use `allSitesName` and set `type` as “parallel” to retrieve parallel 144 | sites name 145 | 146 | ``` r 147 | allSitesName(paraFix, type = "parallel") 148 | ``` 149 | 150 | ## [1] "105" "2292" "1264" "918" "1226" "1717" "988" "2611" "2787" "2749" 151 | ## [11] "3328" "3162" "1857" "2445" "358" "1404" "3046" "791" "1180" "1016" 152 | ## [21] "1171" "1327" "3076" "106" "2357" "916" "1303" "969" "573" "2909" 153 | ## [31] "2122" "940" 154 | 155 | Use `plotParallelSites` to view parallel sites 156 | 157 | ``` r 158 | plotParallelSites(paraFix) # View all parallel sites on the tree 159 | ``` 160 | 161 | ![](man/figures/unnamed-chunk-1-1.png) 162 | 163 | ``` r 164 | plotParallelSites(paraFix, site = 105) # View a single site 165 | ``` 166 | 167 | ![](man/figures/unnamed-chunk-1-2.png) 168 | 169 | # Read more 170 | 171 | The above uses wrapper functions but the analysis can be dissembled into 172 | step functions (so you can view the result of each step and modify 173 | parameters). Click 174 | [here](https://wuaipinglab.github.io/sitePath/articles/sitePath.html) 175 | for a more detailed tutorial. 176 | 177 | # Getting help 178 | 179 | Post on Bioconductor [support site](https://support.bioconductor.org/) 180 | if having trouble using `sitePath`. Or open an 181 | [issue](https://github.com/wuaipinglab/sitePath/issues/new?assignees=&labels=&template=bug_report.md&title=) 182 | if a bug is found. 183 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | S3method(SNPsites,phyMSAmatched) 4 | S3method(addMSA,phylo) 5 | S3method(addMSA,treedata) 6 | S3method(allSitesName,SNPsites) 7 | S3method(allSitesName,fixationSites) 8 | S3method(allSitesName,paraFixSites) 9 | S3method(allSitesName,parallelSites) 10 | S3method(allSitesName,sitesMinEntropy) 11 | S3method(as.data.frame,SNPsites) 12 | S3method(as.data.frame,fixationSites) 13 | S3method(as.data.frame,parallelSites) 14 | S3method(as.phylo,fixationSites) 15 | S3method(as.phylo,phyMSAmatched) 16 | S3method(as.phylo,sitePath) 17 | S3method(as.phylo,sitesMinEntropy) 18 | S3method(as.treedata,fixationPath) 19 | S3method(as.treedata,fixationSites) 20 | S3method(extractSite,fixationSites) 21 | S3method(extractSite,parallelSites) 22 | S3method(extractTips,fixationSites) 23 | S3method(extractTips,lineagePath) 24 | S3method(extractTips,parallelSites) 25 | S3method(extractTips,sitePara) 26 | S3method(extractTips,sitePath) 27 | S3method(extractTips,sitesMinEntropy) 28 | S3method(fixationIndels,sitesMinEntropy) 29 | S3method(fixationPath,fixationSites) 30 | S3method(fixationPath,sitesMinEntropy) 31 | S3method(fixationSites,lineagePath) 32 | S3method(fixationSites,paraFixSites) 33 | S3method(fixationSites,sitesMinEntropy) 34 | S3method(groupTips,fixationPath) 35 | S3method(groupTips,fixationSites) 36 | S3method(groupTips,lineagePath) 37 | S3method(groupTips,phyMSAmatched) 38 | S3method(groupTips,sitesMinEntropy) 39 | S3method(lineagePath,paraFixSites) 40 | S3method(lineagePath,phyMSAmatched) 41 | S3method(lineagePath,phylo) 42 | S3method(lineagePath,sneakPeekedPaths) 43 | S3method(lineagePath,treedata) 44 | S3method(multiFixationSites,lineagePath) 45 | S3method(paraFixSites,lineagePath) 46 | S3method(paraFixSites,phylo) 47 | S3method(paraFixSites,sitesMinEntropy) 48 | S3method(paraFixSites,treedata) 49 | S3method(parallelSites,lineagePath) 50 | S3method(parallelSites,paraFixSites) 51 | S3method(parallelSites,sitesMinEntropy) 52 | S3method(plot,fixationPath) 53 | S3method(plot,fixationSites) 54 | S3method(plot,lineagePath) 55 | S3method(plot,parallelSites) 56 | S3method(plot,phyMSAmatched) 57 | S3method(plot,sitePath) 58 | S3method(plotFixationSites,fixationSites) 59 | S3method(plotFixationSites,paraFixSites) 60 | S3method(plotMutSites,SNPsites) 61 | S3method(plotMutSites,fixationSites) 62 | S3method(plotMutSites,lineagePath) 63 | S3method(plotMutSites,paraFixSites) 64 | S3method(plotMutSites,parallelSites) 65 | S3method(plotParallelSites,paraFixSites) 66 | S3method(plotParallelSites,parallelSites) 67 | S3method(plotSingleSite,fixationSites) 68 | S3method(plotSingleSite,lineagePath) 69 | S3method(plotSingleSite,parallelSites) 70 | S3method(plotSingleSite,sitesMinEntropy) 71 | S3method(print,SNPsites) 72 | S3method(print,alignment) 73 | S3method(print,fixationIndels) 74 | S3method(print,fixationPath) 75 | S3method(print,fixationSites) 76 | S3method(print,indelPath) 77 | S3method(print,lineagePath) 78 | S3method(print,paraFixSites) 79 | S3method(print,parallelSites) 80 | S3method(print,phyMSAmatched) 81 | S3method(print,sitePara) 82 | S3method(print,sitePath) 83 | S3method(print,sitesMinEntropy) 84 | S3method(setSiteNumbering,phyMSAmatched) 85 | S3method(sitesMinEntropy,lineagePath) 86 | export(SNPsites) 87 | export(addMSA) 88 | export(allSitesName) 89 | export(as.phylo) 90 | export(as.treedata) 91 | export(extractSite) 92 | export(extractTips) 93 | export(fixationIndels) 94 | export(fixationPath) 95 | export(fixationSites) 96 | export(groupTips) 97 | export(lineagePath) 98 | export(multiFixationSites) 99 | export(paraFixSites) 100 | export(parallelSites) 101 | export(plotFixationSites) 102 | export(plotMutSites) 103 | export(plotParallelSites) 104 | export(plotSingleSite) 105 | export(read.alignment) 106 | export(read.tree) 107 | export(setSiteNumbering) 108 | export(similarityMatrix) 109 | export(sitesMinEntropy) 110 | export(sneakPeek) 111 | importFrom(RColorBrewer,brewer.pal) 112 | importFrom(Rcpp,sourceCpp) 113 | importFrom(ape,Nnode) 114 | importFrom(ape,Ntip) 115 | importFrom(ape,as.phylo) 116 | importFrom(ape,getMRCA) 117 | importFrom(ape,is.binary) 118 | importFrom(ape,multi2di) 119 | importFrom(ape,node.depth.edgelength) 120 | importFrom(ape,nodepath) 121 | importFrom(ape,read.tree) 122 | importFrom(aplot,insert_left) 123 | importFrom(ggplot2,GeomPoint) 124 | importFrom(ggplot2,GeomSegment) 125 | importFrom(ggplot2,GeomText) 126 | importFrom(ggplot2,aes) 127 | importFrom(ggplot2,element_blank) 128 | importFrom(ggplot2,element_line) 129 | importFrom(ggplot2,element_rect) 130 | importFrom(ggplot2,geom_point) 131 | importFrom(ggplot2,geom_tile) 132 | importFrom(ggplot2,ggplot) 133 | importFrom(ggplot2,ggtitle) 134 | importFrom(ggplot2,guide_legend) 135 | importFrom(ggplot2,guides) 136 | importFrom(ggplot2,scale_color_manual) 137 | importFrom(ggplot2,scale_fill_manual) 138 | importFrom(ggplot2,scale_size) 139 | importFrom(ggplot2,scale_x_continuous) 140 | importFrom(ggplot2,theme) 141 | importFrom(ggrepel,geom_label_repel) 142 | importFrom(ggtree,geom_point2) 143 | importFrom(ggtree,geom_tiplab) 144 | importFrom(ggtree,ggtree) 145 | importFrom(ggtree,theme_tree2) 146 | importFrom(grDevices,colorRampPalette) 147 | importFrom(graphics,plot) 148 | importFrom(gridExtra,arrangeGrob) 149 | importFrom(gridExtra,grid.arrange) 150 | importFrom(methods,is) 151 | importFrom(parallel,clusterSplit) 152 | importFrom(parallel,detectCores) 153 | importFrom(parallel,makeCluster) 154 | importFrom(parallel,parLapply) 155 | importFrom(parallel,stopCluster) 156 | importFrom(seqinr,read.alignment) 157 | importFrom(stats,complete.cases) 158 | importFrom(stats,median) 159 | importFrom(stats,na.omit) 160 | importFrom(stats,quantile) 161 | importFrom(stats,sd) 162 | importFrom(tidytree,as.treedata) 163 | importFrom(tidytree,as_tibble) 164 | importFrom(tidytree,full_join) 165 | importFrom(tidytree,groupOTU) 166 | importFrom(utils,flush.console) 167 | importFrom(utils,head) 168 | importFrom(utils,setTxtProgressBar) 169 | importFrom(utils,tail) 170 | importFrom(utils,txtProgressBar) 171 | useDynLib(sitePath) 172 | -------------------------------------------------------------------------------- /R/printFunctions.R: -------------------------------------------------------------------------------- 1 | #' @export 2 | print.alignment <- function(x, ...) { 3 | cat("MSA with", length(x[["seq"]]), "sequences.\n\n") 4 | cat("Sequence names:\n") 5 | cat(" ", paste(head(x[["nam"]]), collapse = ", "), ", ...\n\n") 6 | cat("Sequence length:", nchar(x[["seq"]][[1]]), "\n") 7 | } 8 | 9 | #' @export 10 | print.phyMSAmatched <- function(x, ...) { 11 | cat("This is a 'phyMSAmatched' object.\n") 12 | } 13 | 14 | #' @export 15 | print.lineagePath <- function(x, ...) { 16 | cat( 17 | "This is a 'lineagePath' object.\n\n", 18 | length(x), 19 | " lineage paths using ", 20 | attr(x, "minSize"), 21 | " as \"major SNP\" threshold \n", 22 | sep = "" 23 | ) 24 | } 25 | 26 | #' @export 27 | print.SNPsites <- function(x, ...) { 28 | cat("This is a 'SNPsites' object.", "\n\n") 29 | x <- allSitesName(x) 30 | cat(paste(x, collapse = " ")) 31 | } 32 | 33 | #' @export 34 | print.sitesMinEntropy <- function(x, ...) { 35 | cat("This is a 'sitesMinEntropy' object.", "\n\n") 36 | loci <- allSitesName.sitesMinEntropy(x) 37 | cat("There are", length(loci), "sites with enough variation.\n") 38 | } 39 | 40 | #' @export 41 | print.fixationSites <- function(x, ...) { 42 | cat("This is a 'fixationSites' object.\n\nResult for", 43 | length(attr(x, "paths")), 44 | "paths:\n\n") 45 | if (length(x) == 0) { 46 | cat("No multi-fixation found\n") 47 | } else { 48 | cat(paste(names(x), collapse = " "), "\n") 49 | refSeqName <- attr(x, "reference") 50 | if (is.null(refSeqName)) { 51 | cat("No reference sequence specified.", 52 | "Using alignment numbering\n") 53 | } else { 54 | cat("Reference sequence: ", refSeqName, "\n", sep = "") 55 | } 56 | } 57 | } 58 | 59 | #' @export 60 | print.sitePath <- function(x, ...) { 61 | cat("Site", 62 | attr(x, "site"), 63 | "may experience fixation on", 64 | length(x), 65 | "path(s):\n\n") 66 | # A 'sitePath' consists of all the fixation paths for a single site. So each 67 | # 'm' represent a single fixation path 68 | for (m in x) { 69 | if (length(m) == 2) { 70 | mutName <- 71 | paste0(attr(m[[1]], "AA"), attr(x, "site"), attr(m[[2]], "AA")) 72 | cat(mutName, 73 | paste0("(", length(m[[1]]), "->", length(m[[2]]), ")"), 74 | "\n") 75 | } else { 76 | mutName <- character(0) 77 | for (tips in m) { 78 | aa <- attr(tips, "AA") 79 | mutName <- 80 | c(mutName, paste0(aa, "(", length(tips), ")")) 81 | } 82 | cat(paste0(mutName, collapse = " -> "), "\n") 83 | } 84 | } 85 | cat("\nIn the bracket are the number of tips", 86 | "involved before and after the fixation\n") 87 | } 88 | 89 | #' @export 90 | print.parallelSites <- function(x, ...) { 91 | cat("This is a 'parallelSites' object.\n\nResult for", 92 | length(attr(x, "paths")), 93 | "paths:\n\n") 94 | if (length(x) == 0) { 95 | cat("No parallel site found\n") 96 | } else { 97 | cat(paste(names(x), collapse = " "), "\n") 98 | refSeqName <- attr(x, "reference") 99 | if (is.null(refSeqName)) { 100 | cat("No reference sequence specified.", 101 | "Using alignment numbering\n") 102 | } else { 103 | cat("Reference sequence: ", refSeqName, "\n", sep = "") 104 | } 105 | } 106 | } 107 | 108 | #' @export 109 | print.sitePara <- function(x, ...) { 110 | cat( 111 | "This is a 'sitePara' object.\n\nSite", 112 | attr(x, "site"), 113 | "may have parallel mutation on", 114 | length(x), 115 | "pair of paths:\n\n" 116 | ) 117 | mutSummary <- table(vapply( 118 | X = extractTips.sitePara(x), 119 | FUN = function(mutTips) { 120 | attr(mutTips, "mutName")[4] 121 | }, 122 | FUN.VALUE = character(1) 123 | )) 124 | mutInfo <- character() 125 | for (mutName in names(mutSummary)) { 126 | mutInfo <- 127 | c(mutInfo, paste0(mutName, "(", mutSummary[[mutName]], ")")) 128 | } 129 | cat( 130 | paste0(mutInfo, collapse = ", "), 131 | "\n\nIn the bracket are the number of tips", 132 | "involved in the mutation\n" 133 | ) 134 | } 135 | 136 | #' @export 137 | print.paraFixSites <- function(x, ...) { 138 | cat("This is a 'paraFixSites' object\n\n") 139 | for (type in c("fixation", "parallel", "paraFix")) { 140 | sites <- allSitesName.paraFixSites(x, type = type) 141 | if (length(sites)) { 142 | cat(type, 143 | " sites:\n", 144 | paste0(sites, collapse = ", "), 145 | "\n\n", sep = "") 146 | } else { 147 | cat("No", type, "sites found.\n\n") 148 | } 149 | } 150 | } 151 | 152 | #' @export 153 | print.fixationIndels <- function(x, ...) { 154 | cat("This is a 'fixationIndels' object.\n\nResult for", 155 | length(attr(x, "paths")), 156 | "paths:\n\n") 157 | if (length(x) == 0) { 158 | cat("No multi-fixation found\n") 159 | } else { 160 | cat(paste(names(x), collapse = " "), "\n") 161 | refSeqName <- attr(x, "reference") 162 | if (is.null(refSeqName)) { 163 | cat("No reference sequence specified.", 164 | "Using alignment numbering\n") 165 | } else { 166 | cat("Reference sequence: ", refSeqName, "\n", sep = "") 167 | } 168 | } 169 | } 170 | 171 | #' @export 172 | print.indelPath <- function(x, ...) { 173 | cat( 174 | "Site(s)", 175 | attr(x, "indelSites"), 176 | "may experience fixation on", 177 | length(x), 178 | "group(s) of tips:\n\n" 179 | ) 180 | for (i in x) { 181 | cat("(", length(i), ") ", sep = "") 182 | } 183 | cat("\nIn the bracket are the number of tips", 184 | "involved in the fixation\n") 185 | } 186 | 187 | #' @export 188 | print.fixationPath <- function(x, ...) { 189 | print(names(x)) 190 | } 191 | -------------------------------------------------------------------------------- /R/staticData.R: -------------------------------------------------------------------------------- 1 | #' @useDynLib sitePath 2 | #' @importFrom Rcpp sourceCpp 3 | NULL 4 | 5 | #' @rdname sars2_align 6 | #' @title Multiple sequence alignment of SARS-CoV-2 genome CDS 7 | #' @description The raw sequences were downloaded from GISAID database 8 | #' (\url{https://www.gisaid.org/}) and aligned using MAFFT 9 | #' (\url{https://mafft.cbrc.jp/alignment/software/}) with default settings. 10 | #' @format an \code{alignment} object 11 | #' @usage data(sars2_align) 12 | #' @docType data 13 | "sars2_align" 14 | 15 | #' @rdname sars2_tree 16 | #' @title Phylogenetic tree of SARS-CoV-2 genome CDS 17 | #' @description Tree was built from \code{\link{sars2_align}} using RAxML 18 | #' (\url{http://www.exelixis-lab.org/}) with default settings. The tip 19 | #' \code{EPI_ISL_402125} was used as the outgroup to root the tree. 20 | #' @format a \code{phylo} object 21 | #' @usage data(sars2_tree) 22 | #' @docType data 23 | "sars2_tree" 24 | 25 | #' @rdname zikv_align 26 | #' @title Multiple sequence alignment of Zika virus polyprotein 27 | #' @description The raw protein sequences were downloaded from ViPR database 28 | #' (\url{https://www.viprbrc.org/}) and aligned using MAFFT 29 | #' (\url{https://mafft.cbrc.jp/alignment/software/}). with default settings. 30 | #' @format an \code{alignment} object 31 | #' @usage data(zikv_align) 32 | #' @docType data 33 | "zikv_align" 34 | 35 | #' @rdname zikv_tree 36 | #' @title Phylogenetic tree of Zika virus polyprotein 37 | #' @description Tree was built from \code{\link{zikv_align}} using RAxML 38 | #' (\url{http://www.exelixis-lab.org/}) with default settings. The tip 39 | #' ANK57896 was used as outgroup to root the tree. 40 | #' @format a \code{phylo} object 41 | #' @usage data(zikv_tree) 42 | #' @docType data 43 | "zikv_tree" 44 | 45 | #' @rdname zikv_align 46 | #' @description \code{zikv_align_reduced} is a truncated version of 47 | #' \code{zikv_align} 48 | #' @format an \code{alignment} object 49 | #' @usage data(zikv_align_reduced) 50 | #' @docType data 51 | "zikv_align_reduced" 52 | 53 | #' @rdname zikv_tree 54 | #' @description \code{zikv_tree_reduced} is a truncated version of 55 | #' \code{zikv_tree} 56 | #' @format a \code{phylo} object 57 | #' @usage data(zikv_tree_reduced) 58 | #' @docType data 59 | "zikv_tree_reduced" 60 | 61 | #' @rdname h3n2_align 62 | #' @title Multiple sequence alignment of H3N2's HA protein 63 | #' @description The raw protein sequences were downloaded from NCBI database and 64 | #' aligned using MAFFT (\url{https://mafft.cbrc.jp/alignment/software/}). 65 | #' @format an \code{alignment} object 66 | #' @usage data(h3n2_align) 67 | #' @docType data 68 | "h3n2_align" 69 | 70 | #' @rdname h3n2_tree 71 | #' @title Phylogenetic tree of H3N2's HA protein 72 | #' @description Tree was built from \code{\link{h3n2_align}} using RAxML 73 | #' (\url{http://www.exelixis-lab.org/}) with default settings. 74 | #' @format a \code{phylo} object 75 | #' @usage data(h3n2_tree) 76 | #' @docType data 77 | "h3n2_tree" 78 | 79 | #' @rdname h3n2_align 80 | #' @description \code{h3n2_align_reduced} is a truncated version of 81 | #' \code{h3n2_align} 82 | #' @format an \code{alignment} object 83 | #' @usage data(h3n2_align_reduced) 84 | #' @docType data 85 | "h3n2_align_reduced" 86 | 87 | #' @rdname h3n2_tree 88 | #' @description \code{h3n2_tree_reduced} is a truncated version of 89 | #' \code{h3n2_tree} 90 | #' @format a \code{phylo} object 91 | #' @usage data(h3n2_tree_reduced) 92 | #' @docType data 93 | "h3n2_tree_reduced" 94 | 95 | AA_COLORS <- c( 96 | "His" = "#8282D2", 97 | "Arg" = "#9370DB", 98 | "Lys" = "#145AFF", 99 | "Ile" = "#55AE3A", 100 | "Phe" = "#3232AA", 101 | "Leu" = "#0F820F", 102 | "Trp" = "#B45AB4", 103 | "Ala" = "#C8C8C8", 104 | "Met" = "#FFD700", 105 | "Pro" = "#DC9682", 106 | "Val" = "#2F4F2F", 107 | "Asn" = "#00DCDC", 108 | "Cys" = "#E6E600", 109 | "Gly" = "#666666", 110 | "Ser" = "#FF6347", 111 | "Tyr" = "#ADD8E6", 112 | "Gln" = "#0099CC", 113 | "Thr" = "#FA9600", 114 | "Glu" = "#8C1717", 115 | "Asp" = "#E60A0A", 116 | "gap" = "#d3d3d3", 117 | "unknown" = "#000000", 118 | "Ile_or_Leu" = "#000000", 119 | "Asp_or_Asn" = "#000000", 120 | "Glu_or_Gln" = "#000000" 121 | ) 122 | 123 | AA_FULL_NAMES <- c( 124 | "h" = "His", 125 | "r" = "Arg", 126 | "k" = "Lys", 127 | "i" = "Ile", 128 | "f" = "Phe", 129 | "l" = "Leu", 130 | "w" = "Trp", 131 | "a" = "Ala", 132 | "m" = "Met", 133 | "p" = "Pro", 134 | "v" = "Val", 135 | "n" = "Asn", 136 | "c" = "Cys", 137 | "g" = "Gly", 138 | "s" = "Ser", 139 | "y" = "Tyr", 140 | "q" = "Gln", 141 | "t" = "Thr", 142 | "e" = "Glu", 143 | "d" = "Asp", 144 | "-" = "gap", 145 | "x" = "unknown", 146 | "j" = "Ile_or_Leu", 147 | "b" = "Asp_or_Asn", 148 | "z" = "Glu_or_Gln" 149 | ) 150 | 151 | AA_SHORT_NAMES <- c( 152 | "His" = "H", 153 | "Arg" = "R", 154 | "Lys" = "K", 155 | "Ile" = "I", 156 | "Phe" = "F", 157 | "Leu" = "L", 158 | "Trp" = "W", 159 | "Ala" = "A", 160 | "Met" = "M", 161 | "Pro" = "P", 162 | "Val" = "V", 163 | "Asn" = "N", 164 | "Cys" = "C", 165 | "Gly" = "G", 166 | "Ser" = "S", 167 | "Tyr" = "Y", 168 | "Gln" = "Q", 169 | "Thr" = "T", 170 | "Glu" = "E", 171 | "Asp" = "D", 172 | "gap" = "-", 173 | "unknown" = "X", 174 | "Ile_or_Leu" = "J", 175 | "Asp_or_Asn" = "B", 176 | "Glu_or_Gln" = "Z" 177 | ) 178 | 179 | AA_UNAMBIGUOUS <- c( 180 | "H", 181 | "R", 182 | "K", 183 | "I", 184 | "F", 185 | "L", 186 | "W", 187 | "A", 188 | "M", 189 | "P", 190 | "V", 191 | "N", 192 | "C", 193 | "G", 194 | "S", 195 | "Y", 196 | "Q", 197 | "T", 198 | "E", 199 | "D" 200 | ) 201 | 202 | NT_COLORS <- c( 203 | "A" = "#5050ff", 204 | "C" = "#e00000", 205 | "G" = "#00c000", 206 | "T" = "#e6e600", 207 | "U" = "#cc9900", 208 | "-" = "#d3d3d3", 209 | "N" = "#000000", 210 | "X" = "#000000", 211 | "Y" = "#000000", 212 | "R" = "#000000", 213 | "W" = "#000000", 214 | "S" = "#000000", 215 | "K" = "#000000", 216 | "M" = "#000000", 217 | "D" = "#000000", 218 | "V" = "#000000", 219 | "H" = "#000000", 220 | "B" = "#000000" 221 | ) 222 | 223 | NT_UNAMBIGUOUS <- c("A", 224 | "C", 225 | "G", 226 | "T", 227 | "U") 228 | -------------------------------------------------------------------------------- /R/fixationSites.R: -------------------------------------------------------------------------------- 1 | #' @rdname fixationSites 2 | #' @title Fixation sites prediction 3 | #' @description After finding the \code{\link{lineagePath}} of a phylogenetic 4 | #' tree, \code{fixationSites} uses the result to find those sites that show 5 | #' fixation on some, if not all, of the lineages. The number of tips before 6 | #' and after the fixation mutation is expected to be more than 7 | #' \code{minEffectiveSize}. Also, the fixation will be skipped if the amino 8 | #' acid/nucleotide is gap or ambiguous character. A lineage has to have at 9 | #' least one fixation mutation to be reported. 10 | #' @param paths A \code{lineagePath} object returned from 11 | #' \code{\link{lineagePath}} function. 12 | #' @param ... further arguments passed to or from other methods. 13 | #' @return A \code{fixationSites} object. 14 | #' @seealso \code{\link{as.data.frame.fixationSites}} 15 | #' @export 16 | #' @examples 17 | #' data(zikv_tree_reduced) 18 | #' data(zikv_align_reduced) 19 | #' tree <- addMSA(zikv_tree_reduced, alignment = zikv_align_reduced) 20 | #' fixationSites(lineagePath(tree)) 21 | fixationSites <- function(paths, ...) { 22 | UseMethod("fixationSites") 23 | } 24 | 25 | #' @rdname fixationSites 26 | #' @param minEffectiveSize The minimum number of tips in a group. 27 | #' @param searchDepth The function uses heuristic search but the termination of 28 | #' the search cannot be intrinsically decided. \code{searchDepth} is needed to 29 | #' tell the search when to stop. 30 | #' @param method The strategy for predicting the fixation. The basic approach is 31 | #' entropy minimization and can be achieved by adding or removing fixation 32 | #' point, or by comparing the two. 33 | #' @export 34 | fixationSites.lineagePath <- function(paths, 35 | minEffectiveSize = NULL, 36 | searchDepth = 1, 37 | method = c("compare", "insert", "delete"), 38 | ...) { 39 | minEntropy <- sitesMinEntropy.lineagePath(paths, 40 | minEffectiveSize, 41 | searchDepth, 42 | method) 43 | res <- fixationSites.sitesMinEntropy(minEntropy, ...) 44 | return(res) 45 | } 46 | 47 | #' @rdname fixationSites 48 | #' @export 49 | fixationSites.sitesMinEntropy <- function(paths, ...) { 50 | x <- paths 51 | paths <- attr(x, "paths") 52 | tree <- attr(paths, "tree") 53 | align <- attr(paths, "align") 54 | seqType <- attr(paths, "seqType") 55 | unambiguous <- .unambiguousChars(paths) 56 | # 'res' is going to be the return of this function. Each entry in the list 57 | # is the 'sitePath' for a site. Each site ('sitePath') consists of 'mutPath' 58 | # that is named by the starting node name. The fixed AA and number of 59 | # non-dominant AA is also stored. 60 | res <- list() 61 | for (segs in x) { 62 | for (site in names(segs)) { 63 | seg <- segs[[site]] 64 | # There has to be at least one fixation on the lineage and at least 65 | # two of the mutation is neither gap nor ambiguous character 66 | if (.qualifiedFixation(seg, unambiguous)) { 67 | i <- as.integer(site) 68 | # Test if the slot for the site is empty 69 | if (is.null(res[[site]])) { 70 | # Initiate the first 'mutPath' for the site 71 | res[[site]][[1]] <- lapply(seg, function(tips) { 72 | attr(tips, "tipsAA") <- substr( 73 | x = align[tips], 74 | start = i, 75 | stop = i 76 | ) 77 | return(tips) 78 | }) 79 | } else { 80 | # Assume a new 'mutPath' is to add (not combined by default) 81 | targetIndex <- length(res[[site]]) + 1 82 | # The index to extract the terminal tips of the 'mutPath' 83 | endIndex <- length(seg) 84 | finalAA <- attr(seg[[endIndex]], "AA") 85 | # Combine the two 'mutPath' when the fixation tips before 86 | # the terminal tips are identical and the final fixation 87 | # mutation are the same 88 | toCombine <- vapply( 89 | X = res[[site]], 90 | FUN = function(ep) { 91 | identical(lapply(seg, c)[-endIndex], 92 | lapply(ep, c)[-endIndex]) && 93 | finalAA == attr(ep[[endIndex]], "AA") 94 | }, 95 | FUN.VALUE = logical(1) 96 | ) 97 | if (any(toCombine)) { 98 | existIndex <- which(toCombine) 99 | # These are the candidates to combine. The additional 100 | # condition be all the descendant tips are included. 101 | toCombine <- unique(unlist(lapply( 102 | X = c(res[[site]][existIndex], list(seg)), 103 | FUN = "[[", 104 | i = endIndex 105 | ))) 106 | # Create the newly combined 'mutPath' 107 | seg[[endIndex]] <- toCombine 108 | attr(seg[[endIndex]], "AA") <- finalAA 109 | # Remove the combined 'mutPath' 110 | res[[site]] <- res[[site]][-existIndex] 111 | targetIndex <- length(res[[site]]) + 1 112 | } 113 | seg <- lapply(seg, function(tips) { 114 | attr(tips, "tipsAA") <- substr( 115 | x = align[tips], 116 | start = i, 117 | stop = i 118 | ) 119 | return(tips) 120 | }) 121 | res[[site]][[targetIndex]] <- seg 122 | } 123 | attr(res[[site]], "site") <- i 124 | attr(res[[site]], "tree") <- tree 125 | attr(res[[site]], "seqType") <- seqType 126 | class(res[[site]]) <- "sitePath" 127 | } 128 | } 129 | } 130 | # Set 'paths' and 'clustersByPath' attributes 131 | attr(res, "paths") <- paths 132 | attr(res, "clustersByPath") <- attr(x, "clustersByPath") 133 | class(res) <- "fixationSites" 134 | return(res) 135 | } 136 | 137 | .qualifiedFixation <- function(seg, unambiguous) { 138 | siteChars <- unique(vapply( 139 | X = seg, 140 | FUN = attr, 141 | FUN.VALUE = character(1), 142 | which = "AA" 143 | )) 144 | sum(siteChars %in% unambiguous) >= 2 145 | } 146 | 147 | #' @rdname fixationSites 148 | #' @export 149 | fixationSites.paraFixSites <- function(paths, ...) { 150 | res <- attr(paths, "allFixSites") 151 | return(res) 152 | } 153 | -------------------------------------------------------------------------------- /R/as.data.frame.R: -------------------------------------------------------------------------------- 1 | #' @importFrom ape getMRCA nodepath 2 | 3 | #' @rdname as.data.frame 4 | #' @title Convert results to Data Frame 5 | #' @description Convert return of functions in \code{sitePath} package to a 6 | #' \code{\link{data.frame}} so can be better worked with. The group name for 7 | #' each tip is the same as \code{\link{groupTips}}. 8 | #' @description A \code{\link{fixationSites}} object will output the mutation 9 | #' name of the fixation and the cluster name before and after the mutation. 10 | #' @param x The object to be converted to \code{data.frame}. 11 | #' @param row.names Unimplemented. 12 | #' @param optional Unimplemented. 13 | #' @param ... Other arguments. 14 | #' @return A \code{\link{data.frame}} object. 15 | #' @export 16 | #' @examples 17 | #' data(zikv_tree_reduced) 18 | #' data(zikv_align_reduced) 19 | #' tree <- addMSA(zikv_tree_reduced, alignment = zikv_align_reduced) 20 | #' fixations <- fixationSites(lineagePath(tree)) 21 | #' as.data.frame(fixations) 22 | as.data.frame.fixationSites <- function(x, 23 | row.names = NULL, 24 | optional = FALSE, 25 | ...) { 26 | res <- .mutationTable(x) 27 | res <- res[, c("mutation", "from", "to")] 28 | return(res) 29 | } 30 | 31 | .mutationTable <- function(fixations) { 32 | # The original tree 33 | tree <- as.phylo.fixationSites(fixations) 34 | # The clusters sorted by path 35 | clustersByPath <- attr(fixations, "clustersByPath") 36 | # The cluster each tip belongs to 37 | clusterInfo <- character() 38 | # Extract the node path for each tip cluster 39 | clusterPaths <- list() 40 | rootNode <- getMRCA(tree, tree[["tip.label"]]) 41 | # The cluster name and node of each group 42 | for (gp in clustersByPath) { 43 | for (tips in gp) { 44 | cluster <- attr(tips, "clsName") 45 | # The cluster named by the tips 46 | clsNames <- rep(cluster, length(tips)) 47 | names(clsNames) <- as.character(tips) 48 | clusterInfo <- c(clusterInfo, clsNames) 49 | # The node path towards the cluster 50 | ancestral <- as.integer(attr(tips, "node")) 51 | if (is.null(ancestral)) { 52 | np <- nodepath(tree, rootNode, tips) 53 | clusterPaths[[cluster]] <- 54 | np[seq_len(length(np) - 1)] 55 | } else { 56 | clusterPaths[[cluster]] <- nodepath(tree, rootNode, ancestral) 57 | } 58 | } 59 | } 60 | # Info for the transition mutation 61 | prevCls <- character() 62 | currCls <- character() 63 | mutName <- character() 64 | transNode <- integer() 65 | for (sp in fixations) { 66 | site <- attr(sp, "site") 67 | prefix <- "" 68 | if (is.character(site)) { 69 | nameSplit <- rev(strsplit(site, " ")[[1]]) 70 | site <- nameSplit[1] 71 | prefix <- nameSplit[2] 72 | if (is.na(prefix)) { 73 | prefix <- "" 74 | } else { 75 | prefix <- paste0(prefix, " ") 76 | } 77 | } 78 | for (mp in sp) { 79 | nodeNames <- names(mp) 80 | for (i in seq_along(mp)[-1]) { 81 | prevTips <- mp[[i - 1]] 82 | currTips <- mp[[i]] 83 | mutation <- paste0(prefix, 84 | attr(prevTips, "AA"), 85 | site, 86 | attr(currTips, "AA")) 87 | trans <- nodeNames[i] 88 | prev <- unique(clusterInfo[as.character(prevTips)]) 89 | names(prev) <- prev 90 | # Choose the most recent cluster to stay un-mutated 91 | prev <- names(which.max(lapply( 92 | X = prev, 93 | FUN = function(cluster) { 94 | length(clusterPaths[[cluster]]) 95 | } 96 | ))) 97 | curr <- unique(clusterInfo[as.character(currTips)]) 98 | names(curr) <- curr 99 | # Choose the most ancient cluster which first receive the 100 | # mutation 101 | curr <- names(which.min(lapply( 102 | X = curr, 103 | FUN = function(cluster) { 104 | length(clusterPaths[[cluster]]) 105 | } 106 | ))) 107 | # Add the new transition mutation 108 | prevCls <- c(prevCls, prev) 109 | currCls <- c(currCls, curr) 110 | mutName <- c(mutName, mutation) 111 | transNode <- c(transNode, trans) 112 | } 113 | } 114 | } 115 | # The mutation between adjacent clusters 116 | res <- data.frame( 117 | "mutation" = mutName, 118 | "from" = prevCls, 119 | "to" = currCls, 120 | "node" = transNode 121 | ) 122 | res <- unique(res) 123 | rownames(res) <- NULL 124 | return(res) 125 | } 126 | 127 | #' @rdname as.data.frame 128 | #' @description An \code{\link{SNPsites}} object will output the tip name with 129 | #' the SNP and its position. 130 | #' @export 131 | as.data.frame.SNPsites <- function(x, 132 | row.names = NULL, 133 | optional = FALSE, 134 | ...) { 135 | res <- attr(x, "allSNP") 136 | return(res) 137 | } 138 | 139 | #' @rdname as.data.frame 140 | #' @description An \code{\link{parallelSites}} object will output the tip name 141 | #' with the group name and mutation info. 142 | #' @export 143 | as.data.frame.parallelSites <- function(x, 144 | row.names = NULL, 145 | optional = FALSE, 146 | ...) { 147 | tree <- as.phylo.phyMSAmatched(attr(x, "paths")) 148 | tipNames <- tree[["tip.label"]] 149 | clustersByPath <- attr(x, "clustersByPath") 150 | clusterInfo <- character() 151 | for (gp in clustersByPath) { 152 | for (tips in gp) { 153 | # The cluster named by the tips 154 | clsNames <- rep(attr(tips, "clsName"), length(tips)) 155 | names(clsNames) <- as.character(tipNames[tips]) 156 | clusterInfo <- c(clusterInfo, clsNames) 157 | } 158 | } 159 | # Info for the parallel mutation 160 | accession <- character() 161 | clsName <- character() 162 | mutSite <- integer() 163 | mutFrom <- character() 164 | mutTo <- character() 165 | isFixed <- logical() 166 | for (sp in x) { 167 | tips <- extractTips.sitePara(sp) 168 | for (t in tips) { 169 | accession <- c(accession, t) 170 | clsName <- c(clsName, clusterInfo[t]) 171 | mutName <- attr(t, "mutName") 172 | mutSite <- c(mutSite, rep(mutName[2], length(t))) 173 | mutFrom <- c(mutFrom, rep(mutName[1], length(t))) 174 | mutTo <- c(mutTo, rep(mutName[3], length(t))) 175 | isFixed <- c(isFixed, rep(attr(t, "fixed"), length(t))) 176 | } 177 | } 178 | res <- data.frame( 179 | "Accession" = accession, 180 | "group" = clsName, 181 | "site" = mutSite, 182 | "mutFrom" = mutFrom, 183 | "mutTo" = mutTo, 184 | "fixation" = isFixed 185 | ) 186 | return(res) 187 | } 188 | -------------------------------------------------------------------------------- /R/paraFixSites.R: -------------------------------------------------------------------------------- 1 | #' @rdname paraFixSites 2 | #' @title The fixation sites with mutation on parallel lineage 3 | #' @description The operation between the results of \code{\link{fixationSites}} 4 | #' and \code{\link{parallelSites}}. 5 | #' @param x A \code{lineagePath} object returned from \code{\link{lineagePath}} 6 | #' function. 7 | #' @param ... further arguments passed to or from other methods. 8 | #' @return A \code{paraFixSites} object. 9 | #' @export 10 | #' @examples 11 | #' data(zikv_tree_reduced) 12 | #' data(zikv_align_reduced) 13 | #' paraFixSites(zikv_tree_reduced, alignment = zikv_align_reduced) 14 | paraFixSites <- function(x, ...) { 15 | UseMethod("paraFixSites") 16 | } 17 | 18 | #' @rdname paraFixSites 19 | #' @param alignment An \code{alignment} object. This commonly can be from 20 | #' sequence parsing function in the \code{\link{seqinr}} package. Sequence 21 | #' names in the alignment should include all \code{tip.label} in the tree 22 | #' @param seqType The type of the sequence in the alignment file. The default is 23 | #' "AA" for amino acid. The other options are "DNA" and "RNA". 24 | #' @param Nmin The parameter for identifying phylogenetic pathway using SNP. If 25 | #' provided as fraction between 0 and 1, then the minimum number of SNP will 26 | #' be total tips times \code{Nmin}. If provided as integer greater than 1, the 27 | #' minimum number will be \code{Nmin}. 28 | #' @param reference Name of reference for site numbering. The name has to be one 29 | #' of the sequences' name. The default uses the intrinsic alignment numbering 30 | #' @param gapChar The character to indicate gap. The numbering will skip the 31 | #' \code{gapChar} for the reference sequence. 32 | #' @param minSkipSize The minimum number of tips to have gap or ambiguous amino 33 | #' acid/nucleotide for a site to be ignored in other analysis. This will not 34 | #' affect the numbering. The default is 0.8. 35 | #' @export 36 | paraFixSites.phylo <- function(x, 37 | alignment = NULL, 38 | seqType = c("AA", "DNA", "RNA"), 39 | Nmin = NULL, 40 | reference = NULL, 41 | gapChar = "-", 42 | minSkipSize = NULL, 43 | ...) { 44 | paths <- lineagePath.phylo( 45 | tree = x, 46 | alignment = alignment, 47 | seqType = seqType, 48 | Nmin = Nmin, 49 | reference = reference, 50 | gapChar = gapChar, 51 | minSkipSize = minSkipSize 52 | ) 53 | res <- paraFixSites.lineagePath(paths, ...) 54 | return(res) 55 | } 56 | 57 | #' @rdname paraFixSites 58 | #' @export 59 | paraFixSites.treedata <- function(x, ...) { 60 | tree <- as.phylo(x) 61 | res <- paraFixSites.phylo(tree) 62 | return(res) 63 | } 64 | 65 | #' @rdname paraFixSites 66 | #' @param minEffectiveSize The minimum number of tips in a group. 67 | #' @param searchDepth The function uses heuristic search but the termination of 68 | #' the search cannot be intrinsically decided. \code{searchDepth} is needed to 69 | #' tell the search when to stop. 70 | #' @param method The strategy for predicting the fixation. The basic approach is 71 | #' entropy minimization and can be achieved by adding or removing fixation 72 | #' point, or by comparing the two. 73 | #' @export 74 | paraFixSites.lineagePath <- function(x, 75 | minEffectiveSize = NULL, 76 | searchDepth = 1, 77 | method = c("compare", "insert", "delete"), 78 | ...) { 79 | minEntropy <- sitesMinEntropy.lineagePath(x, 80 | minEffectiveSize, 81 | searchDepth, 82 | method) 83 | res <- paraFixSites.sitesMinEntropy(minEntropy, ...) 84 | return(res) 85 | } 86 | 87 | #' @rdname paraFixSites 88 | #' @param category Could be \code{parallelOnly}, \code{fixationOnly}, 89 | #' \code{intersect} or \code{union}. 90 | #' @param minSNP The minimum number of mutations to be qualified as parallel on 91 | #' at least two lineages. The default is 1. 92 | #' @param mutMode The strategy for finding parallel site. The default \code{all} 93 | #' is to consider any mutation regardless of the amino acid/nucleotide before 94 | #' and after mutation; Or \code{exact} to force mutation to be the same; Or 95 | #' \code{pre}/\code{post} to select the site having amino acid/nucleotide 96 | #' before/after mutation. 97 | #' @export 98 | paraFixSites.sitesMinEntropy <- function(x, 99 | category = c("intersect", 100 | "union", 101 | "parallelOnly", 102 | "fixationOnly"), 103 | minSNP = NULL, 104 | mutMode = c("all", "exact", 105 | "pre", "post"), 106 | ...) { 107 | # All fixation sites and site names 108 | fixSites <- fixationSites.sitesMinEntropy(x) 109 | fixSiteNames <- allSitesName.fixationSites(fixSites) 110 | # All parallel sites and site names 111 | paraSites <- parallelSites.sitesMinEntropy(x, minSNP, mutMode) 112 | paraSiteNames <- allSitesName.parallelSites(paraSites) 113 | # Derive the sites based on the specified category 114 | sites <- switch( 115 | match.arg(category), 116 | "intersect" = intersect(fixSiteNames, paraSiteNames), 117 | "union" = union(fixSiteNames, paraSiteNames), 118 | "parallelOnly" = setdiff(paraSiteNames, fixSiteNames), 119 | "fixationOnly" = setdiff(fixSiteNames, paraSiteNames) 120 | ) 121 | # Site names as result 122 | res <- sort(as.integer(sites)) 123 | attr(res, "allFixSites") <- fixSites 124 | attr(res, "allParaSites") <- paraSites 125 | # Subset of the fixation and parallel sites 126 | fixSites <- fixSites[intersect(sites, fixSiteNames)] 127 | paraSites <- paraSites[intersect(sites, paraSiteNames)] 128 | # Set fixation sites attribute if any 129 | if (length(fixSites)) { 130 | # Set 'paths' and 'clustersByPath' attributes 131 | attr(fixSites, "paths") <- attr(x, "paths") 132 | attr(fixSites, "clustersByPath") <- 133 | attr(x, "clustersByPath") 134 | class(fixSites) <- "fixationSites" 135 | attr(res, "fixSites") <- fixSites 136 | } 137 | # Set 'allSNP' (to represent parallel sites) attribute if any 138 | if (length(paraSites)) { 139 | allSNP <- unlist(paraSites, 140 | recursive = FALSE, 141 | use.names = FALSE) 142 | allSNP <- unlist(allSNP, 143 | recursive = FALSE, 144 | use.names = FALSE) 145 | allSNP <- do.call(rbind, lapply(allSNP, function(tips) { 146 | Pos <- as.integer(rep(attr(tips, "mutName")[2], length(tips))) 147 | SNP <- rep(attr(tips, "mutName")[3], length(tips)) 148 | data.frame("Accession" = tips, 149 | "Pos" = Pos, 150 | "SNP" = SNP) 151 | })) 152 | attr(res, "allSNP") <- allSNP 153 | } 154 | attr(res, "paths") <- attr(x, "paths") 155 | class(res) <- "paraFixSites" 156 | return(res) 157 | } 158 | -------------------------------------------------------------------------------- /R/fixationIndels.R: -------------------------------------------------------------------------------- 1 | #' @rdname fixationIndels 2 | #' @title Fixation indels prediction 3 | #' @description The fixation of insertions of deletions. 4 | #' @param x The return from \code{\link{sitesMinEntropy}} function. 5 | #' @param ... Other arguments. 6 | #' @return A \code{fixationIndels} object. 7 | #' @export 8 | #' @examples 9 | #' data(zikv_tree_reduced) 10 | #' data(zikv_align_reduced) 11 | #' tree <- addMSA(zikv_tree_reduced, alignment = zikv_align_reduced) 12 | #' fixationIndels(sitesMinEntropy(tree)) 13 | fixationIndels <- function(x, ...) { 14 | UseMethod("fixationIndels") 15 | } 16 | 17 | #' @rdname fixationIndels 18 | #' @export 19 | fixationIndels.sitesMinEntropy <- function(x, ...) { 20 | paths <- attr(x, "paths") 21 | tree <- attr(paths, "tree") 22 | seqType <- attr(paths, "seqType") 23 | gapChar <- attr(paths, "gapChar") 24 | minSize <- attr(paths, "minSize") 25 | # 'res' is going to be the return of this function. Each entry in the list 26 | # is the 'indelPath' for a fragment of sequence. 27 | res <- list() 28 | for (segs in x) { 29 | pathNodeTips <- lapply(attr(segs, "pathNodeTips"), as.integer) 30 | prevSite <- -1 31 | currIndels <- list() 32 | for (site in names(segs)) { 33 | seg <- segs[[site]] 34 | # Find the tips having 'gapChar' at the site 35 | siteChars <- vapply( 36 | X = seg, 37 | FUN = attr, 38 | FUN.VALUE = character(1), 39 | which = "AA" 40 | ) 41 | tipsWithDeletion <- seg[which(siteChars == gapChar)] 42 | if (length(tipsWithDeletion)) { 43 | currSite <- as.integer(site) 44 | # Test the continuity of the deletion 45 | if (currSite - prevSite == 1) { 46 | # Find the overlapping tips to further ensure the continuity 47 | for (iter in seq_along(currIndels)) { 48 | # Existing tips with continuing deletion 49 | refTips <- currIndels[[iter]] 50 | indelSites <- c(attr(refTips, "indelSites"), 51 | currSite) 52 | for (tips in tipsWithDeletion) { 53 | continued <- intersect(refTips, tips) 54 | # The deletion of the tips is ended if the current 55 | # site is not gap 56 | ended <- setdiff(refTips, continued) 57 | # A new deletion is started if a new group of tips 58 | # are gap at the current site 59 | started <- setdiff(tips, continued) 60 | if (length(continued)) { 61 | continued <- .findAncestralNode(continued, 62 | pathNodeTips, 63 | indelSites) 64 | currIndels[iter] <- continued 65 | if (length(ended)) { 66 | ended <- .findAncestralNode(ended, 67 | pathNodeTips, 68 | indelSites) 69 | currIndels <- 70 | c(currIndels, ended) 71 | } 72 | } else { 73 | if (length(ended)) { 74 | ended <- .findAncestralNode(ended, 75 | pathNodeTips, 76 | indelSites) 77 | currIndels[iter] <- ended 78 | } 79 | } 80 | if (length(started)) { 81 | started <- .findAncestralNode(started, 82 | pathNodeTips, 83 | currSite) 84 | currIndels <- c(currIndels, started) 85 | } 86 | } 87 | } 88 | } else { 89 | # Initiate the first deletion fragment or re-initiate new 90 | # deletion fragment if the gap can't be extended due to 91 | # discontunity of the site 92 | currIndels <- 93 | lapply(tipsWithDeletion, function(tips) { 94 | attr(tips, "indelSites") <- currSite 95 | return(tips) 96 | }) 97 | } 98 | # Update the 'prevSite' only when the site is a gap 99 | prevSite <- currSite 100 | } 101 | } 102 | # All indel for the current path 103 | for (tips in currIndels) { 104 | if (length(tips) >= minSize) { 105 | indelSites <- attr(tips, "indelSites") 106 | if (length(indelSites) > 1) { 107 | indelSites <- range(indelSites) 108 | indelSites <- paste0(indelSites, collapse = "-") 109 | } 110 | node <- attr(tips, "node") 111 | res[[indelSites]][[node]] <- tips 112 | attr(res[[indelSites]], "indelSites") <- indelSites 113 | attr(res[[indelSites]], "tree") <- tree 114 | attr(res[[indelSites]], "seqType") <- seqType 115 | class(res[[indelSites]]) <- "indelPath" 116 | } 117 | } 118 | } 119 | # Set 'paths' and 'clustersByPath' attributes 120 | attr(res, "paths") <- paths 121 | class(res) <- "fixationIndels" 122 | return(res) 123 | } 124 | 125 | .findAncestralNode <- function(tipsWithGap, 126 | pathNodeTips, 127 | indelSites) { 128 | res <- list() 129 | # The tips to be grouped 130 | currTips <- integer() 131 | ancestralNode <- names(pathNodeTips[1]) 132 | # Iterate the tips along the path 133 | for (node in names(pathNodeTips)) { 134 | tips <- pathNodeTips[[node]] 135 | # To find the tips that are grouped in the 'tipsWithGap' 136 | if (any(tips %in% tipsWithGap)) { 137 | # The ancestral node is from its starting tips in 'pathNodeTips' 138 | if (length(currTips) == 0) { 139 | ancestralNode <- node 140 | } 141 | # Accumulating the tips (it's assumed the all 'tips' are in 142 | # 'tipsWithGap' if any) 143 | currTips <- c(currTips, tips) 144 | tipsWithGap <- setdiff(tipsWithGap, tips) 145 | } else { 146 | if (length(currTips)) { 147 | # The continuity stopped and new tip group formed 148 | attr(currTips, "node") <- ancestralNode 149 | res[[ancestralNode]] <- currTips 150 | } 151 | # Reset the tips to be grouped 152 | currTips <- integer() 153 | } 154 | } 155 | if (length(currTips)) { 156 | # The continuity stopped and new tip group formed 157 | attr(currTips, "node") <- ancestralNode 158 | attr(currTips, "indelSites") <- indelSites 159 | res[[ancestralNode]] <- currTips 160 | } 161 | return(res) 162 | } 163 | -------------------------------------------------------------------------------- /R/groupTips.R: -------------------------------------------------------------------------------- 1 | #' @importFrom ape Ntip 2 | 3 | #' @rdname groupTips 4 | #' @title The grouping of tree tips 5 | #' @description The tips between divergent nodes or fixation mutations on the 6 | #' lineages are each gathered as group. 7 | #' @param tree The return from \code{\link{addMSA}}, \code{\link{lineagePath}}, 8 | #' \code{\link{sitesMinEntropy}} or other functions. 9 | #' @param similarity This decides how minor SNPs are to remove. If provided as 10 | #' fraction between 0 and 1, then the minimum number of SNP will be total tips 11 | #' times \code{similariy}. If provided as integer greater than 1, the minimum 12 | #' number will be \code{similariy}. The default \code{similariy} is 0.05 for 13 | #' \code{lineagePath}. 14 | #' @param simMatrix Deprecated and will not have effect. 15 | #' @param forbidTrivial Does not allow trivial trimming. 16 | #' @param tipnames If return tips as integer or tip names. 17 | #' @param ... Other arguments. 18 | #' @return \code{groupTips} returns grouping of tips. 19 | #' @export 20 | #' @examples 21 | #' data(zikv_tree) 22 | #' data(zikv_align) 23 | #' tree <- addMSA(zikv_tree, alignment = zikv_align) 24 | #' groupTips(tree) 25 | groupTips <- function(tree, ...) { 26 | UseMethod("groupTips") 27 | } 28 | 29 | #' @rdname groupTips 30 | #' @export 31 | groupTips.phyMSAmatched <- function(tree, 32 | similarity = NULL, 33 | simMatrix = NULL, 34 | forbidTrivial = TRUE, 35 | tipnames = TRUE, 36 | ...) { 37 | paths <- lineagePath.phyMSAmatched( 38 | tree = tree, 39 | similarity = simMatrix, 40 | simMatrix = simMatrix, 41 | forbidTrivial = forbidTrivial, 42 | ... 43 | ) 44 | res <- groupTips.lineagePath(paths, tipnames = tipnames) 45 | return(res) 46 | } 47 | 48 | #' @rdname groupTips 49 | #' @export 50 | groupTips.lineagePath <- function(tree, tipnames = TRUE, ...) { 51 | paths <- tree 52 | tree <- attr(paths, "tree") 53 | # Get the divergent nodes 54 | divNodes <- divergentNode(paths) 55 | # The tips and the corresponding ancestral node 56 | pathNodeTips <- .tipSeqsAlongPathNodes(paths, divNodes) 57 | # To group the tips by the node right after the divergent point 58 | res <- list() 59 | # Iterate through each lineage path 60 | for (p in paths) { 61 | # Assume the root node as the first ancestral node 62 | aNode <- as.character(attr(paths, "rootNode")) 63 | tips <- integer() 64 | pathLen <- length(p) 65 | for (i in seq_len(pathLen)[-pathLen]) { 66 | currNode <- p[[i]] 67 | nextNode <- p[[i + 1]] 68 | # Add the tips of the current node to the group 69 | tips <- c(tips, pathNodeTips[[as.character(currNode)]]) 70 | # Stop adding the tips to the group and take the group out 71 | if (nextNode %in% divNodes) { 72 | if (length(tips)) { 73 | res[[aNode]] <- tips 74 | } 75 | # The node next to the divergent node is the new ancestral node 76 | aNode <- as.character(p[[i + 2]]) 77 | tips <- integer() 78 | next 79 | } 80 | } 81 | # Add the tips of the final node to the group and take the final group 82 | # out 83 | res[[aNode]] <- c(tips, 84 | pathNodeTips[[as.character(p[[pathLen]])]]) 85 | } 86 | if (tipnames) { 87 | res <- lapply(res, function(tips) { 88 | tree[["tip.label"]][tips] 89 | }) 90 | } 91 | return(res) 92 | } 93 | 94 | .tipSeqsAlongPathNodes <- function(paths, divNodes) { 95 | tree <- attr(paths, "tree") 96 | align <- attr(paths, "align") 97 | allNodes <- unlist(paths) 98 | terminalNodes <- vapply( 99 | X = paths, 100 | FUN = function(p) { 101 | p[length(p)] 102 | }, 103 | FUN.VALUE = integer(1) 104 | ) 105 | # Get all the nodes that are not at divergent point 106 | nodes <- setdiff(allNodes, divNodes) 107 | # Get the sequence of the children tips that are descendant of the nodes. 108 | # Assign the tip index to the sequences for retrieving the tip name 109 | nodeAlign <- lapply(nodes, function(n) { 110 | isTerminal <- FALSE 111 | if (n %in% terminalNodes) { 112 | childrenNode <- n 113 | isTerminal <- TRUE 114 | } else { 115 | childrenNode <- tree[["edge"]][which(tree[["edge"]][, 1] == n), 2] 116 | # Keep the node that is not on the path. 117 | childrenNode <- setdiff(childrenNode, allNodes) 118 | } 119 | res <- .childrenTips(tree, childrenNode) 120 | attr(res, "align") <- align[res] 121 | attr(res, "isTerminal") <- isTerminal 122 | return(res) 123 | }) 124 | # Assign the node names to the 'nodeAlign' list 125 | names(nodeAlign) <- nodes 126 | return(nodeAlign) 127 | } 128 | 129 | .childrenTips <- function(tree, node) { 130 | maxTip <- Ntip(tree) 131 | children <- integer() 132 | getChildren <- function(edges, parent) { 133 | children <<- c(children, parent[which(parent <= maxTip)]) 134 | i <- which(edges[, 1] %in% parent) 135 | if (length(i) == 0L) { 136 | return(children) 137 | } else { 138 | parent <- edges[i, 2] 139 | return(getChildren(edges, parent)) 140 | } 141 | } 142 | return(getChildren(tree[["edge"]], node)) 143 | } 144 | 145 | #' @rdname groupTips 146 | #' @export 147 | groupTips.sitesMinEntropy <- function(tree, tipnames = TRUE, ...) { 148 | .unpackClustersByPath(x = tree, tipnames = tipnames) 149 | } 150 | 151 | .unpackClustersByPath <- function(x, tipnames) { 152 | clustersByPath <- attr(x, "clustersByPath") 153 | tree <- as.phylo.sitesMinEntropy(x) 154 | tipLabels <- tree[["tip.label"]] 155 | if (!tipnames) { 156 | tipLabels <- seq_along(tipLabels) 157 | } 158 | res <- list() 159 | for (gp in clustersByPath) { 160 | for (tips in gp) { 161 | res[[attr(tips, "clsName")]] <- tipLabels[as.integer(tips)] 162 | } 163 | } 164 | return(res) 165 | } 166 | 167 | #' @rdname groupTips 168 | #' @export 169 | groupTips.fixationSites <- function(tree, tipnames = TRUE, ...) { 170 | .unpackClustersByPath(x = tree, tipnames = tipnames) 171 | } 172 | 173 | #' @rdname groupTips 174 | #' @export 175 | groupTips.fixationPath <- function(tree, tipnames = TRUE, ...) { 176 | x <- tree 177 | groupName <- names(x) 178 | tree <- attr(x, "tree") 179 | attributes(x) <- NULL 180 | if (tipnames) { 181 | res <- lapply(x, function(tips) { 182 | attributes(tips) <- NULL 183 | return(tree[["tip.label"]][tips]) 184 | }) 185 | } else { 186 | res <- lapply(x, function(tips) { 187 | attributes(tips) <- NULL 188 | return(tips) 189 | }) 190 | } 191 | names(res) <- groupName 192 | return(res) 193 | } 194 | 195 | #' @rdname similarityMatrix 196 | #' @title Similarity between sequences 197 | #' @description Get similarity between aligned sequences with gap ignored. 198 | #' @param tree The return from \code{\link{addMSA}} function. 199 | #' @return A diagonal matrix of similarity between sequences. 200 | #' @export 201 | #' @examples 202 | #' data(zikv_tree) 203 | #' data(zikv_align) 204 | #' tree <- addMSA(zikv_tree, alignment = zikv_align) 205 | #' simMatrix <- similarityMatrix(tree) 206 | similarityMatrix <- function(tree) { 207 | sim <- attr(tree, "simMatrix") 208 | return(sim) 209 | } 210 | -------------------------------------------------------------------------------- /src/minEntropy.h: -------------------------------------------------------------------------------- 1 | /* 2 | * Try to use branch and bound algorithm for finding the best segmenting against 3 | * tree tips along a lineage path. The basic idea is that a path contains 4 | * several segmenting points alongside. And by adding or removing those points, 5 | * the tree tips will be segmented into groups. 6 | * 7 | * By calculating Shannon Entropy, the purity of amino acid in all groups amount 8 | * to the total entropy. If using brute force, there is a max number of 2^n 9 | * calculations to find the minimum entropy. 10 | * 11 | * In branch and bound algorithm, each node represent a combination of segment 12 | * points. We start from no points and all points and converge the result of two 13 | * search trees. 14 | */ 15 | 16 | #ifndef SITEPATH_MINENTROPY_H 17 | #define SITEPATH_MINENTROPY_H 18 | 19 | #include 20 | #include 21 | #include 22 | #include 23 | 24 | namespace MinEntropy { 25 | 26 | typedef std::map aaSummary; 27 | typedef unsigned int segIndex; 28 | typedef std::vector segment; 29 | 30 | float shannonEntropy(const aaSummary &values, const unsigned int tipNum); 31 | 32 | Rcpp::ListOf updatedSegmentation( 33 | const Rcpp::ListOf &nodeSummaries, 34 | const segment &final 35 | ); 36 | 37 | /* 38 | * The base class for implementing node in tree search for minimum entropy The 39 | * node stores the segment points and calculates the total entropy of the 40 | * current segmentation 41 | */ 42 | class TreeSearchNode { 43 | public: 44 | // Pure virtual methods 45 | virtual unsigned int getOpenSize() const = 0; 46 | virtual bool isEndNode() const = 0; 47 | virtual ~TreeSearchNode(); 48 | // Getters for private/protected member variables 49 | segment getUsed() const; 50 | float getEntropy() const; 51 | bool isQualified() const; 52 | float getScore() const; 53 | protected: 54 | // Empty constructor 55 | TreeSearchNode(); 56 | // Initiate used segment points and calculate the entropy 57 | TreeSearchNode( 58 | const segment &used, 59 | const std::vector &aaSummaries, 60 | const unsigned int minEffectiveSize 61 | ); 62 | // Calculate the total entropy 63 | float totalEntropy( 64 | const std::vector &aaSummaries, 65 | const unsigned int minEffectiveSize 66 | ); 67 | protected: 68 | // A list of the segment points which must include the enclosed point 69 | segment m_used; 70 | // The Shannon Entropy of the search node 71 | float m_entropy; 72 | // Whether the search node is qualified (minEffectiveSize) 73 | bool m_qualified; 74 | }; 75 | 76 | /* 77 | * The class to implement the node in the adding search. Store segment indices 78 | * of a nodePath and the remaining unused indices The growing of the tree search 79 | * is by adding one segment point from the open list to the used list for each 80 | * children node 81 | */ 82 | class Segmentor: public TreeSearchNode { 83 | public: 84 | // Initially only the terminal enclosed point is used. 85 | Segmentor( 86 | const segment &all, 87 | const segment &terminal, 88 | const std::vector &aaSummaries, 89 | const unsigned int minEffectiveSize 90 | ); 91 | // Generate a children node from the parent node. Simply pick the "i"th 92 | // segment point from the parent's open list 93 | Segmentor( 94 | const Segmentor *parent, 95 | const unsigned int i, 96 | const std::vector &aaSummaries, 97 | const unsigned int minEffectiveSize 98 | ); 99 | // Provide size of open list for generating children node 100 | unsigned int getOpenSize() const; 101 | // If the node has reach the end of the search 102 | bool isEndNode() const; 103 | private: 104 | // The available segment points for children node 105 | segment m_open; 106 | private: 107 | // The new used segment points 108 | segment newUsed( 109 | const Segmentor *parent, 110 | const unsigned int i 111 | ) const; 112 | // The available segment points for children 113 | segment newOpen( 114 | const Segmentor *parent, 115 | const unsigned int i 116 | ) const; 117 | }; 118 | 119 | /* 120 | * The class to implemenet the node in the removing search. Only the used list 121 | * is needed tracking as it's also the open list. The growing of the search tree 122 | * is by removing one segment point from the used list 123 | */ 124 | class Amalgamator: public TreeSearchNode { 125 | public: 126 | // All segment points are being used initially 127 | Amalgamator( 128 | const segment &withTerminal, 129 | const std::vector &aaSummaries, 130 | const unsigned int minEffectiveSize 131 | ); 132 | // Genetrat a child node from the parent node 133 | Amalgamator( 134 | const Amalgamator *parent, 135 | const unsigned int i, 136 | const std::vector &aaSummaries, 137 | const unsigned int minEffectiveSize 138 | ); 139 | // Provide size of used list for generating children node 140 | unsigned int getOpenSize() const; 141 | // If the node has reach the end of the search 142 | bool isEndNode() const; 143 | private: 144 | // The new used segment points 145 | segment newUsed( 146 | const Amalgamator *parent, 147 | const unsigned int i 148 | ) const; 149 | }; 150 | 151 | /* 152 | * The template class for implementing the tree search. Store the original 153 | * nodePath segmentation and search constrain. It's gonna carry the heuristic 154 | * search for minimum entropy 155 | */ 156 | template 157 | class SearchTree { 158 | public: 159 | SearchTree( 160 | const unsigned int minEffectiveSize, 161 | const unsigned int searchDepth, 162 | const Rcpp::ListOf &nodeSummaries 163 | ); 164 | virtual ~SearchTree(); 165 | // Getters for the private/protected member variables 166 | segment getFinal() const; 167 | float getMinEntropy() const; 168 | // Minimum entropy search 169 | void search(); 170 | void resumeSearch(); 171 | private: 172 | // The minimum number of tips within a segmented group 173 | const unsigned int m_minTipNum, m_searchDepth; 174 | // The terminal segment point essential for enclosing the segmenting 175 | const segIndex m_enclosed; 176 | // Store all possible segment points (except the enclosed point). Track 177 | // final list of segment points which gives minimum entropy 178 | segment m_all, m_final; 179 | // The transformed AA summaries for each node 180 | std::vector m_aaSummaries; 181 | // Track the parent node 182 | T *m_parent; 183 | // To keep track of current minimum entropy of the segmenting 184 | float m_minEntropy; 185 | // The search list containing all the active search nodes 186 | std::vector m_segList; 187 | // The existing or dropped segment 188 | std::vector m_segListHistory; 189 | private: 190 | // Initialize the tree search 191 | void initSearch(); 192 | // Grow the tree from the current parent node 193 | void growTree(T *seg); 194 | // Not yet implemented 195 | void updateFinal(T *tempMin); 196 | // Decide the new parent node 197 | T *updateParent(); 198 | }; 199 | 200 | template<> void SearchTree::initSearch(); 201 | template<> void SearchTree::initSearch(); 202 | 203 | template<> void SearchTree::growTree(Segmentor *seg); 204 | template<> void SearchTree::growTree(Amalgamator *seg); 205 | 206 | // Not yet implemented 207 | template<> void SearchTree::updateFinal(Segmentor *tempMin); 208 | template<> void SearchTree::updateFinal(Amalgamator *tempMin); 209 | 210 | } 211 | 212 | #endif /* SITEPATH_MINENTROPY_H */ 213 | -------------------------------------------------------------------------------- /src/searchTree.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | 6 | #include "minEntropy.h" 7 | 8 | template 9 | MinEntropy::SearchTree::SearchTree( 10 | const unsigned int minEffectiveSize, 11 | const unsigned int searchDepth, 12 | const Rcpp::ListOf &nodeSummaries 13 | ): 14 | m_minTipNum(minEffectiveSize), 15 | m_searchDepth(searchDepth), 16 | m_enclosed(nodeSummaries.size()) { 17 | // Transform R list to a vector of AA and mapped frequency as 18 | // 'm_aaSummaries'. Get all the possible segment points 19 | for (segIndex i = 0; i < m_enclosed; ++i) { 20 | m_all.push_back(i); 21 | Rcpp::IntegerVector summary = nodeSummaries[i].attr("aaSummary"); 22 | Rcpp::CharacterVector aa = summary.names(); 23 | aaSummary node; 24 | for (int j = 0; j < aa.size(); ++j) { 25 | node[Rcpp::as(aa[j])] = summary[j]; 26 | } 27 | m_aaSummaries.push_back(node); 28 | } 29 | // The segment point of "0" is removed 30 | m_all.erase(m_all.begin()); 31 | // The final used list contains the last segment point to enclose the last 32 | // segment 33 | m_final.push_back(m_enclosed); 34 | // Generate the first parent node to initialize the search 35 | initSearch(); 36 | } 37 | 38 | template 39 | MinEntropy::SearchTree::~SearchTree() { 40 | // Release the memory used by search nodes 41 | typedef typename std::vector::iterator iter; 42 | for (iter it = m_segList.begin(); it != m_segList.end(); ++it) { 43 | delete *it; 44 | } 45 | m_segList.clear(); 46 | } 47 | 48 | template 49 | MinEntropy::segment MinEntropy::SearchTree::getFinal() const { 50 | return m_final; 51 | } 52 | 53 | template 54 | float MinEntropy::SearchTree::getMinEntropy() const { 55 | return m_minEntropy; 56 | } 57 | 58 | template 59 | void MinEntropy::SearchTree::search() { 60 | unsigned int depth = 0; 61 | const unsigned int maxDepth = m_enclosed * m_searchDepth; 62 | // Find the search node with minimum entropy in the active list and make it 63 | // the growing parent node for the next round. The current minimum entropy 64 | // should be decreasing but increasing is allowed. The used list of a node 65 | // is returned when its entropy stays minimum for a long time. 66 | while (true) { 67 | // Stop when the search reaches the end and delete the new parent node 68 | if (m_parent->isEndNode()) { 69 | delete m_parent; 70 | break; 71 | } 72 | // Generate children node from the parent node 73 | for (unsigned int i = 0; i < m_parent->getOpenSize(); ++i) { 74 | T *seg = new T(m_parent, i, m_aaSummaries, m_minTipNum); 75 | // This is to decide whether the child is valid and can be included 76 | // in the search list 77 | growTree(seg); 78 | } 79 | // Delete the parent node as its pointer already removed from m_segList 80 | delete m_parent; 81 | // Stop when there is no search node in the list 82 | if (m_segList.empty()) { break; } 83 | // Get a new parent node from the search list. 84 | T *tempMin = updateParent(); 85 | // Increment the number (depth) of consecutive times the candidate node 86 | // being minimum entropy or update to a new candidate node and re-count 87 | // the depth 88 | if (tempMin->getEntropy() > m_minEntropy) { 89 | // Increment the depth when the candidate is not beaten 90 | ++depth; 91 | // The search stops when the depth reaches the threshold 92 | if (depth >= maxDepth) { break; } 93 | // The candidate node stays unchanged if the new parent node 94 | // cannot beat it. 95 | } else { 96 | // The new parent node will be the new candidate node if the 97 | // previous candidate is beaten by it. 98 | if (tempMin->isQualified()) { 99 | m_final = tempMin->getUsed(); 100 | m_minEntropy = tempMin->getEntropy(); 101 | } 102 | // Stop when the entropy of the new parent node is 0 103 | if (m_minEntropy == 0) { break; } 104 | // Re-count the depth for the new candidate 105 | depth = 0; 106 | } 107 | // Update the new parent node no matter whether it becomes the new 108 | // candidate or not. 109 | m_parent = tempMin; 110 | } 111 | } 112 | 113 | template 114 | void MinEntropy::SearchTree::resumeSearch() { 115 | if (!m_segList.empty()) { 116 | m_parent = updateParent(); 117 | search(); 118 | } 119 | } 120 | 121 | template class MinEntropy::SearchTree; 122 | 123 | template<> 124 | void MinEntropy::SearchTree::initSearch() { 125 | // The adding search starts with only the last segment point and all the 126 | // rest are open for children nodes 127 | m_parent = new MinEntropy::Segmentor( 128 | m_all, 129 | m_final, 130 | m_aaSummaries, 131 | m_minTipNum 132 | ); 133 | // Initialize minimum entropy as the initial parent's entropy 134 | m_minEntropy = m_parent->getEntropy(); 135 | } 136 | 137 | template<> 138 | void MinEntropy::SearchTree::growTree( 139 | MinEntropy::Segmentor *seg 140 | ) { 141 | // Only the qualified search node in the adding search is added to the 142 | // active list. Deleted otherwise 143 | if (seg->isQualified()) { 144 | m_segList.push_back(seg); 145 | } else { 146 | delete seg; 147 | } 148 | } 149 | 150 | template class MinEntropy::SearchTree; 151 | 152 | template<> 153 | void MinEntropy::SearchTree::initSearch() { 154 | // Use the initial entropy is the starting parent node of the adding search 155 | // because the starting parent node of the removing search can be invalid 156 | MinEntropy::Segmentor noSeg( 157 | m_all, 158 | m_final, 159 | m_aaSummaries, 160 | m_minTipNum 161 | ); 162 | m_final = noSeg.getUsed(); 163 | m_minEntropy = noSeg.getEntropy(); 164 | // Add the last segment point to the used list for the staring parent node 165 | m_all.push_back(m_enclosed); 166 | m_parent = new MinEntropy::Amalgamator( 167 | m_all, 168 | m_aaSummaries, 169 | m_minTipNum 170 | ); 171 | } 172 | 173 | template<> 174 | void MinEntropy::SearchTree::growTree( 175 | MinEntropy::Amalgamator *seg 176 | ) { 177 | // The node will not be included in the active list and deleted if a node 178 | // with the same used list has already been evaluated. The node doesn't have 179 | // to be qualified because we don't want to rule out the possibly qualifed 180 | // children node 181 | segment x = seg->getUsed(); 182 | if (std::find( 183 | m_segListHistory.begin(), 184 | m_segListHistory.end(), 185 | x 186 | ) == m_segListHistory.end()) { 187 | m_segListHistory.push_back(x); 188 | m_segList.push_back(seg); 189 | } else { 190 | delete seg; 191 | } 192 | } 193 | 194 | template 195 | T *MinEntropy::SearchTree::updateParent() { 196 | typedef typename std::vector::iterator iter; 197 | // Assume the first search node in the active list is the new parent node. 198 | iter it = m_segList.begin(), rm = it; 199 | // The new parent node has the minimum entropy among the active list 200 | T *tempMin = *it; 201 | for (++it; it != m_segList.end(); ++it) { 202 | if ((*it)->getEntropy() < tempMin->getEntropy()) { 203 | tempMin = *it; 204 | rm = it; 205 | } 206 | } 207 | // Remove the pointer of the new parent from the search list. 208 | m_segList.erase(rm); 209 | // Return the pointer to the new parent node 210 | return tempMin; 211 | } 212 | -------------------------------------------------------------------------------- /vignettes/sitePath.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Use sitePath to find fixation and parallel sites" 3 | author: "Chengyang Ji" 4 | package: sitePath 5 | output: 6 | BiocStyle::html_document: 7 | toc_float: false 8 | abstract: > 9 | In viral evolution, fixed substitutions in the nucleic acid or protein level are closely associated with maintaining viral function, while parallel mutation reflects the competitive nature in adaptive selection. The continued accumulation of large-scale viral sequence data enhances the challenge of identifying important mutations in a quick and accurate manner. In sitePath, the phylogenetic tree was separated into a set of inheritable phylogenetic pathways via an automated pathway-division method. Then, for each phylogenetic pathway, the identification of fixed substitutions was transformed into a local-optimal-solution problem directed by a minimal entropy algorithm. Finally, the parallel mutation was determined based on the recurrence of mutations among at least two phylogenetic pathways. 10 | vignette: > 11 | %\VignetteIndexEntry{An introduction to sitePath} 12 | %\VignetteEngine{knitr::rmarkdown} 13 | %\VignetteEncoding{UTF-8} 14 | --- 15 | 16 | ```{r setup, include=FALSE} 17 | knitr::opts_chunk$set( 18 | collapse = TRUE, 19 | comment = "#>" 20 | ) 21 | ``` 22 | 23 | # Introduction 24 | 25 | The `sitePath` package is made for the high-throughput identification of fixed substitutions and parallel mutations in viruses from a single phylogenetic tree. This is achieved by three major steps: 26 | 27 | 1. Clustering phylogenetic terminals 28 | 2. Identifying phylogenetic pathways 29 | 3. Finding fixed and parallel mutations 30 | 31 | # Clustering phylogenetic terminals 32 | 33 | The firs step is to import phylogenetic tree and multiple sequence alignment files. For now, `sitePath` accepts `phylo` object and `alignment` object. Functions from `ggtree` and `seqinr` are able to handle most file formats. 34 | 35 | ## Import tree file 36 | 37 | The S3 `phylo` class is a common data structure for phylogenetic analysis in R. The CRAN package [ape](https://cran.r-project.org/web/packages/ape/index.html) provides basic parsing function for reading tree files. The Bioconductor package [treeio](https://bioconductor.org/packages/release/bioc/html/treeio.html) provides more comprehensive parsing utilities. 38 | 39 | 40 | ```{r import_tree, message=FALSE} 41 | library(sitePath) 42 | 43 | tree_file <- system.file("extdata", "ZIKV.newick", package = "sitePath") 44 | tree <- read.tree(tree_file) 45 | ``` 46 | 47 | It is highly recommended that the file stores a rooted tree as R would consider the tree is rooted by default and re-rooting the tree in R is difficult. Also, we expect the tree to have no super long branches. A bad example is shown below: 48 | 49 | ```{r bad_tree_example, echo=FALSE} 50 | bad_tree <- read.tree(system.file("extdata", "WNV.newick", package = "sitePath")) 51 | 52 | ggtree::ggtree(bad_tree) + ggplot2::ggtitle("Do not use a tree like this") 53 | ``` 54 | 55 | 56 | ## Import sequence alignment file 57 | 58 | Most multiple sequence alignment format can be parsed by [seqinr](https://cran.r-project.org/web/packages/seqinr/index.html). There is a wrapper function for parsing and adding the sequence alignment. Set "cl.cores" in `options` to the number of cores you want to use for multiprocessing. 59 | 60 | ```{r add_alignment, message=FALSE} 61 | alignment_file <- system.file("extdata", "ZIKV.fasta", package = "sitePath") 62 | 63 | options(list("cl.cores" = 1)) # Set this bigger than 1 to use multiprocessing 64 | 65 | paths <- addMSA(tree, msaPath = alignment_file, msaFormat = "fasta") 66 | ``` 67 | 68 | ## Clustering using site polymorphism 69 | 70 | The `addMSA` function will match the sequence names in tree and alignment. Also, the function uses polymorphism of each site to cluster sequences for identifying phylogenetic pathways. 71 | 72 | # Identifying phylogenetic pathways 73 | 74 | After importing the tree and sequence file, `sitePath` is ready to identify phylogenetic pathways. 75 | 76 | ## The impact of threshold on resolving lineages 77 | 78 | The impact of threshold depends on the tree topology hence there is no universal choice. The function `sneakPeak` samples thresholds and calculates the resulting number of paths. *The use of this function can help choose the threshold.* 79 | 80 | ```{r sneakPeek_plot} 81 | preassessment <- sneakPeek(paths, makePlot = TRUE) 82 | ``` 83 | 84 | ## Choose a threshold 85 | 86 | The default threshold is the first 'stable' value to produce the same number of phylogenetic pathways. You can directly use the return of `addMSA` if you want the default or choose other threshold by using function `lineagePath`. The choice of the threshold really depends. Here 18 is used as an example. 87 | 88 | ```{r get_lineagePath} 89 | paths <- lineagePath(preassessment, 18) 90 | paths 91 | ``` 92 | 93 | You can visualize the result. 94 | 95 | ```{r plot_paths} 96 | plot(paths) 97 | ``` 98 | 99 | # Finding fixed and parallel mutations 100 | 101 | Now you're ready to find fixation and parallel mutations. 102 | 103 | ## Entropy minimization 104 | 105 | The `sitesMinEntropy` function perform entropy minimization on every site for each lineage. The fixation and parallel mutations can be derived from the function's return value. 106 | 107 | ```{r min_entropy} 108 | minEntropy <- sitesMinEntropy(paths) 109 | ``` 110 | 111 | ## Fixation mutations 112 | 113 | The hierarchical search is done by `fixationSites` function. The function detects the site with fixation mutation. 114 | 115 | ```{r find_fixations} 116 | fixations <- fixationSites(minEntropy) 117 | fixations 118 | ``` 119 | 120 | To get the position of all the resulting sites, `allSitesName` can be used on the return of `fixationSites` and also other functions like `SNPsites` and `parallelSites`. 121 | 122 | ```{r} 123 | allSites <- allSitesName(fixations) 124 | allSites 125 | ``` 126 | 127 | If you want to retrieve the result of a single site, you can pass the result of `fixationSites` and the site index to `extractSite` function. The output is a `sitePath` object which stores the tip names. 128 | 129 | ```{r get_sitePath} 130 | sp <- extractSite(fixations, 139) 131 | ``` 132 | 133 | It is also possible to retrieve the tips involved in the fixation of the site. 134 | 135 | ```{r get_tipNames} 136 | extractTips(fixations, 139) 137 | ``` 138 | 139 | Use `plot` on a `sitePath` object to visualize the fixation mutation of a single site. Alternatively, use `plotSingleSite` on an `fixationSites` object with the site specified. 140 | 141 | ```{r plot_sitePath} 142 | plot(sp) 143 | plotSingleSite(fixations, 139) 144 | ``` 145 | 146 | To have an overall view of the transition of fixation mutation: 147 | 148 | ```{r plot_fixations} 149 | plot(fixations) 150 | ``` 151 | 152 | ## Parallel mutations 153 | 154 | Parallel mutation can be found by the `parallelSites` function. There are four ways of defining the parallel mutation: `all`, `exact`, `pre` and `post`. Here `exact` is used as an example. 155 | 156 | ```{r} 157 | paraSites <- parallelSites(minEntropy, minSNP = 1, mutMode = "exact") 158 | paraSites 159 | ``` 160 | 161 | The result of a single site can be visualized by `plotSingleSite` function. 162 | 163 | ```{r} 164 | plotSingleSite(paraSites, 105) 165 | ``` 166 | 167 | To have an overall view of the parallel mutations: 168 | 169 | ```{r plot_parallel} 170 | plot(paraSites) 171 | ``` 172 | 173 | # Miscellaneous 174 | 175 | This part is extra and experimental but might be useful when pre-assessing your data. We'll use an example to demonstrate. 176 | 177 | ## Inspect one site 178 | 179 | The `plotSingleSite` function will color the tree according to amino acids if you use the output of `lineagePath` function. 180 | 181 | ```{r plot_sites} 182 | plotSingleSite(paths, 139) 183 | plotSingleSite(paths, 763) 184 | ``` 185 | 186 | ## SNP sites 187 | 188 | An SNP site could potentially undergo fixation event. The `SNPsites` function predicts possible SNP sites and the result could be what you'll expect to be fixation mutation. Also, a tree plot with mutation could be visualized with `plotMutSites` function. 189 | 190 | ```{r find_SNP} 191 | snps <- SNPsites(paths) 192 | plotMutSites(snps) 193 | plotSingleSite(paths, snps[4]) 194 | plotSingleSite(paths, snps[5]) 195 | ``` 196 | 197 | # Session info {.unnumbered} 198 | ```{r session_info} 199 | sessionInfo() 200 | ``` 201 | --------------------------------------------------------------------------------