├── .gitignore ├── images └── sorting_example.jpeg ├── genesortR.Rproj ├── README.md └── genesortR.R /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata -------------------------------------------------------------------------------- /images/sorting_example.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mongiardino/genesortR/HEAD/images/sorting_example.jpeg -------------------------------------------------------------------------------- /genesortR.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # genesortR 2 | Sorting and subsampling of phylogenomic datasets using a multivariate method to quantify phylogenetic usefulness. 3 | 4 | ## Description 5 | This R script estimates seven gene properties commonly used to characterize the information content of loci in phylogenomic datasets: four sources of systematic bias (average pairwise patristic distance, compositional heterogeneity, level of saturation, and root-to-tip variance), two proxies for phylogenetic signal (Robinson-Foulds similarity to a target topology and average bootstrap support), as well as the proportion of variable sites. Some of these properties are estimated directly on sequence data, others on their corresponding topologies. Instead of directly optimizing these properties, a principal component analysis (PCA) is used to find an axis of phylogenetic usefulness along which proxies for signal increase while sources of bias decrease. 6 | 7 | This approach was first used for phylogenomic subsampling by Mongiardino Koch & Thompson (2020), and then found to be able to discover an axis of phylogenetic usefulness across a sample of 18 diverse phylogenomic datasets by Mongiardino Koch (2021). The method was also found to recover more accurate topologies than most other commonly-used methods for phylogenomic subsampling, including those based on rates of evolution (targeting either low, intermediate or high rates), as well as methods that seek to minimize potential sources of systematic bias (e.g., targeting the most clock-like, least saturated, and least compositionally heterogeneous loci). It therefore provides a well-founded alternative to reduce the size of phylogenomic datasets, allowing the use of more complex inference methods (full coalescent methods, site-heterogeneous models, total-evidence dating, etc.), as well as allowing phylogenetic hypotheses to be tested with smaller and better curated datasets. 8 | 9 | ## Parameters 10 | The script requires some editing to define a number of input parameters: 11 | * Set the woring directory to a desired location, ideally where input files are located. This is also where output files will be written. 12 | * Provide names for the four data files that need to be loaded: 13 | 1. Data matrix: needs to be in FASTA format. Only '?', '-' and 'X' are taken to represent missing data. If missing data is represented with other symbols, consider editing the data before running the script. 14 | 2. Partition file: in format 'gene_name = start-end'. 15 | 3. Gene trees: a single tree file in newick format with all gene trees (with node support values). 16 | 4. Species tree: a target topology used to calculate Robinson-Foulds similarity (also in newick format). If some relationships between the sampled taxa are contentious, these can be collapsed so as not to favor any specific resolution. IMPORTANT: This species tree has to be rooted using outgroups. 17 | * ```type```: the data type, either 'AA' or 'DNA'. 18 | * ```ingroup``` (OPTIONAL): The names of two terminals that bracket the ingroup clade. This is recommended and serves to purposes: it allows gene trees to be rooted (which is important for the calculation of root-to-tip distances for example), and it allows outgroups to be discarded before measuring gene properties. If this is not desired, then this parameter can be left empty (i.e., unmodified) and all terminals will be used in the estimation of properties. If terminal names are provided, gene trees will be rooted at the node of the most-recent common ancestor (MRCA) of the ingroup. 19 | * ```threshold```: A threshold of number of ingroup taxa (i.e., a level of occupancy within the ingroup) to even consider the loci. If left as default (i.e., 'auto'), then this is set to 10% of the ingroup, otherwise it can be modified to a specific number of taxa. Loci with less than ```threshold``` will be removed from the data. If this is not desired set to 0. 20 | * ```remove_outliers```: A logical value stating whether to activate the removal of outlier loci (recommended if the dataset is composed of hundreds of loci or more). This will discard a fraction of loci (defined using ```outlier_fraction```) that exhibit the largest Mahalanobis distance in PC space, and which may suffer from problems in orthology inference, alignment, etc. Even if no issue can be detected with these loci, their pattern of correlation between gene properties deviates markedly from that of all others, indicating unusual evolutionary histories (e.g. driven by strong selective pressures) and it is unlikely that small datasets will benefit from their inclusion. The PCA is then repeated after their exclusion. 21 | * ```n_genes```: A final number of genes to retain. If left as defualt (i.e., 'all') then the data is sorted but not subsampled, and all loci are saved to output files. If the objective is to subsample the dataset, modify to a desired number of loci in the final dataset. 22 | * ```topological_similarity```: A logical value determining whether to incorporate Robinson-Foulds (RF) similarity in the PCA or not. This is highly recommended, as in the absence of RF similarity the only proxy for phylogenetic signal left is the average bootstrap support. I have found that, across several datasets, this can lead to favoring the selection of loci that evolve faster than necessary (and contain a higher prevalence of some issues such as compositional and rate heterogeneity). The addition of RF similarity helps balance this effect. However, this requires using a species tree that is considered "true", i.e., it is used as a target against which the topologies of gene trees are evaluated. In the presence of extensive phylogenetic uncertainty, this can be considered undesirable, and potentially even mislead the selection of loci. While even in the case of a highly uncertain phylogeny I would still recommend the inclusion of RF similarity, possibly by providing a relatively unresolved topology with uncertain nodes collapsed, ```topological_similarity``` can be set to ```FALSE``` and RF will not be used to constrain the phylogenetic usefulness axis. Note that even if you choose not to use RF similarities, you still need to provide a species tree, as the script will use it for other tasks such as defining the ingroup.

23 | **WARNING: Taxon names need to match across all files, and loci need to be ordered in the same way in the alignment, partition and gene tree files.** 24 | 25 | ## Estimated gene properties 26 | Seven gene properties are calculated in order to derive PC axes, inlcuding four sources of systematic bias (1-4 below) and two proxies for phylogenetic signal (5-6 below). In every case, sources of bias and proxies for signal are defined in such a way that their minimization and mazimization (respectively) is desireable. 27 | 28 | 1. Root-to-tip variance: This metric provides an estimate of the clocklikeness of the evolutionary process for a loci, andh has been employed routinely to select genes for divergence time estimation (e.g., Smith et al. 2018). If an ```ingroup``` is defined by providing terminal names that bracket the clade, then gene trees will be rooted at the node of their MRCA, and distances will be calculated from this root all terminals within the ingroup. Distances to outgroup taxa will not be used, even when these form part of the ingroup clade in a given gene tree. If an ingroupis not defined, the tree will be rooted at its mid-point and distances will be estimated from this root to all terminals. 29 | 2. Average patristic distances: This value is estimated as the average of all pair-wise patristic distances (i.e., the sum of the lengths of branches that link two nodes in the gene tree) between terminals. Higher values are considered to be conducive to long-branch attraction artifacts (Struck 2014, Kocot et al. 2017). 30 | 3. Level of saturation: Estimated as the complement of the regression slope of pair-wise patristic distances (see above) against p-distances (the proportion of sites in the alignment for which two sequences differ) (Nosenko et al. 2013). Although negative correlations can occur, these are taken to happen due to some level of random error in a completely saturated loci, and the value is replaced with a 1. 31 | 4. Compositional heterogeneity: Variation in the use of amino acids among different branches of a tree is a generally unaccounted source of non-phylogenetic signal, and a major source of error for some clades (Nesnidal et al. 2013). This heterogeneity is estimated here using the relative composition frequency variability (RCFV), as defined by Zhong et al. (2011). Note that this variable is not included in the PCA if data type is set to 'DNA'. 32 | 5. Robinson-Foulds (RF) similarity: This variable (which corresponds to the complement of the RF distance, Robinson & Foulds 1981) is used as an estimate of gene tree error. Although this can suffer from issues of circularity, it should be noted that the species tree provided (which is used as the target topology for the calculation of RF similarity scores) can contain any number of unresolved nodes depicting uncertainty in the resolution of clades. 33 | 6. Average bootstrap support: The average degree of support in the molecular data for the nodes in the gene tree. 34 | 7. Proportion of variable sites: This variable has been variously consider an estimate of information content, a proxy for evolutionary rate, or an amount of phylogenetic signal in the alignment. Its degree of correlation with other gene properties across multiple datasets supports the first two interpretations. 35 | 36 | Besides these seven properties directly employed for the discovery of a phylogenetic usefulness axis, six other properties are also estimated although not directly employed. These include: A. The amount of missing data; B. The level of occupancy; C. The length of the alignment; D. The total tree length (i.e. the sum of all of its branches); E. An estimate of the rate of evolution (tree length divided by the number of taxa; Telford et al. 2014); and F) The treeness of a gene tree (i.e., the proportion of branch lengths in internal branches). 37 | Any of these thirteen properties can be directly employed for the sorting of datasets with a minimial degree of editing to the R script. Instructions on how to do this are provided as comments in section 'D) Sort & Subsample' of the script. 38 | 39 | ## Output 40 | The main outputs that are directly saved to file are the sorted alignment, partition file and gene tree file. If a desired number of genes is specified using the ```n_genes``` parameter, then these datasets will be subsampled as well, allowing for both concatenation and coalescent-based phylogenetic inference to be repeated with a smaller dataset. The estimated properties for both the outlier loci (if outlier filtering is activated), and the full sorted dataset are also output as csv files (properties_outliers.csv and properties_sorted_dataset.csv, respectively). This can help explore the reasons why loci were removed (and potentially fix some of their issues if these are evident), as well as check the correlation among different gene properties and their distribution along the ordering imposed by the script. Additionally, a plot (Fig. 1) is generated showing how the underlying gene properties vary according to the order in which genes are placed based on their phylogenetic usefulness. This plot is also saved to file.


41 | ![sorting_example](https://github.com/mongiardino/genesortR/blob/main/images/sorting_example.jpeg) 42 | **Fig. 1:** Value of the seven gene properties against the order in which loci are sorted according to their phylogenetic usefulness. Loci are from a sea urchin transcriptomic dataset (2,356 loci) of Mongiardino Koch & Thompson (2020). Regression lines correspond to generalized additive models (GAM). 43 | 44 | ## Author 45 | Nicolás Mongiardino Koch. Department of Earth & Planetary Sciences, Yale University. 46 | 47 | Citation: Mongiardino Koch N. 2021. Phylogenomic subsampling and the search for phylogenetically reliable loci. Molecular Biology and Evolution, msab151, https://doi.org/10.1093/molbev/msab151 48 | 49 | I greatly appreciate the efforts of Mansa Srivastav and Kevin Kocot in helping debug this script. 50 | Prashant Sharma provided useful comments and suggestions. 51 | 52 | ## References 53 | Kocot KM, Struck TH, Merkel J, Waits DS, Todt C, Brannock PM, Weese DA, Cannon JT, Moroz LL, Lieb B, Halanych KM. 2017. Phylogenomics of Lophotrochozoa with consideration of systematic error. Systematic Biology 66:256-282. 54 | 55 | Mongiardino Koch N, Thompson JR. 2020. A total-evidence dated phylogeny of Echinoidea combining phylogenomic and paleontological data. Systematic Biology syaa069, https://doi.org/10.1093/sysbio/syaa069. 56 | 57 | Mongiardino Koch N. 2021. Phylogenomic subsampling and the search for phylogenetically reliable loci. Molecular Biology and Evolution, msab151, https://doi.org/10.1093/molbev/msab151 58 | 59 | Nesnidal MP, Helmkampf M, Meyer A, Witek A, Bruchhaus I, Ebersberger I, Hankeln T, Lieb B, Struck TH, Hausdorf B. 2013. New phylogenomic data support the monophyly of Lophophorata and an Ectoproct–Phoronid clade and indicate that Polyzoa and Kryptrochozoa are caused by systematic bias. BMC Evolutionary Biology 13:253. 60 | 61 | Nosenko T, Schreiber F, Adamska M, Adamski M, Eitel M, Hammel J, Maldonado M, Müller WE, Nickel M, Schierwater B, Vacelet J. 2013. Deep metazoan phylogeny: when different genes tell different stories. Molecular Phylogenetics and Evolution 67:223-233. 62 | 63 | Robinson DF, Foulds LR. 1981. Comparison of phylogenetic trees. Mathematical Biosciences 53:131-730. 64 | 65 | Smith SA, Brown JW, Walker JF. 2018. So many genes, so little time: A practical approach to divergence-time estimation in the genomic era. PloS One 13:e0197433. 66 | 67 | Struck TH. 2014. TreSpEx – Detection of misleading signal in phylogenetic reconstructions based on tree information. Evolutionary Bioinformatics 10:51-67. 68 | 69 | Telford MJ, Lowe CJ, Cameron CB, Ortega-Martinez O, Aronowicz J, Oliveri P, Copley RR. 2014. Phylogenomic analysis of echinoderm class relationships supports Asterozoa. Proceedings of the Royal Society B 281:20140479. 70 | 71 | Zhong M, Hansen B, Nesnidal M, Golombek A, Halanych KM, Struck TH. 2011. Detecting the symplesiomorphy trap: a multigene phylogenetic analysis of terebelliform annelids. BMC evolutionary biology, 11:1-15. 72 | -------------------------------------------------------------------------------- /genesortR.R: -------------------------------------------------------------------------------- 1 | #genesortR: Property-based multivariate phylogenomic subsampling 2 | #Written by Nicolas Mongiardino Koch 02/2021 3 | 4 | #This script requires an alignment in FASTA format, a partition file in format 5 | #'geneX = 1-200', a species tree considered the best estimate of the true tree 6 | #(e.g., as obtained using concatenation or coalescent methods using the full 7 | #alignment, note that uncertain nodes can be collapsed) and a file with all gene 8 | #trees. Species and gene trees need to be in newick format. The order of genes 9 | #in the alignment and the order of gene trees need to match. The species tree 10 | #must be rooted using outgroups. 11 | 12 | #Parameters needing input are marked with 'INPUT' and are all in the first 13 | #section below, entitled 'Parameters'. 14 | 15 | #More gene properties than the ones used to infer a usefulness axis are 16 | #inferred. If you would like to sort and subsample based on any of these (such 17 | #as occupancy for example) go to heading 'D) Sort & Subsample and modify the 18 | #lines after 'WARNING' 19 | 20 | #More details can be found in the following publications: 21 | #1) Mongiardino Koch & Thompson (2021) - A Total-Evidence Dated Phylogeny of 22 | #Echinoidea Combining Phylogenomic and Paleontological Data. Systematic 23 | #Biology 70(3): 421–439, https://doi.org/10.1093/sysbio/syaa069. 24 | 25 | #2) Mongiardino Koch (2021) - Phylogenomic subsampling and the search for 26 | #phylogenetically reliable loci. Molecular Biology and Evolution 38(9): 27 | #4025–4038, https://doi.org/10.1093/molbev/msab151. 28 | 29 | #Parameters----------------------------------------------------------------------------------------- 30 | #INPUT: set working directory to folder containing these four files 31 | setwd('') 32 | 33 | #INPUT: fill this with file names 34 | alignment <- '' 35 | partition <- '' 36 | species_tree <- '' 37 | gene_trees <- '' 38 | 39 | #INPUT: is the alignment 'DNA' or 'AA' 40 | type <- 'AA' 41 | 42 | #INPUT: provide the names of two terminals that bracket the ingroup, i.e., one 43 | #descendant of each of the two main clades of the ingroup. Leave blank and 44 | #properties will be calculated across the entire tree without removing outliers 45 | ingroup <- c('', '') 46 | 47 | #INPUT: do not even consider genes with less than 'threshold' ingroup taxa. 48 | #If threshold == 'auto' then it is automatically set to more than 10% of the 49 | #ingroup terminals (if the dataset is small, a larger value is probably 50 | #desirable) 51 | threshold <- 'auto' 52 | 53 | #INPUT: activate/deactivate outlier gene removal (recommended) 54 | remove_outliers <- T 55 | outlier_fraction <- 0.01 #i.e. 1% 56 | 57 | ##INPUT: Desired number of genes to retain 58 | #if n_genes == 'all' then the dataset is sorted but not subsampled. 59 | n_genes <- 'all' 60 | 61 | ##INPUT: Whether to incorporate Robinson-Foulds similarity in the PCA. This 62 | ##option is available in case the relationships among the studied taxa are 63 | ##highly uncertain, and there is concern that specifying a given topology might 64 | ##bias results. Alternatively (and preferentially) these uncertainties can also 65 | ##be accommodated by using a partially resolved species tree as input, for which 66 | ##uncertain relationships have been collapsed. In that case, RF similarities 67 | ##will not be affected by the specific resolution of uncertain nodes favored by 68 | ##different genes. Note that even if topological_similarity is set to FALSE, a 69 | ##species tree needs to be provided to delineate the species in the ingroup 70 | topological_similarity <- T 71 | 72 | #Install and load packages------------------------------------------------------------------------- 73 | packages <- c('ape', 'phytools', 'phangorn', 'tibble', 'dplyr', 'tidyr', 74 | 'adephylo', 'ggplot2', 'cowplot') 75 | new_packages <- packages[!packages %in% installed.packages()[,'Package']] 76 | if(length(new_packages)) { install.packages(new_packages) } 77 | 78 | library(phangorn) 79 | library(ape) 80 | library(tibble) 81 | library(dplyr) 82 | library(tidyr) 83 | library(phytools) 84 | library(adephylo) 85 | library(ggplot2) 86 | library(cowplot) 87 | 88 | #Some necessary functions----------------------------------------------------------------------------------- 89 | `%not in%` <- function(x, table) is.na(match(x, table, nomatch = NA_integer_)) 90 | #function to count invariant sites 91 | inv <- function(x) { 92 | pattern <- unique(x) 93 | if(any(pattern %in% c('?'))) pattern <- pattern[-which(pattern == '?')] 94 | if(any(pattern %in% c('-'))) pattern <- pattern[-which(pattern == '-')] 95 | if(any(pattern %in% c('X'))) pattern <- pattern[-which(pattern == 'X')] 96 | 97 | if(length(pattern) == 1) { 98 | invariant <- T 99 | } else { 100 | invariant <- F 101 | } 102 | return(invariant) 103 | } 104 | 105 | #function to remove missing data from the estimation of RCFV 106 | remove_empty <- function(x) { 107 | if('-' %in% names(unlist(x))) { 108 | missing <- which(names(unlist(x)) == '-') 109 | x <- x[-missing] 110 | } 111 | if('?' %in% names(unlist(x))) { 112 | missing <- which(names(unlist(x)) == '?') 113 | x <- x[-missing] 114 | } 115 | if('X' %in% names(unlist(x))) { 116 | missing <- which(names(unlist(x)) == 'X') 117 | x <- x[-missing] 118 | } 119 | return(x) 120 | } 121 | 122 | #A) Prepare data------------------------------------------------------------------------------------------------------- 123 | data <- read.phyDat(alignment, format = 'fasta', type = type) 124 | if(type == 'AA') { 125 | data <- as.AAbin(data) 126 | } else { 127 | data <- as.DNAbin(data) 128 | } 129 | 130 | partitions <- read.table(partition, sep = ' ') 131 | names <- as.character(unlist(enframe(partitions[,(which(partitions[1,] == '=') - 1)], 132 | name = NULL), use.names = F)) 133 | partitions <- enframe(partitions[,ncol(partitions)], name = NULL) 134 | partitions <- partitions %>% 135 | separate(value, into = c('Start', 'End'), sep = '-') %>% 136 | mutate_if(is.character, as.numeric) 137 | 138 | gene_trees <- read.tree(gene_trees) 139 | species_tree <- read.tree(species_tree) 140 | 141 | #get names of ingroup and outgroup taxa 142 | if(all(nchar(ingroup) != 0)) { 143 | node <- getMRCA(species_tree, ingroup) 144 | IG <- Descendants(species_tree, node, type = 'tips') 145 | IG <- species_tree$tip.label[unlist(IG)] 146 | OG <- species_tree$tip.label[species_tree$tip.label %not in% IG] 147 | } else { 148 | IG <- species_tree$tip.label 149 | } 150 | 151 | if(threshold == 'auto') { 152 | threshold <- ceiling(length(IG)/10) 153 | if(all(nchar(ingroup) != 0)) { 154 | cat('Setting threshold to evaluate loci to', threshold, 155 | 'taxa (i.e., 10% of ingroup taxa).\n') 156 | } else { 157 | cat('Setting threshold to evaluate loci to', threshold, 158 | 'taxa (i.e., 10% of all taxa).\n') 159 | } 160 | } else { 161 | if(is.numeric(threshold)) { 162 | if(threshold == 0) { 163 | cat('Taxon threshold is disabled. All loci will be considered regardless 164 | of occupancy level.\n') 165 | } else { 166 | if(threshold < 1) { 167 | threshold <- ceiling(length(IG) * threshold) 168 | cat('Threshold was expecting an integer but was provided a number < 1.\n') 169 | cat('It will be assumed that this should be taken as a fraction 170 | of ingroup taxa.\n') 171 | } else { 172 | if(all(nchar(ingroup) != 0)) { 173 | cat('Loci with less than', threshold, 'ingroup will be discarded.\n') 174 | } else { 175 | cat('Loci with less than', threshold, 'taxa will be discarded.\n') 176 | } 177 | } 178 | } 179 | } else { 180 | cat("Modify threshold parameter to either \'auto\' 181 | or an integer. This run will fail.", '\n') 182 | } 183 | } 184 | 185 | #B) Estimate properties------------------------------------------------------------------------------- 186 | genes <- 1:length(gene_trees) 187 | 188 | root_tip_var <- saturation <- missing <- av_patristic <- 189 | length <- tree_length <- occupancy <- variable_sites <- 190 | RCFV <- rate <- treeness <- average_BS_support <- 191 | robinson_sim <- integer(length(gene_trees)) 192 | 193 | for(i in 1:length(gene_trees)) { 194 | tree <- gene_trees[[i]] 195 | 196 | #remove genes with less than 'threshold' ingroup taxa 197 | if(length(which(tree$tip.label %in% IG)) < threshold) next 198 | 199 | #if OGs are defined and present in this tree, root with them 200 | if(length(IG) != length(species_tree$tip.label)) { 201 | if(any(OG %in% tree$tip.label)) { 202 | MRCA <- getMRCA(tree, which(tree$tip.label %in% IG)) 203 | tree_rooted <- root(tree, node = MRCA) 204 | root_tip_var[i] <- var(dist.nodes(tree_rooted)[MRCA, 205 | which(tree_rooted$tip.label %in% IG)]) 206 | 207 | #after this remove the OGs from the gene tree 208 | tree <- drop.tip(tree, which(tree$tip.label %in% OG)) 209 | } else { 210 | tree_rooted <- midpoint.root(tree) 211 | root_tip_var[i] <- var(dist.nodes(tree_rooted)[(length(tree_rooted$tip.label) + 1), 212 | (1:length(tree_rooted$tip.label))]) 213 | } 214 | } else { #otherwise do midpoint rooting 215 | tree_rooted <- midpoint.root(tree) 216 | root_tip_var[i] <- var(dist.nodes(tree_rooted)[(length(tree_rooted$tip.label) + 1), 217 | (1:length(tree_rooted$tip.label))]) 218 | } 219 | 220 | average_BS_support[i] <- mean(as.numeric(tree$node.label), na.rm = T) 221 | if(is.nan(average_BS_support[i])) average_BS_support[i] <- 0 222 | 223 | #remove taxa from species tree to match gene tree sampling 224 | if(length(which(species_tree$tip.label %not in% tree$tip.label)) > 0) { 225 | this_species_tree <- drop.tip(species_tree, which(species_tree$tip.label %not in% 226 | tree$tip.label)) 227 | } else { 228 | this_species_tree <- species_tree 229 | } 230 | 231 | if(topological_similarity) { 232 | robinson_sim[i] <- 1 - suppressMessages(RF.dist(this_species_tree, tree, 233 | normalize = TRUE, 234 | check.labels = TRUE)) 235 | } 236 | 237 | patristic_dist <- as.matrix(distTips(tree, tips = 'all', 238 | method = 'patristic', useC = T)) 239 | 240 | #get gene sequence 241 | gene <- as.character(data[,partitions$Start[i]:partitions$End[i]]) 242 | 243 | #remove OGs 244 | if(length(IG) != length(species_tree$tip.label)) { 245 | gene <- gene[-which(rownames(gene) %in% OG),] 246 | } 247 | ntax <- dim(gene)[1] 248 | 249 | #remove taxa not in tree (e.g., those with no data for this loci) 250 | if(any(rownames(gene) %not in% tree$tip.label)) { 251 | gene <- gene[-which(rownames(gene) %not in% tree$tip.label),] 252 | } 253 | 254 | #remove entirely empty positions (might originate from pruning OGs for 255 | #example) 256 | all_missing <- which(apply(gene, 2, function(x) all(x %in% c('-','?','X')))) 257 | if(length(all_missing) > 0) { 258 | gene <- gene[,-all_missing] 259 | } 260 | 261 | variable_sites[i] <- 1 - (length(which(apply(gene, 2, inv))) / dim(gene)[2]) 262 | missing[i] <- length(which(gene %in% c('-','?','X'))) / 263 | (dim(gene)[1]*dim(gene)[2]) 264 | length[i] <- dim(gene)[2] 265 | occupancy[i] <- dim(gene)[1]/ntax 266 | 267 | p_dist <- as.matrix(dist.hamming(as.phyDat(gene, type = type), ratio = TRUE, 268 | exclude = "pairwise")) 269 | p_dist <- p_dist[order(colnames(p_dist)),order(colnames(p_dist))] 270 | patristic_dist <- patristic_dist[order(colnames(patristic_dist)), 271 | order(colnames(patristic_dist))] 272 | p_dist <- p_dist[lower.tri(p_dist)] 273 | patristic_dist <- patristic_dist[lower.tri(patristic_dist)] 274 | av_patristic[i] <- mean(patristic_dist) 275 | saturation[i] <- 1 - lm(p_dist ~ patristic_dist)$coefficients[[2]] 276 | if(is.na(saturation[i])) saturation[i] <- 0 277 | if(saturation[i] > 1) saturation[i] <- 1 278 | if(saturation[i] < 0) saturation[i] <- 0 279 | 280 | #if sequence is made of AA, calculate comp. heterogeneity 281 | if(type == 'AA') { 282 | mean_freqs <- table(c(gene)) 283 | states <- sort(unique(c(gene))) 284 | if('-' %in% states) { 285 | mean_freqs <- mean_freqs[-which(states == '-')] 286 | states <- states[-which(states == '-')] 287 | } 288 | if('?' %in% states) { 289 | mean_freqs <- mean_freqs[-which(states == '?')] 290 | states <- states[-which(states == '?')] 291 | } 292 | if('X' %in% states) { 293 | mean_freqs <- mean_freqs[-which(states == 'X')] 294 | states <- states[-which(states == 'X')] 295 | } 296 | mean_freqs <- mean_freqs / sum(mean_freqs) 297 | 298 | freqs <- lapply(split(gene, seq(nrow(gene))), table) 299 | freqs <- lapply(freqs, remove_empty) 300 | for(j in 1:length(freqs)) { 301 | if(!all(states %in% names(freqs[[j]]))) { 302 | miss <- states[which(states %not in% names(freqs[[j]]))] 303 | add <- rep(0, length(miss)) 304 | names(add) <- miss 305 | add <- as.table(add) 306 | freqs[[j]] <- as.table(c(freqs[[j]], add)) 307 | freqs[[j]] <- freqs[[j]][order(factor(names(freqs[[j]])))] 308 | } 309 | } 310 | 311 | freqs <- lapply(freqs, function(x) x/sum(x)) 312 | freqs <- lapply(freqs, function(x) abs(x-mean_freqs)) 313 | freqs <- lapply(freqs, function(x) x/dim(gene)[1]) 314 | freqs <- lapply(freqs, function(x) sum(unlist(x))) 315 | RCFV[i] <- sum(unlist(freqs)) 316 | } 317 | 318 | tree_length[i] <- sum(tree$edge.length) 319 | rate[i] <- sum(tree$edge.length)/length(tree$tip.label) 320 | treeness[i] <- 1 - (sum(tree$edge.length[which(tree$edge[,2] %in% 321 | c(1:length(tree$tip.label)))]) / 322 | sum(tree$edge.length)) 323 | if(is.nan(treeness[i])) treeness[i] <- 0 324 | } 325 | 326 | #gather gene properties 327 | variables <- data.frame(genes, root_tip_var, saturation, missing, rate, 328 | tree_length, treeness, av_patristic, RCFV, length, 329 | occupancy, variable_sites, average_BS_support, 330 | robinson_sim) 331 | 332 | if(type == 'DNA') { 333 | variables <- variables[,-which(colnames(variables) == 'RCFV')] 334 | } else { 335 | if(any(is.na(variables$RCFV))) { 336 | cat('This script will soon crash, as loci and gene trees 337 | are not composed of the same taxa.\n') 338 | cat('Most likely this is due to loci and gene trees not 339 | being in the same order.\n') 340 | cat('This makes the estimation of some gene properties to fail\n') 341 | } 342 | } 343 | 344 | if(!topological_similarity) { 345 | variables <- variables[,-which(colnames(variables) == 'robinson_sim')] 346 | } 347 | 348 | if(any(apply(variables, 2, function(x) length(unique(x))) == 1)) { 349 | cat('Some requested variables were not successfully estimated.\n') 350 | cat('This script will soon crash.\n') 351 | cat('Maybe double check that your gene trees have support values?.\n') 352 | } 353 | 354 | #remove those with less than 'threshold' taxa 355 | useless <- which(apply(variables[,-1], 1, function(x) all(x == 0))) 356 | if(length(useless) > 0) { 357 | variables <- variables[-useless,] 358 | } 359 | 360 | #Select gene properties for PCA (RCFV is only included if type == 'AA) 361 | variables_to_use <- which(colnames(variables) %in% c('root_tip_var', 'saturation', 362 | 'av_patristic', 'RCFV', 363 | 'variable_sites', 'average_BS_support', 364 | 'robinson_sim')) 365 | 366 | if(any(is.na(variables[,variables_to_use]))) { 367 | cat('Something went wrong. Most likely the order of genes 368 | and gene trees does not match.\n') 369 | } 370 | 371 | #perform PCA 372 | PCA <- princomp(variables[,variables_to_use], cor = T, scores = T) 373 | 374 | column_names <- c('Gene name', 'Position in dataset', 'Root-tip var.', 375 | 'Saturation', 'Missing data', 'Evolutionary rate', 376 | 'Tree length', 'Treeness', 'Av patristic dist.', 377 | 'Comp. heterogeneity', 'Alignment length', 'Occupancy', 378 | 'Prop. variable sites', 'Av. boostrap support', 379 | 'RF similarity') 380 | 381 | if(type == 'DNA') column_names <- column_names[-10] 382 | 383 | if(remove_outliers) { 384 | #estimate Mahalanobis distances 385 | maha_distances <- order(mahalanobis(PCA$scores, rep(0, length(variables_to_use)), 386 | cov(PCA$scores)), decreasing = T) 387 | if(outlier_fraction >= 1/nrow(variables)) { 388 | #remove the loci within the top outlier_fraction 389 | outliers <- maha_distances[1:floor(nrow(variables) * outlier_fraction)] 390 | outliers <- sort(outliers) 391 | 392 | outlier_properties <- data.frame(variables[outliers,]) 393 | outlier_properties <- cbind(data.frame(names = names[outliers]), 394 | outlier_properties) 395 | 396 | colnames(outlier_properties) <- column_names[1:ncol(outlier_properties)] 397 | write.csv(outlier_properties, 398 | file = paste0(getwd(), '/properties_outliers.csv'), row.names = F) 399 | 400 | #redo PCA 401 | PCA <- princomp(variables[-outliers,variables_to_use], cor = T, scores = T) 402 | 403 | #get scores for loci along dimensions 1 and 2 404 | PC_1 <- PCA$scores[,1] 405 | PC_2 <- PCA$scores[,2] 406 | for(i in 1:length(outliers)) { 407 | if(outliers[i] == 1) { 408 | PC_1 <- c(NA, PC_1) 409 | PC_2 <- c(NA, PC_2) 410 | } else { 411 | if(outliers[i] < length(PC_1)) { 412 | PC_1 <- c(PC_1[1:(outliers[i]-1)], NA, PC_1[outliers[i]:length(PC_1)]) 413 | PC_2 <- c(PC_2[1:(outliers[i]-1)], NA, PC_2[outliers[i]:length(PC_2)]) 414 | } else { 415 | PC_1 <- c(PC_1, NA) 416 | PC_2 <- c(PC_2, NA) 417 | } 418 | } 419 | } 420 | } else { 421 | cat('Not enough genes to remove even 1 loci, 422 | use different outlier_fraction.\n') 423 | PC_1 <- PCA$scores[,1] 424 | PC_2 <- PCA$scores[,2] 425 | } 426 | } else { 427 | PC_1 <- PCA$scores[,1] 428 | PC_2 <- PCA$scores[,2] 429 | } 430 | 431 | variables <- cbind(variables, PC_1, PC_2) 432 | if(any(is.na(variables))) variables <- variables[which(complete.cases(variables)),] 433 | 434 | #Deal with subsampling threshold 435 | if(n_genes == 'all') { 436 | n_genes <- nrow(variables) 437 | cut <- F 438 | } else { 439 | if(is.character(n_genes)){ 440 | n_genes <- as.numeric(n_genes) 441 | } 442 | cut <- T 443 | } 444 | 445 | #C) Attempt to find usefulness axis automatically---------------------------------------------------- 446 | 447 | #estimate correlation between PCs and rate 448 | corr_rate_PC1 = cor.test(variables$rate, variables$PC_1)$estimate 449 | corr_rate_PC2 = cor.test(variables$rate, variables$PC_2)$estimate 450 | 451 | #if one of the major PCs is rate 452 | if(corr_rate_PC1 > 0.7 | corr_rate_PC2 > 0.7) { 453 | #if correlation between PC1 and rate is higher 454 | if(corr_rate_PC1 > corr_rate_PC2) { 455 | #is it really high? 456 | really_high <- corr_rate_PC1 > 0.7 457 | #and much higher than PC2? 458 | much_higher <- corr_rate_PC1 > (corr_rate_PC2 * 2) 459 | if(really_high & much_higher) { 460 | PC_rate <- 'PC_1' #definitely PC1 461 | } else { 462 | PC_rate <- 'PC_1_maybe' #posibly PC1 463 | } 464 | } else { 465 | #is it really high? 466 | really_high <- corr_rate_PC2 > 0.7 467 | #and much higher than PC1? 468 | much_higher <- corr_rate_PC2 > (corr_rate_PC1 * 2) 469 | if(really_high & much_higher) { 470 | PC_rate <- 'PC_2' #definitely PC2 471 | } else { 472 | PC_rate <- 'PC_2_maybe' #posibly PC2 473 | } 474 | } 475 | } else { 476 | PC_rate <- 'unknown' 477 | } 478 | 479 | if(PC_rate != 'unknown') { 480 | #is rate also usefulness?? 481 | #i.e. should we choose the fastest/slowest evolving loci? 482 | loadings_usefulness <- loadings(PCA)[][,as.numeric(unlist(strsplit(PC_rate, 483 | '_'))[2])] 484 | biases <- which(names(loadings_usefulness) %in% c('root_tip_var', 'saturation', 485 | 'av_patristic', 'RCFV')) 486 | signal <- which(names(loadings_usefulness) %in% c('average_BS_support', 487 | 'robinson_sim')) 488 | if(all(loadings_usefulness[signal] < 0)) { 489 | if(length(which(loadings_usefulness[biases] > 0)) >= 2) { 490 | PC_usefulness <- as.numeric(unlist(strsplit(PC_rate, '_'))[2]) 491 | direction <- 'clear' 492 | descending <- F 493 | } else { 494 | direction <- 'unclear' 495 | } 496 | } else { 497 | if(all(loadings_usefulness[signal] > 0)) { 498 | if(length(which(loadings_usefulness[biases] < 0)) >= 2) { 499 | PC_usefulness <- as.numeric(unlist(strsplit(PC_rate, '_'))[2]) 500 | direction <- 'clear' 501 | descending <- T 502 | } else { 503 | direction <- 'unclear' 504 | } 505 | } else { 506 | direction <- 'unclear' 507 | } 508 | } 509 | 510 | #rate != usefulness, can we find usefulness?? 511 | if(direction == 'unclear') { 512 | if(as.numeric(unlist(strsplit(PC_rate, '_'))[2]) == 1) { 513 | PC_usefulness <- 2 514 | } else { 515 | PC_usefulness <- 1 516 | } 517 | 518 | loadings_usefulness <- loadings(PCA)[][,PC_usefulness] 519 | if(all(loadings_usefulness[signal] < 0)) { 520 | if(length(which(loadings_usefulness[biases] > 0)) >= 2) { 521 | direction <- 'clear' 522 | descending <- F 523 | cat('A usefulness axis has been found!\n') 524 | } else { 525 | direction <- 'unclear' 526 | } 527 | } else { 528 | if(all(loadings_usefulness[signal] > 0)) { 529 | if(length(which(loadings_usefulness[biases] < 0)) >= 2) { 530 | direction <- 'clear' 531 | descending <- T 532 | cat('A usefulness axis has been found!\n') 533 | } 534 | direction <- 'unclear' 535 | } else { 536 | direction <- 'unclear' 537 | } 538 | } 539 | } else { 540 | cat('Rate == usefulness. Proceed and loci will be sorted by rate.\n') 541 | } 542 | } else { 543 | cat('The dataset does not seem to conform to expectations.\n') 544 | cat('Multivariate subsampling is not appropriate. Sorry!\n') 545 | direction <- 'unclear' 546 | } 547 | 548 | #D) Sort & Subsample------------------------------------------------------------------------------ 549 | if(grepl('maybe', PC_rate)) { 550 | cat('There seems to be some ambiguity as to the identity of the axes.\n') 551 | cat('Proceed with caution and check PCA loadings to see 552 | if sorting is appropriate.\n') 553 | } 554 | 555 | if(direction == 'unclear') { 556 | cat('It is unclear how to sort the data.\n') 557 | cat('You can the check loadings and decide manually how to proceed.\n') 558 | cat('In the absense of a clear usefulness axis 559 | my best guess is to sort by rates.\n') 560 | cat('(i.e., choose the slowest evolving genes)\n') 561 | 562 | variables_sorted <- variables[order(variables[,'rate'], decreasing = F),] 563 | } 564 | 565 | if(direction == 'clear') { 566 | cat('Usefulness axis explains', round((PCA$sdev[PC_usefulness]^2 / 567 | sum(PCA$sdev^2)) * 100, digits = 2), 568 | 'percentage of variance\n') 569 | 570 | usefulness_col <- grep(PC_usefulness, colnames(variables)) 571 | 572 | #sort by usefulness 573 | variables_sorted <- variables[order(variables[,usefulness_col], 574 | decreasing = descending),] 575 | } 576 | 577 | #output properties of the dataset 578 | variables_sorted_tosave <- cbind(data.frame(names = names[variables_sorted$genes]), 579 | variables_sorted) 580 | if(topological_similarity) { 581 | colnames(variables_sorted_tosave) <- c(column_names, 'PC1', 'PC2') 582 | } else { 583 | colnames(variables_sorted_tosave) <- c(column_names[-length(column_names)], 584 | 'PC1', 'PC2') 585 | } 586 | 587 | write.csv(variables_sorted_tosave, 588 | file = paste0(getwd(), '/properties_sorted_dataset.csv'), 589 | row.names = F) 590 | 591 | ###WARNING: uncomment and modify the following lines if you would like to sort 592 | ###and subsample by a different property, for example occupancy or RF similarity 593 | #variables_sorted <- variables[order(variables[,'occupancy'], decreasing = T),] 594 | #variables_sorted <- variables[order(variables[,'robinson_sim'], decreasing = T),] 595 | 596 | #sort entire dataset according to the sorting order imposed 597 | positions <- c() 598 | for(j in 1:nrow(variables_sorted)) { 599 | positions <- c(positions, partitions$Start[variables_sorted$genes[j]]:partitions$End[variables_sorted$genes[j]]) 600 | } 601 | sorted_data <- data[,positions] 602 | 603 | #sort partitions 604 | sorted_partitions <- partitions[variables_sorted$genes,] 605 | for(j in 1:nrow(sorted_partitions)) { 606 | dif <- sorted_partitions$End[j] - sorted_partitions$Start[j] 607 | if(j == 1) { 608 | sorted_partitions$Start[j] <- 1 609 | } else { 610 | sorted_partitions$Start[j] <- sorted_partitions$End[j-1] + 1 611 | } 612 | sorted_partitions$End[j] <- sorted_partitions$Start[j]+dif 613 | } 614 | 615 | #sort gene names 616 | sorted_names <- names[variables_sorted$genes] 617 | 618 | #sort genes 619 | sorted_trees <- gene_trees[variables_sorted$genes] 620 | 621 | #subsample (or not if n_genes was left == 0) 622 | sorted_names <- sorted_names[1:n_genes] 623 | sorted_partitions <- sorted_partitions[1:n_genes,] 624 | sorted_data <- sorted_data[,1:sorted_partitions$End[n_genes]] 625 | sorted_trees <- sorted_trees[1:n_genes] 626 | 627 | write.phyDat(phyDat(sorted_data, type = type), 628 | file = paste0(getwd(), '/sorted_alignment_', 629 | n_genes, 'genes.fa'), format = 'fasta') 630 | partitions_tosave <- paste0(sorted_names, ' = ', sorted_partitions$Start, '-', 631 | sorted_partitions$End) 632 | write(partitions_tosave, file = paste0(getwd(), '/sorted_alignment_', 633 | n_genes, 'genes.txt')) 634 | write.tree(sorted_trees, file = paste0(getwd(), '/sorted_trees_', 635 | n_genes, 'genes.tre')) 636 | 637 | #E) Optional: visualize some sorting results------------------------------------------------------------------------------------ 638 | variables_to_plot <- data.frame(gene = rep(variables_sorted$genes, ncol(PCA$loadings)), 639 | value = c(as.matrix(variables_sorted[,variables_to_use])), 640 | property = rep(colnames(variables_sorted[,variables_to_use]), 641 | each = nrow(variables_sorted)), 642 | pos = rep(1:nrow(variables_sorted), ncol(PCA$loadings))) 643 | 644 | order_properties <- as.character(unique(variables_to_plot$property)) 645 | 646 | if(type == 'DNA') { 647 | if(topological_similarity) { 648 | labs <- c('Root-to-tip variance', 'Level of saturation', 'Av. patristic distance', 649 | 'Prop. of variable sites', 'Average bootstrap', 'RF similarity') 650 | colors <- c('#8A2B0E', '#C75E24', '#C69E57', '#868568', '#5F7881', '#586160') 651 | } else { 652 | labs <- c('Root-to-tip variance', 'Level of saturation', 'Av. patristic distance', 653 | 'Prop. of variable sites', 'Average bootstrap') 654 | colors <- c('#8A2B0E', '#C75E24', '#C69E57', '#868568', '#5F7881') 655 | } 656 | } else { 657 | if(topological_similarity) { 658 | labs <- c('Root-to-tip variance', 'Level of saturation', 'Av. patristic distance', 'Comp. heterogeneity', 659 | 'Prop. of variable sites', 'Average bootstrap', 'RF similarity') 660 | colors <- c('#5F1202', '#8A2B0E', '#C75E24', '#C69E57', '#868568', '#5F7881', '#586160') 661 | } else { 662 | labs <- c('Root-to-tip variance', 'Level of saturation', 'Av. patristic distance', 'Comp. heterogeneity', 663 | 'Prop. of variable sites', 'Average bootstrap') 664 | colors <- c('#5F1202', '#8A2B0E', '#C75E24', '#C69E57', '#868568', '#5F7881') 665 | } 666 | } 667 | 668 | for(i in 1:length(unique(variables_to_plot$property))) { 669 | main_plot <- ggplot(subset(variables_to_plot, property == 670 | as.character(unique(variables_to_plot$property)[i])), 671 | aes(x = pos, y = value, color = property)) + 672 | geom_point(alpha = 0.2, shape = 16) + geom_smooth(method = 'gam', se = F) + 673 | theme_bw() + theme(legend.position = "none") + 674 | xlab('Sorted position') + ylab('Value') + 675 | scale_color_manual(values = colors[i]) + ggtitle(labs[i]) + 676 | theme(plot.title = element_text(hjust = 0.5)) 677 | 678 | inset_plot <- ggplot(subset(variables_to_plot, property == 679 | as.character(unique(variables_to_plot$property)[i])), 680 | aes(x = pos, y = value, color = property)) + 681 | geom_smooth(method = 'gam', se = T) + 682 | theme_bw() + theme(legend.position = "none") + 683 | theme(axis.title.x = element_blank(), axis.title.y = element_blank(), 684 | axis.text.x = element_blank(), panel.grid.major = element_blank(), 685 | panel.grid.minor = element_blank(), axis.ticks.x = element_blank()) + 686 | scale_color_manual(values = colors[i]) 687 | 688 | if(cut) inset_plot <- inset_plot + geom_vline(xintercept = n_genes, linetype = 'dashed') 689 | 690 | if(length(unique(variables_to_plot$property)) == 7) second_row = 5 691 | if(length(unique(variables_to_plot$property)) <= 6) second_row = 4 692 | 693 | if(i < second_row) { 694 | plot_with_inset <- ggdraw() + suppressMessages(draw_plot(main_plot)) + 695 | suppressMessages(draw_plot(inset_plot, x = 0.15, y = 0.67, 696 | width = .3, height = .25)) 697 | } else { 698 | plot_with_inset <- ggdraw() + suppressMessages(draw_plot(main_plot)) + 699 | suppressMessages(draw_plot(inset_plot, x = 0.15, y = 0.10, 700 | width = .3, height = .25)) 701 | } 702 | 703 | assign(paste0('plot', letters[i]), plot_with_inset) 704 | } 705 | 706 | if(type == 'AA') { 707 | if(topological_similarity) { 708 | final_plot <- plot_grid(plota, plotb, plotc, plotd, 709 | plote, plotf, plotg, nrow = 2) 710 | } else { 711 | final_plot <- plot_grid(plota, plotb, plotc, plotd, 712 | plote, plotf, nrow = 2) 713 | } 714 | } else { 715 | if(topological_similarity) { 716 | final_plot <- plot_grid(plota, plotb, plotc, plotd, 717 | plote, plotf, nrow = 2) 718 | } else { 719 | final_plot <- plot_grid(plota, plotb, plotc, plotd, 720 | plote, nrow = 2) 721 | } 722 | } 723 | 724 | #plot and save 725 | plot(final_plot) 726 | ggsave(paste0('sorted_figure_', n_genes, 'genes.pdf'), plot = final_plot, 727 | width = 16, height = 9, units = 'in') 728 | --------------------------------------------------------------------------------