├── .gitignore
├── images
    └── sorting_example.jpeg
├── genesortR.Rproj
├── README.md
└── genesortR.R


/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | .Ruserdata


--------------------------------------------------------------------------------
/images/sorting_example.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mongiardino/genesortR/HEAD/images/sorting_example.jpeg


--------------------------------------------------------------------------------
/genesortR.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: Default
 4 | SaveWorkspace: Default
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # genesortR
 2 | Sorting and subsampling of phylogenomic datasets using a multivariate method to quantify phylogenetic usefulness.
 3 | 
 4 | ## Description
 5 | This R script estimates seven gene properties commonly used to characterize the information content of loci in phylogenomic datasets: four sources of systematic bias (average pairwise patristic distance, compositional heterogeneity, level of saturation, and root-to-tip variance), two proxies for phylogenetic signal (Robinson-Foulds similarity to a target topology and average bootstrap support), as well as the proportion of variable sites. Some of these properties are estimated directly on sequence data, others on their corresponding topologies. Instead of directly optimizing these properties, a principal component analysis (PCA) is used to find an axis of phylogenetic usefulness along which proxies for signal increase while sources of bias decrease.
 6 | 
 7 | This approach was first used for phylogenomic subsampling by Mongiardino Koch & Thompson (2020), and then found to be able to discover an axis of phylogenetic usefulness across a sample of 18 diverse phylogenomic datasets by Mongiardino Koch (2021). The method was also found to recover more accurate topologies than most other commonly-used methods for phylogenomic subsampling, including those based on rates of evolution (targeting either low, intermediate or high rates), as well as methods that seek to minimize potential sources of systematic bias (e.g., targeting the most clock-like, least saturated, and least compositionally heterogeneous loci). It therefore provides a well-founded alternative to reduce the size of phylogenomic datasets, allowing the use of more complex inference methods (full coalescent methods, site-heterogeneous models, total-evidence dating, etc.), as well as allowing phylogenetic hypotheses to be tested with smaller and better curated datasets.
 8 | 
 9 | ## Parameters
10 | The script requires some editing to define a number of input parameters:
11 | * Set the woring directory to a desired location, ideally where input files are located. This is also where output files will be written.
12 | * Provide names for the four data files that need to be loaded:
13 |   1. Data matrix: needs to be in FASTA format. Only '?', '-' and 'X' are taken to represent missing data. If missing data is represented with other symbols, consider editing the data before running the script. 
14 |   2. Partition file: in format 'gene_name = start-end'.
15 |   3. Gene trees: a single tree file in newick format with all gene trees (with node support values).
16 |   4. Species tree: a target topology used to calculate Robinson-Foulds similarity (also in newick format). If some relationships between the sampled taxa are contentious, these can be collapsed so as not to favor any specific resolution. IMPORTANT: This species tree has to be rooted using outgroups.
17 | * ```type```: the data type, either 'AA' or 'DNA'.
18 | * ```ingroup``` (OPTIONAL): The names of two terminals that bracket the ingroup clade. This is recommended and serves to purposes: it allows gene trees to be rooted (which is important for the calculation of root-to-tip distances for example), and it allows outgroups to be discarded before measuring gene properties. If this is not desired, then this parameter can be left empty (i.e., unmodified) and all terminals will be used in the estimation of properties. If terminal names are provided, gene trees will be rooted at the node of the most-recent common ancestor (MRCA) of the ingroup.
19 | * ```threshold```: A threshold of number of ingroup taxa (i.e., a level of occupancy within the ingroup) to even consider the loci. If left as default (i.e., 'auto'), then this is set to 10% of the ingroup, otherwise it can be modified to a specific number of taxa. Loci with less than ```threshold``` will be removed from the data. If this is not desired set to 0.
20 | * ```remove_outliers```: A logical value stating whether to activate the removal of outlier loci (recommended if the dataset is composed of hundreds of loci or more). This will discard a fraction of loci (defined using ```outlier_fraction```) that exhibit the largest Mahalanobis distance in PC space, and which may suffer from problems in orthology inference, alignment, etc. Even if no issue can be detected with these loci, their pattern of correlation between gene properties deviates markedly from that of all others, indicating unusual evolutionary histories (e.g. driven by strong selective pressures) and it is unlikely that small datasets will benefit from their inclusion. The PCA is then repeated after their exclusion.
21 | * ```n_genes```: A final number of genes to retain. If left as defualt (i.e., 'all') then the data is sorted but not subsampled, and all loci are saved to output files. If the objective is to subsample the dataset, modify to a desired number of loci in the final dataset.
22 | * ```topological_similarity```: A logical value determining whether to incorporate Robinson-Foulds (RF) similarity in the PCA or not. This is highly recommended, as in the absence of RF similarity the only proxy for phylogenetic signal left is the average bootstrap support. I have found that, across several datasets, this can lead to favoring the selection of loci that evolve faster than necessary (and contain a higher prevalence of some issues such as compositional and rate heterogeneity). The addition of RF similarity helps balance this effect. However, this requires using a species tree that is considered "true", i.e., it is used as a target against which the topologies of gene trees are evaluated. In the presence of extensive phylogenetic uncertainty, this can be considered undesirable, and potentially even mislead the selection of loci. While even in the case of a highly uncertain phylogeny I would still recommend the inclusion of RF similarity, possibly by providing a relatively unresolved topology with uncertain nodes collapsed, ```topological_similarity``` can be set to ```FALSE``` and RF will not be used to constrain the phylogenetic usefulness axis. Note that even if you choose not to use RF similarities, you still need to provide a species tree, as the script will use it for other tasks such as defining the ingroup.<br/><br/>
23 | **WARNING: Taxon names need to match across all files, and loci need to be ordered in the same way in the alignment, partition and gene tree files.**
24 | 
25 | ## Estimated gene properties
26 | Seven gene properties are calculated in order to derive PC axes, inlcuding four sources of systematic bias (1-4 below) and two proxies for phylogenetic signal (5-6 below). In every case, sources of bias and proxies for signal are defined in such a way that their minimization and mazimization (respectively) is desireable.
27 | 
28 | 1. Root-to-tip variance: This metric provides an estimate of the clocklikeness of the evolutionary process for a loci, andh has been employed routinely to select genes for divergence time estimation (e.g., Smith et al. 2018). If an ```ingroup``` is defined by providing terminal names that bracket the clade, then gene trees will be rooted at the node of their MRCA, and distances will be calculated from this root all terminals within the ingroup. Distances to outgroup taxa will not be used, even when these form part of the ingroup clade in a given gene tree. If an ingroupis not defined, the tree will be rooted at its mid-point and distances will be estimated from this root to all terminals.
29 | 2. Average patristic distances: This value is estimated as the average of all pair-wise patristic distances (i.e., the sum of the lengths of branches that link two nodes in the gene tree) between terminals. Higher values are considered to be conducive to long-branch attraction artifacts (Struck 2014, Kocot et al. 2017).
30 | 3. Level of saturation: Estimated as the complement of the regression slope of pair-wise patristic distances (see above) against p-distances (the proportion of sites in the alignment for which two sequences differ) (Nosenko et al. 2013). Although negative correlations can occur, these are taken to happen due to some level of random error in a completely saturated loci, and the value is replaced with a 1.
31 | 4. Compositional heterogeneity: Variation in the use of amino acids among different branches of a tree is a generally unaccounted source of non-phylogenetic signal, and a major source of error for some clades (Nesnidal et al. 2013). This heterogeneity is estimated here using the relative composition frequency variability (RCFV), as defined by Zhong et al. (2011). Note that this variable is not included in the PCA if data type is set to 'DNA'.
32 | 5. Robinson-Foulds (RF) similarity: This variable (which corresponds to the complement of the RF distance, Robinson & Foulds 1981) is used as an estimate of gene tree error. Although this can suffer from issues of circularity, it should be noted that the species tree provided (which is used as the target topology for the calculation of RF similarity scores) can contain any number of unresolved nodes depicting uncertainty in the resolution of clades.
33 | 6. Average bootstrap support: The average degree of support in the molecular data for the nodes in the gene tree.
34 | 7. Proportion of variable sites: This variable has been variously consider an estimate of information content, a proxy for evolutionary rate, or an amount of phylogenetic signal in the alignment. Its degree of correlation with other gene properties across multiple datasets supports the first two interpretations.
35 | 
36 | Besides these seven properties directly employed for the discovery of a phylogenetic usefulness axis, six other properties are also estimated although not directly employed. These include: A. The amount of missing data; B. The level of occupancy; C. The length of the alignment; D. The total tree length (i.e. the sum of all of its branches); E. An estimate of the rate of evolution (tree length divided by the number of taxa; Telford et al. 2014); and F) The treeness of a gene tree (i.e., the proportion of branch lengths in internal branches).
37 | Any of these thirteen properties can be directly employed for the sorting of datasets with a minimial degree of editing to the R script. Instructions on how to do this are provided as comments in section 'D) Sort & Subsample' of the script.
38 | 
39 | ## Output
40 | The main outputs that are directly saved to file are the sorted alignment, partition file and gene tree file. If a desired number of genes is specified using the ```n_genes``` parameter, then these datasets will be subsampled as well, allowing for both concatenation and coalescent-based phylogenetic inference to be repeated with a smaller dataset. The estimated properties for both the outlier loci (if outlier filtering is activated), and the full sorted dataset are also output as csv files (properties_outliers.csv and properties_sorted_dataset.csv, respectively). This can help explore the reasons why loci were removed (and potentially fix some of their issues if these are evident), as well as check the correlation among different gene properties and their distribution along the ordering imposed by the script. Additionally, a plot (Fig. 1) is generated showing how the underlying gene properties vary according to the order in which genes are placed based on their phylogenetic usefulness. This plot is also saved to file.<br/><br/><br/>
41 | ![sorting_example](https://github.com/mongiardino/genesortR/blob/main/images/sorting_example.jpeg)
42 | **Fig. 1:** Value of the seven gene properties against the order in which loci are sorted according to their phylogenetic usefulness. Loci are from a sea urchin transcriptomic dataset (2,356 loci) of Mongiardino Koch & Thompson (2020). Regression lines correspond to generalized additive models (GAM).
43 | 
44 | ## Author
45 | Nicolás Mongiardino Koch. Department of Earth & Planetary Sciences, Yale University.
46 | 
47 | Citation: Mongiardino Koch N. 2021. Phylogenomic subsampling and the search for phylogenetically reliable loci. Molecular Biology and Evolution, msab151, https://doi.org/10.1093/molbev/msab151
48 | 
49 | I greatly appreciate the efforts of Mansa Srivastav and Kevin Kocot in helping debug this script.
50 | Prashant Sharma provided useful comments and suggestions.
51 | 
52 | ## References
53 | Kocot KM, Struck TH, Merkel J, Waits DS, Todt C, Brannock PM, Weese DA, Cannon JT, Moroz LL, Lieb B, Halanych KM. 2017. Phylogenomics of Lophotrochozoa with consideration of systematic error. Systematic Biology 66:256-282.
54 | 
55 | Mongiardino Koch N, Thompson JR. 2020. A total-evidence dated phylogeny of Echinoidea combining phylogenomic and paleontological data. Systematic Biology syaa069, https://doi.org/10.1093/sysbio/syaa069.
56 | 
57 | Mongiardino Koch N. 2021. Phylogenomic subsampling and the search for phylogenetically reliable loci. Molecular Biology and Evolution, msab151, https://doi.org/10.1093/molbev/msab151
58 | 
59 | Nesnidal MP, Helmkampf M, Meyer A, Witek A, Bruchhaus I, Ebersberger I, Hankeln T, Lieb B, Struck TH, Hausdorf B. 2013. New phylogenomic data support the monophyly of Lophophorata and an Ectoproct–Phoronid clade and indicate that Polyzoa and Kryptrochozoa are caused by systematic bias. BMC Evolutionary Biology 13:253.
60 | 
61 | Nosenko T, Schreiber F, Adamska M, Adamski M, Eitel M, Hammel J, Maldonado M, Müller WE, Nickel M, Schierwater B, Vacelet J. 2013. Deep metazoan phylogeny: when different genes tell different stories. Molecular Phylogenetics and Evolution 67:223-233.
62 | 
63 | Robinson DF, Foulds LR. 1981. Comparison of phylogenetic trees. Mathematical Biosciences 53:131-730.
64 | 
65 | Smith SA, Brown JW, Walker JF. 2018. So many genes, so little time: A practical approach to divergence-time estimation in the genomic era. PloS One 13:e0197433.
66 | 
67 | Struck TH. 2014. TreSpEx – Detection of misleading signal in phylogenetic reconstructions based on tree information. Evolutionary Bioinformatics 10:51-67.
68 | 
69 | Telford MJ, Lowe CJ, Cameron CB, Ortega-Martinez O, Aronowicz J, Oliveri P, Copley RR. 2014. Phylogenomic analysis of echinoderm class relationships supports Asterozoa. Proceedings of the Royal Society B 281:20140479.
70 | 
71 | Zhong M, Hansen B, Nesnidal M, Golombek A, Halanych KM, Struck TH. 2011. Detecting the symplesiomorphy trap: a multigene phylogenetic analysis of terebelliform annelids. BMC evolutionary biology, 11:1-15.
72 | 


--------------------------------------------------------------------------------
/genesortR.R:
--------------------------------------------------------------------------------
  1 | #genesortR: Property-based multivariate phylogenomic subsampling
  2 | #Written by Nicolas Mongiardino Koch 02/2021
  3 | 
  4 | #This script requires an alignment in FASTA format, a partition file in format
  5 | #'geneX = 1-200', a species tree considered the best estimate of the true tree
  6 | #(e.g., as obtained using concatenation or coalescent methods using the full
  7 | #alignment, note that uncertain nodes can be collapsed) and a file with all gene
  8 | #trees. Species and gene trees need to be in newick format. The order of genes
  9 | #in the alignment and the order of gene trees need to match. The species tree 
 10 | #must be rooted using outgroups.
 11 | 
 12 | #Parameters needing input are marked with 'INPUT' and are all in the first
 13 | #section below, entitled 'Parameters'.
 14 | 
 15 | #More gene properties than the ones used to infer a usefulness axis are
 16 | #inferred. If you would like to sort and subsample based on any of these (such
 17 | #as occupancy for example) go to heading 'D) Sort & Subsample and modify the
 18 | #lines after 'WARNING'
 19 | 
 20 | #More details can be found in the following publications:
 21 | #1) Mongiardino Koch & Thompson (2021) - A Total-Evidence Dated Phylogeny of
 22 | #Echinoidea Combining Phylogenomic and Paleontological Data. Systematic 
 23 | #Biology 70(3): 421–439, https://doi.org/10.1093/sysbio/syaa069.
 24 | 
 25 | #2) Mongiardino Koch (2021) - Phylogenomic subsampling and the search for
 26 | #phylogenetically reliable loci. Molecular Biology and Evolution 38(9): 
 27 | #4025–4038, https://doi.org/10.1093/molbev/msab151.
 28 | 
 29 | #Parameters-----------------------------------------------------------------------------------------
 30 | #INPUT: set working directory to folder containing these four files
 31 | setwd('')
 32 | 
 33 | #INPUT: fill this with file names
 34 | alignment <- ''
 35 | partition <- ''
 36 | species_tree <- ''
 37 | gene_trees <- ''
 38 | 
 39 | #INPUT: is the alignment 'DNA' or 'AA'
 40 | type <- 'AA'
 41 | 
 42 | #INPUT: provide the names of two terminals that bracket the ingroup, i.e., one
 43 | #descendant of each of the two main clades of the ingroup. Leave blank and 
 44 | #properties will be calculated across the entire tree without removing outliers
 45 | ingroup <- c('', '')
 46 | 
 47 | #INPUT: do not even consider genes with less than 'threshold' ingroup taxa. 
 48 | #If threshold == 'auto' then it is automatically set to more than 10% of the
 49 | #ingroup terminals (if the dataset is small, a larger value is probably
 50 | #desirable)
 51 | threshold <- 'auto'
 52 | 
 53 | #INPUT: activate/deactivate outlier gene removal (recommended)
 54 | remove_outliers <- T
 55 | outlier_fraction <- 0.01 #i.e. 1%
 56 | 
 57 | ##INPUT: Desired number of genes to retain
 58 | #if n_genes == 'all' then the dataset is sorted but not subsampled.
 59 | n_genes <- 'all'
 60 | 
 61 | ##INPUT: Whether to incorporate Robinson-Foulds similarity in the PCA. This
 62 | ##option is available in case the relationships among the studied taxa are
 63 | ##highly uncertain, and there is concern that specifying a given topology might
 64 | ##bias results. Alternatively (and preferentially) these uncertainties can also
 65 | ##be accommodated by using a partially resolved species tree as input, for which
 66 | ##uncertain relationships have been collapsed. In that case, RF similarities
 67 | ##will not be affected by the specific resolution of uncertain nodes favored by
 68 | ##different genes. Note that even if topological_similarity is set to FALSE, a
 69 | ##species tree needs to be provided to delineate the species in the ingroup
 70 | topological_similarity <- T
 71 | 
 72 | #Install and load packages-------------------------------------------------------------------------
 73 | packages <- c('ape', 'phytools', 'phangorn', 'tibble', 'dplyr', 'tidyr', 
 74 |               'adephylo', 'ggplot2', 'cowplot')
 75 | new_packages <- packages[!packages %in% installed.packages()[,'Package']]
 76 | if(length(new_packages)) { install.packages(new_packages) }
 77 | 
 78 | library(phangorn)
 79 | library(ape)
 80 | library(tibble)
 81 | library(dplyr)
 82 | library(tidyr)
 83 | library(phytools)
 84 | library(adephylo)
 85 | library(ggplot2)
 86 | library(cowplot)
 87 | 
 88 | #Some necessary functions-----------------------------------------------------------------------------------
 89 | `%not in%` <- function(x, table) is.na(match(x, table, nomatch = NA_integer_))
 90 | #function to count invariant sites
 91 | inv <- function(x) {
 92 |   pattern <- unique(x)
 93 |   if(any(pattern %in% c('?'))) pattern <- pattern[-which(pattern == '?')]
 94 |   if(any(pattern %in% c('-'))) pattern <- pattern[-which(pattern == '-')]
 95 |   if(any(pattern %in% c('X'))) pattern <- pattern[-which(pattern == 'X')]
 96 |   
 97 |   if(length(pattern) == 1) {
 98 |     invariant <- T
 99 |   } else {
100 |     invariant <- F
101 |   }
102 |   return(invariant)
103 | }
104 | 
105 | #function to remove missing data from the estimation of RCFV
106 | remove_empty <- function(x) {
107 |   if('-' %in% names(unlist(x))) {
108 |     missing <- which(names(unlist(x)) == '-')
109 |     x <- x[-missing]
110 |   }
111 |   if('?' %in% names(unlist(x))) {
112 |     missing <- which(names(unlist(x)) == '?')
113 |     x <- x[-missing]
114 |   }
115 |   if('X' %in% names(unlist(x))) {
116 |     missing <- which(names(unlist(x)) == 'X')
117 |     x <- x[-missing]
118 |   }
119 |   return(x)
120 | }
121 | 
122 | #A) Prepare data-------------------------------------------------------------------------------------------------------
123 | data <- read.phyDat(alignment, format = 'fasta', type = type)
124 | if(type == 'AA') {
125 |   data <- as.AAbin(data) 
126 | } else {
127 |   data <- as.DNAbin(data)
128 | }
129 | 
130 | partitions <- read.table(partition, sep = ' ')
131 | names <- as.character(unlist(enframe(partitions[,(which(partitions[1,] == '=') - 1)], 
132 |                                      name = NULL), use.names = F))
133 | partitions <- enframe(partitions[,ncol(partitions)], name = NULL)
134 | partitions <- partitions %>% 
135 |   separate(value, into = c('Start', 'End'), sep = '-') %>% 
136 |   mutate_if(is.character, as.numeric)
137 | 
138 | gene_trees <- read.tree(gene_trees)
139 | species_tree <- read.tree(species_tree)
140 | 
141 | #get names of ingroup and outgroup taxa
142 | if(all(nchar(ingroup) != 0)) {
143 |   node <- getMRCA(species_tree, ingroup)
144 |   IG <- Descendants(species_tree, node, type = 'tips')
145 |   IG <- species_tree$tip.label[unlist(IG)]
146 |   OG <- species_tree$tip.label[species_tree$tip.label %not in% IG]
147 | } else {
148 |   IG <- species_tree$tip.label
149 | }
150 | 
151 | if(threshold == 'auto') {
152 |   threshold <- ceiling(length(IG)/10)
153 |   if(all(nchar(ingroup) != 0)) {
154 |     cat('Setting threshold to evaluate loci to', threshold, 
155 |         'taxa (i.e., 10% of ingroup taxa).\n')
156 |   } else {
157 |     cat('Setting threshold to evaluate loci to', threshold, 
158 |         'taxa (i.e., 10% of all taxa).\n')
159 |   }
160 | } else {
161 |   if(is.numeric(threshold)) {
162 |     if(threshold == 0) {
163 |       cat('Taxon threshold is disabled. All loci will be considered regardless 
164 |           of occupancy level.\n')
165 |     } else {
166 |       if(threshold < 1) {
167 |         threshold <- ceiling(length(IG) * threshold)
168 |         cat('Threshold was expecting an integer but was provided a number < 1.\n')
169 |         cat('It will be assumed that this should be taken as a fraction 
170 |             of ingroup taxa.\n')
171 |       } else {
172 |         if(all(nchar(ingroup) != 0)) {
173 |           cat('Loci with less than', threshold, 'ingroup will be discarded.\n')
174 |         } else {
175 |           cat('Loci with less than', threshold, 'taxa will be discarded.\n')
176 |         }
177 |       }
178 |     }
179 |   } else {
180 |     cat("Modify threshold parameter to either \'auto\' 
181 |         or an integer. This run will fail.", '\n')
182 |   }
183 | }
184 | 
185 | #B) Estimate properties-------------------------------------------------------------------------------
186 | genes <- 1:length(gene_trees)
187 | 
188 | root_tip_var <- saturation <- missing <- av_patristic <- 
189 |   length <- tree_length <- occupancy <- variable_sites <- 
190 |   RCFV <- rate <- treeness <- average_BS_support <- 
191 |   robinson_sim <- integer(length(gene_trees))
192 | 
193 | for(i in 1:length(gene_trees)) {
194 |   tree <- gene_trees[[i]]
195 | 
196 |   #remove genes with less than 'threshold' ingroup taxa
197 |   if(length(which(tree$tip.label %in% IG)) < threshold) next
198 |   
199 |   #if OGs are defined and present in this tree, root with them
200 |   if(length(IG) != length(species_tree$tip.label)) {
201 |     if(any(OG %in% tree$tip.label)) {
202 |       MRCA <- getMRCA(tree, which(tree$tip.label %in% IG))
203 |       tree_rooted <- root(tree, node = MRCA)
204 |       root_tip_var[i] <- var(dist.nodes(tree_rooted)[MRCA, 
205 |                                                      which(tree_rooted$tip.label %in% IG)])
206 |       
207 |       #after this remove the OGs from the gene tree
208 |       tree <- drop.tip(tree, which(tree$tip.label %in% OG))
209 |     } else {
210 |       tree_rooted <- midpoint.root(tree)
211 |       root_tip_var[i] <- var(dist.nodes(tree_rooted)[(length(tree_rooted$tip.label) + 1), 
212 |                                                      (1:length(tree_rooted$tip.label))])
213 |     }
214 |   } else {  #otherwise do midpoint rooting
215 |     tree_rooted <- midpoint.root(tree)
216 |     root_tip_var[i] <- var(dist.nodes(tree_rooted)[(length(tree_rooted$tip.label) + 1), 
217 |                                                    (1:length(tree_rooted$tip.label))])
218 |   }
219 |   
220 |   average_BS_support[i] <- mean(as.numeric(tree$node.label), na.rm = T)
221 |   if(is.nan(average_BS_support[i])) average_BS_support[i] <- 0
222 |   
223 |   #remove taxa from species tree to match gene tree sampling
224 |   if(length(which(species_tree$tip.label %not in% tree$tip.label)) > 0) {
225 |     this_species_tree <- drop.tip(species_tree, which(species_tree$tip.label %not in% 
226 |                                                         tree$tip.label))
227 |   } else {
228 |     this_species_tree <- species_tree
229 |   }
230 |   
231 |   if(topological_similarity) {
232 |     robinson_sim[i] <- 1 - suppressMessages(RF.dist(this_species_tree, tree, 
233 |                                                     normalize = TRUE, 
234 |                                                     check.labels = TRUE))
235 |   }
236 |   
237 |   patristic_dist <- as.matrix(distTips(tree, tips = 'all', 
238 |                                        method = 'patristic', useC = T))
239 |   
240 |   #get gene sequence
241 |   gene <- as.character(data[,partitions$Start[i]:partitions$End[i]])
242 |   
243 |   #remove OGs
244 |   if(length(IG) != length(species_tree$tip.label)) {
245 |     gene <- gene[-which(rownames(gene) %in% OG),]
246 |   }
247 |   ntax <- dim(gene)[1]
248 |   
249 |   #remove taxa not in tree (e.g., those with no data for this loci)
250 |   if(any(rownames(gene) %not in% tree$tip.label)) {
251 |     gene <- gene[-which(rownames(gene) %not in% tree$tip.label),]
252 |   }
253 |   
254 |   #remove entirely empty positions (might originate from pruning OGs for
255 |   #example)
256 |   all_missing <- which(apply(gene, 2, function(x) all(x %in% c('-','?','X'))))
257 |   if(length(all_missing) > 0) {
258 |     gene <- gene[,-all_missing]
259 |   }
260 |   
261 |   variable_sites[i] <- 1 - (length(which(apply(gene, 2, inv))) / dim(gene)[2])
262 |   missing[i] <- length(which(gene %in% c('-','?','X'))) / 
263 |     (dim(gene)[1]*dim(gene)[2])
264 |   length[i] <- dim(gene)[2]
265 |   occupancy[i] <- dim(gene)[1]/ntax
266 |   
267 |   p_dist <- as.matrix(dist.hamming(as.phyDat(gene, type = type), ratio = TRUE, 
268 |                                    exclude = "pairwise"))
269 |   p_dist <- p_dist[order(colnames(p_dist)),order(colnames(p_dist))]
270 |   patristic_dist <- patristic_dist[order(colnames(patristic_dist)), 
271 |                                    order(colnames(patristic_dist))]
272 |   p_dist <- p_dist[lower.tri(p_dist)]
273 |   patristic_dist <- patristic_dist[lower.tri(patristic_dist)]
274 |   av_patristic[i] <- mean(patristic_dist)
275 |   saturation[i] <- 1 - lm(p_dist ~ patristic_dist)$coefficients[[2]]
276 |   if(is.na(saturation[i])) saturation[i] <- 0
277 |   if(saturation[i] > 1) saturation[i] <- 1
278 |   if(saturation[i] < 0) saturation[i] <- 0
279 |   
280 |   #if sequence is made of AA, calculate comp. heterogeneity
281 |   if(type == 'AA') {
282 |     mean_freqs <- table(c(gene))
283 |     states <- sort(unique(c(gene)))
284 |     if('-' %in% states) {
285 |       mean_freqs <- mean_freqs[-which(states == '-')]
286 |       states <- states[-which(states == '-')]
287 |     }
288 |     if('?' %in% states) {
289 |       mean_freqs <- mean_freqs[-which(states == '?')]
290 |       states <- states[-which(states == '?')]
291 |     }
292 |     if('X' %in% states) {
293 |       mean_freqs <- mean_freqs[-which(states == 'X')]
294 |       states <- states[-which(states == 'X')]
295 |     }
296 |     mean_freqs <- mean_freqs / sum(mean_freqs)
297 |     
298 |     freqs <- lapply(split(gene, seq(nrow(gene))), table)
299 |     freqs <- lapply(freqs, remove_empty)
300 |     for(j in 1:length(freqs)) {
301 |       if(!all(states %in% names(freqs[[j]]))) {
302 |         miss <- states[which(states %not in% names(freqs[[j]]))]
303 |         add <- rep(0, length(miss))
304 |         names(add) <- miss
305 |         add <- as.table(add)
306 |         freqs[[j]] <- as.table(c(freqs[[j]], add))
307 |         freqs[[j]] <- freqs[[j]][order(factor(names(freqs[[j]])))]
308 |       }
309 |     }
310 |     
311 |     freqs <- lapply(freqs, function(x) x/sum(x))
312 |     freqs <- lapply(freqs, function(x) abs(x-mean_freqs))
313 |     freqs <- lapply(freqs, function(x) x/dim(gene)[1])
314 |     freqs <- lapply(freqs, function(x) sum(unlist(x)))
315 |     RCFV[i] <- sum(unlist(freqs))
316 |   }
317 |   
318 |   tree_length[i] <- sum(tree$edge.length)
319 |   rate[i] <- sum(tree$edge.length)/length(tree$tip.label)
320 |   treeness[i] <- 1 - (sum(tree$edge.length[which(tree$edge[,2] %in% 
321 |                                                    c(1:length(tree$tip.label)))]) / 
322 |                         sum(tree$edge.length))
323 |   if(is.nan(treeness[i])) treeness[i] <- 0
324 | }
325 | 
326 | #gather gene properties
327 | variables <- data.frame(genes, root_tip_var, saturation, missing, rate, 
328 |                         tree_length, treeness, av_patristic, RCFV, length, 
329 |                         occupancy, variable_sites, average_BS_support, 
330 |                         robinson_sim)
331 | 
332 | if(type == 'DNA') {
333 |   variables <- variables[,-which(colnames(variables) == 'RCFV')]
334 | } else {
335 |   if(any(is.na(variables$RCFV))) {
336 |     cat('This script will soon crash, as loci and gene trees 
337 |         are not composed of the same taxa.\n')
338 |     cat('Most likely this is due to loci and gene trees not 
339 |         being in the same order.\n')
340 |     cat('This makes the estimation of some gene properties to fail\n')
341 |   }
342 | }
343 | 
344 | if(!topological_similarity) {
345 |   variables <- variables[,-which(colnames(variables) == 'robinson_sim')]
346 | }
347 | 
348 | if(any(apply(variables, 2, function(x) length(unique(x))) == 1)) {
349 |   cat('Some requested variables were not successfully estimated.\n')
350 |   cat('This script will soon crash.\n')
351 |   cat('Maybe double check that your gene trees have support values?.\n')
352 | }
353 | 
354 | #remove those with less than 'threshold' taxa
355 | useless <- which(apply(variables[,-1], 1, function(x) all(x == 0)))
356 | if(length(useless) > 0) {
357 |   variables <- variables[-useless,]
358 | }
359 | 
360 | #Select gene properties for PCA (RCFV is only included if type == 'AA)
361 | variables_to_use <- which(colnames(variables) %in% c('root_tip_var', 'saturation', 
362 |                                                      'av_patristic', 'RCFV', 
363 |                                                      'variable_sites', 'average_BS_support', 
364 |                                                      'robinson_sim'))
365 | 
366 | if(any(is.na(variables[,variables_to_use]))) {
367 |   cat('Something went wrong. Most likely the order of genes 
368 |       and gene trees does not match.\n')
369 | }
370 | 
371 | #perform PCA
372 | PCA <- princomp(variables[,variables_to_use], cor = T, scores = T)
373 | 
374 | column_names <- c('Gene name', 'Position in dataset', 'Root-tip var.', 
375 |                   'Saturation', 'Missing data', 'Evolutionary rate', 
376 |                   'Tree length', 'Treeness', 'Av patristic dist.',
377 |                   'Comp. heterogeneity', 'Alignment length', 'Occupancy', 
378 |                   'Prop. variable sites', 'Av. boostrap support', 
379 |                   'RF similarity')
380 | 
381 | if(type == 'DNA') column_names <- column_names[-10]
382 | 
383 | if(remove_outliers) {
384 |   #estimate Mahalanobis distances
385 |   maha_distances <- order(mahalanobis(PCA$scores, rep(0, length(variables_to_use)), 
386 |                                       cov(PCA$scores)), decreasing = T)
387 |   if(outlier_fraction >= 1/nrow(variables)) {
388 |     #remove the loci within the top outlier_fraction
389 |     outliers <- maha_distances[1:floor(nrow(variables) * outlier_fraction)]
390 |     outliers <- sort(outliers)
391 |     
392 |     outlier_properties <- data.frame(variables[outliers,])
393 |     outlier_properties <- cbind(data.frame(names = names[outliers]), 
394 |                                 outlier_properties)
395 |     
396 |     colnames(outlier_properties) <- column_names[1:ncol(outlier_properties)]
397 |     write.csv(outlier_properties, 
398 |               file = paste0(getwd(), '/properties_outliers.csv'), row.names = F)
399 |     
400 |     #redo PCA
401 |     PCA <- princomp(variables[-outliers,variables_to_use], cor = T, scores = T)
402 |     
403 |     #get scores for loci along dimensions 1 and 2
404 |     PC_1 <- PCA$scores[,1]
405 |     PC_2 <- PCA$scores[,2]
406 |     for(i in 1:length(outliers)) {
407 |       if(outliers[i] == 1) {
408 |         PC_1 <- c(NA, PC_1)
409 |         PC_2 <- c(NA, PC_2)
410 |       } else {
411 |         if(outliers[i] < length(PC_1)) {
412 |           PC_1 <- c(PC_1[1:(outliers[i]-1)], NA, PC_1[outliers[i]:length(PC_1)])
413 |           PC_2 <- c(PC_2[1:(outliers[i]-1)], NA, PC_2[outliers[i]:length(PC_2)])
414 |         } else {
415 |           PC_1 <- c(PC_1, NA)
416 |           PC_2 <- c(PC_2, NA)
417 |         }
418 |       }
419 |     }
420 |   } else {
421 |     cat('Not enough genes to remove even 1 loci, 
422 |         use different outlier_fraction.\n')
423 |     PC_1 <- PCA$scores[,1]
424 |     PC_2 <- PCA$scores[,2]
425 |   }
426 | } else {
427 |   PC_1 <- PCA$scores[,1]
428 |   PC_2 <- PCA$scores[,2]
429 | }
430 | 
431 | variables <- cbind(variables, PC_1, PC_2)
432 | if(any(is.na(variables))) variables <- variables[which(complete.cases(variables)),]
433 | 
434 | #Deal with subsampling threshold
435 | if(n_genes == 'all') {
436 |   n_genes <- nrow(variables)
437 |   cut <- F
438 | } else {
439 |   if(is.character(n_genes)){
440 |     n_genes <- as.numeric(n_genes)
441 |   }
442 |   cut <- T
443 | }
444 | 
445 | #C) Attempt to find usefulness axis automatically----------------------------------------------------
446 | 
447 | #estimate correlation between PCs and rate
448 | corr_rate_PC1 = cor.test(variables$rate, variables$PC_1)$estimate
449 | corr_rate_PC2 = cor.test(variables$rate, variables$PC_2)$estimate
450 | 
451 | #if one of the major PCs is rate
452 | if(corr_rate_PC1 > 0.7 | corr_rate_PC2 > 0.7) {
453 |   #if correlation between PC1 and rate is higher
454 |   if(corr_rate_PC1 > corr_rate_PC2) {
455 |     #is it really high?
456 |     really_high <- corr_rate_PC1 > 0.7
457 |     #and much higher than PC2?
458 |     much_higher <- corr_rate_PC1 > (corr_rate_PC2 * 2)
459 |     if(really_high & much_higher) {
460 |       PC_rate <- 'PC_1' #definitely PC1
461 |     } else {
462 |       PC_rate <- 'PC_1_maybe' #posibly PC1
463 |     }
464 |   } else {
465 |     #is it really high?
466 |     really_high <- corr_rate_PC2 > 0.7
467 |     #and much higher than PC1?
468 |     much_higher <- corr_rate_PC2 > (corr_rate_PC1 * 2)
469 |     if(really_high & much_higher) {
470 |       PC_rate <- 'PC_2' #definitely PC2
471 |     } else {
472 |       PC_rate <- 'PC_2_maybe' #posibly PC2
473 |     }
474 |   }
475 | } else {
476 |   PC_rate <- 'unknown'
477 | }
478 | 
479 | if(PC_rate != 'unknown') {
480 |   #is rate also usefulness?? 
481 |   #i.e. should we choose the fastest/slowest evolving loci?
482 |   loadings_usefulness <- loadings(PCA)[][,as.numeric(unlist(strsplit(PC_rate, 
483 |                                                                      '_'))[2])]
484 |   biases <- which(names(loadings_usefulness) %in% c('root_tip_var', 'saturation', 
485 |                                                     'av_patristic', 'RCFV'))
486 |   signal <- which(names(loadings_usefulness) %in% c('average_BS_support', 
487 |                                                     'robinson_sim'))
488 |   if(all(loadings_usefulness[signal] < 0)) {
489 |     if(length(which(loadings_usefulness[biases] > 0)) >= 2) {
490 |       PC_usefulness <- as.numeric(unlist(strsplit(PC_rate, '_'))[2])
491 |       direction <- 'clear'
492 |       descending <- F
493 |     } else {
494 |       direction <- 'unclear'
495 |     }
496 |   } else {
497 |     if(all(loadings_usefulness[signal] > 0)) {
498 |       if(length(which(loadings_usefulness[biases] < 0)) >= 2) {
499 |         PC_usefulness <- as.numeric(unlist(strsplit(PC_rate, '_'))[2])
500 |         direction <- 'clear'
501 |         descending <- T
502 |       } else {
503 |         direction <- 'unclear'
504 |       }
505 |     } else {
506 |       direction <- 'unclear'
507 |     }
508 |   }
509 |   
510 |   #rate != usefulness, can we find usefulness??
511 |   if(direction == 'unclear') {
512 |     if(as.numeric(unlist(strsplit(PC_rate, '_'))[2]) == 1) {
513 |       PC_usefulness <- 2
514 |     } else {
515 |       PC_usefulness <- 1
516 |     }
517 |     
518 |     loadings_usefulness <- loadings(PCA)[][,PC_usefulness]
519 |     if(all(loadings_usefulness[signal] < 0)) {
520 |       if(length(which(loadings_usefulness[biases] > 0)) >= 2) {
521 |         direction <- 'clear'
522 |         descending <- F
523 |         cat('A usefulness axis has been found!\n')
524 |       } else {
525 |         direction <- 'unclear'
526 |       }
527 |     } else {
528 |       if(all(loadings_usefulness[signal] > 0)) {
529 |         if(length(which(loadings_usefulness[biases] < 0)) >= 2) {
530 |           direction <- 'clear'
531 |           descending <- T
532 |           cat('A usefulness axis has been found!\n')
533 |         }
534 |         direction <- 'unclear'
535 |       } else {
536 |         direction <- 'unclear'
537 |       }
538 |     }
539 |   } else {
540 |     cat('Rate == usefulness. Proceed and loci will be sorted by rate.\n')
541 |   }
542 | } else {
543 |   cat('The dataset does not seem to conform to expectations.\n')
544 |   cat('Multivariate subsampling is not appropriate. Sorry!\n')
545 |   direction <- 'unclear'
546 | }
547 | 
548 | #D) Sort & Subsample------------------------------------------------------------------------------
549 | if(grepl('maybe', PC_rate)) {
550 |   cat('There seems to be some ambiguity as to the identity of the axes.\n') 
551 |   cat('Proceed with caution and check PCA loadings to see 
552 |       if sorting is appropriate.\n')
553 | }
554 | 
555 | if(direction == 'unclear') {
556 |   cat('It is unclear how to sort the data.\n') 
557 |   cat('You can the check loadings and decide manually how to proceed.\n') 
558 |   cat('In the absense of a clear usefulness axis 
559 |       my best guess is to sort by rates.\n')
560 |   cat('(i.e., choose the slowest evolving genes)\n')
561 |   
562 |   variables_sorted <- variables[order(variables[,'rate'], decreasing = F),]
563 | }
564 | 
565 | if(direction == 'clear') {
566 |   cat('Usefulness axis explains', round((PCA$sdev[PC_usefulness]^2 / 
567 |                                            sum(PCA$sdev^2)) * 100, digits = 2), 
568 |       'percentage of variance\n')
569 |   
570 |   usefulness_col <- grep(PC_usefulness, colnames(variables))
571 |   
572 |   #sort by usefulness
573 |   variables_sorted <- variables[order(variables[,usefulness_col], 
574 |                                       decreasing = descending),]
575 | }
576 | 
577 | #output properties of the dataset
578 | variables_sorted_tosave <- cbind(data.frame(names = names[variables_sorted$genes]), 
579 |                                  variables_sorted)
580 | if(topological_similarity) {
581 |   colnames(variables_sorted_tosave) <- c(column_names, 'PC1', 'PC2')
582 | } else {
583 |   colnames(variables_sorted_tosave) <- c(column_names[-length(column_names)], 
584 |                                          'PC1', 'PC2')
585 | }
586 | 
587 | write.csv(variables_sorted_tosave, 
588 |           file = paste0(getwd(), '/properties_sorted_dataset.csv'), 
589 |           row.names = F)
590 | 
591 | ###WARNING: uncomment and modify the following lines if you would like to sort
592 | ###and subsample by a different property, for example occupancy or RF similarity
593 | #variables_sorted <- variables[order(variables[,'occupancy'], decreasing = T),]
594 | #variables_sorted <- variables[order(variables[,'robinson_sim'], decreasing = T),]
595 | 
596 | #sort entire dataset according to the sorting order imposed
597 | positions <- c()
598 | for(j in 1:nrow(variables_sorted)) {
599 |   positions <- c(positions, partitions$Start[variables_sorted$genes[j]]:partitions$End[variables_sorted$genes[j]])
600 | }
601 | sorted_data <- data[,positions]
602 | 
603 | #sort partitions
604 | sorted_partitions <- partitions[variables_sorted$genes,]
605 | for(j in 1:nrow(sorted_partitions)) {
606 |   dif <- sorted_partitions$End[j] - sorted_partitions$Start[j]
607 |   if(j == 1) {
608 |     sorted_partitions$Start[j] <- 1
609 |   } else {
610 |     sorted_partitions$Start[j] <- sorted_partitions$End[j-1] + 1
611 |   }
612 |   sorted_partitions$End[j] <- sorted_partitions$Start[j]+dif
613 | }
614 | 
615 | #sort gene names
616 | sorted_names <- names[variables_sorted$genes]
617 | 
618 | #sort genes
619 | sorted_trees <- gene_trees[variables_sorted$genes]
620 | 
621 | #subsample (or not if n_genes was left == 0)
622 | sorted_names <- sorted_names[1:n_genes]
623 | sorted_partitions <- sorted_partitions[1:n_genes,]
624 | sorted_data <- sorted_data[,1:sorted_partitions$End[n_genes]]
625 | sorted_trees <- sorted_trees[1:n_genes]
626 | 
627 | write.phyDat(phyDat(sorted_data, type = type), 
628 |              file = paste0(getwd(), '/sorted_alignment_', 
629 |                            n_genes, 'genes.fa'), format = 'fasta')
630 | partitions_tosave <- paste0(sorted_names, ' = ', sorted_partitions$Start, '-', 
631 |                             sorted_partitions$End)
632 | write(partitions_tosave, file = paste0(getwd(), '/sorted_alignment_', 
633 |                                        n_genes, 'genes.txt'))
634 | write.tree(sorted_trees, file = paste0(getwd(), '/sorted_trees_', 
635 |                                        n_genes, 'genes.tre'))
636 | 
637 | #E) Optional: visualize some sorting results------------------------------------------------------------------------------------
638 | variables_to_plot <- data.frame(gene = rep(variables_sorted$genes, ncol(PCA$loadings)),
639 |                                value = c(as.matrix(variables_sorted[,variables_to_use])),
640 |                                property = rep(colnames(variables_sorted[,variables_to_use]), 
641 |                                               each = nrow(variables_sorted)),
642 |                                pos = rep(1:nrow(variables_sorted), ncol(PCA$loadings)))
643 | 
644 | order_properties <- as.character(unique(variables_to_plot$property))
645 | 
646 | if(type == 'DNA') {
647 |   if(topological_similarity) {
648 |     labs <- c('Root-to-tip variance', 'Level of saturation', 'Av. patristic distance', 
649 |               'Prop. of variable sites', 'Average bootstrap', 'RF similarity')
650 |     colors <- c('#8A2B0E', '#C75E24', '#C69E57', '#868568', '#5F7881', '#586160')
651 |   } else {
652 |     labs <- c('Root-to-tip variance', 'Level of saturation', 'Av. patristic distance', 
653 |               'Prop. of variable sites', 'Average bootstrap')
654 |     colors <- c('#8A2B0E', '#C75E24', '#C69E57', '#868568', '#5F7881')
655 |   }
656 | } else {
657 |   if(topological_similarity) {
658 |     labs <- c('Root-to-tip variance', 'Level of saturation', 'Av. patristic distance', 'Comp. heterogeneity', 
659 |               'Prop. of variable sites', 'Average bootstrap', 'RF similarity')
660 |     colors <- c('#5F1202', '#8A2B0E', '#C75E24', '#C69E57', '#868568', '#5F7881', '#586160')
661 |   } else {
662 |     labs <- c('Root-to-tip variance', 'Level of saturation', 'Av. patristic distance', 'Comp. heterogeneity', 
663 |               'Prop. of variable sites', 'Average bootstrap')
664 |     colors <- c('#5F1202', '#8A2B0E', '#C75E24', '#C69E57', '#868568', '#5F7881')
665 |   }
666 | }
667 | 
668 | for(i in 1:length(unique(variables_to_plot$property))) {
669 |   main_plot <- ggplot(subset(variables_to_plot, property == 
670 |                                as.character(unique(variables_to_plot$property)[i])), 
671 |                       aes(x = pos, y = value, color = property)) + 
672 |     geom_point(alpha = 0.2, shape = 16) + geom_smooth(method = 'gam', se = F) +
673 |     theme_bw() + theme(legend.position = "none") +
674 |     xlab('Sorted position') + ylab('Value') + 
675 |     scale_color_manual(values = colors[i]) + ggtitle(labs[i]) + 
676 |     theme(plot.title = element_text(hjust = 0.5))
677 |   
678 |   inset_plot <- ggplot(subset(variables_to_plot, property == 
679 |                                 as.character(unique(variables_to_plot$property)[i])), 
680 |                        aes(x = pos, y = value, color = property)) + 
681 |     geom_smooth(method = 'gam', se = T) +
682 |     theme_bw() + theme(legend.position = "none") + 
683 |     theme(axis.title.x = element_blank(), axis.title.y = element_blank(), 
684 |           axis.text.x = element_blank(), panel.grid.major = element_blank(), 
685 |           panel.grid.minor = element_blank(), axis.ticks.x = element_blank()) + 
686 |     scale_color_manual(values = colors[i])
687 |   
688 |   if(cut) inset_plot <- inset_plot + geom_vline(xintercept = n_genes, linetype = 'dashed')
689 |   
690 |   if(length(unique(variables_to_plot$property)) == 7) second_row = 5
691 |   if(length(unique(variables_to_plot$property)) <= 6) second_row = 4
692 |   
693 |   if(i < second_row) {
694 |     plot_with_inset <- ggdraw() + suppressMessages(draw_plot(main_plot)) + 
695 |       suppressMessages(draw_plot(inset_plot, x = 0.15, y = 0.67, 
696 |                                  width = .3, height = .25))
697 |   } else {
698 |     plot_with_inset <- ggdraw() + suppressMessages(draw_plot(main_plot)) + 
699 |       suppressMessages(draw_plot(inset_plot, x = 0.15, y = 0.10, 
700 |                                  width = .3, height = .25))
701 |   }
702 |   
703 |   assign(paste0('plot', letters[i]), plot_with_inset)
704 | }
705 | 
706 | if(type == 'AA') {
707 |   if(topological_similarity) {
708 |     final_plot <- plot_grid(plota, plotb, plotc, plotd, 
709 |                             plote, plotf, plotg, nrow = 2)
710 |   } else {
711 |     final_plot <- plot_grid(plota, plotb, plotc, plotd, 
712 |                             plote, plotf, nrow = 2)
713 |   }
714 | } else {
715 |   if(topological_similarity) {
716 |     final_plot <- plot_grid(plota, plotb, plotc, plotd, 
717 |                             plote, plotf, nrow = 2)
718 |   } else {
719 |     final_plot <- plot_grid(plota, plotb, plotc, plotd, 
720 |                             plote, nrow = 2)
721 |   }
722 | }
723 | 
724 | #plot and save
725 | plot(final_plot)
726 | ggsave(paste0('sorted_figure_', n_genes, 'genes.pdf'), plot = final_plot, 
727 |        width = 16, height = 9, units = 'in')
728 | 


--------------------------------------------------------------------------------