├── .gitignore ├── LICENSE ├── README.md ├── conda-requirements.txt ├── examples └── descriptor-huH1N1-wgs.csv ├── prepare_dataset.sh ├── setup.py ├── treesort.py ├── treesort ├── __init__.py ├── cli.py ├── helpers.py ├── jc_reassortment_test.py ├── lm_outlier_detector.py ├── options.py ├── parsimony.py ├── reassortment_inference.py ├── reassortment_utils.py ├── tree_indexer.py └── version.py ├── treetime-root.py └── tutorial ├── figures ├── TreeSort-illustration.png ├── TreeSort-logo-150.png ├── TreeSort-logo-300.png ├── swH1-reassortment-ex1.png ├── swH1-reassortment-ex2.png ├── swH1-reassortment-ex3.png └── swH1-reassortment-ex4.png ├── swH1-dataset └── swine_H1_HANA.fasta └── swH1-parsed ├── HA-swine_H1_HANA.fasta.aln ├── HA-swine_H1_HANA.fasta.aln.dates.csv ├── HA-swine_H1_HANA.fasta.aln.rooted.tre ├── HA-swine_H1_HANA.fasta.aln.treetime ├── outliers.tsv ├── root_to_tip_regression.pdf └── rtt.csv ├── HA-swine_H1_HANA.fasta.tre ├── NA-swine_H1_HANA.fasta.aln ├── NA-swine_H1_HANA.fasta.aln.dates.csv ├── NA-swine_H1_HANA.fasta.aln.rooted.tre ├── NA-swine_H1_HANA.fasta.aln.treetime ├── outliers.tsv ├── root_to_tip_regression.pdf └── rtt.csv ├── NA-swine_H1_HANA.fasta.tre └── descriptor.csv /.gitignore: -------------------------------------------------------------------------------- 1 | # Phylogenetic/fasta test files. 2 | *.tre 3 | *.nexus 4 | *.aln 5 | *.fasta 6 | *.fna 7 | 8 | # TreeTime output 9 | treetime* 10 | 11 | # General 12 | *.pdf 13 | *.ppt 14 | *.pptx 15 | *.zip 16 | *.csv 17 | 18 | 19 | # Compiled Python bytecode and related files 20 | *.py[cod] 21 | dist/ 22 | build/ 23 | *.egg-info/ 24 | __pycache__/ 25 | 26 | # Log files 27 | *.log 28 | 29 | # JetBrains IDE 30 | .idea/ 31 | 32 | # Unit test reports 33 | TEST*.xml 34 | .pytest* 35 | 36 | # Generated by MacOS 37 | .DS_Store 38 | 39 | # Python virtual environment 40 | venv/ 41 | 42 | # Simulation files 43 | sims/ 44 | 45 | # Other 46 | testfiles/ 47 | misc/ 48 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Flu-crew at the National Animal Disease Center 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # TreeSort # 2 | 3 | ![TreeSort logo](tutorial/figures/TreeSort-logo-150.png) 4 | 5 | TreeSort infers both recent and ancestral reassortment events along the branches of a phylogenetic tree of a fixed genomic segment. 6 | It uses a statistical hypothesis testing framework to identify branches where reassortment with other segments has occurred and reports these events. 7 | 8 | 11 | 12 | Below is an example of 2 reassortment events inferred by TreeSort on a swine H1 dataset. The reference phylogeny is the hemagglutinin (HA) segment tree, and the branch annotations indicate reassortment relative to the HA's evolutionary history. The annotations list the acquired gene segments and how distant these segments were (# of nucleotide differences) from the original segments. For example, `PB2(136)` indicates that a new PB2 was acquired that was approximately 136 nucleotides different from the pre-reassortment PB2. 13 | 14 | 15 |
16 | 17 |
18 | 19 | ### Citation ### 20 | **If you use TreeSort, please cite it as**
21 | *Markin, A., Macken, C.A., Baker, A.L., and Anderson, T.K. Revealing reassortment in influenza A viruses with TreeSort. bioRxiv 2024.11.15.623781; [doi: https://doi.org/10.1101/2024.11.15.623781](https://doi.org/10.1101/2024.11.15.623781).* 22 | 23 | N.B. TreeSort uses TreeTime in a subroutine to infer substitution rates for segments - please also cite *Sagulenko et al. 2018 [doi: 10.1093/ve/vex042](https://doi.org/10.1093/ve/vex042).* 24 | 25 | ### Installation ### 26 | For a default installation, run `pip install treesort`. Alternatively, you can download this repository and run `pip install .` from within the downloaded directory. TreeSort requires **Python 3** to run and depends on SciPy, BioPython, DendroPy, and TreeTime (these dependencies will be installed automatically). 27 | 28 | For a broader installation of the bioinformatics suite required to align sequences and build phylogenetic trees via the [prepare_dataset.sh](prepare_dataset.sh) script that we provide, we recommend using a conda environment that can be set up as follows. 29 | 30 | If you haven't already, configure bioconda. 31 | ``` 32 | conda config --add channels bioconda 33 | conda config --add channels conda-forge 34 | conda config --set channel_priority strict 35 | ``` 36 | Then create a new environment with required dependencies and install TreeSort inside that environment. 37 | ``` 38 | git clone https://github.com/flu-crew/TreeSort.git 39 | cd TreeSort 40 | conda create -n treesort-env --file conda-requirements.txt 41 | conda activate treesort-env 42 | pip install . 43 | 44 | conda deactivate 45 | ``` 46 | 47 | 48 | ## Tutorial ## 49 | We use a swine H1 influenza A virus dataset for this tutorial. We include only HA and NA gene segments in this analysis for simplicity, but it can be expanded to all 8 segments. 50 | **Please note** that all sequences should have the dates of collection included in the deflines, and all metadata fields should be separated by "|". E.g., "A/swine/Iowa/A02934932/2017|1A.3.3.2|2017-05-12". 51 | 52 | To start, we will install TreeSort using the conda method above 53 | ``` 54 | git clone https://github.com/flu-crew/TreeSort.git # Download this repo 55 | cd TreeSort 56 | conda create -n treesort-env --file conda-requirements.txt # Create a new conda env and install dependencies 57 | conda activate treesort-env 58 | pip install . # Install TreeSort 59 | ``` 60 | 61 | ### Creating a descriptor file ### 62 | 63 | The input to TreeSort is a **descriptor** file, which is a comma-separated csv file that describes where the alignments and trees for individual segments can be found. Here is an [example descriptor file](examples/descriptor-huH1N1-wgs.csv). For our case, the descriptor file could look as follows (the column headings should not be included): 64 | 65 | | segment name | path to the fasta alignment | path to the newick-formatted tree | 66 | | --- | --- | --- | 67 | | *HA | HA-swine_H1_HANA.fasta.aln | HA-swine_H1_HANA.fasta.aln.rooted.tre 68 | | NA | NA-swine_H1_HANA.fasta.aln | NA-swine_H1_HANA.fasta.aln.rooted.tre 69 | 70 | The star symbol (\*) indicates the segment that will be used as the reference phylogeny and reassortment events will be inferred relative to this phylogeny (HA in this case). Note that the reference phylogeny should be **rooted**, whereas trees for other segments can be unrooted. 71 | 72 | We will use [prepare_dataset.sh](prepare_dataset.sh) bash script to automatically build alignments and trees for two segments in our swine dataset and compile a descriptor file. The script relies on the fact that every sequence has a segment name in the middle of the defline (e.g., |HA| or |4|). 73 | 74 | 75 | 76 | 77 | ``` 78 | ./prepare_dataset.sh --fast --segments "HA,NA" tutorial/swH1-dataset/swine_H1_HANA.fasta HA tutorial/swH1-parsed 79 | ``` 80 | To make things faster, we use the `--fast` flag here so that all trees are built using FastTree. However, we do not recommend to use this flag for high-precision analyses. When this flag is not used, the script will build the reference phylogeny using IQ-Tree, which will be slower but will likely result in a better quality tree, and therefore more accurate reassortment inference. 81 | 82 | The required arguments to the script are the path to the main fasta file, name of the regerence segment, and the path to the output directory. If `--segments` are not specified, the script assumes that 8 IAV segment names should be used (PB2, PB1, PA, HA, NP, NA, MP, NS). 83 | 84 | Running the above command will save the descriptor file, all trees, and alignments to the `tutorial/swH1-parsed` directory. Note that if for your data you already have trees built, you can manually create the descriptor file without using the script. 85 | 86 | 87 | ### Running TreeSort ### 88 | First make sure to familiarize yourself with the options available in the tool by looking through the help message. 89 | ``` 90 | treesort -h 91 | ``` 92 | 93 | Having the descriptor file from above, TreeSort can be run as follows 94 | ``` 95 | cd tutorial/swH1-parsed/ 96 | treesort -i descriptor.csv -o swH1-HA.annotated.tre 97 | ``` 98 | To run the newest mincut algorithm for reassortment inference (see details [here](https://github.com/flu-crew/TreeSort/releases/tag/0.3.0)), please use 99 | ``` 100 | treesort -i descriptor.csv -o swH1-HA.annotated.tre -m mincut 101 | ``` 102 | 103 | TreeSort will first estimate molecular clock rates for each segment and then will infer reassortment and annotate the backbone tree. The output tree in nexus format (`swH1-HA.annotated.tre`) can be visualized in FigTree or [icytree.org](https://icytree.org/). You can view the inferred reassortment events by displaying the **'rea'** annotations on tree edges, as shown in the Figure above. 104 | 105 | In this example TreeSort identifies a total of 93 HA-NA reassortment events: 106 | ``` 107 | Inferred reassortment events with NA: 93. 108 | Identified exact branches for 79/93 of them 109 | ``` 110 | 111 | Additionally, the method outputs the estimated reassortment rate per ancestral lineage per year. The rate translates to the probability of a single strain to undergo a reassortment event over the course of a year. In our case this probability of reassortment with NA is approximately 4%. 112 | 113 | Below is a part of the TreeSort output, where we see two consecutive NA reassortment events. The NA clade classifications were added to the strain names so that it's easier to interpret these reassortment events. Here we had a 2002 NA -> 1998A NA switch, followed by a 1998A -> 2002B NA switch. 114 |
115 | 116 |
117 | 118 | ### Uncertain reassortment placement (the '?' tag) ### 119 | Note that this section only applies to the `-m local` inference method (the default method for TreeSort). The `-m mincut` method always infers certain reassortment placements. 120 | 121 | Sometimes TreeSort does not have enough information to confidently place a reassortment event on a specific branch of the tree. TreeSort always narrows down the reassortment event to a particular ancestral node on a tree, but may not distinguish which of the child branches was affected by reassortment. In those cases, TreeSort will annotate both child branches with a `?` tag. For example, `?PB2(26)` below indicates that the reassortment with PB2 might have happened on either of the child branches. 122 | 123 |
124 | 125 |
126 | 127 | Typically, this happens when the sampling density is low. Therefore, increasing the sampling density by including more strains in the analysis may resolve such instances. 128 | -------------------------------------------------------------------------------- /conda-requirements.txt: -------------------------------------------------------------------------------- 1 | fasttree 2 | iqtree 3 | mafft 4 | pip 5 | smof 6 | -------------------------------------------------------------------------------- /examples/descriptor-huH1N1-wgs.csv: -------------------------------------------------------------------------------- 1 | PB2, ../huH1N1/USA/PB2.final.aln, ../huH1N1/USA/PB2.fasttree.tre 2 | PB1, ../huH1N1/USA/PB1.final.aln, ../huH1N1/USA/PB1.fasttree.tre 3 | PA, ../huH1N1/USA/PA.final.aln, ../huH1N1/USA/PA.fasttree.tre 4 | *HA, ../huH1N1/USA/HA.final.aln, ../huH1N1/USA/HA.final.aln.rooted.tre 5 | NP, ../huH1N1/USA/NP.final.aln, ../huH1N1/USA/NP.fasttree.tre 6 | NA, ../huH1N1/USA/NA.final.aln, ../huH1N1/USA/NA.fasttree.tre 7 | MP, ../huH1N1/USA/MP.final.aln, ../huH1N1/USA/MP.fasttree.tre 8 | NS, ../huH1N1/USA/NS.final.aln, ../huH1N1/USA/NS.fasttree.tre 9 | -------------------------------------------------------------------------------- /prepare_dataset.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Usage: ./prepare_dataset.sh [--segments "..." --fast] fasta_path reference_segment outdir 4 | # Using --fast will make all trees to be inferred with FastTree. 5 | # By default (without --fast) the reference tree is inferred with IQ-Tree, which is recommended for better accuracy. 6 | # Example usage: ./prepare_dataset.sh --segments "HA,NA" segments.fasta HA myoutdir 7 | # Example with default segments: ./prepare_dataset.sh segments.fasta HA myoutdir 8 | 9 | # These are the default segment names 10 | declare -a segments=("PB2" "PB1" "PA" "HA" "NP" "NA" "MP" "NS") 11 | FAST=0 12 | 13 | POSITIONAL_ARGS=() 14 | 15 | while [[ $# -gt 0 ]]; do 16 | case $1 in 17 | --segments) 18 | SEGMENTS_STR="$2" 19 | segments=(${SEGMENTS_STR//,/ }) 20 | shift # past argument 21 | shift # past value 22 | ;; 23 | --fast) 24 | FAST=1 25 | shift 26 | ;; 27 | -*|--*) 28 | echo "Unrecognized option $1" 29 | exit 1 30 | ;; 31 | *) 32 | POSITIONAL_ARGS+=("$1") # save positional arg 33 | shift # past argument 34 | ;; 35 | esac 36 | done 37 | 38 | set -- "${POSITIONAL_ARGS[@]}" 39 | 40 | # Required arguments: 41 | main_fasta="$1" # Provide a path to a fasta file with all segments 42 | ref_seg="$2" # Name of the segment to use as the reference (typically - HA) 43 | outdir="$3" # Path to the directory to store the results 44 | 45 | rm -r $outdir # Clear out the directory 46 | mkdir $outdir # Re-create the directory 47 | 48 | name=${main_fasta##*/} 49 | 50 | # Split out the segments and align them 51 | for seg in "${segments[@]}" 52 | do 53 | cat $main_fasta | smof grep "|${seg}|" > ${outdir}/${seg}-${name} 54 | echo "Aligning ${seg}..." 55 | mafft --thread 6 ${outdir}/${seg}-${name} | sed "s/|${seg}|/|/g"> ${outdir}/${seg}-${name}.aln 56 | rm ${outdir}/${seg}-${name} 57 | done 58 | 59 | if [ $FAST -eq 0 ]; then 60 | # Build fasttree trees in parallel for non-reference segments 61 | echo "Building non-reference trees in parallel with FastTree..." 62 | for seg in "${segments[@]}" 63 | do 64 | if [ $seg != $ref_seg ]; then 65 | fasttree -nt -gtr -gamma ${outdir}/${seg}-${name}.aln > ${outdir}/${seg}-${name}.tre & 66 | fi 67 | done 68 | wait # Wait to finish. 69 | 70 | # Build an IQ-Tree tree for the reference segment. We use the GTR+F+R5 model by default which can be changed 71 | echo "Building the reference tree with IQ-Tree..." 72 | iqtree2 -s ${outdir}/${ref_seg}-${name}.aln -T 6 --prefix "${outdir}/${ref_seg}-${name}" -m GTR+G+R5 73 | mv ${outdir}/${ref_seg}-${name}.treefile ${outdir}/${ref_seg}-${name}.tre 74 | else 75 | # Build all trees with FastTree in parallel. 76 | echo "Building trees in parallel with FastTree..." 77 | for seg in "${segments[@]}" 78 | do 79 | fasttree -nt -gtr -gamma ${outdir}/${seg}-${name}.aln > ${outdir}/${seg}-${name}.tre & 80 | done 81 | wait # Wait to finish. 82 | fi 83 | 84 | # Root the trees with a custom rooting script (in parallel) 85 | echo "Rooting trees with TreeTime..." 86 | for seg in "${segments[@]}" 87 | do 88 | python treetime-root.py ${outdir}/${seg}-${name}.tre ${outdir}/${seg}-${name}.aln & 89 | done 90 | wait 91 | 92 | # Create a descriptor file 93 | descriptor=${outdir}/descriptor.csv 94 | for seg in "${segments[@]}" 95 | do 96 | if [ $seg == $ref_seg ]; then 97 | echo -n "*" >> $descriptor 98 | fi 99 | echo "${seg},${seg}-${name}.aln,${seg}-${name}.aln.rooted.tre" >> $descriptor 100 | done 101 | echo "The descriptor file was written to ${descriptor}" 102 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | from treesort.version import __version__ 4 | 5 | with open("README.md", "r") as fh: 6 | long_description = fh.read() 7 | 8 | setup( 9 | install_requires=[ 10 | 'scipy>=1.7.0', 11 | 'biopython>=1.67', 12 | 'dendropy>=4.5.0', 13 | 'phylo-treetime>=0.9.4', 14 | 'matplotlib' 15 | ], 16 | name="TreeSort", 17 | version=__version__, 18 | author="Alexey Markin", 19 | author_email="alex.markin57@gmail.com", 20 | license='MIT', 21 | description="Virus reassortment inference software." 22 | "Infers both recent and ancestral reassortment and uses flexible molecular clock constraints.", 23 | long_description=long_description, 24 | long_description_content_type="text/markdown", 25 | url="https://github.com/flu-crew/TreeSort", 26 | packages=["treesort"], 27 | classifiers=[ 28 | "Programming Language :: Python :: 3", 29 | "Programming Language :: Python :: 3.6", 30 | "Programming Language :: Python :: 3.7", 31 | "Programming Language :: Python :: 3.8", 32 | "Programming Language :: Python :: 3.9", 33 | "Programming Language :: Python :: 3.10", 34 | "Programming Language :: Python :: 3.11", 35 | "Programming Language :: Python :: 3.12", 36 | "Topic :: Scientific/Engineering :: Bio-Informatics", 37 | "License :: OSI Approved :: MIT License", 38 | "Operating System :: OS Independent", 39 | ], 40 | entry_points={"console_scripts": ["treesort=treesort.cli:run_treesort_cli"]}, 41 | py_modules=["treesort"], 42 | ) 43 | -------------------------------------------------------------------------------- /treesort.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | from treesort.cli import run_treesort_cli 4 | 5 | 6 | if __name__ == '__main__': 7 | run_treesort_cli() 8 | -------------------------------------------------------------------------------- /treesort/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | -------------------------------------------------------------------------------- /treesort/cli.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import math 3 | import sys 4 | from typing import List, Optional, Dict, Tuple, Set 5 | import re 6 | import os 7 | 8 | from Bio import SeqIO 9 | from Bio.SeqRecord import SeqRecord 10 | from dendropy import Tree, Node, Edge, Taxon 11 | 12 | from treesort import options, helpers 13 | from treesort.helpers import binarize_tree, collapse_zero_branches 14 | from treesort.tree_indexer import TreeIndexer 15 | from treesort.reassortment_utils import compute_rea_rate_simple, compute_rea_rate_binary_mle 16 | from treesort.reassortment_inference import REA_FIELD, ReassortmentDetector 17 | from treesort.jc_reassortment_test import JCReassortmentTester 18 | 19 | ADD_UNCERTAIN = True # For local method only. 20 | RESOLVE_GREEDY = True # Whether to use the greedy multifurcation resolution algorithm 21 | MIN_TAXA_THRESHOLD = 10 22 | 23 | 24 | def extract_join_regex(label: str, join_on_regex: str, print_error=True) -> Optional[str]: 25 | """ 26 | Extracts the portion of the strain label captured by 'join_on_regex'. 27 | """ 28 | re_search = re.search(join_on_regex, label) 29 | if re_search and re_search.group(0): 30 | return re_search.group(0) 31 | else: 32 | if print_error: 33 | print(f'Cannot match pattern {join_on_regex} to {label}. Skipping this strain.') 34 | return None 35 | 36 | 37 | def is_taxon_in_tree(label: str, tree: Tree, join_on_regex=None) -> bool: 38 | if join_on_regex: 39 | label = extract_join_regex(label, join_on_regex) 40 | tree_labels = {extract_join_regex(leaf.taxon.label, join_on_regex) for leaf in tree.leaf_nodes()} 41 | else: 42 | tree_labels = {leaf.taxon.label for leaf in tree.leaf_nodes()} 43 | if label: 44 | return label in tree_labels 45 | else: 46 | return False 47 | 48 | 49 | def get_aln_labels(aln: Dict[str, SeqRecord], join_on_regex=None) -> Set[str]: 50 | """ 51 | Get a set of strain labels from the alignment 52 | (returns the label portion matched by 'join_on_regex' only, if specified). 53 | """ 54 | if join_on_regex: 55 | return {extract_join_regex(strain, join_on_regex) for strain in aln.keys()} 56 | else: 57 | return set(aln.keys()) 58 | 59 | 60 | def is_taxon_in_aln(label: str, aln_labels: Set[str], join_on_regex=None) -> bool: 61 | if join_on_regex: 62 | label = extract_join_regex(label, join_on_regex) 63 | if label: 64 | return label in aln_labels 65 | else: 66 | return False 67 | 68 | 69 | def find_common_taxa(aln_by_seg: List[Dict[str, SeqRecord]], ref_segment_i: int, join_on_regex=None) -> List[str]: 70 | # Find taxa in common. 71 | aln_labels_by_seg = [get_aln_labels(aln, join_on_regex) for aln in aln_by_seg] 72 | common_taxa = [extract_join_regex(strain, join_on_regex) if join_on_regex else strain 73 | for strain in aln_by_seg[ref_segment_i] if 74 | all([is_taxon_in_aln(strain, aln_labels, join_on_regex) for aln_labels in aln_labels_by_seg])] 75 | return common_taxa 76 | 77 | 78 | def prune_tree_to_taxa(tree: Tree, common_taxa: List[str], join_on_regex=None) -> Optional[Dict[str, str]]: 79 | """ 80 | Prune the tree and rename the taxa according to 'join_on_regex' regex, if provided. 81 | Returns a dictionary that maps the new names to the old names (if subs were made). 82 | """ 83 | name_map: Optional[Dict[str, str]] = None 84 | if join_on_regex: 85 | # Need to rename all taxa first. 86 | name_map = {} 87 | new_taxa = set() 88 | taxon: Taxon 89 | for taxon in tree.taxon_namespace: 90 | new_label = extract_join_regex(taxon.label, join_on_regex, print_error=False) 91 | if new_label: 92 | if new_label not in new_taxa: 93 | name_map[new_label] = taxon.label 94 | taxon.label = new_label 95 | new_taxa.add(new_label) 96 | else: 97 | print(f'REPEATED strain {new_label} - discarding the copy') 98 | 99 | # Prune the tree. 100 | tree.retain_taxa_with_labels(common_taxa) 101 | return name_map 102 | 103 | 104 | def prune_and_update_alignments(aln_by_seg: List[Dict[str, SeqRecord]], segments: List[Tuple[str, str, str, float]], 105 | common_taxa: List[str], outdir: str, join_on_regex=None) -> List[Dict[str, SeqRecord]]: 106 | if not join_on_regex: 107 | # Don't need to do anything. 108 | return aln_by_seg 109 | else: 110 | upd_aln_by_seg: List[Dict[str, SeqRecord]] = [] 111 | for i, seg in enumerate(segments): 112 | aln_map = aln_by_seg[i] 113 | new_aln: List[SeqRecord] = [] 114 | upd_aln_map: Dict[str, SeqRecord] = {} 115 | added_labels_upper = set() 116 | upd_aln_by_seg.append(upd_aln_map) 117 | new_aln_path = os.path.join(outdir, f'{seg[0]}_unified.aln') 118 | for label in aln_map: 119 | new_label = extract_join_regex(label, join_on_regex, print_error=False) 120 | if new_label and new_label in common_taxa: 121 | if new_label.upper() in added_labels_upper: 122 | # print(f'REPEATED strain {new_label} in segment {seg[0]} - discarding') 123 | continue # do not add to the alignment 124 | else: 125 | record = aln_map[label] 126 | record.id = record.name = new_label 127 | record.description = '' 128 | new_aln.append(record) 129 | upd_aln_map[new_label] = record 130 | added_labels_upper.add(new_label.upper()) 131 | SeqIO.write(new_aln, new_aln_path, 'fasta') 132 | segments[i] = (seg[0], new_aln_path, seg[2], seg[3]) 133 | return upd_aln_by_seg 134 | 135 | 136 | def run_treesort_cli(): 137 | # Each segment has format (name, aln_path, tree_path, rate) 138 | sys.setrecursionlimit(100000) 139 | segments: List[Tuple[str, str, str, float]] # name, aln path, tree path, rate. 140 | descriptor_name, outdir, segments, ref_segment_i, output_path, clades_out_path, pval_threshold, allowed_deviation, \ 141 | method, collapse_branches, join_on_regex, args = options.parse_args() 142 | ref_tree_path = segments[ref_segment_i][2] 143 | tree: Tree = Tree.get(path=ref_tree_path, schema='newick', preserve_underscores=True) 144 | ref_seg = segments[ref_segment_i] 145 | 146 | if args.timetree: 147 | # reduce the deviation rate. 148 | if allowed_deviation > 1: 149 | allowed_deviation = math.sqrt(allowed_deviation) 150 | 151 | if collapse_branches: 152 | collapse_zero_branches(tree, 1e-7) 153 | 154 | # Parse the alignments into a list of dictionaries. 155 | aln_by_seg: List[Dict[str, SeqRecord]] = [] 156 | for i, seg in enumerate(segments): 157 | seg_list = list(SeqIO.parse(seg[1], format='fasta')) 158 | seg_aln = {seq.id: seq for seq in seg_list} 159 | aln_by_seg.append(seg_aln) 160 | 161 | # Find taxa in common and prune trees/alignments if needed. 162 | common_taxa = find_common_taxa(aln_by_seg, ref_segment_i, join_on_regex) 163 | name_map: Optional[Dict[str, str]] = None # New to old label names map (if subs were made) 164 | if len(common_taxa) >= MIN_TAXA_THRESHOLD: 165 | print(f'Found {len(common_taxa)} strains in common across the alignments.') 166 | name_map = prune_tree_to_taxa(tree, common_taxa, join_on_regex) 167 | aln_by_seg = prune_and_update_alignments(aln_by_seg, segments, common_taxa, outdir, join_on_regex) 168 | else: 169 | # Print an error and exit. 170 | print(f'Found {len(common_taxa)} strains in common across the segment alignments - insufficient for a reassortment analysis.') 171 | exit(-1) 172 | 173 | if RESOLVE_GREEDY: 174 | print('Optimally resolving the multifurcations to minimize reassortment...') 175 | tree.suppress_unifurcations() # remove the unifurcations. 176 | # Compute the averaged sub rate. 177 | total_rate = 0 178 | total_sites = 0 179 | for i, seg in enumerate(segments): 180 | if i == ref_segment_i: 181 | continue 182 | aln_len = len(next(iter(aln_by_seg[i].values())).seq) 183 | total_rate += seg[3] * aln_len 184 | total_sites += aln_len 185 | overall_rate = total_rate / total_sites 186 | # print('Concatenated rate:', overall_rate) 187 | 188 | # Concatenate non-ref alignments. 189 | concatenated_seqs = [] 190 | taxa = [leaf.taxon.label for leaf in tree.leaf_nodes()] 191 | for taxon in taxa: 192 | # Find this taxon across all (non-reference) segments and concatenate aligned sequences. 193 | concat_seq = '' 194 | for i, seg in enumerate(segments): 195 | if i == ref_segment_i: 196 | continue 197 | concat_seq += aln_by_seg[i][taxon].seq 198 | concatenated_seqs.append(SeqRecord(concat_seq, id=taxon, name=taxon, description='')) 199 | concat_path = os.path.join(outdir, descriptor_name + '.concatenated.fasta') 200 | SeqIO.write(concatenated_seqs, concat_path, 'fasta') 201 | 202 | # Binarize the tree 203 | reassortment_tester = JCReassortmentTester(total_sites, overall_rate / ref_seg[3], pval_threshold, allowed_deviation) 204 | rea_detector = ReassortmentDetector(tree, concat_path, 'concatenated', reassortment_tester) 205 | rea_detector.binarize_tree_greedy() 206 | # tree = rea_detector.tree # use the binarized tree 207 | else: 208 | binarize_tree(tree) # simple binarization, where polytomies are resolved as caterpillars. 209 | 210 | tree_indexer = TreeIndexer(tree.taxon_namespace) 211 | tree_indexer.index_tree(tree) 212 | 213 | for i, seg in enumerate(segments): 214 | if i == ref_segment_i: 215 | continue 216 | 217 | print(f'Inferring reassortment with the {seg[0]} segment...') 218 | # seg2_aln = list(SeqIO.parse(seg[1], format='fasta')) 219 | seg2_len = len(next(iter(aln_by_seg[i].values())).seq) 220 | rate_ratio = seg[3] / ref_seg[3] 221 | reassortment_tester = JCReassortmentTester(seg2_len, rate_ratio, pval_threshold, allowed_deviation) 222 | rea_detector = ReassortmentDetector(tree, seg[1], seg[0], reassortment_tester) 223 | 224 | if method == 'MINCUT': 225 | rea_detector.infer_reassortment_mincut() 226 | else: 227 | rea_detector.infer_reassortment_local(pval_threshold, add_uncertain=ADD_UNCERTAIN) 228 | 229 | clades_out = None 230 | reported_rea = set() 231 | if clades_out_path: 232 | clades_out = open(clades_out_path, 'w') 233 | 234 | node: Node 235 | for node in tree.postorder_node_iter(): 236 | if node.is_internal(): 237 | node.label = f'TS_NODE_{node.index}' # Add node labels to the output tree. 238 | 239 | annotation = ','.join(getattr(node.edge, REA_FIELD, [])) 240 | if annotation: 241 | node.edge.annotations.add_new('rea', f'"{annotation}"') 242 | node.edge.annotations.add_new('is_reassorted', '1') 243 | 244 | if clades_out: 245 | # Report reassortment associated with the clade. 246 | clade = ';'.join(sorted([leaf.taxon.label for leaf in node.leaf_nodes()])) 247 | sister_clade = ';'.join(sorted([leaf.taxon.label for leaf in node.sister_nodes()[0].leaf_nodes()])) 248 | reassorted_genes = [g_str[:g_str.find('(')] for g_str in annotation.split(',')] 249 | # Drop out already reported ?-genes: we want to report ?-genes only once. 250 | # TODO: ideally report ?-genes with the more likely clade 251 | report_genes = [gene for gene in reassorted_genes if (sister_clade, gene) not in reported_rea] 252 | for rea_gene in report_genes: 253 | if rea_gene.startswith('?'): 254 | reported_rea.add((clade, rea_gene)) # mark the ?-genes that we report here 255 | report_genes_str = ';'.join(report_genes) 256 | if report_genes_str: 257 | clades_out.write(f'{clade},{report_genes_str}' 258 | # if there are ?-genes - add alternative clade (sister clade) 259 | f'{"," + sister_clade if report_genes_str.count("?") > 0 else ","}\n') 260 | else: 261 | node.edge.annotations.add_new('is_reassorted', '0') 262 | 263 | if clades_out: 264 | clades_out.close() 265 | 266 | if ref_seg[3] < 1 or args.timetree: # Do not estimate the reassortment rate if --equal-rates was given. 267 | rea_rate_1 = compute_rea_rate_binary_mle(tree, ref_seg[3], 268 | ref_seg_len=len(next(iter(aln_by_seg[ref_segment_i].values())).seq)) 269 | rea_rate_2 = compute_rea_rate_simple(tree, ref_seg[3], ignore_top_edges=1) 270 | print(f'\nEstimated reassortment rate per ancestral lineage per year.') 271 | if rea_rate_1: 272 | print(f'\tBernoulli estimate: {round(rea_rate_1, 6)}') 273 | print(f'\tPoisson estimate (for dense datasets): {round(rea_rate_2, 6)}') 274 | 275 | if name_map: 276 | # Substitute the labels back. 277 | for leaf in tree.leaf_node_iter(): 278 | if leaf.taxon: 279 | leaf.taxon.label = name_map[leaf.taxon.label] 280 | 281 | tree.write_to_path(output_path, schema='nexus') 282 | # tree.write_to_path(output_path + 'phylo.xml', schema='phyloxml') 283 | print(f'Saved the annotated tree file to {output_path}') 284 | -------------------------------------------------------------------------------- /treesort/helpers.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | from datetime import datetime 3 | import re 4 | 5 | from Bio import SeqIO 6 | from dendropy import Tree, Node 7 | from typing import Union, List 8 | 9 | 10 | def get_median(l: List[float]) -> float: 11 | l = sorted(l) 12 | l_size = len(l) 13 | if l_size % 2 == 1: 14 | return l[l_size // 2] 15 | else: 16 | return (l[l_size // 2 - 1] + l[l_size // 2]) / 2 17 | 18 | 19 | def compute_sampling_density(tree: Union[str, Tree]) -> float: 20 | """ 21 | Reports the median edge length. 22 | """ 23 | if isinstance(tree, str): 24 | tree: Tree = Tree.get(path=tree, schema='newick', preserve_underscores=True) 25 | elif not isinstance(tree, Tree): 26 | raise ValueError('"tree" should be either a path to a newick tree or a dendropy Tree object.') 27 | edge_lengths = [node.edge_length for node in tree.postorder_node_iter()] 28 | return get_median(edge_lengths) 29 | 30 | 31 | def collapse_zero_branches(tree: Tree, threshold=1e-7): 32 | tree.collapse_unweighted_edges(threshold) 33 | 34 | 35 | def parse_dates(aln_path: str): 36 | records = SeqIO.parse(aln_path, 'fasta') 37 | dates = {} 38 | for record in records: 39 | name = record.name 40 | # date_str = name.split('|')[-1] 41 | date = None 42 | for token in name.split('|'): 43 | if re.fullmatch(r'[\d\-/]{4,}', token) and not re.fullmatch(r'\d{5,}', token): 44 | if token.count('/') == 2: 45 | date = datetime.strptime(token, '%m/%d/%Y') 46 | elif token.count('/') == 1: 47 | date = datetime.strptime(token, '%m/%Y') 48 | elif token.count('-') == 2: 49 | date = datetime.strptime(token, '%Y-%m-%d') 50 | elif token.count('-') == 1: 51 | date = datetime.strptime(token, '%Y-%m') 52 | else: 53 | date = datetime.strptime(token, '%Y') 54 | if not date: 55 | # print(f'No date for {record.id}') 56 | # TODO: log with low priority level 57 | pass 58 | else: 59 | dec_date = date.year + ((date.month - 1) * 30 + date.day) / 365.0 60 | dates[name] = dec_date 61 | return dates 62 | 63 | 64 | # For a binary tree only 65 | def sibling_distance(parent_node: Node) -> float: 66 | return parent_node.child_nodes()[0].edge_length + parent_node.child_nodes()[1].edge_length 67 | 68 | 69 | # Siblings specified. 70 | def sibling_distance_n2(sib1: Node, sib2: Node) -> float: 71 | assert sib1.parent_node is sib2.parent_node 72 | return sib1.edge_length + sib2.edge_length 73 | 74 | 75 | def aunt_distance(node: Node) -> float: 76 | # We assume that the tree is binary 77 | assert node.parent_node and node.parent_node.parent_node 78 | parent: Node = node.parent_node 79 | aunt: Node = parent.sibling_nodes()[0] 80 | return node.edge_length + parent.edge_length + aunt.edge_length 81 | 82 | 83 | def node_distance(node1: Node, node2: Node) -> float: 84 | """ 85 | Linear-time algorithm to find a distance between two nodes on the same tree. 86 | Note: with constant-time LCA computation, one can compute distance in constant time. 87 | """ 88 | node1_depth = get_node_depth(node1) 89 | node2_depth = get_node_depth(node2) 90 | distance = 0 91 | p1, p2 = node1, node2 92 | if node1_depth > node2_depth: 93 | for step in range(node1_depth - node2_depth): 94 | distance += p1.edge_length 95 | p1 = p1.parent_node 96 | elif node2_depth > node1_depth: 97 | for step in range(node2_depth - node1_depth): 98 | distance += p2.edge_length 99 | p2 = p2.parent_node 100 | 101 | while p1 != p2: 102 | distance += p1.edge_length 103 | distance += p2.edge_length 104 | p1 = p1.parent_node 105 | p2 = p2.parent_node 106 | return distance 107 | 108 | 109 | def node_distance_w_lca(node1: Node, node2: Node, lca: Node) -> float: 110 | distance = 0 111 | while node1 is not None and node1 is not lca: 112 | distance += node1.edge_length 113 | node1 = node1.parent_node 114 | 115 | while node2 is not None and node2 is not lca: 116 | distance += node2.edge_length 117 | node2 = node2.parent_node 118 | 119 | assert node1 and node2 120 | return distance 121 | 122 | 123 | def get_node_depth(node: Node) -> int: 124 | depth = 0 125 | p: Node = node.parent_node 126 | while p: 127 | depth += 1 128 | p = p.parent_node 129 | return depth 130 | 131 | 132 | def binarize_tree(tree: Tree, edge_length=0): 133 | """ 134 | Adds/removes nodes from the tree to make it fully binary (added edges will have length 'edge_length') 135 | :param tree: Dendropy tree to be made bifurcating. 136 | """ 137 | 138 | # First suppress unifurcations. 139 | tree.suppress_unifurcations() 140 | 141 | # Now binarize multifurcations. 142 | node: Node 143 | for node in tree.postorder_node_iter(): 144 | if node.child_nodes() and len(node.child_nodes()) > 2: 145 | num_children = len(node.child_nodes()) 146 | children = node.child_nodes() 147 | interim_node = node 148 | # Creates a caterpillar structure with children on the left of the trunk: 149 | for child_ind in range(len(children) - 2): 150 | new_node = Node(edge_length=edge_length) 151 | interim_node.set_child_nodes([children[child_ind], new_node]) 152 | interim_node = new_node 153 | interim_node.set_child_nodes(children[num_children - 2:]) 154 | -------------------------------------------------------------------------------- /treesort/jc_reassortment_test.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | from scipy.stats import binomtest 3 | import math 4 | from typing import Tuple 5 | 6 | 7 | def jc_pvalue(subs: int, sites: int, ml_distance: float, rate_ratio=1.0, allowed_deviation=1.5): 8 | """ 9 | We assume the Jukes-Cantor substitution model and test whether the observed number of substitutions was likely 10 | to come from the observed time interval (ml_distance). The method assumes strict molecular clock (but with deviation) 11 | :param subs: Number of observed substitutions in the second gene segment 12 | :param sites: Number of sites in the second gene segment 13 | :param ml_distance: Expected number of substitutions per site in the first gene segment 14 | :param rate_ratio: Ratio in global substitution rates between the second and first segments 15 | :param pvalue_threshold: p-values below this threshold will be inferred as reassortments 16 | :param allowed_deviation: Should be >=1: allowed deviation from the strict molecular clock in each segment 17 | :return: the pvalue of observing the number of subs over the ml_distance edge. 18 | """ 19 | if ml_distance < 1 / sites: 20 | ml_distance = 1 / sites 21 | max_deviation = allowed_deviation * allowed_deviation 22 | sub_probability = 0.75 - 0.75 * (math.exp(-(4 * ml_distance * rate_ratio * max_deviation) / 3)) 23 | pvalue = binomtest(subs, sites, p=sub_probability, alternative='greater').pvalue 24 | # if pvalue < 0.001: 25 | # print(subs, sites, ml_distance, sub_probability, pvalue) 26 | return pvalue 27 | 28 | 29 | class JCReassortmentTester(object): 30 | 31 | def __init__(self, sites: int, rate_ratio: float, pvalue_threshold: float, allowed_deviation: float): 32 | self.sites = sites 33 | self.rate_ratio = rate_ratio 34 | self.pvalue_threshold = pvalue_threshold 35 | self.allowed_deviation = allowed_deviation 36 | 37 | def is_reassorted(self, subs: int, ml_distance: float) -> Tuple[bool, float]: 38 | pvalue = jc_pvalue(subs, self.sites, ml_distance, self.rate_ratio, self.allowed_deviation) 39 | return (pvalue < self.pvalue_threshold), pvalue 40 | -------------------------------------------------------------------------------- /treesort/lm_outlier_detector.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import numpy as np 3 | from sklearn.linear_model import LinearRegression 4 | 5 | 6 | # This is an earlier idea of using linear regression outliers for reassortment detection. 7 | class LMOutlierDetector(object): 8 | trained_reg: LinearRegression 9 | iqd: float 10 | q3: float 11 | 12 | def __init__(self, sibling_dists_s1: np.ndarray, sibling_dists_s2: np.ndarray): 13 | assert len(sibling_dists_s1) >= 10 14 | self.sibling_dists_s1 = sibling_dists_s1 15 | self.sibling_dists_s2 = sibling_dists_s2 16 | 17 | reg: LinearRegression = LinearRegression(fit_intercept=True).fit( 18 | sibling_dists_s1.reshape(-1, 1), sibling_dists_s2.reshape(-1, 1)) 19 | residuals: np.ndarray = sibling_dists_s2 - reg.predict(sibling_dists_s1.reshape(-1, 1)).reshape(-1) 20 | residuals.sort() 21 | print(residuals) 22 | q1 = residuals[round(len(residuals) / 4) - 1] 23 | q3 = residuals[round(len(residuals) * 3 / 4) - 1] 24 | self.iqd = q3 - q1 25 | self.trained_reg = reg 26 | self.q3 = q3 27 | print(f'IQD {self.iqd}, Q1 {q1}, Q3 {self.q3}') 28 | 29 | def get_residual(self, x: float, y: float) -> float: 30 | residual = y - self.trained_reg.predict(np.array([[x]], dtype=float)) 31 | return residual[0, 0] 32 | 33 | def is_outlier(self, x: float, y: float, iqd_mult=2) -> bool: 34 | residual = self.get_residual(x, y) 35 | return residual >= self.q3 + self.iqd * iqd_mult 36 | -------------------------------------------------------------------------------- /treesort/options.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import argparse 3 | from argparse import RawDescriptionHelpFormatter 4 | import random 5 | import os 6 | from typing import Tuple 7 | 8 | import matplotlib 9 | 10 | matplotlib.use('Agg') 11 | from matplotlib import pyplot as plt 12 | 13 | from treetime import TreeTime, TreeTimeError, utils 14 | from dendropy import Tree 15 | 16 | from treesort.helpers import parse_dates 17 | 18 | DEFAULT_PVALUE_THRESHOLD = 0.001 19 | DEFAULT_DEVIATION = 2 20 | DEFAULT_METHOD = 'LOCAL' 21 | STRAIN_NAME_REGEX = r'([ABCD](/[^/\|]+){3,5})' 22 | EPI_ISL_REGEX = r'(EPI_ISL_\d+)' 23 | 24 | # Program interface: 25 | parser = argparse.ArgumentParser(description='TreeSort: fast and effective reassortment detection in ' 26 | 'segmented RNA viruses (primarily influenza)', 27 | formatter_class=RawDescriptionHelpFormatter) 28 | parser._optionals.title = "Arguments" 29 | parser.add_argument('-i', type=str, action='store', dest='descriptor', 30 | help='Path to the descriptor file. The descriptor file provides paths to the alignments and ' 31 | 'phylogenetic trees for different virus segments (see examples/)', required=True) 32 | parser.add_argument('-o', type=str, action='store', dest='output', 33 | help='Path to the output file (tree will be save in nexus format)', required=True) 34 | parser.add_argument('-m', type=str, action='store', dest='method', 35 | help='Methods are "local" (default) or "mincut". The mincut method always determines the most ' 36 | 'parsimonious reassortment placement even in ambiguous circumstances.', required=False) 37 | parser.add_argument('--dev', type=float, action='store', dest='deviation', 38 | help='Maximum deviation from the estimated substitution rate within each segment. ' 39 | 'The default is 2 - the substitution rate on a particular tree branch is allowed to be ' 40 | 'twice as high or twice as low as the estimated rate. ' 41 | 'The default value was estimated from the empirical influenza A data', required=False) 42 | parser.add_argument('--pvalue', type=float, action='store', dest='pvalue', 43 | help='The cutoff p-value for the reassortment tests: the default is 0.001 (0.1 percent). ' 44 | 'You may want to decrease or increase this parameter depending on how stringent you want ' 45 | 'the analysis to be', required=False) 46 | parser.add_argument('--match-on-strain', action='store_true', dest='match_strain', 47 | help='Match the names (deflines in fastas) across the segments based on the strain name. ' 48 | 'E.g., "A/Texas/32/2021" or "A/swine/A0229832/Iowa/2021". Works for flu A, B, C, and D.' 49 | 'This way no preprocessing is needed to standardize the names before the analysis.') 50 | parser.add_argument('--match-on-epi', action='store_true', dest='match_epi', 51 | help='Similar to "--match-on-strain", but here segments are matched based on the "EPI_ISL_XXX" ' 52 | 'field (if present in deflines)') 53 | parser.add_argument('--match-on-regex', action='store', dest='match_regex', 54 | help='Provide your own custom regex to match the segments across the alignments.', 55 | required=False) 56 | parser.add_argument('--no-collapse', action='store_true', dest='no_collapse', 57 | help='Do not collapse near-zero length branches into multifurcations ' 58 | '(by default, TreeSort collapses all branches shorter than 1e-7 and then optimizes ' 59 | 'the multifurcations).') 60 | parser.add_argument('--equal-rates', action='store_true', dest='equal_rates', 61 | help='Do not estimate molecular clock rates for different segments: assume equal rates. ' 62 | 'Ignored if --timetree is specified', 63 | required=False) 64 | parser.add_argument('--clades', type=str, action='store', dest='clades_path', 65 | help='Path to an output file, where clades with evidence of reassrotment will be saved', 66 | required=False) 67 | parser.add_argument('--timetree', action='store_true', dest='timetree', 68 | help='Indicates that the reference tree is time-scaled (e.g., through TreeTime)') 69 | 70 | 71 | def make_outdir(descriptor_path: str) -> Tuple[str, str]: 72 | descriptor_path = descriptor_path.split(os.path.sep)[-1] 73 | if descriptor_path.count('.') > 0: 74 | descriptor_name = '.'.join(descriptor_path.split('.')[:-1]) 75 | else: 76 | descriptor_name = descriptor_path 77 | i = 1 78 | outdir = f'treesort-{descriptor_name}-{i}' 79 | while os.path.exists(outdir) and i <= 50: 80 | i += 1 81 | outdir = f'treesort-{descriptor_name}-{i}' 82 | if not os.path.exists(outdir): 83 | os.mkdir(outdir) 84 | return outdir, descriptor_name 85 | 86 | 87 | def estimate_clock_rate(segment: str, tree_path: str, aln_path: str, plot=False, outdir='.') -> (float, float): 88 | # This code was adapted from the 'estimate_clock_model' method in the treetime/wrappers.py. 89 | # print(f"\tExecuting TreeTime on segment {segment}...") 90 | tree: Tree = Tree.get(path=tree_path, schema='newick', preserve_underscores=True) 91 | if len(tree.leaf_nodes()) > 1000: 92 | # TODO: implement Bio.Phylo tree subsampling to avoid creating temporary files 93 | # Downsample the tree for rate estimation 94 | tree_path = os.path.join(outdir, os.path.split(tree_path)[-1] + '.sample1k.tre') 95 | taxa_labels = [t.label for t in tree.taxon_namespace] 96 | random.shuffle(taxa_labels) 97 | subtree: Tree = tree.extract_tree_with_taxa_labels(taxa_labels[:1000]) 98 | subtree.write(path=tree_path, schema='newick') 99 | 100 | dates = parse_dates(aln_path) 101 | try: 102 | timetree = TreeTime(dates=dates, tree=tree_path, aln=aln_path, gtr='JC69', verbose=-1) # TODO: JC->GTR? 103 | except TreeTimeError as e: 104 | parser.error(f"TreeTime exception on the input files {tree_path} and {aln_path}: {e}\n " 105 | f"Please make sure that the specified alignments and trees are correct.") 106 | timetree.clock_filter(n_iqd=3, reroot='least-squares') 107 | timetree.use_covariation = True 108 | timetree.reroot() 109 | timetree.get_clock_model(covariation=True) 110 | r_val = timetree.clock_model['r_val'] 111 | if plot: 112 | timetree.plot_root_to_tip() 113 | plt.savefig(f'{outdir}/{segment}-treetime-clock.pdf') 114 | d2d = utils.DateConversion.from_regression(timetree.clock_model) 115 | clock_rate = round(d2d.clock_rate, 7) 116 | print(f"\t{segment} estimated molecular clock rate: {clock_rate} (R^2 = {round(r_val, 3)})") 117 | return d2d.clock_rate, r_val 118 | 119 | 120 | # Currently requiring a tree for all segments 121 | def parse_descriptor(path: str, outdir: str, estimate_rates=True, timetree=False): 122 | segments = [] 123 | ref_segment = -1 124 | with open(path) as descriptor: 125 | for line in descriptor: 126 | line = line.strip('\n').strip() 127 | if line: 128 | tokens = [token.strip() for token in line.split(',')] 129 | if len(tokens) != 3: 130 | parser.error(f'The descriptor file should have 3 columns: {line}') 131 | else: 132 | seg_name, aln_path, tree_path = tokens 133 | if seg_name.startswith('*'): 134 | ref_segment = len(segments) 135 | seg_name = seg_name[1:] 136 | # seg_rate = estimate_clock_rate(seg_name, tree_path, aln_path) 137 | segments.append((seg_name, aln_path, tree_path, 1)) 138 | if len(segments) <= 1: 139 | parser.error('The descriptor should specify at least two segments.') 140 | if ref_segment < 0: 141 | parser.error('The descriptor should specify one of the segments as a reference segment, e.g., like "*HA".') 142 | 143 | print(f'Read {len(segments)} segments: {", ".join([seg[0] for seg in segments])}') 144 | if estimate_rates: 145 | print('Estimating molecular clock rates for each segment (TreeTime)...') 146 | for i, seg in enumerate(segments): 147 | if timetree and i == ref_segment: 148 | # If the reference tree is a timetree = use 1 for the rate. 149 | print(f"\t{seg[0]} (time-tree), rate 1") 150 | else: 151 | seg_name, aln_path, tree_path, _ = seg 152 | seg_rate, r_val = estimate_clock_rate(seg_name, tree_path, aln_path, plot=True, outdir=outdir) 153 | segments[i] = (seg_name, aln_path, tree_path, seg_rate) 154 | return segments, ref_segment 155 | 156 | 157 | def parse_args(): 158 | args = parser.parse_args() 159 | 160 | # Validate PVALUE 161 | pval = DEFAULT_PVALUE_THRESHOLD 162 | if args.pvalue is not None: 163 | pval = args.pvalue 164 | if pval >= 0.5 or pval <= 1e-9: 165 | parser.error('the PVALUE cutoff has to be positive and cannot exceed 0.5 (50%).') 166 | print(f'P-value threshold for significance was set to {pval}') 167 | 168 | # Validate DEVIATION 169 | deviation = DEFAULT_DEVIATION 170 | if args.deviation: 171 | deviation = args.deviation 172 | if deviation > 10 or deviation < 1: 173 | parser.error('DEVIATION has to be in the [1, 10] interval.') 174 | 175 | # Validate METHOD 176 | method = DEFAULT_METHOD 177 | if args.method: 178 | if args.method in ('local', 'mincut'): 179 | method = args.method.upper() 180 | else: 181 | parser.error(f'Unknown method "{args.method}". Known methods are "local" or "mincut".') 182 | 183 | # Check for a MATCH regex 184 | match_regex = None 185 | if args.match_strain: 186 | match_regex = STRAIN_NAME_REGEX 187 | if args.match_epi: 188 | match_regex = EPI_ISL_REGEX 189 | if args.match_regex: 190 | match_regex = args.match_regex 191 | if match_regex: 192 | print(f'Using the "{match_regex}" REGEX to match segments across trees and alignments.') 193 | 194 | collapse_branches = False if args.no_collapse else True 195 | 196 | outdir, descriptor_name = make_outdir(args.descriptor) 197 | estimate_rates = args.timetree or (not args.equal_rates) 198 | segments, ref_segment = parse_descriptor(args.descriptor, outdir, estimate_rates, args.timetree) 199 | 200 | return descriptor_name, outdir, segments, ref_segment, args.output, args.clades_path, pval, deviation, method, \ 201 | collapse_branches, match_regex, args 202 | -------------------------------------------------------------------------------- /treesort/parsimony.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | from typing import Tuple 3 | 4 | import numpy as np 5 | from dendropy import Tree, Node, DnaCharacterMatrix 6 | from dendropy.model import parsimony 7 | import random as rnd 8 | 9 | # character_sets_annotation = 'character_sets' 10 | 11 | 12 | def compute_parsimony_edge_lengths(tree: Tree, aln_path: str) -> np.ndarray: 13 | """ 14 | Compute the parsimony score of the tree given an alignment and find associated edge-lengths. 15 | :param tree: Tree topology to be scored by parsimony. Must be BINARY. 16 | :param aln_path: DNA alignment for the tips of the tree 17 | :return: A dictionary that specifies # of parsimony substitutions per node (except the root). 18 | """ 19 | tree_copy: Tree = tree.clone() 20 | taxon_characters: DnaCharacterMatrix = DnaCharacterMatrix.get_from_path(aln_path, schema='fasta', 21 | taxon_namespace=tree_copy.taxon_namespace) 22 | taxon_states = taxon_characters.taxon_state_sets_map(gaps_as_missing=True) 23 | p_score = parsimony.fitch_down_pass(tree_copy.postorder_node_iter(), taxon_state_sets_map=taxon_states) 24 | print(p_score) 25 | 26 | edge_lengths = np.zeros(len(tree.nodes()), dtype=int) 27 | p_score_2 = 0 28 | node: Node 29 | for node in tree_copy.preorder_node_iter(): 30 | edge_len = 0 31 | parent: Node = node.parent_node 32 | for site in range(taxon_characters.sequence_size): 33 | if parent: 34 | parent_state = parent.state_sets[site] 35 | if parent_state in node.state_sets[site]: 36 | node.state_sets[site] = parent_state 37 | continue 38 | else: 39 | edge_len += 1 40 | p_score_2 += 1 41 | # choose a random state and assign 42 | state_sets = list(node.state_sets[site]) 43 | rnd_state = rnd.choice(state_sets) 44 | node.state_sets[site] = rnd_state 45 | if parent: 46 | cluster = {leaf.taxon.label for leaf in node.leaf_nodes()} 47 | original_node = tree.find_node(filter_fn=lambda tree_node: {leaf.taxon.label for leaf in tree_node.leaf_nodes()} == cluster) 48 | edge_lengths[original_node.index] = edge_len 49 | print(p_score, p_score_2) 50 | return edge_lengths 51 | 52 | 53 | def compute_parsimony_sibling_dist(tree: Tree, aln_path: str, schema='fasta') -> Tuple[np.ndarray, np.ndarray, np.ndarray]: 54 | # tree must be binary! 55 | tree_copy: Tree = tree.clone() 56 | taxon_characters: DnaCharacterMatrix = DnaCharacterMatrix.get_from_path(aln_path, schema=schema, 57 | taxon_namespace=tree_copy.taxon_namespace) 58 | taxon_states = taxon_characters.taxon_state_sets_map(gaps_as_missing=True) 59 | p_score = parsimony.fitch_down_pass(tree_copy.postorder_node_iter(), taxon_state_sets_map=taxon_states) 60 | # print(p_score) 61 | 62 | children_dists = np.zeros(len(tree.internal_nodes()), dtype=int) 63 | child1_to_sibling_dists, child2_to_sibling_dists = np.zeros(len(tree.internal_nodes()), dtype=int),\ 64 | np.zeros(len(tree.internal_nodes()), dtype=int) 65 | p_score_2 = 0 66 | node: Node 67 | for node in tree_copy.preorder_internal_node_iter(): 68 | children_dist = 0 69 | child1_to_sibling, child2_to_sibling = 0, 0 70 | sibling = node.sibling_nodes()[0] if node.parent_node else None 71 | if not sibling: 72 | child1_to_sibling, child2_to_sibling = -1, -1 73 | child1, child2 = node.child_nodes() 74 | for site in range(taxon_characters.sequence_size): 75 | if len(child1.state_sets[site].intersection(child2.state_sets[site])) == 0: 76 | children_dist += 1 77 | p_score_2 += 1 78 | if sibling and len(child1.state_sets[site].intersection(sibling.state_sets[site])) == 0: 79 | child1_to_sibling += 1 80 | if sibling and len(child2.state_sets[site].intersection(sibling.state_sets[site])) == 0: 81 | child2_to_sibling += 1 82 | # cluster = {leaf.taxon.label for leaf in node.leaf_nodes()} 83 | # original_node = tree.find_node(filter_fn=lambda tree_node: {leaf.taxon.label for leaf in tree_node.leaf_nodes()} == cluster) 84 | children_dists[node.index] = children_dist 85 | child1_to_sibling_dists[node.index] = child1_to_sibling 86 | child2_to_sibling_dists[node.index] = child2_to_sibling 87 | # print(p_score, p_score_2) 88 | # TODO: if p_score != p_score_2: log a warning (debug only) 89 | return children_dists, child1_to_sibling_dists, child2_to_sibling_dists 90 | 91 | 92 | def get_cluster_str(node: Node) -> str: 93 | return ';'.join(sorted([leaf.taxon.label for leaf in node.leaf_nodes()])) 94 | 95 | # 96 | # if __name__ == '__main__': 97 | # # seg1_tree_path = '../simulations/segs2/l1500/sim_250_10/sim_1.trueSeg1.tre' 98 | # # schema = 'nexus' 99 | # # seg1_path = '../simulations/segs2/l1500/sim_250_10/sim_1.seg1.alignment.fasta' 100 | # # seg2_path = '../simulations/segs2/l1500/sim_250_10/sim_1.seg2.alignment.fasta' 101 | # # simulated = True 102 | # seg1_tree_path = '../../gammas/HAs.fast.rooted.tre' 103 | # schema = 'newick' 104 | # seg1_path = '../../gammas/HAs_unique.aln' 105 | # seg2_path = '../../gammas/NAs_unique.aln' 106 | # simulated = False 107 | # na_ha_ratio = 1.057 108 | # 109 | # tree: Tree = Tree.get(path=seg1_tree_path, schema=schema, preserve_underscores=True) 110 | # binarize_tree(tree) # Randomly binarize. 111 | # tree_indexer = TreeIndexer(tree.taxon_namespace) 112 | # tree_indexer.index_tree(tree) 113 | # if simulated: 114 | # node: Node 115 | # for node in tree.nodes(): 116 | # if node.edge_length: 117 | # node.edge_length *= 0.00474 118 | # 119 | # # lengths_by_node_s1 = compute_parsimony_edge_lengths(tree, 'testdata/l1000_50_5/sim_1.seg4.alignment.fasta') 120 | # # lengths_by_node_s2 = compute_parsimony_edge_lengths(tree, 'testdata/l1000_50_5/sim_1.seg1.alignment.fasta') 121 | # child_dists_s1, child1_dists_s1, child2_dists_s1 = compute_parsimony_sibling_dist(tree, seg1_path) 122 | # child_dists_s2, child1_dists_s2, child2_dists_s2 = compute_parsimony_sibling_dist(tree, seg2_path) 123 | # seg2_aln = list(SeqIO.parse(seg2_path, format='fasta')) 124 | # seg2_len = len(seg2_aln[0]) 125 | # 126 | # node_by_index = {} 127 | # cluster_by_index = {} 128 | # for node in tree.postorder_node_iter(): 129 | # node_by_index[node.index] = node 130 | # cluster_by_index[node.index] = get_cluster_str(node) 131 | # 132 | # # s1_lengths = np.zeros(len(lengths_by_node_s1)) 133 | # # node: Node 134 | # # for node in tree.postorder_internal_node_iter(): 135 | # # cluster = [leaf.taxon.label for leaf in node.leaf_nodes()] 136 | # # # print(cluster, lengths_by_node_s1[node.index], lengths_by_node_s2[node.index]) 137 | # # print(node.index, sorted(cluster), 138 | # # f'c1 {node.child_nodes()[0].index}({child1_dists_s1[node.index]}, {child1_dists_s2[node.index]})', 139 | # # f'c2 {node.child_nodes()[1].index}({child2_dists_s1[node.index]}, {child2_dists_s2[node.index]})') 140 | # 141 | # outlier_detector = LMOutlierDetector(child_dists_s1, child_dists_s2) 142 | # outliers = [(ind, outlier_detector.get_residual(child_dists_s1[ind], child_dists_s2[ind])) 143 | # for ind in range(child_dists_s1.size) if 144 | # outlier_detector.is_outlier(child_dists_s1[ind], child_dists_s2[ind], iqd_mult=3)] 145 | # jc_outliers = [(node.index, -1) for node in tree.internal_nodes() if 146 | # is_jc_outlier(child_dists_s2[node.index], seg2_len, helpers.sibling_distance(node), 147 | # rate_ratio=na_ha_ratio, pvalue_threshold=0.001)] 148 | # jc_outlier_indices = [x[0] for x in jc_outliers] 149 | # outliers = sorted(outliers, key=lambda x: x[1], reverse=True) 150 | # print(len(outliers)) 151 | # print(len(jc_outliers)) 152 | # for outlier_ind, residual in jc_outliers: 153 | # outlier_node = node_by_index[outlier_ind] 154 | # is_c1_rea = outlier_detector.is_outlier(child1_dists_s1[outlier_ind], child1_dists_s2[outlier_ind], iqd_mult=1.7) 155 | # is_c2_rea = outlier_detector.is_outlier(child2_dists_s1[outlier_ind], child2_dists_s2[outlier_ind], iqd_mult=1.7) 156 | # print(outlier_ind, residual, is_c1_rea, cluster_by_index[outlier_node.child_nodes()[0].index], 157 | # is_c2_rea, cluster_by_index[outlier_node.child_nodes()[1].index]) 158 | # 159 | # jc_colors = ['red' if i in jc_outlier_indices else 'blue' for i in range(len(tree.internal_nodes()))] 160 | # plt.scatter(child_dists_s1, child_dists_s2, c=jc_colors) 161 | # for ind in range(len(child_dists_s1)): 162 | # plt.annotate(str(ind), (child_dists_s1[ind], child_dists_s2[ind] + 0.2)) 163 | # plt.show() 164 | -------------------------------------------------------------------------------- /treesort/reassortment_inference.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import math 3 | from dendropy import Tree, Node, Edge, DnaCharacterMatrix 4 | from dendropy.model import parsimony 5 | import random as rnd 6 | from typing import List, Set, Optional, Union, Tuple 7 | 8 | from treesort import helpers 9 | from treesort.jc_reassortment_test import JCReassortmentTester, jc_pvalue 10 | from treesort.helpers import node_distance_w_lca, sibling_distance_n2, binarize_tree 11 | from treesort.parsimony import compute_parsimony_sibling_dist 12 | from treesort.tree_indexer import TreeIndexer 13 | 14 | 15 | REA_FIELD = 'rea_events' # This field of Edge will store the list of inferred reassortment events. 16 | # PROPAGATED_FROM = 'propagated_from' # The field of Node that indicates whether the state sets were copied from a lower node. 17 | MINCUT_STATES = 'mincut_states' # This field of Node will store the list of 'MinCutState' 18 | 19 | 20 | class MinCutState(object): 21 | node: Node 22 | state_sets: List[Set] 23 | child_states: Optional[List['MinCutState']] 24 | pvalue: float 25 | 26 | def __init__(self, node: Node, state_sets: List[Set], child_states=None, pvalue=-1): 27 | self.node = node 28 | self.state_sets = state_sets 29 | self.child_states = child_states 30 | self.pvalue = pvalue 31 | 32 | @classmethod 33 | def merge_states(cls, node: Node, child_state1: 'MinCutState', child_state2: 'MinCutState', pvalue: float): 34 | state_sets = [] # Fitch parsimony step up from the two children state sets. 35 | for site in range(len(child_state1.state_sets)): # Perform Fitch-up. 36 | intersection = child_state1.state_sets[site].intersection(child_state2.state_sets[site]) 37 | if len(intersection) == 0: 38 | union = child_state1.state_sets[site].union(child_state2.state_sets[site]) 39 | state_sets.append(union) 40 | else: 41 | state_sets.append(intersection) 42 | return cls(node, state_sets, [child_state1, child_state2], pvalue) 43 | 44 | 45 | class ReassortmentDetector(object): 46 | 47 | def __init__(self, tree: Tree, aln_path: str, segment: str, rea_tester: JCReassortmentTester, schema='fasta'): 48 | # tree_copy: Tree = tree.clone() # Process the copy of the original tree. 49 | for node in tree.postorder_node_iter(): # Remove previous dendropy Fitch-parsimony annotations 50 | if hasattr(node, 'state_sets'): 51 | delattr(node, 'state_sets') 52 | self.taxon_characters: DnaCharacterMatrix = DnaCharacterMatrix.get_from_path(aln_path, schema=schema, 53 | taxon_namespace=tree.taxon_namespace) 54 | taxon_states = self.taxon_characters.taxon_state_sets_map(gaps_as_missing=True) 55 | parsimony.fitch_down_pass(tree.postorder_node_iter(), 56 | taxon_state_sets_map=taxon_states) # Do the first Fitch parsimony pass (annotates the tree). 57 | self.tree = tree 58 | self.segment = segment 59 | self.aln_path = aln_path 60 | self.rea_tester = rea_tester 61 | 62 | def parsimony_distance(self, node1: Union[Node, MinCutState], node2: Union[Node, MinCutState]) -> int: 63 | parsimony_dist = 0 64 | for site in range(self.taxon_characters.sequence_size): 65 | if len(node1.state_sets[site].intersection(node2.state_sets[site])) == 0: 66 | parsimony_dist += 1 67 | return parsimony_dist 68 | 69 | def are_reassorted(self, node1: Union[Node, MinCutState], node2: Union[Node, MinCutState], lca: Node) -> Tuple[bool, float]: 70 | """ 71 | Returns whether the results of a statistical test for reassortment between the two nodes + the pvalue. 72 | """ 73 | # Compute the parsimony score between the two nodes. 74 | parsimony_dist = self.parsimony_distance(node1, node2) 75 | # Compute the ML distance on the reference tree and test for reassortment. 76 | if isinstance(node1, MinCutState) and isinstance(node2, MinCutState): 77 | ml_distance = node_distance_w_lca(node1.node, node2.node, lca) 78 | else: 79 | ml_distance = node_distance_w_lca(node1, node2, lca) 80 | return self.rea_tester.is_reassorted(parsimony_dist, ml_distance) 81 | 82 | def merge_siblings(self, sib1: Node, sib2: Node) -> Node: 83 | parent: Node = sib1.parent_node 84 | parent.remove_child(sib1) 85 | parent.remove_child(sib2) 86 | 87 | new_node = Node(edge_length=0) 88 | new_node.set_child_nodes([sib1, sib2]) 89 | parent.add_child(new_node) 90 | return new_node 91 | 92 | def propagate_parsimony(self, node1: Node, node2: Node, lca: Node): 93 | lca_state_sets = [] 94 | for site in range(self.taxon_characters.sequence_size): 95 | intersection = node1.state_sets[site].intersection(node2.state_sets[site]) 96 | if len(intersection) == 0: 97 | union = node1.state_sets[site].union(node2.state_sets[site]) 98 | lca_state_sets.append(union) 99 | else: 100 | lca_state_sets.append(intersection) 101 | lca.state_sets = lca_state_sets 102 | 103 | def add_rea_annotation(self, edge: Edge, parsimony_dist: int, is_uncertain: bool): 104 | annotation = f'{"?" if is_uncertain else ""}{self.segment}({parsimony_dist})' 105 | rea_events = getattr(edge, REA_FIELD, []) 106 | rea_events.append(annotation) 107 | setattr(edge, REA_FIELD, rea_events) 108 | 109 | def binarize_tree_greedy(self): 110 | """ 111 | Greedily resolve multifurcations by grouping siblings into non-reassortant groups. 112 | """ 113 | multifurcations = 0 114 | node: Node 115 | for node in self.tree.postorder_node_iter(): 116 | if node.child_nodes() and len(node.child_nodes()) > 2: 117 | # Found a multifurcation. 118 | multifurcations += 1 119 | # print('Multifurcation size:', len(node.child_nodes())) 120 | siblings = node.child_nodes() 121 | rnd.shuffle(siblings) # Randomly shuffle all the siblings. 122 | 123 | new_siblings = [siblings[0]] # We will group all the siblings into reassortment-free blocks greedily. 124 | for sibling in siblings[1:]: 125 | placed = False 126 | for i, new_sibling in enumerate(new_siblings): 127 | reassorted, pvalue = self.are_reassorted(new_sibling, sibling, node) 128 | if not reassorted: 129 | merged_node = self.merge_siblings(new_sibling, sibling) 130 | self.propagate_parsimony(new_sibling, sibling, merged_node) 131 | new_siblings[i] = merged_node 132 | placed = True 133 | break 134 | if not placed: 135 | new_siblings.append(sibling) 136 | 137 | # for sibling in node.child_nodes(): # Remove all the children from "node". 138 | # node.remove_child(sibling) 139 | 140 | if len(new_siblings) == 1: 141 | # No reassortment: split out the top and add as children to the node. 142 | merged_node = new_siblings[0] 143 | node.state_sets = merged_node.state_sets 144 | node.set_child_nodes(merged_node.child_nodes()) 145 | else: 146 | # Merge reassortant blocks in a caterpillar structure in the reverse order of their size 147 | new_siblings.sort(key=lambda new_sib: len(new_sib.leaf_nodes()), reverse=True) 148 | while len(new_siblings) > 2: 149 | new1, new2 = new_siblings.pop(), new_siblings.pop() 150 | merged_node = self.merge_siblings(new1, new2) 151 | self.propagate_parsimony(new1, new2, merged_node) 152 | new_siblings.append(merged_node) 153 | node.set_child_nodes(new_siblings) 154 | self.propagate_parsimony(new_siblings[0], new_siblings[1], node) 155 | print('Multifurcations resolved:', multifurcations) 156 | 157 | def infer_reassortment_mincut(self) -> int: 158 | """ 159 | MinCut algorithm that annotates the branches of the tree with the inferred reassortment events. 160 | This algorithm cuts the tree into the smallest number of reassortment-free subtrees possible. 161 | :return: The number of inferred reassortment events. 162 | """ 163 | if not TreeIndexer.is_indexed(self.tree): 164 | tree_indexer = TreeIndexer(self.tree.taxon_namespace) 165 | tree_indexer.index_tree(self.tree) 166 | node: Node # every node will get a 'mincut_states' list. 167 | for node in self.tree.leaf_nodes(): 168 | setattr(node, MINCUT_STATES, [MinCutState(node, node.state_sets)]) # Initialize a new mincutstate for the leaf. 169 | for node in self.tree.postorder_internal_node_iter(): 170 | child1, child2 = node.child_nodes() 171 | compatible_pairs: List[Tuple[MinCutState, MinCutState, float]] = [] # (left_state, right_state, pvalue) 172 | for left_state in getattr(child1, MINCUT_STATES): 173 | best_match: Tuple[MinCutState, float] = (None, math.inf) # right_state and pvalue. 174 | for right_state in getattr(child2, MINCUT_STATES): 175 | reassorted, pvalue = self.are_reassorted(left_state, right_state, node) 176 | if not reassorted and abs(0.5 - pvalue) < abs(0.5 - best_match[1]): # check if pvalue is closer to median. 177 | best_match = (right_state, pvalue) 178 | if best_match[0]: 179 | compatible_pairs.append((left_state, best_match[0], best_match[1])) 180 | if compatible_pairs: 181 | mincut_states = [] 182 | for left_state, right_state, pvalue in compatible_pairs: 183 | # Merge the pairs. 184 | mincut_states.append(MinCutState.merge_states(node, left_state, right_state, pvalue)) 185 | else: 186 | # Get the union of mincut_states of children. 187 | mincut_states: List = getattr(child1, MINCUT_STATES).copy() 188 | mincut_states.extend(getattr(child2, MINCUT_STATES)) 189 | setattr(node, MINCUT_STATES, mincut_states) 190 | rea_events = self._backtrack_mincut() 191 | print(f'\tInferred reassortment events with {self.segment}: {rea_events}.') 192 | return rea_events 193 | 194 | def _backtrack_mincut(self) -> int: 195 | """ 196 | A backtracking subroutine for 'infer_reassortment_mincut'. 197 | Works top-down and identifies branches with reassortment. 198 | :return: The number of identified reassortment events. 199 | """ 200 | rea_events = 0 201 | root: Node = self.tree.seed_node 202 | node: Node 203 | for node in self.tree.preorder_node_iter(): 204 | if node == root: 205 | # Choose the option with the largest number of leaf nodes (and highest pvalue, if equal) as the most likely ancestral state. 206 | root_states: List[MinCutState] = getattr(node, MINCUT_STATES) 207 | sorted_states = sorted(root_states, key=lambda state: (len(state.node.leaf_nodes()), -abs(0.5 - state.pvalue)), reverse=True) 208 | major_state = sorted_states[0] 209 | setattr(node, MINCUT_STATES, major_state) 210 | else: 211 | parent_state: MinCutState = getattr(node.parent_node, MINCUT_STATES) 212 | node_states: List[MinCutState] = getattr(node, MINCUT_STATES) 213 | if node_states[0].node is node: 214 | node_states.sort(key=lambda state: abs(0.5 - state.pvalue)) # Sort the states by pvalue closeness to 0.5 215 | # Find the node's states that agree with the parent assignment: 216 | relevant_states = [state for state in node_states if state in parent_state.child_states] 217 | if relevant_states: 218 | # Assign the "relevant" state with the highest pvalue. 219 | best_state = relevant_states[0] 220 | else: 221 | # Cut off the edge (add rea annotation) and assign the state with the highest pvalue. 222 | rea_events += 1 223 | best_state = node_states[0] 224 | pars_dist = self.parsimony_distance(parent_state, best_state) 225 | self.add_rea_annotation(node.edge, pars_dist, is_uncertain=False) 226 | setattr(node, MINCUT_STATES, best_state) 227 | else: 228 | # Just propagate the parent state below. 229 | setattr(node, MINCUT_STATES, parent_state) 230 | return rea_events 231 | 232 | def infer_reassortment_local(self, pval_threshold: float, add_uncertain=True): 233 | """ 234 | The first (local) implementation, where the reassortment placement is determined by the aunt node. 235 | If reassortment placement in unclear, both branches get marked as potential reassortment events. 236 | """ 237 | if not TreeIndexer.is_indexed(self.tree): 238 | tree_indexer = TreeIndexer(self.tree.taxon_namespace) 239 | tree_indexer.index_tree(self.tree) 240 | child_dists_s2, child1_dists_s2, child2_dists_s2 = compute_parsimony_sibling_dist(self.tree, self.aln_path) 241 | node_by_index = {} 242 | node: Node 243 | for node in self.tree.postorder_node_iter(): 244 | node_by_index[node.index] = node 245 | 246 | pvalues = [(node.index, self.rea_tester.is_reassorted(child_dists_s2[node.index], helpers.sibling_distance(node))[1]) 247 | for node in self.tree.internal_nodes()] 248 | jc_outliers = [(index, pvalue) for index, pvalue in pvalues if pvalue < pval_threshold] 249 | jc_outlier_indices = [x[0] for x in jc_outliers] 250 | # print(len(jc_outliers)) 251 | total_rea, certain_rea = 0, 0 252 | for outlier_ind, pvalue in jc_outliers: 253 | outlier_node: Node = node_by_index[outlier_ind] 254 | specific_edge: Node = None 255 | c1, c2 = outlier_node.child_nodes() 256 | annotation = f'{self.segment}({child_dists_s2[outlier_ind]})' 257 | if outlier_node != self.tree.seed_node: 258 | c1_pvalue = self.rea_tester.is_reassorted(child1_dists_s2[outlier_ind], helpers.aunt_distance(c1))[1] 259 | c2_pvalue = self.rea_tester.is_reassorted(child2_dists_s2[outlier_ind], helpers.aunt_distance(c2))[1] 260 | c1_outlier = c1_pvalue < pval_threshold 261 | c2_outlier = c2_pvalue < pval_threshold 262 | if (not c1_outlier) and (not c2_outlier): 263 | # print('Neither', child_dists_s2[outlier_ind], helpers.sibling_distance(outlier_node)) 264 | continue 265 | if c1_outlier and c2_outlier: 266 | # print('Both', child_dists_s2[outlier_ind], helpers.sibling_distance(outlier_node)) 267 | pass 268 | if c1_outlier ^ c2_outlier: 269 | specific_edge = c1 if c1_outlier else c2 270 | total_rea += 1 271 | if specific_edge: 272 | certain_rea += 1 273 | edge: Edge = specific_edge.edge 274 | self.add_rea_annotation(edge, child_dists_s2[outlier_ind], is_uncertain=False) 275 | elif add_uncertain: 276 | for edge in [c1.edge, c2.edge]: 277 | self.add_rea_annotation(edge, child_dists_s2[outlier_ind], is_uncertain=True) 278 | print(f'\tInferred reassortment events with {self.segment}: {total_rea}.\n' 279 | f'\tIdentified exact branches for {certain_rea}/{total_rea} of them') 280 | -------------------------------------------------------------------------------- /treesort/reassortment_utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import math 3 | import random as rnd 4 | from typing import List, Optional 5 | from dendropy import Tree, Node, Edge 6 | import numpy as np 7 | from scipy.optimize import minimize, LinearConstraint 8 | import warnings 9 | 10 | from treesort.tree_indexer import TreeIndexer 11 | from treesort.helpers import sibling_distance 12 | 13 | 14 | def compute_rea_rate_simple(annotated_tree: Tree, evol_rate: float, ignore_top_edges=1) -> float: 15 | """ 16 | A simpler way to compute the reassortment rate: the number of detected events divided by the total size of the tree (in years) 17 | :ignore_top_edges: the longest x percent of edges will not be counted for. 18 | """ 19 | edge_cutoff = math.inf 20 | if ignore_top_edges > 0: 21 | edge_lengths = sorted([node.edge_length for node in annotated_tree.postorder_node_iter() if node.edge_length]) 22 | top_percentile = int(round(len(edge_lengths) * (1.0 - ignore_top_edges / 100))) 23 | edge_cutoff = edge_lengths[top_percentile] 24 | 25 | # Compute the number of reassortment events detected (a ?-only edge counts as 0.5). 26 | rea_events = 0 27 | for node in annotated_tree.postorder_node_iter(): 28 | if node is annotated_tree.seed_node: 29 | continue # Skip the root edge 30 | edge: Edge = node.edge 31 | if edge and node.edge_length >= edge_cutoff: 32 | continue # Skip the edge if its in the top percentile. 33 | if edge.annotations.get_value('is_reassorted', '0') == '1': 34 | rea_annotation = edge.annotations.get_value('rea').strip('"') 35 | is_uncertain = all([g_str.startswith('?') for g_str in rea_annotation.split(',')]) # Is this 100% uncertain reassortment? 36 | if not is_uncertain: 37 | rea_events += 1 38 | else: 39 | rea_events += 0.5 40 | 41 | # Compute the total tree length (phylogenetic diversity) 42 | tree_length = 0 43 | for node in annotated_tree.postorder_node_iter(): 44 | if node is not annotated_tree.seed_node: 45 | if node.edge_length and node.edge_length >= edge_cutoff: 46 | continue # Skip the edge if its in the top percentile. 47 | tree_length += node.edge_length 48 | 49 | # print(f'{rea_events}, {tree_length}, {evol_rate}') 50 | rea_rate_per_lineage_per_year = (rea_events / tree_length * evol_rate) if tree_length > 0 else 0.0 51 | return rea_rate_per_lineage_per_year 52 | 53 | 54 | def likelihood_binary(x, rea_events, edge_lengths): 55 | func = 0 56 | if x < 1e-10: 57 | return np.inf 58 | for i in range(len(rea_events)): 59 | if edge_lengths[i] > 0: 60 | if rea_events[i] > 0: 61 | with warnings.catch_warnings(): 62 | warnings.filterwarnings('error') 63 | try: 64 | func -= np.log(1 - np.exp(-1 * x * edge_lengths[i])) 65 | except Warning: 66 | # print(x) 67 | func += np.inf 68 | else: 69 | func -= (-1 * x * edge_lengths[i]) 70 | elif rea_events[i] > 0: 71 | # print('+1') 72 | pass 73 | return func 74 | 75 | 76 | def compute_rea_rate_binary_mle(annotated_tree: Tree, evol_rate: float, ref_seg_len=1700) -> Optional[float]: 77 | rea_events = [] # reassortment events per branches (1 - at least one event, 0 - no events). 78 | edge_lengths = [] # Corresponding branch lengths (the two arrays are coupled). 79 | processed_uncertain = set() 80 | node: Node 81 | for node in annotated_tree.postorder_node_iter(): 82 | if node is annotated_tree.seed_node: 83 | continue # Skip the root edge 84 | is_uncertain = False 85 | edge: Edge = node.edge 86 | if edge.annotations.get_value('is_reassorted', '0') == '1': 87 | rea_annotation = edge.annotations.get_value('rea').strip('"') 88 | is_uncertain = all( 89 | [g_str.startswith('?') for g_str in rea_annotation.split(',')]) # Is this 100% uncertain reassortment? 90 | if not is_uncertain: # Uncertain branches are handled below. 91 | rea_events.append(1) 92 | else: 93 | rea_events.append(0) 94 | 95 | edge_length = node.edge_length 96 | if is_uncertain: 97 | # check if the sister edge was already processed. 98 | siblings = node.parent_node.child_nodes() 99 | sibling = siblings[0] if siblings[0] is not Node else siblings[1] 100 | if sibling not in processed_uncertain: 101 | # log the event over the two sister branches 102 | rea_events.append(1) 103 | edge_length = sibling_distance(node.parent_node) 104 | processed_uncertain.add(node) 105 | else: 106 | continue # Skip if already processed. 107 | 108 | if edge_length > 1e-7: 109 | edge_lengths.append(edge_length / evol_rate) 110 | elif rea_events[-1] > 0: 111 | # If reassortment happened on too short of an edge, this can mess up the likelihood function. 112 | # Replace branch length with (1 / ref_seg_len), e.g., 1 / 1700 for HA (1 substitution). 113 | edge_lengths.append((1 / ref_seg_len) / evol_rate) 114 | else: 115 | edge_lengths.append(0) 116 | 117 | # print(len(rea_events), len(edge_lengths)) 118 | est = compute_rea_rate_simple(annotated_tree, evol_rate, ignore_top_edges=1) 119 | np_est = np.array([est]) 120 | linear_constraint = LinearConstraint([[1]], [0]) 121 | num_est = minimize(likelihood_binary, np_est, args=(rea_events, edge_lengths), tol=1e-9, 122 | constraints=[linear_constraint]) 123 | if num_est.success: 124 | return num_est.x[0] 125 | else: 126 | return None 127 | -------------------------------------------------------------------------------- /treesort/tree_indexer.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | from dendropy import TaxonNamespace, Tree 3 | 4 | 5 | class InvalidArgumentError(Exception): 6 | def __init__(self, name: str, value: str, message=''): 7 | super(InvalidArgumentError, self).__init__('Invalid argument %s: %s; %s' % (name, value, message)) 8 | self.name = name 9 | self.value = value 10 | 11 | 12 | class TreeIndexer: 13 | """ 14 | Adds 'index' field to all nodes and taxa on the passed trees. 15 | It ensures that the taxa are indexed consistently across trees. 16 | """ 17 | 18 | def __init__(self, taxon_namespace: TaxonNamespace): 19 | self.label_mapping = {} 20 | index = 0 21 | for taxon in taxon_namespace: 22 | self.label_mapping[taxon.label] = index 23 | index += 1 24 | 25 | def index_tree(self, tree: Tree): 26 | for leaf_node in tree.leaf_nodes(): 27 | if leaf_node.taxon.label in self.label_mapping: 28 | leaf_node.taxon.index = self.label_mapping[leaf_node.taxon.label] 29 | else: 30 | print(leaf_node.taxon.label) 31 | print(self.label_mapping) 32 | raise InvalidArgumentError('tree', '', 'Input tree should be over the initially specified taxon set') 33 | 34 | node_id = 0 35 | for node in tree.postorder_internal_node_iter(): 36 | node.index = node_id 37 | # node.annotations.add_new('id', node_id) 38 | node_id += 1 39 | for node in tree.leaf_node_iter(): 40 | node.index = node_id 41 | node_id += 1 42 | tree.annotations.add_new('indexed', str(True)) 43 | 44 | @staticmethod 45 | def is_indexed(tree: Tree) -> bool: 46 | return tree.annotations.get_value('indexed', str(False)) == str(True) 47 | -------------------------------------------------------------------------------- /treesort/version.py: -------------------------------------------------------------------------------- 1 | __version__ = '0.3.1' 2 | -------------------------------------------------------------------------------- /treetime-root.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """ 3 | @author: Alexey Markin 4 | """ 5 | import sys 6 | import os 7 | import subprocess 8 | from Bio import SeqIO 9 | from datetime import datetime 10 | import re 11 | 12 | 13 | def extract_dates(path: str, format='%Y-%m-%d') -> str: 14 | records = SeqIO.parse(path, 'fasta') 15 | file_name = path + '.dates.csv' 16 | dates_file = open(file_name, 'w+') 17 | dates_file.write('name, date\n') 18 | for record in records: 19 | name = record.name 20 | # date_str = name.split('|')[-1] 21 | for token in name.split('|'): 22 | if re.fullmatch(r'[\d\-/]{4,}', token) and not re.fullmatch(r'\d{5,}', token): 23 | try: 24 | if token.count('/') == 2: 25 | try: 26 | date = datetime.strptime(token, '%m/%d/%Y') 27 | except ValueError as e: 28 | date = datetime.strptime(token, '%Y/%m/%d') 29 | elif token.count('/') == 1: 30 | date = datetime.strptime(token, '%m/%Y') 31 | elif token.count('-') == 2: 32 | date = datetime.strptime(token, '%Y-%m-%d') 33 | elif token.count('-') == 1: 34 | date = datetime.strptime(token, '%Y-%m') 35 | else: 36 | date = datetime.strptime(token, '%Y') 37 | except ValueError as e: 38 | print(f"Can't parse date {token} -- skipping") 39 | continue 40 | dec_date = date.year + ((date.month-1)*30 + date.day)/365.0 41 | dates_file.write('%s, %.2f\n' % (name, dec_date)) 42 | dates_file.close() 43 | return file_name 44 | 45 | 46 | def root_tree(tree_path: str, alignment_path: str) -> str: 47 | dates_file = extract_dates(alignment_path) 48 | rooted_tree = alignment_path + '.rooted.tre' 49 | treetime_dir = alignment_path + '.treetime' 50 | print(' '.join(['treetime', 'clock', '--tree', tree_path, 51 | '--aln', alignment_path, '--dates', dates_file, 52 | '--outdir', treetime_dir])) 53 | subprocess.call(['treetime', 'clock', '--tree', tree_path, 54 | '--aln', alignment_path, '--dates', dates_file, 55 | '--outdir', treetime_dir], stderr=subprocess.STDOUT) 56 | os.replace(treetime_dir + '/rerooted.newick', rooted_tree) 57 | 58 | 59 | if __name__ == '__main__': 60 | args = sys.argv[1:] 61 | root_tree(args[0], args[1]) 62 | -------------------------------------------------------------------------------- /tutorial/figures/TreeSort-illustration.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/figures/TreeSort-illustration.png -------------------------------------------------------------------------------- /tutorial/figures/TreeSort-logo-150.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/figures/TreeSort-logo-150.png -------------------------------------------------------------------------------- /tutorial/figures/TreeSort-logo-300.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/figures/TreeSort-logo-300.png -------------------------------------------------------------------------------- /tutorial/figures/swH1-reassortment-ex1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/figures/swH1-reassortment-ex1.png -------------------------------------------------------------------------------- /tutorial/figures/swH1-reassortment-ex2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/figures/swH1-reassortment-ex2.png -------------------------------------------------------------------------------- /tutorial/figures/swH1-reassortment-ex3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/figures/swH1-reassortment-ex3.png -------------------------------------------------------------------------------- /tutorial/figures/swH1-reassortment-ex4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/figures/swH1-reassortment-ex4.png -------------------------------------------------------------------------------- /tutorial/swH1-parsed/HA-swine_H1_HANA.fasta.aln.treetime/outliers.tsv: -------------------------------------------------------------------------------- 1 | given_date apparent_date residual 2 | A/swine/Chile/YA026/2014|1B.2.2|LAIV-98|PPPPPP|2014-01-01 2014.0 1998.1559081042533 -6.609097869366056 3 | A/swine/Oklahoma/A02245006/2019|1A.2-3-like|LAIV-98|TTTPPT|2019-03-07 2019.18 2000.8638006608433 -7.640296132074614 4 | A/swine/Nebraska/A02479104/2020|1A.2-3-like|LAIV-98|LLLLPT|2020-02-25 2020.15 2000.7591344688262 -8.08857515536438 5 | A/swine/Oklahoma/A01785571/2018|1A.2-3-like|LAIV-Classical|LTLLPT|2018-06-12 2018.44 2000.594878879735 -7.443793734002908 6 | A/swine/Oklahoma/A01785573/2018|1A.2-3-like|LAIV-Classical|LTLLPT|2018-06-14 2018.45 2000.594878879735 -7.447965066707222 7 | A/swine/Minnesota/A02245535/2020|1A.2-3-like|LAIV-Classical|LLLPPT|2020-03-19 2020.22 2001.0826973297421 -7.98280564993242 8 | A/swine/Illinois/A02157797/2018|1A.2-3-like|LAIV-Classical|TTLTPT|2018-04-11 2018.28 2000.7569594747958 -7.309443202045201 9 | A/swine/Minnesota/A01785575/2018|1A.2-3-like|LAIV-98|TLLTPT|2018-06-19 2018.46 2000.5948654061 -7.452142019712851 10 | A/swine/Minnesota/A01785613/2018|1A.2-3-like|LAIV-98|TLLTPT|2018-09-10 2018.68 2001.2455868419358 -7.272473778540949 11 | A/swine/Minnesota/A01785608/2018|1A.2-3-like|LAIV-98|TLLTPT|2018-08-30 2018.66 2001.2463119231168 -7.263828657648307 12 | A/swine/Nebraska/A02245333/2019|1A.2-3-like|2002A|TLLPPT|2019-11-12 2019.85 2002.2278834603933 -7.350771103953527 13 | A/swine/Nebraska/A01378044/2019|1A.2-3-like|2002B|TLLPPT|2019-05-23 2019.39 2001.4092014353105 -7.50038931011847 14 | A/swine/Nebraska/A02245334/2019|1A.2-3-like|2002B|TLLPPT|2019-11-12 2019.85 2002.0736866884797 -7.415091707710629 15 | A/swine/Iowa/A02432384/2019|1A.2-3-like|LAIV-Classical|TTTPPT|2019-04-17 2019.29 2000.9194220723643 -7.662979250527693 16 | A/swine/Iowa/A02478477/2019|1A.2-3-like|LAIV-Classical|TTTPPT|2019-05-07 2019.35 2001.0811593392405 -7.620541251671756 17 | A/swine/Texas/A01104132/2019|1A.2-3-like|LAIV-Classical|LTTLPT|2019-05-30 2019.41 2000.9216749140646 -7.712095507752426 18 | A/swine/California/A02478680/2019|1A.2-3-like|LAIV-Classical|LLLLPT|2019-09-17 2019.7 2001.5726720603977 -7.561511587488909 19 | A/swine/Nebraska/A02157974/2018|1A.2-3-like|LAIV-Classical|TLLLLT|2018-04-18 2018.3 2000.9213206073623 -7.2492253706956555 20 | A/swine/Iowa/A02271349/2018|1A.2-3-like|pdm|LLPLPP|2018-12-04 2018.92 2000.5947231843954 -7.644082649512495 21 | A/swine/South_Dakota/A02156993/2018|1A.2-3-like|LAIV-Classical|TLLLPT|2018-03-22 2018.22 2000.5949088211464 -7.352011924950756 22 | A/swine/Illinois/A02431144/2019|1A.2-3-like|LAIV-Classical|TLLTPT|2019-02-22 2019.14 2000.7567446451687 -7.668267427194577 23 | A/swine/Texas/A01785906/2019|1A.2-3-like|LAIV-Classical|LLLLPP|2019-01-23 2019.06 2000.5947903030597 -7.70245330994396 24 | A/swine/North_Carolina/A02245875/2021|1A.2-3-like|LAIV-Classical|TTTTPT|2021-01-21 2021.06 2000.5947785760068 -8.536724742535252 25 | A/swine/North_Carolina/A02479173/2020|1A.2-3-like|pdm|PPPLPP|2020-03-25 2020.23 2000.4329637109724 -8.258002491938507 26 | A/swine/Nebraska/A02257618/2018|1A.2-3-like|LAIV-Classical|LLLLPT|2018-09-21 2018.72 2000.9193604429593 -7.425238994061569 27 | A/swine/Iowa/A02254795/2018|1A.2-3-like|LAIV-Classical|LLLLPT|2018-07-30 2018.58 2001.0815498251486 -7.299185748781743 28 | A/swine/Oklahoma/A02246973/2021|1A.2-3-like|LAIV-Classical|LLLLPT|2021-10-06 2021.76 2001.8977946193393 -8.285186688262133 29 | A/swine/Kansas/A02636184/2021|1A.2-3-like|LAIV-Classical|LLLLPT|2021-09-21 2021.72 2001.7358068426536 -8.336071848502439 30 | A/swine/Oklahoma/A02248197/2021|1A.2-3-like|LAIV-Classical|LLLLPT|2021-09-15 2021.7 2001.4074790556162 -8.46468562667179 31 | A/swine/Oklahoma/A02246915/2022|1A.2-3-like|LAIV-Classical|LLLLPT|2022-01-07 2022.02 2002.0565265170408 -8.327428982963466 32 | A/swine/Colorado/A02636469/2022|1A.2-3-like|LAIV-Classical|LLLLPT|2022-02-04 2022.09 2001.731696136373 -8.492125870914018 33 | A/swine/Missouri/A01932424/2017|1A.3.2|classicalSwine|PPPPPP|2017-02-22 2017.14 2027.5609154662764 4.346910549256599 34 | -------------------------------------------------------------------------------- /tutorial/swH1-parsed/HA-swine_H1_HANA.fasta.aln.treetime/root_to_tip_regression.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/swH1-parsed/HA-swine_H1_HANA.fasta.aln.treetime/root_to_tip_regression.pdf -------------------------------------------------------------------------------- /tutorial/swH1-parsed/NA-swine_H1_HANA.fasta.aln.treetime/outliers.tsv: -------------------------------------------------------------------------------- 1 | given_date apparent_date residual 2 | A/swine/Oklahoma/A02245577/2020|1A.3.3.2|LAIV-98|TLLPPT|2020-03-24 2020.23 1994.2189896759946 -4.114862948315013 3 | A/swine/Nebraska/A01378047/2021|1B.2.2.2|LAIV-98|LLLLPT|2021-01-28 2021.08 1994.4302255610737 -4.215913494082321 4 | A/swine/Iowa/A02478443/2019|1A.3.3.2|LAIV-98|LLLLPP|2019-04-26 2019.32 1993.7862012468288 -4.039369525073382 5 | A/swine/Nebraska/A02479104/2020|1A.2-3-like|LAIV-98|LLLLPT|2020-02-25 2020.15 1994.4343152730062 -4.068143334517119 6 | A/swine/Iowa/A02525361/2021|1B.2.2.1|LAIV-98|TLLTPT|2021-04-09 2021.27 1995.965872599665 -4.003036213590898 7 | A/swine/Indiana/A02636016/2021|1A.1.1|LAIV-98|TTTPPT|2021-08-02 2021.58 1993.7861821241984 -4.396897696196047 8 | A/swine/Iowa/A02524534/2020|1B.2.2.2|LAIV-98|LLLPPT|2020-08-12 2020.61 1994.8706124236035 -4.071893053407719 9 | A/swine/Iowa/A02271349/2018|1A.2-3-like|pdm|LLPLPP|2018-12-04 2018.92 1990.468299250318 -4.500972771648383 10 | -------------------------------------------------------------------------------- /tutorial/swH1-parsed/NA-swine_H1_HANA.fasta.aln.treetime/root_to_tip_regression.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/swH1-parsed/NA-swine_H1_HANA.fasta.aln.treetime/root_to_tip_regression.pdf -------------------------------------------------------------------------------- /tutorial/swH1-parsed/descriptor.csv: -------------------------------------------------------------------------------- 1 | *HA,HA-swine_H1_HANA.fasta.aln,HA-swine_H1_HANA.fasta.aln.rooted.tre 2 | NA,NA-swine_H1_HANA.fasta.aln,NA-swine_H1_HANA.fasta.aln.rooted.tre 3 | --------------------------------------------------------------------------------