├── .gitignore
├── LICENSE
├── README.md
├── conda-requirements.txt
├── examples
    └── descriptor-huH1N1-wgs.csv
├── prepare_dataset.sh
├── setup.py
├── treesort.py
├── treesort
    ├── __init__.py
    ├── cli.py
    ├── helpers.py
    ├── jc_reassortment_test.py
    ├── lm_outlier_detector.py
    ├── options.py
    ├── parsimony.py
    ├── reassortment_inference.py
    ├── reassortment_utils.py
    ├── tree_indexer.py
    └── version.py
├── treetime-root.py
└── tutorial
    ├── figures
        ├── TreeSort-illustration.png
        ├── TreeSort-logo-150.png
        ├── TreeSort-logo-300.png
        ├── swH1-reassortment-ex1.png
        ├── swH1-reassortment-ex2.png
        ├── swH1-reassortment-ex3.png
        └── swH1-reassortment-ex4.png
    ├── swH1-dataset
        └── swine_H1_HANA.fasta
    └── swH1-parsed
        ├── HA-swine_H1_HANA.fasta.aln
        ├── HA-swine_H1_HANA.fasta.aln.dates.csv
        ├── HA-swine_H1_HANA.fasta.aln.rooted.tre
        ├── HA-swine_H1_HANA.fasta.aln.treetime
            ├── outliers.tsv
            ├── root_to_tip_regression.pdf
            └── rtt.csv
        ├── HA-swine_H1_HANA.fasta.tre
        ├── NA-swine_H1_HANA.fasta.aln
        ├── NA-swine_H1_HANA.fasta.aln.dates.csv
        ├── NA-swine_H1_HANA.fasta.aln.rooted.tre
        ├── NA-swine_H1_HANA.fasta.aln.treetime
            ├── outliers.tsv
            ├── root_to_tip_regression.pdf
            └── rtt.csv
        ├── NA-swine_H1_HANA.fasta.tre
        └── descriptor.csv


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Phylogenetic/fasta test files.
 2 | *.tre
 3 | *.nexus
 4 | *.aln
 5 | *.fasta
 6 | *.fna
 7 | 
 8 | # TreeTime output
 9 | treetime*
10 | 
11 | # General
12 | *.pdf
13 | *.ppt
14 | *.pptx
15 | *.zip
16 | *.csv
17 | 
18 | 
19 | # Compiled Python bytecode and related files
20 | *.py[cod]
21 | dist/
22 | build/
23 | *.egg-info/
24 | __pycache__/
25 | 
26 | # Log files
27 | *.log
28 | 
29 | # JetBrains IDE
30 | .idea/
31 | 
32 | # Unit test reports
33 | TEST*.xml
34 | .pytest*
35 | 
36 | # Generated by MacOS
37 | .DS_Store
38 | 
39 | # Python virtual environment
40 | venv/
41 | 
42 | # Simulation files
43 | sims/
44 | 
45 | # Other
46 | testfiles/
47 | misc/
48 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 Flu-crew at the National Animal Disease Center
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # TreeSort #
  2 | 
  3 | ![TreeSort logo](tutorial/figures/TreeSort-logo-150.png)
  4 | 
  5 | TreeSort infers both recent and ancestral reassortment events along the branches of a phylogenetic tree of a fixed genomic segment.
  6 | It uses a statistical hypothesis testing framework to identify branches where reassortment with other segments has occurred and reports these events.
  7 | 
  8 | <!-- The idea behind TreeSort is the observation that *if there is no reassortment, then the evolutionary histories of different segments should be identical*. TreeSort then uses a phylogenetic tree for one segment (e.g., the HA influenza A virus segment) as an evolutionary hypothesis for another segment (e.g., the NA segment). We will refer to the first segment as the *reference* and the second segment as the *challenge*. By trying to fit the sequence alignment of the challenge segment to the reference tree, TreeSort identifies points on that tree, where this evolutionary hypothesis breaks. The "breaking" manifests in the mismatch between the divergence time on the reference tree (e.g., 1 year divergence between sister clades) and an unlikely high number of substitutions in the challenge segment that are required to explain the reference tree topology under the null hypothesis of no reassortment.
  9 | 
 10 | TreeSort has demonstrated very high accuracy in reassortment inference in simulations (manuscript in preparation). TreeSort can process datasets with tens of thousands of virus strains in just a few minutes and can scale to very large datasets with hundreds of thousands of strains. -->
 11 | 
 12 | Below is an example of 2 reassortment events inferred by TreeSort on a swine H1 dataset. The reference phylogeny is the hemagglutinin (HA) segment tree, and the branch annotations indicate reassortment relative to the HA's evolutionary history. The annotations list the acquired gene segments and how distant these segments were (# of nucleotide differences) from the original segments. For example, `PB2(136)` indicates that a new PB2 was acquired that was approximately 136 nucleotides different from the pre-reassortment PB2.
 13 | 
 14 | <!-- ![An example of inferred reassortment events](tutorial/figures/TreeSort-illustration.png) -->
 15 | <center>
 16 | <img style="max-width: 700px" src="tutorial/figures/TreeSort-illustration.png">
 17 | </center>
 18 | 
 19 | ### Citation ###
 20 | **If you use TreeSort, please cite it as**</br>
 21 | *Markin, A., Macken, C.A., Baker, A.L., and Anderson, T.K. Revealing reassortment in influenza A viruses with TreeSort. bioRxiv 2024.11.15.623781; [doi: https://doi.org/10.1101/2024.11.15.623781](https://doi.org/10.1101/2024.11.15.623781).*
 22 | 
 23 | N.B. TreeSort uses TreeTime in a subroutine to infer substitution rates for segments - please also cite *Sagulenko et al. 2018 [doi: 10.1093/ve/vex042](https://doi.org/10.1093/ve/vex042).*
 24 | 
 25 | ### Installation ###
 26 | For a default installation, run `pip install treesort`. Alternatively, you can download this repository and run `pip install .` from within the downloaded directory. TreeSort requires **Python 3** to run and depends on SciPy, BioPython, DendroPy, and TreeTime (these dependencies will be installed automatically).
 27 | 
 28 | For a broader installation of the bioinformatics suite required to align sequences and build phylogenetic trees via the [prepare_dataset.sh](prepare_dataset.sh) script that we provide, we recommend using a conda environment that can be set up as follows.
 29 | 
 30 | If you haven't already, configure bioconda.
 31 | ```
 32 | conda config --add channels bioconda
 33 | conda config --add channels conda-forge
 34 | conda config --set channel_priority strict
 35 | ```
 36 | Then create a new environment with required dependencies and install TreeSort inside that environment.
 37 | ```
 38 | git clone https://github.com/flu-crew/TreeSort.git
 39 | cd TreeSort
 40 | conda create -n treesort-env --file conda-requirements.txt
 41 | conda activate treesort-env
 42 | pip install .
 43 | <Run TreeSort on your data>
 44 | conda deactivate
 45 | ```
 46 | 
 47 | 
 48 | ## Tutorial ##
 49 | We use a swine H1 influenza A virus dataset for this tutorial. We include only HA and NA gene segments in this analysis for simplicity, but it can be expanded to all 8 segments. <!-- The segment trees and alignments for HA and NA can be found in the tutorial [folder](tutorial/swH1-dataset/).  -->
 50 | **Please note** that all sequences should have the dates of collection included in the deflines, and all metadata fields should be separated by "|". E.g., "A/swine/Iowa/A02934932/2017|1A.3.3.2|2017-05-12".
 51 | 
 52 | To start, we will install TreeSort using the conda method above
 53 | ```
 54 | git clone https://github.com/flu-crew/TreeSort.git  # Download this repo
 55 | cd TreeSort
 56 | conda create -n treesort-env --file conda-requirements.txt  # Create a new conda env and install dependencies
 57 | conda activate treesort-env
 58 | pip install .  # Install TreeSort
 59 | ```
 60 | 
 61 | ### Creating a descriptor file ###
 62 | 
 63 | The input to TreeSort is a **descriptor** file, which is a comma-separated csv file that describes where the alignments and trees for individual segments can be found. Here is an [example descriptor file](examples/descriptor-huH1N1-wgs.csv). For our case, the descriptor file could look as follows (the column headings should not be included):
 64 | 
 65 | | segment name | path to the fasta alignment | path to the newick-formatted tree |
 66 | | --- | --- | --- |
 67 | | *HA | HA-swine_H1_HANA.fasta.aln | HA-swine_H1_HANA.fasta.aln.rooted.tre
 68 | | NA | NA-swine_H1_HANA.fasta.aln | NA-swine_H1_HANA.fasta.aln.rooted.tre
 69 | 
 70 | The star symbol (\*) indicates the segment that will be used as the reference phylogeny and reassortment events will be inferred relative to this phylogeny (HA in this case). Note that the reference phylogeny should be **rooted**, whereas trees for other segments can be unrooted. <!-- (see [TreeTime](https://github.com/neherlab/treetime) for good rooting options for RNA viruses). -->
 71 | 
 72 | We will use [prepare_dataset.sh](prepare_dataset.sh) bash script to automatically build alignments and trees for two segments in our swine dataset and compile a descriptor file. The script relies on the fact that every sequence has a segment name in the middle of the defline (e.g., |HA| or |4|).
 73 | <!-- However, if you already have trees and alignments built for your own dataset, you can create the descriptor manually following this [example](examples/descriptor-huH1N1-wgs.csv). -->
 74 | 
 75 | <!-- The descriptor can be automatically generated using the [prepare_dataset.sh](prepare_dataset.sh) bash script that can be found in the repository. The script requires a single fasta file that contains the segment sequences as input. -->
 76 | 
 77 | ```
 78 | ./prepare_dataset.sh --fast --segments "HA,NA" tutorial/swH1-dataset/swine_H1_HANA.fasta HA tutorial/swH1-parsed
 79 | ```
 80 | To make things faster, we use the `--fast` flag here so that all trees are built using FastTree. However, we do not recommend to use this flag for high-precision analyses. When this flag is not used, the script will build the reference phylogeny using IQ-Tree, which will be slower but will likely result in a better quality tree, and therefore more accurate reassortment inference.
 81 | 
 82 | The required arguments to the script are the path to the main fasta file, name of the regerence segment, and the path to the output directory. If `--segments` are not specified, the script assumes that 8 IAV segment names should be used (PB2, PB1, PA, HA, NP, NA, MP, NS).
 83 | 
 84 | Running the above command will save the descriptor file, all trees, and alignments to the `tutorial/swH1-parsed` directory. Note that if for your data you already have trees built, you can manually create the descriptor file without using the script.
 85 | 
 86 | 
 87 | ### Running TreeSort ###
 88 | First make sure to familiarize yourself with the options available in the tool by looking through the help message.
 89 | ```
 90 | treesort -h
 91 | ```
 92 | 
 93 | Having the descriptor file from above, TreeSort can be run as follows
 94 | ```
 95 | cd tutorial/swH1-parsed/
 96 | treesort -i descriptor.csv -o swH1-HA.annotated.tre
 97 | ```
 98 | To run the newest mincut algorithm for reassortment inference (see details [here](https://github.com/flu-crew/TreeSort/releases/tag/0.3.0)), please use
 99 | ```
100 | treesort -i descriptor.csv -o swH1-HA.annotated.tre -m mincut
101 | ```
102 | 
103 | TreeSort will first estimate molecular clock rates for each segment and then will infer reassortment and annotate the backbone tree. The output tree in nexus format (`swH1-HA.annotated.tre`) can be visualized in FigTree or [icytree.org](https://icytree.org/). You can view the inferred reassortment events by displaying the **'rea'** annotations on tree edges, as shown in the Figure above.
104 | 
105 | In this example TreeSort identifies a total of 93 HA-NA reassortment events:
106 | ```
107 | Inferred reassortment events with NA: 93.
108 | Identified exact branches for 79/93 of them
109 | ```
110 | 
111 | Additionally, the method outputs the estimated reassortment rate per ancestral lineage per year. The rate translates to the probability of a single strain to undergo a reassortment event over the course of a year. In our case this probability of reassortment with NA is approximately 4%.
112 | 
113 | Below is a part of the TreeSort output, where we see two consecutive NA reassortment events. The NA clade classifications were added to the strain names so that it's easier to interpret these reassortment events. Here we had a 2002 NA -> 1998A NA switch, followed by a 1998A -> 2002B NA switch.
114 | <center>
115 | <img src="tutorial/figures/swH1-reassortment-ex4.png">
116 | </center>
117 | 
118 | ### Uncertain reassortment placement (the '?' tag) ###
119 | Note that this section only applies to the `-m local` inference method (the default method for TreeSort). The `-m mincut` method always infers certain reassortment placements.
120 | 
121 | Sometimes TreeSort does not have enough information to confidently place a reassortment event on a specific branch of the tree. TreeSort always narrows down the reassortment event to a particular ancestral node on a tree, but may not distinguish which of the child branches was affected by reassortment. In those cases, TreeSort will annotate both child branches with a `?<segment-name>` tag. For example, `?PB2(26)` below indicates that the reassortment with PB2 might have happened on either of the child branches.
122 | 
123 | <center>
124 | <img src="tutorial/figures/swH1-reassortment-ex3.png">
125 | </center>
126 | 
127 | Typically, this happens when the sampling density is low. Therefore, increasing the sampling density by including more strains in the analysis may resolve such instances.
128 | 


--------------------------------------------------------------------------------
/conda-requirements.txt:
--------------------------------------------------------------------------------
1 | fasttree
2 | iqtree
3 | mafft
4 | pip
5 | smof
6 | 


--------------------------------------------------------------------------------
/examples/descriptor-huH1N1-wgs.csv:
--------------------------------------------------------------------------------
1 | PB2, ../huH1N1/USA/PB2.final.aln, ../huH1N1/USA/PB2.fasttree.tre
2 | PB1, ../huH1N1/USA/PB1.final.aln, ../huH1N1/USA/PB1.fasttree.tre
3 | PA, ../huH1N1/USA/PA.final.aln, ../huH1N1/USA/PA.fasttree.tre
4 | *HA, ../huH1N1/USA/HA.final.aln, ../huH1N1/USA/HA.final.aln.rooted.tre
5 | NP, ../huH1N1/USA/NP.final.aln, ../huH1N1/USA/NP.fasttree.tre
6 | NA, ../huH1N1/USA/NA.final.aln, ../huH1N1/USA/NA.fasttree.tre
7 | MP, ../huH1N1/USA/MP.final.aln, ../huH1N1/USA/MP.fasttree.tre
8 | NS, ../huH1N1/USA/NS.final.aln, ../huH1N1/USA/NS.fasttree.tre
9 | 


--------------------------------------------------------------------------------
/prepare_dataset.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | # Usage: ./prepare_dataset.sh [--segments "..." --fast] fasta_path reference_segment outdir
  4 | # Using --fast will make all trees to be inferred with FastTree.
  5 | # By default (without --fast) the reference tree is inferred with IQ-Tree, which is recommended for better accuracy.
  6 | # Example usage: ./prepare_dataset.sh --segments "HA,NA" segments.fasta HA myoutdir
  7 | # Example with default segments:  ./prepare_dataset.sh segments.fasta HA myoutdir
  8 | 
  9 | # These are the default segment names
 10 | declare -a segments=("PB2" "PB1" "PA" "HA" "NP" "NA" "MP" "NS")
 11 | FAST=0
 12 | 
 13 | POSITIONAL_ARGS=()
 14 | 
 15 | while [[ $# -gt 0 ]]; do
 16 | 	case $1 in
 17 | 		--segments)
 18 | 			SEGMENTS_STR="$2"
 19 | 			segments=(${SEGMENTS_STR//,/ })
 20 | 			shift  # past argument
 21 | 			shift  # past value
 22 | 			;;
 23 | 		--fast)
 24 | 			FAST=1
 25 | 			shift
 26 | 			;;
 27 | 		-*|--*)
 28 | 			echo "Unrecognized option $1"
 29 | 			exit 1
 30 | 			;;
 31 | 		*)
 32 | 			POSITIONAL_ARGS+=("$1")  # save positional arg
 33 | 			shift  # past argument
 34 | 			;;
 35 | 	esac
 36 | done
 37 | 
 38 | set -- "${POSITIONAL_ARGS[@]}"
 39 | 
 40 | # Required arguments:
 41 | main_fasta="$1"  # Provide a path to a fasta file with all segments
 42 | ref_seg="$2"  # Name of the segment to use as the reference (typically - HA)
 43 | outdir="$3"  # Path to the directory to store the results
 44 | 
 45 | rm -r $outdir  # Clear out the directory
 46 | mkdir $outdir  # Re-create the directory
 47 | 
 48 | name=${main_fasta##*/}
 49 | 
 50 | # Split out the segments and align them
 51 | for seg in "${segments[@]}"
 52 | do
 53 | 	cat $main_fasta | smof grep "|${seg}|" > ${outdir}/${seg}-${name}
 54 | 	echo "Aligning ${seg}..."
 55 | 	mafft --thread 6 ${outdir}/${seg}-${name} | sed "s/|${seg}|/|/g"> ${outdir}/${seg}-${name}.aln
 56 | 	rm ${outdir}/${seg}-${name}
 57 | done
 58 | 
 59 | if [ $FAST -eq 0 ]; then
 60 | 	# Build fasttree trees in parallel for non-reference segments
 61 | 	echo "Building non-reference trees in parallel with FastTree..."
 62 | 	for seg in "${segments[@]}"
 63 | 	do
 64 | 		if [ $seg != $ref_seg ]; then
 65 | 			fasttree -nt -gtr -gamma ${outdir}/${seg}-${name}.aln > ${outdir}/${seg}-${name}.tre &
 66 | 		fi
 67 | 	done
 68 | 	wait  # Wait to finish.
 69 | 
 70 | 	# Build an IQ-Tree tree for the reference segment. We use the GTR+F+R5 model by default which can be changed
 71 | 	echo "Building the reference tree with IQ-Tree..."
 72 | 	iqtree2 -s ${outdir}/${ref_seg}-${name}.aln -T 6 --prefix "${outdir}/${ref_seg}-${name}" -m GTR+G+R5
 73 | 	mv ${outdir}/${ref_seg}-${name}.treefile ${outdir}/${ref_seg}-${name}.tre
 74 | else
 75 | 	# Build all trees with FastTree in parallel.
 76 | 	echo "Building trees in parallel with FastTree..."
 77 | 	for seg in "${segments[@]}"
 78 | 	do
 79 | 		fasttree -nt -gtr -gamma ${outdir}/${seg}-${name}.aln > ${outdir}/${seg}-${name}.tre &
 80 | 	done
 81 | 	wait  # Wait to finish.
 82 | fi
 83 | 
 84 | # Root the trees with a custom rooting script (in parallel)
 85 | echo "Rooting trees with TreeTime..."
 86 | for seg in "${segments[@]}"
 87 | do
 88 | 	python treetime-root.py ${outdir}/${seg}-${name}.tre ${outdir}/${seg}-${name}.aln &
 89 | done
 90 | wait
 91 | 
 92 | # Create a descriptor file
 93 | descriptor=${outdir}/descriptor.csv
 94 | for seg in "${segments[@]}"
 95 | do
 96 | 	if [ $seg == $ref_seg ]; then
 97 | 		echo -n "*" >> $descriptor
 98 | 	fi
 99 | 	echo "${seg},${seg}-${name}.aln,${seg}-${name}.aln.rooted.tre" >> $descriptor
100 | done
101 | echo "The descriptor file was written to ${descriptor}"
102 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup
 2 | 
 3 | from treesort.version import __version__
 4 | 
 5 | with open("README.md", "r") as fh:
 6 |     long_description = fh.read()
 7 | 
 8 | setup(
 9 |     install_requires=[
10 |         'scipy>=1.7.0',
11 |         'biopython>=1.67',
12 |         'dendropy>=4.5.0',
13 |         'phylo-treetime>=0.9.4',
14 |         'matplotlib'
15 |     ],
16 |     name="TreeSort",
17 |     version=__version__,
18 |     author="Alexey Markin",
19 |     author_email="alex.markin57@gmail.com",
20 |     license='MIT',
21 |     description="Virus reassortment inference software."
22 |                 "Infers both recent and ancestral reassortment and uses flexible molecular clock constraints.",
23 |     long_description=long_description,
24 |     long_description_content_type="text/markdown",
25 |     url="https://github.com/flu-crew/TreeSort",
26 |     packages=["treesort"],
27 |     classifiers=[
28 |         "Programming Language :: Python :: 3",
29 |         "Programming Language :: Python :: 3.6",
30 |         "Programming Language :: Python :: 3.7",
31 |         "Programming Language :: Python :: 3.8",
32 |         "Programming Language :: Python :: 3.9",
33 |         "Programming Language :: Python :: 3.10",
34 |         "Programming Language :: Python :: 3.11",
35 |         "Programming Language :: Python :: 3.12",
36 |         "Topic :: Scientific/Engineering :: Bio-Informatics",
37 |         "License :: OSI Approved :: MIT License",
38 |         "Operating System :: OS Independent",
39 |     ],
40 |     entry_points={"console_scripts": ["treesort=treesort.cli:run_treesort_cli"]},
41 |     py_modules=["treesort"],
42 | )
43 | 


--------------------------------------------------------------------------------
/treesort.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # -*- coding: utf-8 -*-
3 | from treesort.cli import run_treesort_cli
4 | 
5 | 
6 | if __name__ == '__main__':
7 |     run_treesort_cli()
8 | 


--------------------------------------------------------------------------------
/treesort/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | 


--------------------------------------------------------------------------------
/treesort/cli.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | import math
  3 | import sys
  4 | from typing import List, Optional, Dict, Tuple, Set
  5 | import re
  6 | import os
  7 | 
  8 | from Bio import SeqIO
  9 | from Bio.SeqRecord import SeqRecord
 10 | from dendropy import Tree, Node, Edge, Taxon
 11 | 
 12 | from treesort import options, helpers
 13 | from treesort.helpers import binarize_tree, collapse_zero_branches
 14 | from treesort.tree_indexer import TreeIndexer
 15 | from treesort.reassortment_utils import compute_rea_rate_simple, compute_rea_rate_binary_mle
 16 | from treesort.reassortment_inference import REA_FIELD, ReassortmentDetector
 17 | from treesort.jc_reassortment_test import JCReassortmentTester
 18 | 
 19 | ADD_UNCERTAIN = True  # For local method only.
 20 | RESOLVE_GREEDY = True  # Whether to use the greedy multifurcation resolution algorithm
 21 | MIN_TAXA_THRESHOLD = 10
 22 | 
 23 | 
 24 | def extract_join_regex(label: str, join_on_regex: str, print_error=True) -> Optional[str]:
 25 |     """
 26 |     Extracts the portion of the strain label captured by 'join_on_regex'.
 27 |     """
 28 |     re_search = re.search(join_on_regex, label)
 29 |     if re_search and re_search.group(0):
 30 |         return re_search.group(0)
 31 |     else:
 32 |         if print_error:
 33 |             print(f'Cannot match pattern {join_on_regex} to {label}. Skipping this strain.')
 34 |         return None
 35 | 
 36 | 
 37 | def is_taxon_in_tree(label: str, tree: Tree, join_on_regex=None) -> bool:
 38 |     if join_on_regex:
 39 |         label = extract_join_regex(label, join_on_regex)
 40 |         tree_labels = {extract_join_regex(leaf.taxon.label, join_on_regex) for leaf in tree.leaf_nodes()}
 41 |     else:
 42 |         tree_labels = {leaf.taxon.label for leaf in tree.leaf_nodes()}
 43 |     if label:
 44 |         return label in tree_labels
 45 |     else:
 46 |         return False
 47 | 
 48 | 
 49 | def get_aln_labels(aln: Dict[str, SeqRecord], join_on_regex=None) -> Set[str]:
 50 |     """
 51 |     Get a set of strain labels from the alignment
 52 |     (returns the label portion matched by 'join_on_regex' only, if specified).
 53 |     """
 54 |     if join_on_regex:
 55 |         return {extract_join_regex(strain, join_on_regex) for strain in aln.keys()}
 56 |     else:
 57 |         return set(aln.keys())
 58 | 
 59 | 
 60 | def is_taxon_in_aln(label: str, aln_labels: Set[str], join_on_regex=None) -> bool:
 61 |     if join_on_regex:
 62 |         label = extract_join_regex(label, join_on_regex)
 63 |     if label:
 64 |         return label in aln_labels
 65 |     else:
 66 |         return False
 67 | 
 68 | 
 69 | def find_common_taxa(aln_by_seg: List[Dict[str, SeqRecord]], ref_segment_i: int, join_on_regex=None) -> List[str]:
 70 |     # Find taxa in common.
 71 |     aln_labels_by_seg = [get_aln_labels(aln, join_on_regex) for aln in aln_by_seg]
 72 |     common_taxa = [extract_join_regex(strain, join_on_regex) if join_on_regex else strain
 73 |                    for strain in aln_by_seg[ref_segment_i] if
 74 |                    all([is_taxon_in_aln(strain, aln_labels, join_on_regex) for aln_labels in aln_labels_by_seg])]
 75 |     return common_taxa
 76 | 
 77 | 
 78 | def prune_tree_to_taxa(tree: Tree, common_taxa: List[str], join_on_regex=None) -> Optional[Dict[str, str]]:
 79 |     """
 80 |     Prune the tree and rename the taxa according to 'join_on_regex' regex, if provided.
 81 |     Returns a dictionary that maps the new names to the old names (if subs were made).
 82 |     """
 83 |     name_map: Optional[Dict[str, str]] = None
 84 |     if join_on_regex:
 85 |         # Need to rename all taxa first.
 86 |         name_map = {}
 87 |         new_taxa = set()
 88 |         taxon: Taxon
 89 |         for taxon in tree.taxon_namespace:
 90 |             new_label = extract_join_regex(taxon.label, join_on_regex, print_error=False)
 91 |             if new_label:
 92 |                 if new_label not in new_taxa:
 93 |                     name_map[new_label] = taxon.label
 94 |                     taxon.label = new_label
 95 |                     new_taxa.add(new_label)
 96 |                 else:
 97 |                     print(f'REPEATED strain {new_label} - discarding the copy')
 98 | 
 99 |     # Prune the tree.
100 |     tree.retain_taxa_with_labels(common_taxa)
101 |     return name_map
102 | 
103 | 
104 | def prune_and_update_alignments(aln_by_seg: List[Dict[str, SeqRecord]], segments: List[Tuple[str, str, str, float]],
105 |                                 common_taxa: List[str], outdir: str, join_on_regex=None) -> List[Dict[str, SeqRecord]]:
106 |     if not join_on_regex:
107 |         # Don't need to do anything.
108 |         return aln_by_seg
109 |     else:
110 |         upd_aln_by_seg: List[Dict[str, SeqRecord]] = []
111 |         for i, seg in enumerate(segments):
112 |             aln_map = aln_by_seg[i]
113 |             new_aln: List[SeqRecord] = []
114 |             upd_aln_map: Dict[str, SeqRecord] = {}
115 |             added_labels_upper = set()
116 |             upd_aln_by_seg.append(upd_aln_map)
117 |             new_aln_path = os.path.join(outdir, f'{seg[0]}_unified.aln')
118 |             for label in aln_map:
119 |                 new_label = extract_join_regex(label, join_on_regex, print_error=False)
120 |                 if new_label and new_label in common_taxa:
121 |                     if new_label.upper() in added_labels_upper:
122 |                         # print(f'REPEATED strain {new_label} in segment {seg[0]} - discarding')
123 |                         continue  # do not add to the alignment
124 |                     else:
125 |                         record = aln_map[label]
126 |                         record.id = record.name = new_label
127 |                         record.description = ''
128 |                         new_aln.append(record)
129 |                         upd_aln_map[new_label] = record
130 |                         added_labels_upper.add(new_label.upper())
131 |             SeqIO.write(new_aln, new_aln_path, 'fasta')
132 |             segments[i] = (seg[0], new_aln_path, seg[2], seg[3])
133 |         return upd_aln_by_seg
134 | 
135 | 
136 | def run_treesort_cli():
137 |     # Each segment has format (name, aln_path, tree_path, rate)
138 |     sys.setrecursionlimit(100000)
139 |     segments: List[Tuple[str, str, str, float]]  # name, aln path, tree path, rate.
140 |     descriptor_name, outdir, segments, ref_segment_i, output_path, clades_out_path, pval_threshold, allowed_deviation, \
141 |         method, collapse_branches, join_on_regex, args = options.parse_args()
142 |     ref_tree_path = segments[ref_segment_i][2]
143 |     tree: Tree = Tree.get(path=ref_tree_path, schema='newick', preserve_underscores=True)
144 |     ref_seg = segments[ref_segment_i]
145 | 
146 |     if args.timetree:
147 |         # reduce the deviation rate.
148 |         if allowed_deviation > 1:
149 |             allowed_deviation = math.sqrt(allowed_deviation)
150 | 
151 |     if collapse_branches:
152 |         collapse_zero_branches(tree, 1e-7)
153 | 
154 |     # Parse the alignments into a list of dictionaries.
155 |     aln_by_seg: List[Dict[str, SeqRecord]] = []
156 |     for i, seg in enumerate(segments):
157 |         seg_list = list(SeqIO.parse(seg[1], format='fasta'))
158 |         seg_aln = {seq.id: seq for seq in seg_list}
159 |         aln_by_seg.append(seg_aln)
160 | 
161 |     # Find taxa in common and prune trees/alignments if needed.
162 |     common_taxa = find_common_taxa(aln_by_seg, ref_segment_i, join_on_regex)
163 |     name_map: Optional[Dict[str, str]] = None # New to old label names map (if subs were made)
164 |     if len(common_taxa) >= MIN_TAXA_THRESHOLD:
165 |         print(f'Found {len(common_taxa)} strains in common across the alignments.')
166 |         name_map = prune_tree_to_taxa(tree, common_taxa, join_on_regex)
167 |         aln_by_seg = prune_and_update_alignments(aln_by_seg, segments, common_taxa, outdir, join_on_regex)
168 |     else:
169 |         # Print an error and exit.
170 |         print(f'Found {len(common_taxa)} strains in common across the segment alignments - insufficient for a reassortment analysis.')
171 |         exit(-1)
172 | 
173 |     if RESOLVE_GREEDY:
174 |         print('Optimally resolving the multifurcations to minimize reassortment...')
175 |         tree.suppress_unifurcations()  # remove the unifurcations.
176 |         # Compute the averaged sub rate.
177 |         total_rate = 0
178 |         total_sites = 0
179 |         for i, seg in enumerate(segments):
180 |             if i == ref_segment_i:
181 |                 continue
182 |             aln_len = len(next(iter(aln_by_seg[i].values())).seq)
183 |             total_rate += seg[3] * aln_len
184 |             total_sites += aln_len
185 |         overall_rate = total_rate / total_sites
186 |         # print('Concatenated rate:', overall_rate)
187 | 
188 |         # Concatenate non-ref alignments.
189 |         concatenated_seqs = []
190 |         taxa = [leaf.taxon.label for leaf in tree.leaf_nodes()]
191 |         for taxon in taxa:
192 |             # Find this taxon across all (non-reference) segments and concatenate aligned sequences.
193 |             concat_seq = ''
194 |             for i, seg in enumerate(segments):
195 |                 if i == ref_segment_i:
196 |                     continue
197 |                 concat_seq += aln_by_seg[i][taxon].seq
198 |             concatenated_seqs.append(SeqRecord(concat_seq, id=taxon, name=taxon, description=''))
199 |         concat_path = os.path.join(outdir, descriptor_name + '.concatenated.fasta')
200 |         SeqIO.write(concatenated_seqs, concat_path, 'fasta')
201 | 
202 |         # Binarize the tree
203 |         reassortment_tester = JCReassortmentTester(total_sites, overall_rate / ref_seg[3], pval_threshold, allowed_deviation)
204 |         rea_detector = ReassortmentDetector(tree, concat_path, 'concatenated', reassortment_tester)
205 |         rea_detector.binarize_tree_greedy()
206 |         # tree = rea_detector.tree  # use the binarized tree
207 |     else:
208 |         binarize_tree(tree)  # simple binarization, where polytomies are resolved as caterpillars.
209 | 
210 |     tree_indexer = TreeIndexer(tree.taxon_namespace)
211 |     tree_indexer.index_tree(tree)
212 | 
213 |     for i, seg in enumerate(segments):
214 |         if i == ref_segment_i:
215 |             continue
216 | 
217 |         print(f'Inferring reassortment with the {seg[0]} segment...')
218 |         # seg2_aln = list(SeqIO.parse(seg[1], format='fasta'))
219 |         seg2_len = len(next(iter(aln_by_seg[i].values())).seq)
220 |         rate_ratio = seg[3] / ref_seg[3]
221 |         reassortment_tester = JCReassortmentTester(seg2_len, rate_ratio, pval_threshold, allowed_deviation)
222 |         rea_detector = ReassortmentDetector(tree, seg[1], seg[0], reassortment_tester)
223 | 
224 |         if method == 'MINCUT':
225 |             rea_detector.infer_reassortment_mincut()
226 |         else:
227 |             rea_detector.infer_reassortment_local(pval_threshold, add_uncertain=ADD_UNCERTAIN)
228 | 
229 |     clades_out = None
230 |     reported_rea = set()
231 |     if clades_out_path:
232 |         clades_out = open(clades_out_path, 'w')
233 | 
234 |     node: Node
235 |     for node in tree.postorder_node_iter():
236 |         if node.is_internal():
237 |             node.label = f'TS_NODE_{node.index}'  # Add node labels to the output tree.
238 | 
239 |         annotation = ','.join(getattr(node.edge, REA_FIELD, []))
240 |         if annotation:
241 |             node.edge.annotations.add_new('rea', f'"{annotation}"')
242 |             node.edge.annotations.add_new('is_reassorted', '1')
243 | 
244 |             if clades_out:
245 |                 # Report reassortment associated with the clade.
246 |                 clade = ';'.join(sorted([leaf.taxon.label for leaf in node.leaf_nodes()]))
247 |                 sister_clade = ';'.join(sorted([leaf.taxon.label for leaf in node.sister_nodes()[0].leaf_nodes()]))
248 |                 reassorted_genes = [g_str[:g_str.find('(')] for g_str in annotation.split(',')]
249 |                 # Drop out already reported ?-genes: we want to report ?-genes only once.
250 |                 # TODO: ideally report ?-genes with the more likely clade
251 |                 report_genes = [gene for gene in reassorted_genes if (sister_clade, gene) not in reported_rea]
252 |                 for rea_gene in report_genes:
253 |                     if rea_gene.startswith('?'):
254 |                         reported_rea.add((clade, rea_gene))  # mark the ?-genes that we report here
255 |                 report_genes_str = ';'.join(report_genes)
256 |                 if report_genes_str:
257 |                     clades_out.write(f'{clade},{report_genes_str}'
258 |                                      # if there are ?-genes - add alternative clade (sister clade)
259 |                                      f'{"," + sister_clade if report_genes_str.count("?") > 0 else ","}\n')
260 |         else:
261 |             node.edge.annotations.add_new('is_reassorted', '0')
262 | 
263 |     if clades_out:
264 |         clades_out.close()
265 | 
266 |     if ref_seg[3] < 1 or args.timetree:  # Do not estimate the reassortment rate if --equal-rates was given.
267 |         rea_rate_1 = compute_rea_rate_binary_mle(tree, ref_seg[3],
268 |                                     ref_seg_len=len(next(iter(aln_by_seg[ref_segment_i].values())).seq))
269 |         rea_rate_2 = compute_rea_rate_simple(tree, ref_seg[3], ignore_top_edges=1)
270 |         print(f'\nEstimated reassortment rate per ancestral lineage per year.')
271 |         if rea_rate_1:
272 |             print(f'\tBernoulli estimate: {round(rea_rate_1, 6)}')
273 |         print(f'\tPoisson estimate (for dense datasets): {round(rea_rate_2, 6)}')
274 | 
275 |     if name_map:
276 |         # Substitute the labels back.
277 |         for leaf in tree.leaf_node_iter():
278 |             if leaf.taxon:
279 |                 leaf.taxon.label = name_map[leaf.taxon.label]
280 | 
281 |     tree.write_to_path(output_path, schema='nexus')
282 |     # tree.write_to_path(output_path + 'phylo.xml', schema='phyloxml')
283 |     print(f'Saved the annotated tree file to {output_path}')
284 | 


--------------------------------------------------------------------------------
/treesort/helpers.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | from datetime import datetime
  3 | import re
  4 | 
  5 | from Bio import SeqIO
  6 | from dendropy import Tree, Node
  7 | from typing import Union, List
  8 | 
  9 | 
 10 | def get_median(l: List[float]) -> float:
 11 |     l = sorted(l)
 12 |     l_size = len(l)
 13 |     if l_size % 2 == 1:
 14 |         return l[l_size // 2]
 15 |     else:
 16 |         return (l[l_size // 2 - 1] + l[l_size // 2]) / 2
 17 | 
 18 | 
 19 | def compute_sampling_density(tree: Union[str, Tree]) -> float:
 20 |     """
 21 |     Reports the median edge length.
 22 |     """
 23 |     if isinstance(tree, str):
 24 |         tree: Tree = Tree.get(path=tree, schema='newick', preserve_underscores=True)
 25 |     elif not isinstance(tree, Tree):
 26 |         raise ValueError('"tree" should be either a path to a newick tree or a dendropy Tree object.')
 27 |     edge_lengths = [node.edge_length for node in tree.postorder_node_iter()]
 28 |     return get_median(edge_lengths)
 29 | 
 30 | 
 31 | def collapse_zero_branches(tree: Tree, threshold=1e-7):
 32 |     tree.collapse_unweighted_edges(threshold)
 33 | 
 34 | 
 35 | def parse_dates(aln_path: str):
 36 |     records = SeqIO.parse(aln_path, 'fasta')
 37 |     dates = {}
 38 |     for record in records:
 39 |         name = record.name
 40 |         # date_str = name.split('|')[-1]
 41 |         date = None
 42 |         for token in name.split('|'):
 43 |             if re.fullmatch(r'[\d\-/]{4,}', token) and not re.fullmatch(r'\d{5,}', token):
 44 |                 if token.count('/') == 2:
 45 |                     date = datetime.strptime(token, '%m/%d/%Y')
 46 |                 elif token.count('/') == 1:
 47 |                     date = datetime.strptime(token, '%m/%Y')
 48 |                 elif token.count('-') == 2:
 49 |                     date = datetime.strptime(token, '%Y-%m-%d')
 50 |                 elif token.count('-') == 1:
 51 |                     date = datetime.strptime(token, '%Y-%m')
 52 |                 else:
 53 |                     date = datetime.strptime(token, '%Y')
 54 |         if not date:
 55 |             # print(f'No date for {record.id}')
 56 |             # TODO: log with low priority level
 57 |             pass
 58 |         else:
 59 |             dec_date = date.year + ((date.month - 1) * 30 + date.day) / 365.0
 60 |             dates[name] = dec_date
 61 |     return dates
 62 | 
 63 | 
 64 | # For a binary tree only
 65 | def sibling_distance(parent_node: Node) -> float:
 66 |     return parent_node.child_nodes()[0].edge_length + parent_node.child_nodes()[1].edge_length
 67 | 
 68 | 
 69 | # Siblings specified.
 70 | def sibling_distance_n2(sib1: Node, sib2: Node) -> float:
 71 |     assert sib1.parent_node is sib2.parent_node
 72 |     return sib1.edge_length + sib2.edge_length
 73 | 
 74 | 
 75 | def aunt_distance(node: Node) -> float:
 76 |     # We assume that the tree is binary
 77 |     assert node.parent_node and node.parent_node.parent_node
 78 |     parent: Node = node.parent_node
 79 |     aunt: Node = parent.sibling_nodes()[0]
 80 |     return node.edge_length + parent.edge_length + aunt.edge_length
 81 | 
 82 | 
 83 | def node_distance(node1: Node, node2: Node) -> float:
 84 |     """
 85 |     Linear-time algorithm to find a distance between two nodes on the same tree.
 86 |     Note: with constant-time LCA computation, one can compute distance in constant time.
 87 |     """
 88 |     node1_depth = get_node_depth(node1)
 89 |     node2_depth = get_node_depth(node2)
 90 |     distance = 0
 91 |     p1, p2 = node1, node2
 92 |     if node1_depth > node2_depth:
 93 |         for step in range(node1_depth - node2_depth):
 94 |             distance += p1.edge_length
 95 |             p1 = p1.parent_node
 96 |     elif node2_depth > node1_depth:
 97 |         for step in range(node2_depth - node1_depth):
 98 |             distance += p2.edge_length
 99 |             p2 = p2.parent_node
100 | 
101 |     while p1 != p2:
102 |         distance += p1.edge_length
103 |         distance += p2.edge_length
104 |         p1 = p1.parent_node
105 |         p2 = p2.parent_node
106 |     return distance
107 | 
108 | 
109 | def node_distance_w_lca(node1: Node, node2: Node, lca: Node) -> float:
110 |     distance = 0
111 |     while node1 is not None and node1 is not lca:
112 |         distance += node1.edge_length
113 |         node1 = node1.parent_node
114 | 
115 |     while node2 is not None and node2 is not lca:
116 |         distance += node2.edge_length
117 |         node2 = node2.parent_node
118 | 
119 |     assert node1 and node2
120 |     return distance
121 | 
122 | 
123 | def get_node_depth(node: Node) -> int:
124 |     depth = 0
125 |     p: Node = node.parent_node
126 |     while p:
127 |         depth += 1
128 |         p = p.parent_node
129 |     return depth
130 | 
131 | 
132 | def binarize_tree(tree: Tree, edge_length=0):
133 |     """
134 |     Adds/removes nodes from the tree to make it fully binary (added edges will have length 'edge_length')
135 |     :param tree: Dendropy tree to be made bifurcating.
136 |     """
137 | 
138 |     # First suppress unifurcations.
139 |     tree.suppress_unifurcations()
140 | 
141 |     # Now binarize multifurcations.
142 |     node: Node
143 |     for node in tree.postorder_node_iter():
144 |         if node.child_nodes() and len(node.child_nodes()) > 2:
145 |             num_children = len(node.child_nodes())
146 |             children = node.child_nodes()
147 |             interim_node = node
148 |             # Creates a caterpillar structure with children on the left of the trunk:
149 |             for child_ind in range(len(children) - 2):
150 |                 new_node = Node(edge_length=edge_length)
151 |                 interim_node.set_child_nodes([children[child_ind], new_node])
152 |                 interim_node = new_node
153 |             interim_node.set_child_nodes(children[num_children - 2:])
154 | 


--------------------------------------------------------------------------------
/treesort/jc_reassortment_test.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | from scipy.stats import binomtest
 3 | import math
 4 | from typing import Tuple
 5 | 
 6 | 
 7 | def jc_pvalue(subs: int, sites: int, ml_distance: float, rate_ratio=1.0, allowed_deviation=1.5):
 8 |     """
 9 |     We assume the Jukes-Cantor substitution model and test whether the observed number of substitutions was likely
10 |     to come from the observed time interval (ml_distance). The method assumes strict molecular clock (but with deviation)
11 |     :param subs: Number of observed substitutions in the second gene segment
12 |     :param sites: Number of sites in the second gene segment
13 |     :param ml_distance: Expected number of substitutions per site in the first gene segment
14 |     :param rate_ratio: Ratio in global substitution rates between the second and first segments
15 |     :param pvalue_threshold: p-values below this threshold will be inferred as reassortments
16 |     :param allowed_deviation: Should be >=1: allowed deviation from the strict molecular clock in each segment
17 |     :return: the pvalue of observing the number of subs over the ml_distance edge.
18 |     """
19 |     if ml_distance < 1 / sites:
20 |         ml_distance = 1 / sites
21 |     max_deviation = allowed_deviation * allowed_deviation
22 |     sub_probability = 0.75 - 0.75 * (math.exp(-(4 * ml_distance * rate_ratio * max_deviation) / 3))
23 |     pvalue = binomtest(subs, sites, p=sub_probability, alternative='greater').pvalue
24 |     # if pvalue < 0.001:
25 |     #     print(subs, sites, ml_distance, sub_probability, pvalue)
26 |     return pvalue
27 | 
28 | 
29 | class JCReassortmentTester(object):
30 | 
31 |     def __init__(self, sites: int, rate_ratio: float, pvalue_threshold: float, allowed_deviation: float):
32 |         self.sites = sites
33 |         self.rate_ratio = rate_ratio
34 |         self.pvalue_threshold = pvalue_threshold
35 |         self.allowed_deviation = allowed_deviation
36 | 
37 |     def is_reassorted(self, subs: int, ml_distance: float) -> Tuple[bool, float]:
38 |         pvalue = jc_pvalue(subs, self.sites, ml_distance, self.rate_ratio, self.allowed_deviation)
39 |         return (pvalue < self.pvalue_threshold), pvalue
40 | 


--------------------------------------------------------------------------------
/treesort/lm_outlier_detector.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | import numpy as np
 3 | from sklearn.linear_model import LinearRegression
 4 | 
 5 | 
 6 | # This is an earlier idea of using linear regression outliers for reassortment detection.
 7 | class LMOutlierDetector(object):
 8 |     trained_reg: LinearRegression
 9 |     iqd: float
10 |     q3: float
11 | 
12 |     def __init__(self, sibling_dists_s1: np.ndarray, sibling_dists_s2: np.ndarray):
13 |         assert len(sibling_dists_s1) >= 10
14 |         self.sibling_dists_s1 = sibling_dists_s1
15 |         self.sibling_dists_s2 = sibling_dists_s2
16 | 
17 |         reg: LinearRegression = LinearRegression(fit_intercept=True).fit(
18 |             sibling_dists_s1.reshape(-1, 1), sibling_dists_s2.reshape(-1, 1))
19 |         residuals: np.ndarray = sibling_dists_s2 - reg.predict(sibling_dists_s1.reshape(-1, 1)).reshape(-1)
20 |         residuals.sort()
21 |         print(residuals)
22 |         q1 = residuals[round(len(residuals) / 4) - 1]
23 |         q3 = residuals[round(len(residuals) * 3 / 4) - 1]
24 |         self.iqd = q3 - q1
25 |         self.trained_reg = reg
26 |         self.q3 = q3
27 |         print(f'IQD {self.iqd}, Q1 {q1}, Q3 {self.q3}')
28 | 
29 |     def get_residual(self, x: float, y: float) -> float:
30 |         residual = y - self.trained_reg.predict(np.array([[x]], dtype=float))
31 |         return residual[0, 0]
32 | 
33 |     def is_outlier(self, x: float, y: float, iqd_mult=2) -> bool:
34 |         residual = self.get_residual(x, y)
35 |         return residual >= self.q3 + self.iqd * iqd_mult
36 | 


--------------------------------------------------------------------------------
/treesort/options.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | import argparse
  3 | from argparse import RawDescriptionHelpFormatter
  4 | import random
  5 | import os
  6 | from typing import Tuple
  7 | 
  8 | import matplotlib
  9 | 
 10 | matplotlib.use('Agg')
 11 | from matplotlib import pyplot as plt
 12 | 
 13 | from treetime import TreeTime, TreeTimeError, utils
 14 | from dendropy import Tree
 15 | 
 16 | from treesort.helpers import parse_dates
 17 | 
 18 | DEFAULT_PVALUE_THRESHOLD = 0.001
 19 | DEFAULT_DEVIATION = 2
 20 | DEFAULT_METHOD = 'LOCAL'
 21 | STRAIN_NAME_REGEX = r'([ABCD](/[^/\|]+){3,5})'
 22 | EPI_ISL_REGEX = r'(EPI_ISL_\d+)'
 23 | 
 24 | # Program interface:
 25 | parser = argparse.ArgumentParser(description='TreeSort: fast and effective reassortment detection in '
 26 |                                              'segmented RNA viruses (primarily influenza)',
 27 |                                  formatter_class=RawDescriptionHelpFormatter)
 28 | parser._optionals.title = "Arguments"
 29 | parser.add_argument('-i', type=str, action='store', dest='descriptor',
 30 |                     help='Path to the descriptor file. The descriptor file provides paths to the alignments and '
 31 |                          'phylogenetic trees for different virus segments (see examples/)', required=True)
 32 | parser.add_argument('-o', type=str, action='store', dest='output',
 33 |                     help='Path to the output file (tree will be save in nexus format)', required=True)
 34 | parser.add_argument('-m', type=str, action='store', dest='method',
 35 |                     help='Methods are "local" (default) or "mincut". The mincut method always determines the most '
 36 |                          'parsimonious reassortment placement even in ambiguous circumstances.', required=False)
 37 | parser.add_argument('--dev', type=float, action='store', dest='deviation',
 38 |                     help='Maximum deviation from the estimated substitution rate within each segment. '
 39 |                          'The default is 2 - the substitution rate on a particular tree branch is allowed to be '
 40 |                          'twice as high or twice as low as the estimated rate. '
 41 |                          'The default value was estimated from the empirical influenza A data', required=False)
 42 | parser.add_argument('--pvalue', type=float, action='store', dest='pvalue',
 43 |                     help='The cutoff p-value for the reassortment tests: the default is 0.001 (0.1 percent). '
 44 |                          'You may want to decrease or increase this parameter depending on how stringent you want '
 45 |                          'the analysis to be', required=False)
 46 | parser.add_argument('--match-on-strain', action='store_true', dest='match_strain',
 47 |                     help='Match the names (deflines in fastas) across the segments based on the strain name. '
 48 |                          'E.g., "A/Texas/32/2021" or "A/swine/A0229832/Iowa/2021". Works for flu A, B, C, and D.'
 49 |                          'This way no preprocessing is needed to standardize the names before the analysis.')
 50 | parser.add_argument('--match-on-epi', action='store_true', dest='match_epi',
 51 |                     help='Similar to "--match-on-strain", but here segments are matched based on the "EPI_ISL_XXX" '
 52 |                          'field (if present in deflines)')
 53 | parser.add_argument('--match-on-regex', action='store', dest='match_regex',
 54 |                     help='Provide your own custom regex to match the segments across the alignments.',
 55 |                     required=False)
 56 | parser.add_argument('--no-collapse', action='store_true', dest='no_collapse',
 57 |                     help='Do not collapse near-zero length branches into multifurcations '
 58 |                          '(by default, TreeSort collapses all branches shorter than 1e-7 and then optimizes '
 59 |                          'the multifurcations).')
 60 | parser.add_argument('--equal-rates', action='store_true', dest='equal_rates',
 61 |                     help='Do not estimate molecular clock rates for different segments: assume equal rates. '
 62 |                          'Ignored if --timetree is specified',
 63 |                     required=False)
 64 | parser.add_argument('--clades', type=str, action='store', dest='clades_path',
 65 |                     help='Path to an output file, where clades with evidence of reassrotment will be saved',
 66 |                     required=False)
 67 | parser.add_argument('--timetree', action='store_true', dest='timetree',
 68 |                     help='Indicates that the reference tree is time-scaled (e.g., through TreeTime)')
 69 | 
 70 | 
 71 | def make_outdir(descriptor_path: str) -> Tuple[str, str]:
 72 |     descriptor_path = descriptor_path.split(os.path.sep)[-1]
 73 |     if descriptor_path.count('.') > 0:
 74 |         descriptor_name = '.'.join(descriptor_path.split('.')[:-1])
 75 |     else:
 76 |         descriptor_name = descriptor_path
 77 |     i = 1
 78 |     outdir = f'treesort-{descriptor_name}-{i}'
 79 |     while os.path.exists(outdir) and i <= 50:
 80 |         i += 1
 81 |         outdir = f'treesort-{descriptor_name}-{i}'
 82 |     if not os.path.exists(outdir):
 83 |         os.mkdir(outdir)
 84 |     return outdir, descriptor_name
 85 | 
 86 | 
 87 | def estimate_clock_rate(segment: str, tree_path: str, aln_path: str, plot=False, outdir='.') -> (float, float):
 88 |     # This code was adapted from the  'estimate_clock_model' method in the treetime/wrappers.py.
 89 |     # print(f"\tExecuting TreeTime on segment {segment}...")
 90 |     tree: Tree = Tree.get(path=tree_path, schema='newick', preserve_underscores=True)
 91 |     if len(tree.leaf_nodes()) > 1000:
 92 |         # TODO: implement Bio.Phylo tree subsampling to avoid creating temporary files
 93 |         # Downsample the tree for rate estimation
 94 |         tree_path = os.path.join(outdir, os.path.split(tree_path)[-1] + '.sample1k.tre')
 95 |         taxa_labels = [t.label for t in tree.taxon_namespace]
 96 |         random.shuffle(taxa_labels)
 97 |         subtree: Tree = tree.extract_tree_with_taxa_labels(taxa_labels[:1000])
 98 |         subtree.write(path=tree_path, schema='newick')
 99 | 
100 |     dates = parse_dates(aln_path)
101 |     try:
102 |         timetree = TreeTime(dates=dates, tree=tree_path, aln=aln_path, gtr='JC69', verbose=-1)  # TODO: JC->GTR?
103 |     except TreeTimeError as e:
104 |         parser.error(f"TreeTime exception on the input files {tree_path} and {aln_path}: {e}\n "
105 |                      f"Please make sure that the specified alignments and trees are correct.")
106 |     timetree.clock_filter(n_iqd=3, reroot='least-squares')
107 |     timetree.use_covariation = True
108 |     timetree.reroot()
109 |     timetree.get_clock_model(covariation=True)
110 |     r_val = timetree.clock_model['r_val']
111 |     if plot:
112 |         timetree.plot_root_to_tip()
113 |         plt.savefig(f'{outdir}/{segment}-treetime-clock.pdf')
114 |     d2d = utils.DateConversion.from_regression(timetree.clock_model)
115 |     clock_rate = round(d2d.clock_rate, 7)
116 |     print(f"\t{segment} estimated molecular clock rate: {clock_rate} (R^2 = {round(r_val, 3)})")
117 |     return d2d.clock_rate, r_val
118 | 
119 | 
120 | # Currently requiring a tree for all segments
121 | def parse_descriptor(path: str, outdir: str, estimate_rates=True, timetree=False):
122 |     segments = []
123 |     ref_segment = -1
124 |     with open(path) as descriptor:
125 |         for line in descriptor:
126 |             line = line.strip('\n').strip()
127 |             if line:
128 |                 tokens = [token.strip() for token in line.split(',')]
129 |                 if len(tokens) != 3:
130 |                     parser.error(f'The descriptor file should have 3 columns: {line}')
131 |                 else:
132 |                     seg_name, aln_path, tree_path = tokens
133 |                     if seg_name.startswith('*'):
134 |                         ref_segment = len(segments)
135 |                         seg_name = seg_name[1:]
136 |                     # seg_rate = estimate_clock_rate(seg_name, tree_path, aln_path)
137 |                     segments.append((seg_name, aln_path, tree_path, 1))
138 |     if len(segments) <= 1:
139 |         parser.error('The descriptor should specify at least two segments.')
140 |     if ref_segment < 0:
141 |         parser.error('The descriptor should specify one of the segments as a reference segment, e.g., like "*HA".')
142 | 
143 |     print(f'Read {len(segments)} segments: {", ".join([seg[0] for seg in segments])}')
144 |     if estimate_rates:
145 |         print('Estimating molecular clock rates for each segment (TreeTime)...')
146 |         for i, seg in enumerate(segments):
147 |             if timetree and i == ref_segment:
148 |                 # If the reference tree is a timetree = use 1 for the rate.
149 |                 print(f"\t{seg[0]} (time-tree), rate 1")
150 |             else:
151 |                 seg_name, aln_path, tree_path, _ = seg
152 |                 seg_rate, r_val = estimate_clock_rate(seg_name, tree_path, aln_path, plot=True, outdir=outdir)
153 |                 segments[i] = (seg_name, aln_path, tree_path, seg_rate)
154 |     return segments, ref_segment
155 | 
156 | 
157 | def parse_args():
158 |     args = parser.parse_args()
159 | 
160 |     # Validate PVALUE
161 |     pval = DEFAULT_PVALUE_THRESHOLD
162 |     if args.pvalue is not None:
163 |         pval = args.pvalue
164 |         if pval >= 0.5 or pval <= 1e-9:
165 |             parser.error('the PVALUE cutoff has to be positive and cannot exceed 0.5 (50%).')
166 |     print(f'P-value threshold for significance was set to {pval}')
167 | 
168 |     # Validate DEVIATION
169 |     deviation = DEFAULT_DEVIATION
170 |     if args.deviation:
171 |         deviation = args.deviation
172 |         if deviation > 10 or deviation < 1:
173 |             parser.error('DEVIATION has to be in the [1, 10] interval.')
174 | 
175 |     # Validate METHOD
176 |     method = DEFAULT_METHOD
177 |     if args.method:
178 |         if args.method in ('local', 'mincut'):
179 |             method = args.method.upper()
180 |         else:
181 |             parser.error(f'Unknown method "{args.method}". Known methods are "local" or "mincut".')
182 | 
183 |     # Check for a MATCH regex
184 |     match_regex = None
185 |     if args.match_strain:
186 |         match_regex = STRAIN_NAME_REGEX
187 |     if args.match_epi:
188 |         match_regex = EPI_ISL_REGEX
189 |     if args.match_regex:
190 |         match_regex = args.match_regex
191 |     if match_regex:
192 |         print(f'Using the "{match_regex}" REGEX to match segments across trees and alignments.')
193 | 
194 |     collapse_branches = False if args.no_collapse else True
195 | 
196 |     outdir, descriptor_name = make_outdir(args.descriptor)
197 |     estimate_rates = args.timetree or (not args.equal_rates)
198 |     segments, ref_segment = parse_descriptor(args.descriptor, outdir, estimate_rates, args.timetree)
199 | 
200 |     return descriptor_name, outdir, segments, ref_segment, args.output, args.clades_path, pval, deviation, method, \
201 |            collapse_branches, match_regex, args
202 | 


--------------------------------------------------------------------------------
/treesort/parsimony.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | from typing import Tuple
  3 | 
  4 | import numpy as np
  5 | from dendropy import Tree, Node, DnaCharacterMatrix
  6 | from dendropy.model import parsimony
  7 | import random as rnd
  8 | 
  9 | # character_sets_annotation = 'character_sets'
 10 | 
 11 | 
 12 | def compute_parsimony_edge_lengths(tree: Tree, aln_path: str) -> np.ndarray:
 13 |     """
 14 |     Compute the parsimony score of the tree given an alignment and find associated edge-lengths.
 15 |     :param tree: Tree topology to be scored by parsimony. Must be BINARY.
 16 |     :param aln_path: DNA alignment for the tips of the tree
 17 |     :return: A dictionary that specifies # of parsimony substitutions per node (except the root).
 18 |     """
 19 |     tree_copy: Tree = tree.clone()
 20 |     taxon_characters: DnaCharacterMatrix = DnaCharacterMatrix.get_from_path(aln_path, schema='fasta',
 21 |                                                                             taxon_namespace=tree_copy.taxon_namespace)
 22 |     taxon_states = taxon_characters.taxon_state_sets_map(gaps_as_missing=True)
 23 |     p_score = parsimony.fitch_down_pass(tree_copy.postorder_node_iter(), taxon_state_sets_map=taxon_states)
 24 |     print(p_score)
 25 | 
 26 |     edge_lengths = np.zeros(len(tree.nodes()), dtype=int)
 27 |     p_score_2 = 0
 28 |     node: Node
 29 |     for node in tree_copy.preorder_node_iter():
 30 |         edge_len = 0
 31 |         parent: Node = node.parent_node
 32 |         for site in range(taxon_characters.sequence_size):
 33 |             if parent:
 34 |                 parent_state = parent.state_sets[site]
 35 |                 if parent_state in node.state_sets[site]:
 36 |                     node.state_sets[site] = parent_state
 37 |                     continue
 38 |                 else:
 39 |                     edge_len += 1
 40 |                     p_score_2 += 1
 41 |             # choose a random state and assign
 42 |             state_sets = list(node.state_sets[site])
 43 |             rnd_state = rnd.choice(state_sets)
 44 |             node.state_sets[site] = rnd_state
 45 |         if parent:
 46 |             cluster = {leaf.taxon.label for leaf in node.leaf_nodes()}
 47 |             original_node = tree.find_node(filter_fn=lambda tree_node: {leaf.taxon.label for leaf in tree_node.leaf_nodes()} == cluster)
 48 |             edge_lengths[original_node.index] = edge_len
 49 |     print(p_score, p_score_2)
 50 |     return edge_lengths
 51 | 
 52 | 
 53 | def compute_parsimony_sibling_dist(tree: Tree, aln_path: str, schema='fasta') -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
 54 |     # tree must be binary!
 55 |     tree_copy: Tree = tree.clone()
 56 |     taxon_characters: DnaCharacterMatrix = DnaCharacterMatrix.get_from_path(aln_path, schema=schema,
 57 |                                                                             taxon_namespace=tree_copy.taxon_namespace)
 58 |     taxon_states = taxon_characters.taxon_state_sets_map(gaps_as_missing=True)
 59 |     p_score = parsimony.fitch_down_pass(tree_copy.postorder_node_iter(), taxon_state_sets_map=taxon_states)
 60 |     # print(p_score)
 61 | 
 62 |     children_dists = np.zeros(len(tree.internal_nodes()), dtype=int)
 63 |     child1_to_sibling_dists, child2_to_sibling_dists = np.zeros(len(tree.internal_nodes()), dtype=int),\
 64 |                                                        np.zeros(len(tree.internal_nodes()), dtype=int)
 65 |     p_score_2 = 0
 66 |     node: Node
 67 |     for node in tree_copy.preorder_internal_node_iter():
 68 |         children_dist = 0
 69 |         child1_to_sibling, child2_to_sibling = 0, 0
 70 |         sibling = node.sibling_nodes()[0] if node.parent_node else None
 71 |         if not sibling:
 72 |             child1_to_sibling, child2_to_sibling = -1, -1
 73 |         child1, child2 = node.child_nodes()
 74 |         for site in range(taxon_characters.sequence_size):
 75 |             if len(child1.state_sets[site].intersection(child2.state_sets[site])) == 0:
 76 |                 children_dist += 1
 77 |                 p_score_2 += 1
 78 |             if sibling and len(child1.state_sets[site].intersection(sibling.state_sets[site])) == 0:
 79 |                 child1_to_sibling += 1
 80 |             if sibling and len(child2.state_sets[site].intersection(sibling.state_sets[site])) == 0:
 81 |                 child2_to_sibling += 1
 82 |         # cluster = {leaf.taxon.label for leaf in node.leaf_nodes()}
 83 |         # original_node = tree.find_node(filter_fn=lambda tree_node: {leaf.taxon.label for leaf in tree_node.leaf_nodes()} == cluster)
 84 |         children_dists[node.index] = children_dist
 85 |         child1_to_sibling_dists[node.index] = child1_to_sibling
 86 |         child2_to_sibling_dists[node.index] = child2_to_sibling
 87 |     # print(p_score, p_score_2)
 88 |     # TODO: if p_score != p_score_2: log a warning (debug only)
 89 |     return children_dists, child1_to_sibling_dists, child2_to_sibling_dists
 90 | 
 91 | 
 92 | def get_cluster_str(node: Node) -> str:
 93 |     return ';'.join(sorted([leaf.taxon.label for leaf in node.leaf_nodes()]))
 94 | 
 95 | #
 96 | # if __name__ == '__main__':
 97 | #     # seg1_tree_path = '../simulations/segs2/l1500/sim_250_10/sim_1.trueSeg1.tre'
 98 | #     # schema = 'nexus'
 99 | #     # seg1_path = '../simulations/segs2/l1500/sim_250_10/sim_1.seg1.alignment.fasta'
100 | #     # seg2_path = '../simulations/segs2/l1500/sim_250_10/sim_1.seg2.alignment.fasta'
101 | #     # simulated = True
102 | #     seg1_tree_path = '../../gammas/HAs.fast.rooted.tre'
103 | #     schema = 'newick'
104 | #     seg1_path = '../../gammas/HAs_unique.aln'
105 | #     seg2_path = '../../gammas/NAs_unique.aln'
106 | #     simulated = False
107 | #     na_ha_ratio = 1.057
108 | #
109 | #     tree: Tree = Tree.get(path=seg1_tree_path, schema=schema, preserve_underscores=True)
110 | #     binarize_tree(tree)  # Randomly binarize.
111 | #     tree_indexer = TreeIndexer(tree.taxon_namespace)
112 | #     tree_indexer.index_tree(tree)
113 | #     if simulated:
114 | #         node: Node
115 | #         for node in tree.nodes():
116 | #             if node.edge_length:
117 | #                 node.edge_length *= 0.00474
118 | #
119 | #     # lengths_by_node_s1 = compute_parsimony_edge_lengths(tree, 'testdata/l1000_50_5/sim_1.seg4.alignment.fasta')
120 | #     # lengths_by_node_s2 = compute_parsimony_edge_lengths(tree, 'testdata/l1000_50_5/sim_1.seg1.alignment.fasta')
121 | #     child_dists_s1, child1_dists_s1, child2_dists_s1 = compute_parsimony_sibling_dist(tree, seg1_path)
122 | #     child_dists_s2, child1_dists_s2, child2_dists_s2 = compute_parsimony_sibling_dist(tree, seg2_path)
123 | #     seg2_aln = list(SeqIO.parse(seg2_path, format='fasta'))
124 | #     seg2_len = len(seg2_aln[0])
125 | #
126 | #     node_by_index = {}
127 | #     cluster_by_index = {}
128 | #     for node in tree.postorder_node_iter():
129 | #         node_by_index[node.index] = node
130 | #         cluster_by_index[node.index] = get_cluster_str(node)
131 | #
132 | #     # s1_lengths = np.zeros(len(lengths_by_node_s1))
133 | #     # node: Node
134 | #     # for node in tree.postorder_internal_node_iter():
135 | #     #     cluster = [leaf.taxon.label for leaf in node.leaf_nodes()]
136 | #     #     # print(cluster, lengths_by_node_s1[node.index], lengths_by_node_s2[node.index])
137 | #     #     print(node.index, sorted(cluster),
138 | #     #           f'c1 {node.child_nodes()[0].index}({child1_dists_s1[node.index]}, {child1_dists_s2[node.index]})',
139 | #     #           f'c2 {node.child_nodes()[1].index}({child2_dists_s1[node.index]}, {child2_dists_s2[node.index]})')
140 | #
141 | #     outlier_detector = LMOutlierDetector(child_dists_s1, child_dists_s2)
142 | #     outliers = [(ind, outlier_detector.get_residual(child_dists_s1[ind], child_dists_s2[ind]))
143 | #                 for ind in range(child_dists_s1.size) if
144 | #                 outlier_detector.is_outlier(child_dists_s1[ind], child_dists_s2[ind], iqd_mult=3)]
145 | #     jc_outliers = [(node.index, -1) for node in tree.internal_nodes() if
146 | #                    is_jc_outlier(child_dists_s2[node.index], seg2_len, helpers.sibling_distance(node),
147 | #                                  rate_ratio=na_ha_ratio, pvalue_threshold=0.001)]
148 | #     jc_outlier_indices = [x[0] for x in jc_outliers]
149 | #     outliers = sorted(outliers, key=lambda x: x[1], reverse=True)
150 | #     print(len(outliers))
151 | #     print(len(jc_outliers))
152 | #     for outlier_ind, residual in jc_outliers:
153 | #         outlier_node = node_by_index[outlier_ind]
154 | #         is_c1_rea = outlier_detector.is_outlier(child1_dists_s1[outlier_ind], child1_dists_s2[outlier_ind], iqd_mult=1.7)
155 | #         is_c2_rea = outlier_detector.is_outlier(child2_dists_s1[outlier_ind], child2_dists_s2[outlier_ind], iqd_mult=1.7)
156 | #         print(outlier_ind, residual, is_c1_rea, cluster_by_index[outlier_node.child_nodes()[0].index],
157 | #               is_c2_rea, cluster_by_index[outlier_node.child_nodes()[1].index])
158 | #
159 | #     jc_colors = ['red' if i in jc_outlier_indices else 'blue' for i in range(len(tree.internal_nodes()))]
160 | #     plt.scatter(child_dists_s1, child_dists_s2, c=jc_colors)
161 | #     for ind in range(len(child_dists_s1)):
162 | #         plt.annotate(str(ind), (child_dists_s1[ind], child_dists_s2[ind] + 0.2))
163 | #     plt.show()
164 | 


--------------------------------------------------------------------------------
/treesort/reassortment_inference.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | import math
  3 | from dendropy import Tree, Node, Edge, DnaCharacterMatrix
  4 | from dendropy.model import parsimony
  5 | import random as rnd
  6 | from typing import List, Set, Optional, Union, Tuple
  7 | 
  8 | from treesort import helpers
  9 | from treesort.jc_reassortment_test import JCReassortmentTester, jc_pvalue
 10 | from treesort.helpers import node_distance_w_lca, sibling_distance_n2, binarize_tree
 11 | from treesort.parsimony import compute_parsimony_sibling_dist
 12 | from treesort.tree_indexer import TreeIndexer
 13 | 
 14 | 
 15 | REA_FIELD = 'rea_events'  # This field of Edge will store the list of inferred reassortment events.
 16 | # PROPAGATED_FROM = 'propagated_from'  # The field of Node that indicates whether the state sets were copied from a lower node.
 17 | MINCUT_STATES = 'mincut_states'  # This field of Node will store the list of 'MinCutState'
 18 | 
 19 | 
 20 | class MinCutState(object):
 21 |     node: Node
 22 |     state_sets: List[Set]
 23 |     child_states: Optional[List['MinCutState']]
 24 |     pvalue: float
 25 | 
 26 |     def __init__(self, node: Node, state_sets: List[Set], child_states=None, pvalue=-1):
 27 |         self.node = node
 28 |         self.state_sets = state_sets
 29 |         self.child_states = child_states
 30 |         self.pvalue = pvalue
 31 | 
 32 |     @classmethod
 33 |     def merge_states(cls, node: Node, child_state1: 'MinCutState', child_state2: 'MinCutState', pvalue: float):
 34 |         state_sets = []  # Fitch parsimony step up from the two children state sets.
 35 |         for site in range(len(child_state1.state_sets)):  # Perform Fitch-up.
 36 |             intersection = child_state1.state_sets[site].intersection(child_state2.state_sets[site])
 37 |             if len(intersection) == 0:
 38 |                 union = child_state1.state_sets[site].union(child_state2.state_sets[site])
 39 |                 state_sets.append(union)
 40 |             else:
 41 |                 state_sets.append(intersection)
 42 |         return cls(node, state_sets, [child_state1, child_state2], pvalue)
 43 | 
 44 | 
 45 | class ReassortmentDetector(object):
 46 | 
 47 |     def __init__(self, tree: Tree, aln_path: str, segment: str, rea_tester: JCReassortmentTester, schema='fasta'):
 48 |         # tree_copy: Tree = tree.clone()  # Process the copy of the original tree.
 49 |         for node in tree.postorder_node_iter():  # Remove previous dendropy Fitch-parsimony annotations
 50 |             if hasattr(node, 'state_sets'):
 51 |                 delattr(node, 'state_sets')
 52 |         self.taxon_characters: DnaCharacterMatrix = DnaCharacterMatrix.get_from_path(aln_path, schema=schema,
 53 |                                                                                 taxon_namespace=tree.taxon_namespace)
 54 |         taxon_states = self.taxon_characters.taxon_state_sets_map(gaps_as_missing=True)
 55 |         parsimony.fitch_down_pass(tree.postorder_node_iter(),
 56 |                                   taxon_state_sets_map=taxon_states)  # Do the first Fitch parsimony pass (annotates the tree).
 57 |         self.tree = tree
 58 |         self.segment = segment
 59 |         self.aln_path = aln_path
 60 |         self.rea_tester = rea_tester
 61 | 
 62 |     def parsimony_distance(self, node1: Union[Node, MinCutState], node2: Union[Node, MinCutState]) -> int:
 63 |         parsimony_dist = 0
 64 |         for site in range(self.taxon_characters.sequence_size):
 65 |             if len(node1.state_sets[site].intersection(node2.state_sets[site])) == 0:
 66 |                 parsimony_dist += 1
 67 |         return parsimony_dist
 68 | 
 69 |     def are_reassorted(self, node1: Union[Node, MinCutState], node2: Union[Node, MinCutState], lca: Node) -> Tuple[bool, float]:
 70 |         """
 71 |         Returns whether the results of a statistical test for reassortment between the two nodes + the pvalue.
 72 |         """
 73 |         # Compute the parsimony score between the two nodes.
 74 |         parsimony_dist = self.parsimony_distance(node1, node2)
 75 |         # Compute the ML distance on the reference tree and test for reassortment.
 76 |         if isinstance(node1, MinCutState) and isinstance(node2, MinCutState):
 77 |             ml_distance = node_distance_w_lca(node1.node, node2.node, lca)
 78 |         else:
 79 |             ml_distance = node_distance_w_lca(node1, node2, lca)
 80 |         return self.rea_tester.is_reassorted(parsimony_dist, ml_distance)
 81 | 
 82 |     def merge_siblings(self, sib1: Node, sib2: Node) -> Node:
 83 |         parent: Node = sib1.parent_node
 84 |         parent.remove_child(sib1)
 85 |         parent.remove_child(sib2)
 86 | 
 87 |         new_node = Node(edge_length=0)
 88 |         new_node.set_child_nodes([sib1, sib2])
 89 |         parent.add_child(new_node)
 90 |         return new_node
 91 | 
 92 |     def propagate_parsimony(self, node1: Node, node2: Node, lca: Node):
 93 |         lca_state_sets = []
 94 |         for site in range(self.taxon_characters.sequence_size):
 95 |             intersection = node1.state_sets[site].intersection(node2.state_sets[site])
 96 |             if len(intersection) == 0:
 97 |                 union = node1.state_sets[site].union(node2.state_sets[site])
 98 |                 lca_state_sets.append(union)
 99 |             else:
100 |                 lca_state_sets.append(intersection)
101 |         lca.state_sets = lca_state_sets
102 | 
103 |     def add_rea_annotation(self, edge: Edge, parsimony_dist: int, is_uncertain: bool):
104 |         annotation = f'{"?" if is_uncertain else ""}{self.segment}({parsimony_dist})'
105 |         rea_events = getattr(edge, REA_FIELD, [])
106 |         rea_events.append(annotation)
107 |         setattr(edge, REA_FIELD, rea_events)
108 | 
109 |     def binarize_tree_greedy(self):
110 |         """
111 |         Greedily resolve multifurcations by grouping siblings into non-reassortant groups.
112 |         """
113 |         multifurcations = 0
114 |         node: Node
115 |         for node in self.tree.postorder_node_iter():
116 |             if node.child_nodes() and len(node.child_nodes()) > 2:
117 |                 # Found a multifurcation.
118 |                 multifurcations += 1
119 |                 # print('Multifurcation size:', len(node.child_nodes()))
120 |                 siblings = node.child_nodes()
121 |                 rnd.shuffle(siblings)  # Randomly shuffle all the siblings.
122 | 
123 |                 new_siblings = [siblings[0]]  # We will group all the siblings into reassortment-free blocks greedily.
124 |                 for sibling in siblings[1:]:
125 |                     placed = False
126 |                     for i, new_sibling in enumerate(new_siblings):
127 |                         reassorted, pvalue = self.are_reassorted(new_sibling, sibling, node)
128 |                         if not reassorted:
129 |                             merged_node = self.merge_siblings(new_sibling, sibling)
130 |                             self.propagate_parsimony(new_sibling, sibling, merged_node)
131 |                             new_siblings[i] = merged_node
132 |                             placed = True
133 |                             break
134 |                     if not placed:
135 |                         new_siblings.append(sibling)
136 | 
137 |                 # for sibling in node.child_nodes():  # Remove all the children from "node".
138 |                 #     node.remove_child(sibling)
139 | 
140 |                 if len(new_siblings) == 1:
141 |                     # No reassortment: split out the top and add as children to the node.
142 |                     merged_node = new_siblings[0]
143 |                     node.state_sets = merged_node.state_sets
144 |                     node.set_child_nodes(merged_node.child_nodes())
145 |                 else:
146 |                     # Merge reassortant blocks in a caterpillar structure in the reverse order of their size
147 |                     new_siblings.sort(key=lambda new_sib: len(new_sib.leaf_nodes()), reverse=True)
148 |                     while len(new_siblings) > 2:
149 |                         new1, new2 = new_siblings.pop(), new_siblings.pop()
150 |                         merged_node = self.merge_siblings(new1, new2)
151 |                         self.propagate_parsimony(new1, new2, merged_node)
152 |                         new_siblings.append(merged_node)
153 |                     node.set_child_nodes(new_siblings)
154 |                     self.propagate_parsimony(new_siblings[0], new_siblings[1], node)
155 |         print('Multifurcations resolved:', multifurcations)
156 | 
157 |     def infer_reassortment_mincut(self) -> int:
158 |         """
159 |         MinCut algorithm that annotates the branches of the tree with the inferred reassortment events.
160 |         This algorithm cuts the tree into the smallest number of reassortment-free subtrees possible.
161 |         :return: The number of inferred reassortment events.
162 |         """
163 |         if not TreeIndexer.is_indexed(self.tree):
164 |             tree_indexer = TreeIndexer(self.tree.taxon_namespace)
165 |             tree_indexer.index_tree(self.tree)
166 |         node: Node  # every node will get a 'mincut_states' list.
167 |         for node in self.tree.leaf_nodes():
168 |             setattr(node, MINCUT_STATES, [MinCutState(node, node.state_sets)])  # Initialize a new mincutstate for the leaf.
169 |         for node in self.tree.postorder_internal_node_iter():
170 |             child1, child2 = node.child_nodes()
171 |             compatible_pairs: List[Tuple[MinCutState, MinCutState, float]] = []  # (left_state, right_state, pvalue)
172 |             for left_state in getattr(child1, MINCUT_STATES):
173 |                 best_match: Tuple[MinCutState, float] = (None, math.inf)  # right_state and pvalue.
174 |                 for right_state in getattr(child2, MINCUT_STATES):
175 |                     reassorted, pvalue = self.are_reassorted(left_state, right_state, node)
176 |                     if not reassorted and abs(0.5 - pvalue) < abs(0.5 - best_match[1]):  # check if pvalue is closer to median.
177 |                         best_match = (right_state, pvalue)
178 |                 if best_match[0]:
179 |                     compatible_pairs.append((left_state, best_match[0], best_match[1]))
180 |             if compatible_pairs:
181 |                 mincut_states = []
182 |                 for left_state, right_state, pvalue in compatible_pairs:
183 |                     # Merge the pairs.
184 |                     mincut_states.append(MinCutState.merge_states(node, left_state, right_state, pvalue))
185 |             else:
186 |                 # Get the union of mincut_states of children.
187 |                 mincut_states: List = getattr(child1, MINCUT_STATES).copy()
188 |                 mincut_states.extend(getattr(child2, MINCUT_STATES))
189 |             setattr(node, MINCUT_STATES, mincut_states)
190 |         rea_events = self._backtrack_mincut()
191 |         print(f'\tInferred reassortment events with {self.segment}: {rea_events}.')
192 |         return rea_events
193 | 
194 |     def _backtrack_mincut(self) -> int:
195 |         """
196 |         A backtracking subroutine for 'infer_reassortment_mincut'.
197 |         Works top-down and identifies branches with reassortment.
198 |         :return: The number of identified reassortment events.
199 |         """
200 |         rea_events = 0
201 |         root: Node = self.tree.seed_node
202 |         node: Node
203 |         for node in self.tree.preorder_node_iter():
204 |             if node == root:
205 |                 # Choose the option with the largest number of leaf nodes (and highest pvalue, if equal) as the most likely ancestral state.
206 |                 root_states: List[MinCutState] = getattr(node, MINCUT_STATES)
207 |                 sorted_states = sorted(root_states, key=lambda state: (len(state.node.leaf_nodes()), -abs(0.5 - state.pvalue)), reverse=True)
208 |                 major_state = sorted_states[0]
209 |                 setattr(node, MINCUT_STATES, major_state)
210 |             else:
211 |                 parent_state: MinCutState = getattr(node.parent_node, MINCUT_STATES)
212 |                 node_states: List[MinCutState] = getattr(node, MINCUT_STATES)
213 |                 if node_states[0].node is node:
214 |                     node_states.sort(key=lambda state: abs(0.5 - state.pvalue))  # Sort the states by pvalue closeness to 0.5
215 |                     # Find the node's states that agree with the parent assignment:
216 |                     relevant_states = [state for state in node_states if state in parent_state.child_states]
217 |                     if relevant_states:
218 |                         # Assign the "relevant" state with the highest pvalue.
219 |                         best_state = relevant_states[0]
220 |                     else:
221 |                         # Cut off the edge (add rea annotation) and assign the state with the highest pvalue.
222 |                         rea_events += 1
223 |                         best_state = node_states[0]
224 |                         pars_dist = self.parsimony_distance(parent_state, best_state)
225 |                         self.add_rea_annotation(node.edge, pars_dist, is_uncertain=False)
226 |                     setattr(node, MINCUT_STATES, best_state)
227 |                 else:
228 |                     # Just propagate the parent state below.
229 |                     setattr(node, MINCUT_STATES, parent_state)
230 |         return rea_events
231 | 
232 |     def infer_reassortment_local(self, pval_threshold: float, add_uncertain=True):
233 |         """
234 |         The first (local) implementation, where the reassortment placement is determined by the aunt node.
235 |         If reassortment placement in unclear, both branches get marked as potential reassortment events.
236 |         """
237 |         if not TreeIndexer.is_indexed(self.tree):
238 |             tree_indexer = TreeIndexer(self.tree.taxon_namespace)
239 |             tree_indexer.index_tree(self.tree)
240 |         child_dists_s2, child1_dists_s2, child2_dists_s2 = compute_parsimony_sibling_dist(self.tree, self.aln_path)
241 |         node_by_index = {}
242 |         node: Node
243 |         for node in self.tree.postorder_node_iter():
244 |             node_by_index[node.index] = node
245 | 
246 |         pvalues = [(node.index, self.rea_tester.is_reassorted(child_dists_s2[node.index], helpers.sibling_distance(node))[1])
247 |                    for node in self.tree.internal_nodes()]
248 |         jc_outliers = [(index, pvalue) for index, pvalue in pvalues if pvalue < pval_threshold]
249 |         jc_outlier_indices = [x[0] for x in jc_outliers]
250 |         # print(len(jc_outliers))
251 |         total_rea, certain_rea = 0, 0
252 |         for outlier_ind, pvalue in jc_outliers:
253 |             outlier_node: Node = node_by_index[outlier_ind]
254 |             specific_edge: Node = None
255 |             c1, c2 = outlier_node.child_nodes()
256 |             annotation = f'{self.segment}({child_dists_s2[outlier_ind]})'
257 |             if outlier_node != self.tree.seed_node:
258 |                 c1_pvalue = self.rea_tester.is_reassorted(child1_dists_s2[outlier_ind], helpers.aunt_distance(c1))[1]
259 |                 c2_pvalue = self.rea_tester.is_reassorted(child2_dists_s2[outlier_ind], helpers.aunt_distance(c2))[1]
260 |                 c1_outlier = c1_pvalue < pval_threshold
261 |                 c2_outlier = c2_pvalue < pval_threshold
262 |                 if (not c1_outlier) and (not c2_outlier):
263 |                     # print('Neither', child_dists_s2[outlier_ind], helpers.sibling_distance(outlier_node))
264 |                     continue
265 |                 if c1_outlier and c2_outlier:
266 |                     # print('Both', child_dists_s2[outlier_ind], helpers.sibling_distance(outlier_node))
267 |                     pass
268 |                 if c1_outlier ^ c2_outlier:
269 |                     specific_edge = c1 if c1_outlier else c2
270 |             total_rea += 1
271 |             if specific_edge:
272 |                 certain_rea += 1
273 |                 edge: Edge = specific_edge.edge
274 |                 self.add_rea_annotation(edge, child_dists_s2[outlier_ind], is_uncertain=False)
275 |             elif add_uncertain:
276 |                 for edge in [c1.edge, c2.edge]:
277 |                     self.add_rea_annotation(edge, child_dists_s2[outlier_ind], is_uncertain=True)
278 |         print(f'\tInferred reassortment events with {self.segment}: {total_rea}.\n'
279 |               f'\tIdentified exact branches for {certain_rea}/{total_rea} of them')
280 | 


--------------------------------------------------------------------------------
/treesort/reassortment_utils.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | import math
  3 | import random as rnd
  4 | from typing import List, Optional
  5 | from dendropy import Tree, Node, Edge
  6 | import numpy as np
  7 | from scipy.optimize import minimize, LinearConstraint
  8 | import warnings
  9 | 
 10 | from treesort.tree_indexer import TreeIndexer
 11 | from treesort.helpers import sibling_distance
 12 | 
 13 | 
 14 | def compute_rea_rate_simple(annotated_tree: Tree, evol_rate: float, ignore_top_edges=1) -> float:
 15 |     """
 16 |     A simpler way to compute the reassortment rate: the number of detected events divided by the total size of the tree (in years)
 17 |     :ignore_top_edges: the longest x percent of edges will not be counted for.
 18 |     """
 19 |     edge_cutoff = math.inf
 20 |     if ignore_top_edges > 0:
 21 |         edge_lengths = sorted([node.edge_length for node in annotated_tree.postorder_node_iter() if node.edge_length])
 22 |         top_percentile = int(round(len(edge_lengths) * (1.0 - ignore_top_edges / 100)))
 23 |         edge_cutoff = edge_lengths[top_percentile]
 24 | 
 25 |     # Compute the number of reassortment events detected (a ?-only edge counts as 0.5).
 26 |     rea_events = 0
 27 |     for node in annotated_tree.postorder_node_iter():
 28 |         if node is annotated_tree.seed_node:
 29 |             continue  # Skip the root edge
 30 |         edge: Edge = node.edge
 31 |         if edge and node.edge_length >= edge_cutoff:
 32 |             continue  # Skip the edge if its in the top percentile.
 33 |         if edge.annotations.get_value('is_reassorted', '0') == '1':
 34 |             rea_annotation = edge.annotations.get_value('rea').strip('"')
 35 |             is_uncertain = all([g_str.startswith('?') for g_str in rea_annotation.split(',')])  # Is this 100% uncertain reassortment?
 36 |             if not is_uncertain:
 37 |                 rea_events += 1
 38 |             else:
 39 |                 rea_events += 0.5
 40 | 
 41 |     # Compute the total tree length (phylogenetic diversity)
 42 |     tree_length = 0
 43 |     for node in annotated_tree.postorder_node_iter():
 44 |         if node is not annotated_tree.seed_node:
 45 |             if node.edge_length and node.edge_length >= edge_cutoff:
 46 |                 continue  # Skip the edge if its in the top percentile.
 47 |             tree_length += node.edge_length
 48 | 
 49 |     # print(f'{rea_events}, {tree_length}, {evol_rate}')
 50 |     rea_rate_per_lineage_per_year = (rea_events / tree_length * evol_rate) if tree_length > 0 else 0.0
 51 |     return rea_rate_per_lineage_per_year
 52 | 
 53 | 
 54 | def likelihood_binary(x, rea_events, edge_lengths):
 55 |     func = 0
 56 |     if x < 1e-10:
 57 |         return np.inf
 58 |     for i in range(len(rea_events)):
 59 |         if edge_lengths[i] > 0:
 60 |             if rea_events[i] > 0:
 61 |                 with warnings.catch_warnings():
 62 |                     warnings.filterwarnings('error')
 63 |                     try:
 64 |                         func -= np.log(1 - np.exp(-1 * x * edge_lengths[i]))
 65 |                     except Warning:
 66 |                         # print(x)
 67 |                         func += np.inf
 68 |             else:
 69 |                 func -= (-1 * x * edge_lengths[i])
 70 |         elif rea_events[i] > 0:
 71 |             # print('+1')
 72 |             pass
 73 |     return func
 74 | 
 75 | 
 76 | def compute_rea_rate_binary_mle(annotated_tree: Tree, evol_rate: float, ref_seg_len=1700) -> Optional[float]:
 77 |     rea_events = []  # reassortment events per branches (1 - at least one event, 0 - no events).
 78 |     edge_lengths = []  # Corresponding branch lengths (the two arrays are coupled).
 79 |     processed_uncertain = set()
 80 |     node: Node
 81 |     for node in annotated_tree.postorder_node_iter():
 82 |         if node is annotated_tree.seed_node:
 83 |             continue  # Skip the root edge
 84 |         is_uncertain = False
 85 |         edge: Edge = node.edge
 86 |         if edge.annotations.get_value('is_reassorted', '0') == '1':
 87 |             rea_annotation = edge.annotations.get_value('rea').strip('"')
 88 |             is_uncertain = all(
 89 |                 [g_str.startswith('?') for g_str in rea_annotation.split(',')])  # Is this 100% uncertain reassortment?
 90 |             if not is_uncertain:  # Uncertain branches are handled below.
 91 |                 rea_events.append(1)
 92 |         else:
 93 |             rea_events.append(0)
 94 | 
 95 |         edge_length = node.edge_length
 96 |         if is_uncertain:
 97 |             # check if the sister edge was already processed.
 98 |             siblings = node.parent_node.child_nodes()
 99 |             sibling = siblings[0] if siblings[0] is not Node else siblings[1]
100 |             if sibling not in processed_uncertain:
101 |                 # log the event over the two sister branches
102 |                 rea_events.append(1)
103 |                 edge_length = sibling_distance(node.parent_node)
104 |                 processed_uncertain.add(node)
105 |             else:
106 |                 continue  # Skip if already processed.
107 | 
108 |         if edge_length > 1e-7:
109 |             edge_lengths.append(edge_length / evol_rate)
110 |         elif rea_events[-1] > 0:
111 |             # If reassortment happened on too short of an edge, this can mess up the likelihood function.
112 |             # Replace branch length with (1 / ref_seg_len), e.g., 1 / 1700 for HA (1 substitution).
113 |             edge_lengths.append((1 / ref_seg_len) / evol_rate)
114 |         else:
115 |             edge_lengths.append(0)
116 | 
117 |     # print(len(rea_events), len(edge_lengths))
118 |     est = compute_rea_rate_simple(annotated_tree, evol_rate, ignore_top_edges=1)
119 |     np_est = np.array([est])
120 |     linear_constraint = LinearConstraint([[1]], [0])
121 |     num_est = minimize(likelihood_binary, np_est, args=(rea_events, edge_lengths), tol=1e-9,
122 |                        constraints=[linear_constraint])
123 |     if num_est.success:
124 |         return num_est.x[0]
125 |     else:
126 |         return None
127 | 


--------------------------------------------------------------------------------
/treesort/tree_indexer.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | from dendropy import TaxonNamespace, Tree
 3 | 
 4 | 
 5 | class InvalidArgumentError(Exception):
 6 |     def __init__(self, name: str, value: str, message=''):
 7 |         super(InvalidArgumentError, self).__init__('Invalid argument %s: %s; %s' % (name, value, message))
 8 |         self.name = name
 9 |         self.value = value
10 | 
11 | 
12 | class TreeIndexer:
13 |     """
14 |     Adds 'index' field to all nodes and taxa on the passed trees.
15 |     It ensures that the taxa are indexed consistently across trees.
16 |     """
17 | 
18 |     def __init__(self, taxon_namespace: TaxonNamespace):
19 |         self.label_mapping = {}
20 |         index = 0
21 |         for taxon in taxon_namespace:
22 |             self.label_mapping[taxon.label] = index
23 |             index += 1
24 | 
25 |     def index_tree(self, tree: Tree):
26 |         for leaf_node in tree.leaf_nodes():
27 |             if leaf_node.taxon.label in self.label_mapping:
28 |                 leaf_node.taxon.index = self.label_mapping[leaf_node.taxon.label]
29 |             else:
30 |                 print(leaf_node.taxon.label)
31 |                 print(self.label_mapping)
32 |                 raise InvalidArgumentError('tree', '', 'Input tree should be over the initially specified taxon set')
33 | 
34 |         node_id = 0
35 |         for node in tree.postorder_internal_node_iter():
36 |             node.index = node_id
37 |             # node.annotations.add_new('id', node_id)
38 |             node_id += 1
39 |         for node in tree.leaf_node_iter():
40 |             node.index = node_id
41 |             node_id += 1
42 |         tree.annotations.add_new('indexed', str(True))
43 | 
44 |     @staticmethod
45 |     def is_indexed(tree: Tree) -> bool:
46 |         return tree.annotations.get_value('indexed', str(False)) == str(True)
47 | 


--------------------------------------------------------------------------------
/treesort/version.py:
--------------------------------------------------------------------------------
1 | __version__ = '0.3.1'
2 | 


--------------------------------------------------------------------------------
/treetime-root.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | """
 3 | @author: Alexey Markin
 4 | """
 5 | import sys
 6 | import os
 7 | import subprocess
 8 | from Bio import SeqIO
 9 | from datetime import datetime
10 | import re
11 | 
12 | 
13 | def extract_dates(path: str, format='%Y-%m-%d') -> str:
14 |     records = SeqIO.parse(path, 'fasta')
15 |     file_name = path + '.dates.csv'
16 |     dates_file = open(file_name, 'w+')
17 |     dates_file.write('name, date\n')
18 |     for record in records:
19 |         name = record.name
20 |         # date_str = name.split('|')[-1]
21 |         for token in name.split('|'):
22 |             if re.fullmatch(r'[\d\-/]{4,}', token) and not re.fullmatch(r'\d{5,}', token):
23 |                 try: 
24 |                     if token.count('/') == 2:
25 |                         try:
26 |                             date = datetime.strptime(token, '%m/%d/%Y')
27 |                         except ValueError as e:
28 |                             date = datetime.strptime(token, '%Y/%m/%d')
29 |                     elif token.count('/') == 1:
30 |                         date = datetime.strptime(token, '%m/%Y')
31 |                     elif token.count('-') == 2:
32 |                         date = datetime.strptime(token, '%Y-%m-%d')
33 |                     elif token.count('-') == 1:
34 |                         date = datetime.strptime(token, '%Y-%m')
35 |                     else:
36 |                         date = datetime.strptime(token, '%Y')
37 |                 except ValueError as e:
38 |                     print(f"Can't parse date {token} -- skipping")
39 |                     continue
40 |         dec_date = date.year + ((date.month-1)*30 + date.day)/365.0
41 |         dates_file.write('%s, %.2f\n' % (name, dec_date))
42 |     dates_file.close()
43 |     return file_name
44 | 
45 | 
46 | def root_tree(tree_path: str, alignment_path: str) -> str:
47 |     dates_file = extract_dates(alignment_path)
48 |     rooted_tree = alignment_path + '.rooted.tre'
49 |     treetime_dir = alignment_path + '.treetime'
50 |     print(' '.join(['treetime', 'clock', '--tree', tree_path,
51 |                      '--aln', alignment_path, '--dates', dates_file,
52 |                      '--outdir', treetime_dir]))
53 |     subprocess.call(['treetime', 'clock', '--tree', tree_path,
54 |                      '--aln', alignment_path, '--dates', dates_file,
55 |                      '--outdir', treetime_dir], stderr=subprocess.STDOUT)
56 |     os.replace(treetime_dir + '/rerooted.newick', rooted_tree)
57 | 
58 | 
59 | if __name__ == '__main__':
60 |     args = sys.argv[1:]
61 |     root_tree(args[0], args[1])
62 | 


--------------------------------------------------------------------------------
/tutorial/figures/TreeSort-illustration.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/figures/TreeSort-illustration.png


--------------------------------------------------------------------------------
/tutorial/figures/TreeSort-logo-150.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/figures/TreeSort-logo-150.png


--------------------------------------------------------------------------------
/tutorial/figures/TreeSort-logo-300.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/figures/TreeSort-logo-300.png


--------------------------------------------------------------------------------
/tutorial/figures/swH1-reassortment-ex1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/figures/swH1-reassortment-ex1.png


--------------------------------------------------------------------------------
/tutorial/figures/swH1-reassortment-ex2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/figures/swH1-reassortment-ex2.png


--------------------------------------------------------------------------------
/tutorial/figures/swH1-reassortment-ex3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/figures/swH1-reassortment-ex3.png


--------------------------------------------------------------------------------
/tutorial/figures/swH1-reassortment-ex4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/figures/swH1-reassortment-ex4.png


--------------------------------------------------------------------------------
/tutorial/swH1-parsed/HA-swine_H1_HANA.fasta.aln.treetime/outliers.tsv:
--------------------------------------------------------------------------------
 1 | 	given_date	apparent_date	residual
 2 | A/swine/Chile/YA026/2014|1B.2.2|LAIV-98|PPPPPP|2014-01-01	2014.0	1998.1559081042533	-6.609097869366056
 3 | A/swine/Oklahoma/A02245006/2019|1A.2-3-like|LAIV-98|TTTPPT|2019-03-07	2019.18	2000.8638006608433	-7.640296132074614
 4 | A/swine/Nebraska/A02479104/2020|1A.2-3-like|LAIV-98|LLLLPT|2020-02-25	2020.15	2000.7591344688262	-8.08857515536438
 5 | A/swine/Oklahoma/A01785571/2018|1A.2-3-like|LAIV-Classical|LTLLPT|2018-06-12	2018.44	2000.594878879735	-7.443793734002908
 6 | A/swine/Oklahoma/A01785573/2018|1A.2-3-like|LAIV-Classical|LTLLPT|2018-06-14	2018.45	2000.594878879735	-7.447965066707222
 7 | A/swine/Minnesota/A02245535/2020|1A.2-3-like|LAIV-Classical|LLLPPT|2020-03-19	2020.22	2001.0826973297421	-7.98280564993242
 8 | A/swine/Illinois/A02157797/2018|1A.2-3-like|LAIV-Classical|TTLTPT|2018-04-11	2018.28	2000.7569594747958	-7.309443202045201
 9 | A/swine/Minnesota/A01785575/2018|1A.2-3-like|LAIV-98|TLLTPT|2018-06-19	2018.46	2000.5948654061	-7.452142019712851
10 | A/swine/Minnesota/A01785613/2018|1A.2-3-like|LAIV-98|TLLTPT|2018-09-10	2018.68	2001.2455868419358	-7.272473778540949
11 | A/swine/Minnesota/A01785608/2018|1A.2-3-like|LAIV-98|TLLTPT|2018-08-30	2018.66	2001.2463119231168	-7.263828657648307
12 | A/swine/Nebraska/A02245333/2019|1A.2-3-like|2002A|TLLPPT|2019-11-12	2019.85	2002.2278834603933	-7.350771103953527
13 | A/swine/Nebraska/A01378044/2019|1A.2-3-like|2002B|TLLPPT|2019-05-23	2019.39	2001.4092014353105	-7.50038931011847
14 | A/swine/Nebraska/A02245334/2019|1A.2-3-like|2002B|TLLPPT|2019-11-12	2019.85	2002.0736866884797	-7.415091707710629
15 | A/swine/Iowa/A02432384/2019|1A.2-3-like|LAIV-Classical|TTTPPT|2019-04-17	2019.29	2000.9194220723643	-7.662979250527693
16 | A/swine/Iowa/A02478477/2019|1A.2-3-like|LAIV-Classical|TTTPPT|2019-05-07	2019.35	2001.0811593392405	-7.620541251671756
17 | A/swine/Texas/A01104132/2019|1A.2-3-like|LAIV-Classical|LTTLPT|2019-05-30	2019.41	2000.9216749140646	-7.712095507752426
18 | A/swine/California/A02478680/2019|1A.2-3-like|LAIV-Classical|LLLLPT|2019-09-17	2019.7	2001.5726720603977	-7.561511587488909
19 | A/swine/Nebraska/A02157974/2018|1A.2-3-like|LAIV-Classical|TLLLLT|2018-04-18	2018.3	2000.9213206073623	-7.2492253706956555
20 | A/swine/Iowa/A02271349/2018|1A.2-3-like|pdm|LLPLPP|2018-12-04	2018.92	2000.5947231843954	-7.644082649512495
21 | A/swine/South_Dakota/A02156993/2018|1A.2-3-like|LAIV-Classical|TLLLPT|2018-03-22	2018.22	2000.5949088211464	-7.352011924950756
22 | A/swine/Illinois/A02431144/2019|1A.2-3-like|LAIV-Classical|TLLTPT|2019-02-22	2019.14	2000.7567446451687	-7.668267427194577
23 | A/swine/Texas/A01785906/2019|1A.2-3-like|LAIV-Classical|LLLLPP|2019-01-23	2019.06	2000.5947903030597	-7.70245330994396
24 | A/swine/North_Carolina/A02245875/2021|1A.2-3-like|LAIV-Classical|TTTTPT|2021-01-21	2021.06	2000.5947785760068	-8.536724742535252
25 | A/swine/North_Carolina/A02479173/2020|1A.2-3-like|pdm|PPPLPP|2020-03-25	2020.23	2000.4329637109724	-8.258002491938507
26 | A/swine/Nebraska/A02257618/2018|1A.2-3-like|LAIV-Classical|LLLLPT|2018-09-21	2018.72	2000.9193604429593	-7.425238994061569
27 | A/swine/Iowa/A02254795/2018|1A.2-3-like|LAIV-Classical|LLLLPT|2018-07-30	2018.58	2001.0815498251486	-7.299185748781743
28 | A/swine/Oklahoma/A02246973/2021|1A.2-3-like|LAIV-Classical|LLLLPT|2021-10-06	2021.76	2001.8977946193393	-8.285186688262133
29 | A/swine/Kansas/A02636184/2021|1A.2-3-like|LAIV-Classical|LLLLPT|2021-09-21	2021.72	2001.7358068426536	-8.336071848502439
30 | A/swine/Oklahoma/A02248197/2021|1A.2-3-like|LAIV-Classical|LLLLPT|2021-09-15	2021.7	2001.4074790556162	-8.46468562667179
31 | A/swine/Oklahoma/A02246915/2022|1A.2-3-like|LAIV-Classical|LLLLPT|2022-01-07	2022.02	2002.0565265170408	-8.327428982963466
32 | A/swine/Colorado/A02636469/2022|1A.2-3-like|LAIV-Classical|LLLLPT|2022-02-04	2022.09	2001.731696136373	-8.492125870914018
33 | A/swine/Missouri/A01932424/2017|1A.3.2|classicalSwine|PPPPPP|2017-02-22	2017.14	2027.5609154662764	4.346910549256599
34 | 


--------------------------------------------------------------------------------
/tutorial/swH1-parsed/HA-swine_H1_HANA.fasta.aln.treetime/root_to_tip_regression.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/swH1-parsed/HA-swine_H1_HANA.fasta.aln.treetime/root_to_tip_regression.pdf


--------------------------------------------------------------------------------
/tutorial/swH1-parsed/NA-swine_H1_HANA.fasta.aln.treetime/outliers.tsv:
--------------------------------------------------------------------------------
 1 | 	given_date	apparent_date	residual
 2 | A/swine/Oklahoma/A02245577/2020|1A.3.3.2|LAIV-98|TLLPPT|2020-03-24	2020.23	1994.2189896759946	-4.114862948315013
 3 | A/swine/Nebraska/A01378047/2021|1B.2.2.2|LAIV-98|LLLLPT|2021-01-28	2021.08	1994.4302255610737	-4.215913494082321
 4 | A/swine/Iowa/A02478443/2019|1A.3.3.2|LAIV-98|LLLLPP|2019-04-26	2019.32	1993.7862012468288	-4.039369525073382
 5 | A/swine/Nebraska/A02479104/2020|1A.2-3-like|LAIV-98|LLLLPT|2020-02-25	2020.15	1994.4343152730062	-4.068143334517119
 6 | A/swine/Iowa/A02525361/2021|1B.2.2.1|LAIV-98|TLLTPT|2021-04-09	2021.27	1995.965872599665	-4.003036213590898
 7 | A/swine/Indiana/A02636016/2021|1A.1.1|LAIV-98|TTTPPT|2021-08-02	2021.58	1993.7861821241984	-4.396897696196047
 8 | A/swine/Iowa/A02524534/2020|1B.2.2.2|LAIV-98|LLLPPT|2020-08-12	2020.61	1994.8706124236035	-4.071893053407719
 9 | A/swine/Iowa/A02271349/2018|1A.2-3-like|pdm|LLPLPP|2018-12-04	2018.92	1990.468299250318	-4.500972771648383
10 | 


--------------------------------------------------------------------------------
/tutorial/swH1-parsed/NA-swine_H1_HANA.fasta.aln.treetime/root_to_tip_regression.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/flu-crew/TreeSort/78db9e0d514c136ccdd95b1889d17628b03d5103/tutorial/swH1-parsed/NA-swine_H1_HANA.fasta.aln.treetime/root_to_tip_regression.pdf


--------------------------------------------------------------------------------
/tutorial/swH1-parsed/descriptor.csv:
--------------------------------------------------------------------------------
1 | *HA,HA-swine_H1_HANA.fasta.aln,HA-swine_H1_HANA.fasta.aln.rooted.tre
2 | NA,NA-swine_H1_HANA.fasta.aln,NA-swine_H1_HANA.fasta.aln.rooted.tre
3 | 


--------------------------------------------------------------------------------