├── Academic_Use_Licence.pdf ├── setup.py ├── README.md └── stride ├── probroot.py └── stride.py /Academic_Use_Licence.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/davidemms/STRIDE/HEAD/Academic_Use_Licence.pdf -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | setup(name='stride', 4 | version='1.0.0', 5 | description='', 6 | url='https://github.com/davidemms/STRIDE', 7 | author='David Emms', 8 | author_email='david_emms@hotmail.com', 9 | license='Oxford University academic use licence', 10 | packages=['stride'], 11 | zip_safe=False, 12 | entry_points = { 13 | 'console_scripts': ['stride=stride.stride:main']}) 14 | 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # STRIDE: Species Tree Root Inference from Gene Duplication Events 2 | 3 | 4 | The correct interpretation of a phylogenetic tree is dependent on it being correctly rooted. STRIDE takes an unrooted species tree and a set of unrooted gene trees and identifies well-supported gene duplication events within the gene trees to infer the root of the species tree. 5 | 6 | A gene duplication event at the base of a clade of species is synapamorphic, and thus excludes the root of the species tree from that clade. STRIDE is a fast, effective, and outgroup-free method for species tree root inference from gene duplication events. On test datasets on a typical 4 core desktop it analysed 14,454 gene trees covering 47 species in ~25s. 7 | 8 | Test datasets together with a script to run all the datasets can be downloaded from [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.581360.svg)](https://doi.org/10.5281/zenodo.581360). 9 | 10 | ### Usage: 11 | STRIDE requires python plus numpy, scipyt and the ete (version 2 or 3) tree library. To run STRIDE: 12 | 13 | `stride.py -s gene_to_species_conversion -S Species_tree.tre -d gene_trees/` 14 | 15 | Stride needs to be able to map the genes in the gene trees to the species in the species tree. Gene names in the gene trees should start with the name of the species they come from. Use the -s option to tell stride how to do the mapping: 16 | 17 | - dot: SpeciesName.GeneName -> SpeciesName 18 | 19 | - dash: SpeciesName_GeneName -> SpeciesName 20 | 21 | - second_dash: Species_Name_GeneName -> Species_Name 22 | 23 | - 3rd_dash: Species_X_Name_GeneName -> Species_X_Name 24 | -------------------------------------------------------------------------------- /stride/probroot.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu Sep 8 11:00:11 2016 4 | 5 | @author: david 6 | """ 7 | import numpy as np 8 | from scipy.special import beta 9 | 10 | """ 11 | T - topology of the tree 12 | B - set of branches in the tree 13 | D - set of all duplications, {d_i | i \in B} 14 | d_i - the duplications on branch i = (x_i, y_i) 15 | O_i(j), where i, j \in B if the function O_i:B -> {->, r, <-} mapping a branch, j, to the orientation it would have 16 | if i where the root. 17 | 18 | The probability of a given branch being the root (given the tree topology and the set of duplications) is 19 | the product of each of the implied/required orientations of all the branches in the tree given only the duplications 20 | on each branch individual and the topology of the tree, P_i(o | d_i, T). i ia the branch in question, o is its orientation, 21 | d_i are the numbers of duplications on the branch. 22 | 23 | Pclade = P(clade | D, T) 24 | Pc = P_j(O_i(j) | d_j,T) 25 | 26 | A. 27 | == 28 | clade1 29 | / 30 | root 31 | \ 32 | clade2 33 | 34 | p = Pclade1 * Proot * Pclade2 35 | ----------------------------------- 36 | \Pi_j (Pclade1 * Proot * Pclade2)_j 37 | 38 | 39 | B. 40 | == 41 | clade_i 42 | / 43 | i 44 | / 45 | clade 46 | \ 47 | j 48 | \ 49 | clade_j 50 | 51 | Pclade = Pclade_i*Pc(O(i)) * Pclade_j*Pc(O(j)) 52 | 53 | C. 54 | == 55 | 56 | Pc = P_i(o | d_i, T) = P_i(d_i | o, T) P_i(o | T) 57 | -------------------------- , X = \sum_{o \in {<-, r, ->} P_i(d_i | o) P_i(o | T) 58 | X 59 | (calculated by Ps_o_G_d, stored in Pc) 60 | 61 | 62 | where P_i(d_i | o, T) = P(d_i | o) - independent of the branch, independent of the tree (only depends on orientation) 63 | (could potentially alter it so it is different for, say, terminal branches) 64 | 65 | P_i(o | T) - is simply counted on the tree (what proportion of roots would give this orientation) 66 | 67 | 68 | ================================ Branch Models ====================================== 69 | 70 | Poisson Model 71 | ------------- 72 | 73 | 74 | """ 75 | 76 | 77 | spTreeFormat = 1 78 | 79 | _lf = [0.000000000000000, 0.000000000000000, 0.693147180559945, 1.791759469228055, 3.178053830347946, 4.787491742782046, 6.579251212010101, 8.525161361065415, 10.604602902745251, 12.801827480081469, 15.104412573075516, 17.502307845873887, 19.987214495661885, 22.552163853123421, 25.191221182738683, 27.899271383840894, 30.671860106080675, 33.505073450136891, 36.395445208033053, 39.339884187199495, 42.335616460753485, 45.380138898476908, 48.471181351835227, 51.606675567764377, 54.784729398112319, 58.003605222980518, 61.261701761002001, 64.557538627006323, 67.889743137181526, 71.257038967168000, 74.658236348830158, 78.092223553315307, 81.557959456115029, 85.054467017581516, 88.580827542197682, 92.136175603687079, 95.719694542143202, 99.330612454787428, 102.968198614513810, 106.631760260643450, 110.320639714757390, 114.034211781461690, 117.771881399745060, 121.533081515438640, 125.317271149356880, 129.123933639127240, 132.952575035616290, 136.802722637326350, 140.673923648234250, 144.565743946344900, 148.477766951773020, 152.409592584497350, 156.360836303078800, 160.331128216630930, 164.320112263195170, 168.327445448427650, 172.352797139162820, 176.395848406997370, 180.456291417543780, 184.533828861449510, 188.628173423671600, 192.739047287844900, 196.866181672889980, 201.009316399281570, 205.168199482641200, 209.342586752536820, 213.532241494563270, 217.736934113954250, 221.956441819130360, 226.190548323727570, 230.439043565776930, 234.701723442818260, 238.978389561834350, 243.268849002982730, 247.572914096186910, 251.890402209723190, 256.221135550009480, 260.564940971863220, 264.921649798552780, 269.291097651019810, 273.673124285693690, 278.067573440366120, 282.474292687630400, 286.893133295426990, 291.323950094270290, 295.766601350760600, 300.220948647014100, 304.686856765668720, 309.164193580146900, 313.652829949878990, 318.152639620209300, 322.663499126726210, 327.185287703775200, 331.717887196928470, 336.261181979198450, 340.815058870798960, 345.379407062266860, 349.954118040770250, 354.539085519440790, 359.134205369575340, 363.739375555563470, 368.354496072404690, 372.979468885689020, 377.614197873918670, 382.258588773060010, 386.912549123217560, 391.575988217329610, 396.248817051791490, 400.930948278915760, 405.622296161144900, 410.322776526937280, 415.032306728249580, 419.750805599544780, 424.478193418257090, 429.214391866651570, 433.959323995014870, 438.712914186121170, 443.475088120918940, 448.245772745384610, 453.024896238496130, 457.812387981278110, 462.608178526874890, 467.412199571608080, 472.224383926980520, 477.044665492585580, 481.872979229887900, 486.709261136839360, 491.553448223298010, 496.405478487217580, 501.265290891579240, 506.132825342034830, 511.008022665236070, 515.890824587822520, 520.781173716044240, 525.679013515995050, 530.584288294433580, 535.496943180169520, 540.416924105997740, 545.344177791154950, 550.278651724285620, 555.220294146894960, 560.169054037273100, 565.124881094874350, 570.087725725134190, 575.057539024710200, 580.034272767130800, 585.017879388839220, 590.008311975617860, 595.005524249382010, 600.009470555327430, 605.020105849423770, 610.037385686238740, 615.061266207084940, 620.091704128477430, 625.128656730891070, 630.172081847810200, 635.221937855059760, 640.278183660408100, 645.340778693435030, 650.409682895655240, 655.484856710889060, 660.566261075873510, 665.653857411105950, 670.747607611912710, 675.847474039736880, 680.953419513637530, 686.065407301994010, 691.183401114410800, 696.307365093814040, 701.437263808737160, 706.573062245787470, 711.714725802289990, 716.862220279103440, 722.015511873601330, 727.174567172815840, 732.339353146739310, 737.509837141777440, 742.685986874351220, 747.867770424643370, 753.055156230484160, 758.248113081374300, 763.446610112640200, 768.650616799717000, 773.860102952558460, 779.075038710167410, 784.295394535245690, 789.521141208958970, 794.752249825813460, 799.988691788643450, 805.230438803703120, 810.477462875863580, 815.729736303910160, 820.987231675937890, 826.249921864842800, 831.517780023906310, 836.790779582469900, 842.068894241700490, 847.352097970438420, 852.640365001133090, 857.933669825857460, 863.231987192405430, 868.535292100464630, 873.843559797865740, 879.156765776907600, 884.474885770751830, 889.797895749890240, 895.125771918679900, 900.458490711945270, 905.796028791646340, 911.138363043611210, 916.485470574328820, 921.837328707804890, 927.193914982476710, 932.555207148186240, 937.921183163208070, 943.291821191335660, 948.667099599019820, 954.046996952560450, 959.431492015349480, 964.820563745165940, 970.214191291518320, 975.612353993036210, 981.015031374908400, 986.422203146368590, 991.833849198223450, 997.249949600427840, 1002.670484599700300, 1008.095434617181700, 1013.524780246136200, 1018.958502249690200, 1024.396581558613400, 1029.838999269135500, 1035.285736640801600, 1040.736775094367400, 1046.192096209724900, 1051.651681723869200, 1057.115513528895000, 1062.583573670030100, 1068.055844343701400, 1073.532307895632800, 1079.012946818975000, 1084.497743752465600, 1089.986681478622400, 1095.479742921962700, 1100.976911147256000, 1106.478169357800900, 1111.983500893733000, 1117.492889230361000, 1123.006317976526100, 1128.523770872990800, 1134.045231790853000, 1139.570684729984800, 1145.100113817496100, 1150.633503306223700, 1156.170837573242400] 80 | 81 | def LogFactorial(n): 82 | if n < 0: raise NotImplementedError() 83 | elif n > 254: 84 | x = float(n + 1) 85 | return (x - 0.5)*np.log(x) - x + 0.5*np.log(2*np.pi) + 1.0/(12.0*x); 86 | else: 87 | return _lf[n] 88 | 89 | # With thanks to "Numerically Stable Hidden Markov Model Implementation", Tobias P. Mann 2006 for extended logarithm approach 90 | # define None := log(0) 91 | 92 | def eexp(x): 93 | if x == None: 94 | return 0. 95 | else: 96 | return np.exp(x) 97 | 98 | def eln(x): 99 | if x == 0.: 100 | return None 101 | else: 102 | return np.log(x) 103 | 104 | def elnsum(elnx, elny): 105 | if (elnx == None) or (elny == None): 106 | return elny if elnx == None else elnx 107 | elif elnx > elny: 108 | return elnx + eln(1 + np.exp(elny-elnx)) 109 | else: 110 | return elny + eln(1 + np.exp(elnx-elny)) 111 | 112 | def elnprod(elnx, elny): 113 | if elnx == None or elny == None: 114 | return None 115 | else: 116 | return elnx + elny 117 | 118 | def elndiv(elnx, elny): 119 | if elnx == None: 120 | return None 121 | elif elny == None: 122 | print((elnx, elny)) 123 | raise NotImplemented() 124 | else: 125 | return elnx - elny 126 | 127 | def GetSpeciesName(name): 128 | if name.count("_") == 2: 129 | names = name.split("_")[1:] 130 | elif name.count("_") == 1: 131 | names = name.split("_") 132 | elif name.count("_") == 0: 133 | names = [name[0], name[1:]] 134 | else: 135 | names = (name[0], name[1:].lower()) 136 | return names[0][0].upper() + ". " + names[1] 137 | 138 | def lnpoisson(k, r): 139 | """ k - events, r - rate""" 140 | return elndiv(k * eln(r) - r, LogFactorial(k)) 141 | 142 | def get_bipartitions(tree): 143 | """ 144 | It returns the set of all possible partitions under a 145 | node. Note that current implementation is quite inefficient 146 | when used in very large trees. 147 | 148 | t = Tree("((a, b), e);") 149 | partitions = t.get_partitions() 150 | 151 | # Will return: 152 | # (a,b,e), () 153 | # (a,e), (b) 154 | # etc 155 | """ 156 | all_leaves = frozenset(tree.get_leaf_names()) 157 | all_partitions = set() 158 | all_partitions.add(frozenset([all_leaves, frozenset()])) 159 | for n in tree.iter_descendants(): 160 | p1 = frozenset(n.get_leaf_names()) 161 | p2 = frozenset(all_leaves - p1) 162 | all_partitions.add(frozenset([p1, p2])) 163 | return all_partitions 164 | 165 | # =============================================================================================================================== 166 | 167 | class BranchProbModel_corrected(object): 168 | """ 169 | Probability of ->, <-, r shouldn't depend on tree 170 | """ 171 | @staticmethod 172 | def P_o_G_T(toA, A, B): 173 | """ 174 | P(orientation | Tree) 175 | 176 | Inputs: 177 | toA - True: o=ToA, False: o=FromA, None: o=root 178 | A - cladeA 179 | B - cladeB 180 | P(o|T) 181 | 182 | Note, will be 0 for terminal branch and time moving away from it 183 | """ 184 | nA = 2.*len(A) - 2 # number of branches in A part of tree 185 | nB = 2.*len(B) - 2 # number of branches in B part of tree 186 | n = nA + nB + 1 # number of branches in tree 187 | if toA == None: 188 | return 1./n 189 | elif (toA and len(B) == 1) or (not toA and len(A) == 1): 190 | return 0. 191 | else: 192 | return 0.5*(n-1)/n 193 | 194 | # =============================================================================================================================== 195 | 196 | class PoissonModel_IntergrateBranchLenthsSumFP(BranchProbModel_corrected): 197 | """Integrate over the possible position of the root along the length of the root branch and sum over the distributions of 198 | false positives 199 | """ 200 | def __init__(self, rel_FP_rate_nonterm=0.001, rel_FP_rate_term=0.001, qExcludeTerminals=False): 201 | """ 202 | rel_FP_rate - the rate of false positives relative to the rate of true positives, \lambda_{FP} = a \lambda_{TP} 203 | """ 204 | self.a = rel_FP_rate_nonterm 205 | self.a_term = rel_FP_rate_term 206 | self.qExcludeTerminals = qExcludeTerminals 207 | 208 | def P_d_G_o(self, m, n, toM, qTerminal=False): 209 | return eexp(self.lnP_d_G_o(m, n, toM, qTerminal)) 210 | 211 | def lnP_d_G_o(self, m, n, toM, qTerminal=False): 212 | """Pnt(d|o) 213 | m - duplications to M 214 | n - duplications away from N 215 | toM - True: o=ToM, False: o=FromM, None: o=root 216 | """ 217 | N = m+n # total number of events 218 | if N == 0: 219 | return 1. 220 | alpha = self.a_term if qTerminal else self.a 221 | ltp = N/(1+alpha) # lambda_TP 222 | lfp = alpha * ltp # lambda_FP 223 | lltp = eln(ltp) 224 | la = eln(alpha) 225 | if toM == None: 226 | ln_pTot = eln(0.) 227 | for s in xrange(m+1): 228 | for t in xrange(n+1): 229 | X = m - s + t 230 | Y = n - t + s 231 | integral = beta(X+1, Y+1) 232 | ln_pTot = elnsum(ln_pTot, elnprod(eln(integral), (m+n)*lltp - ltp - lfp + (s+t)*la - LogFactorial(m-s) - LogFactorial(s) - LogFactorial(n-t) - LogFactorial(t))) 233 | return ln_pTot 234 | elif toM: 235 | return elnprod(lnpoisson(m ,ltp), lnpoisson(n, lfp)) 236 | else: 237 | return elnprod(lnpoisson(m ,lfp), lnpoisson(n, ltp)) 238 | 239 | def Ps_o_G_d(self, A, B, countsA, countsB): 240 | """ 241 | *** Resolve Numerical Overflows: 3 Use logarithms *** 242 | See Base class decription for remaining method documentation 243 | """ 244 | if self.qExcludeTerminals: 245 | if len(A) == 1: 246 | countsA = 0 247 | if len(B) == 1: 248 | countsB = 0 249 | isTerminal = len(A) == 1 or len(B) == 1 250 | lnx = elnprod(self.lnP_d_G_o(countsA, countsB, True, isTerminal), eln(self.P_o_G_T(True, A, B))) # P(d|->A, T) 251 | lny = elnprod(self.lnP_d_G_o(countsA, countsB, False, isTerminal), eln(self.P_o_G_T(False, A, B))) # P(d|<-A, T) 252 | lnz = elnprod(self.lnP_d_G_o(countsA, countsB, None, isTerminal), eln(self.P_o_G_T(None, A, B))) # P(d|r, T) 253 | lntot = elnsum(elnsum(lnx, lnz), lny) # both x or y could be small but not x & z (or y & z) 254 | x = eexp(elndiv(lnx, lntot)) 255 | y = eexp(elndiv(lny, lntot)) 256 | z = eexp(elndiv(lnz, lntot)) 257 | return (x, y, z) 258 | 259 | 260 | # =============================================================================================================================== 261 | 262 | class PoissonModel_WithTeminalModel(BranchProbModel_corrected): 263 | """Integrate over the possible position of the root along the length of the root branch and sum over the distributions of 264 | false positives 265 | """ 266 | def __init__(self, rel_FP_rate_nonterm, TP_rate_term, FP_rate_term): 267 | """ 268 | rel_FP_rate - the rate of false positives relative to the rate of true positives, \lambda_{FP} = a \lambda_{TP} 269 | """ 270 | self.a = rel_FP_rate_nonterm 271 | self.TP_rate_term = TP_rate_term 272 | self.FP_rate_term = FP_rate_term 273 | 274 | def P_d_G_o(self, m, n, toM, qTerminal=False): 275 | if qTerminal: 276 | return eexp(self.lnP_d_G_o_term(m, n)) 277 | else: 278 | return eexp(self.lnP_d_G_o_nonterm(m, n)) 279 | 280 | def lnP_d_G_o_term(self, m_inward, qRoot): 281 | """ 282 | if root: duplications away from species arrive at absolute rate self.TP_rate_term 283 | or 284 | if not root: duplications away from species arrive at absolute rate self.FP_rate_term 285 | """ 286 | if qRoot: 287 | return lnpoisson(m_inward, self.TP_rate_term) 288 | else: 289 | return lnpoisson(m_inward, self.FP_rate_term) 290 | 291 | def lnP_d_G_o(self, m, n, toM): 292 | """Pnt(d|o) 293 | m - duplications to M 294 | n - duplications away from N 295 | toM - True: o=ToM, False: o=FromM, None: o=root 296 | """ 297 | N = m+n # total number of events 298 | if N == 0: 299 | return 1. 300 | alpha = self.a 301 | ltp = N/(1+alpha) # lambda_TP 302 | lfp = alpha * ltp # lambda_FP 303 | lltp = eln(ltp) 304 | la = eln(alpha) 305 | if toM == None: 306 | ln_pTot = eln(0.) 307 | for s in xrange(m+1): 308 | for t in xrange(n+1): 309 | X = m - s + t 310 | Y = n - t + s 311 | integral = beta(X+1, Y+1) 312 | ln_pTot = elnsum(ln_pTot, elnprod(eln(integral), (m+n)*lltp - ltp - lfp + (s+t)*la - LogFactorial(m-s) - LogFactorial(s) - LogFactorial(n-t) - LogFactorial(t))) 313 | return ln_pTot 314 | elif toM: 315 | return elnprod(lnpoisson(m ,ltp), lnpoisson(n, lfp)) 316 | else: 317 | return elnprod(lnpoisson(m ,lfp), lnpoisson(n, ltp)) 318 | 319 | def Ps_o_G_d(self, A, B, countsA, countsB): 320 | if len(A) == 1 or len(B) == 1: 321 | # Terminal duplications 322 | if len(A) == 1: 323 | lnx = elnprod(self.lnP_d_G_o_term(countsB, False), eln(self.P_o_G_T(True, A, B))) 324 | lny = None 325 | lnz = elnprod(self.lnP_d_G_o_term(countsB, True), eln(self.P_o_G_T(None, A, B))) 326 | else: 327 | lnx = None 328 | lny = elnprod(self.lnP_d_G_o_term(countsA, False), eln(self.P_o_G_T(False, A, B))) 329 | lnz = elnprod(self.lnP_d_G_o_term(countsA, True), eln(self.P_o_G_T(None, A, B))) 330 | else: 331 | lnx = elnprod(self.lnP_d_G_o(countsA, countsB, True), eln(self.P_o_G_T(True, A, B))) # P(d|->A, T) 332 | lny = elnprod(self.lnP_d_G_o(countsA, countsB, False), eln(self.P_o_G_T(False, A, B))) # P(d|<-A, T) 333 | lnz = elnprod(self.lnP_d_G_o(countsA, countsB, None), eln(self.P_o_G_T(None, A, B))) # P(d|r, T) 334 | lntot = elnsum(elnsum(lnx, lnz), lny) # both x or y could be small but not x & z (or y & z) 335 | x = eexp(elndiv(lnx, lntot)) 336 | y = eexp(elndiv(lny, lntot)) 337 | z = eexp(elndiv(lnz, lntot)) 338 | return (x, y, z) 339 | 340 | # =============================================================================================================================== 341 | # Tree-level model/calculation 342 | 343 | def Pc(node, p_cond): 344 | x = frozenset(node.get_leaf_names()) 345 | # print((p_cond[x][0], x, 'edge')) 346 | return p_cond[x][0] 347 | 348 | def Pc_root(node, p_cond): 349 | return p_cond[frozenset(node.get_leaf_names())][2] 350 | 351 | def P_clade(node, probs, p_cond): 352 | cluster = frozenset(node.get_leaf_names()) 353 | if len(cluster) == 1: return 1. 354 | if cluster in probs: 355 | return probs[cluster] 356 | else: 357 | l, r = node.children 358 | p = P_clade(l, probs, p_cond)*Pc(l, p_cond) * P_clade(r, probs, p_cond)*Pc(r, p_cond) 359 | probs[cluster] = p 360 | # print((p, cluster)) 361 | return p 362 | 363 | # =============================================================================================================================== 364 | 365 | 366 | def GetBranchProbs(supported_clusters_counter, biparts, probModel): 367 | p_cond = dict() 368 | for A, B in biparts: 369 | if len(A) == 0 or len(B) == 0: continue 370 | x,y,z = probModel.Ps_o_G_d(A, B, supported_clusters_counter[A], supported_clusters_counter[B]) 371 | p_cond[A] = (x,y,z) 372 | p_cond[B] = (y,x,z) 373 | return p_cond 374 | 375 | def GetFinalProbs(biparts, p_cond, tree): 376 | probs_clades = dict() 377 | p_final = dict() # unnormalised until the end 378 | total = 0. 379 | for A, B in biparts: 380 | if len(A) == 0 or len(B) == 0: continue 381 | # print(A) 382 | # tree for calculation 383 | if len(A) == 1: 384 | n2 = tree & list(A)[0] 385 | else: 386 | n2 = tree.get_common_ancestor(A) 387 | # cluster A could straddle root 388 | if n2 == tree: 389 | if len(B) == 1: 390 | n2 = tree & list(B)[0] 391 | else: 392 | n2 = tree.get_common_ancestor(B) 393 | if n2 != tree: tree.set_outgroup(n2) 394 | l, r = tree.children 395 | p = P_clade(l, probs_clades, p_cond) * Pc_root(l, p_cond) * P_clade(r, probs_clades, p_cond) 396 | total += p 397 | p_final[A] = p 398 | # print(p, A, 'final') 399 | p_final = {k:(v/total) for k, v in p_final.items()} 400 | return p_final 401 | 402 | def GetTerminalRates(allSpecies, clades, supported_clusters_counter): 403 | nAntiTerm = len(allSpecies) - 1 404 | termDups = sorted([v for k,v in supported_clusters_counter.items() if len(k) == nAntiTerm], reverse=True) 405 | nSp = float(len(allSpecies)) 406 | ntp = sum(termDups[:2]) / 2. 407 | nfp = sum(termDups) 408 | rtp = (ntp + 1) 409 | rfp = (nfp + 1)/nSp 410 | # print("Terminal inward rate = %f" % rtp) 411 | # print("Terminal inward false-positive rate = %f" % rfp) 412 | return rtp, rfp, 413 | 414 | def GetAlpha(allSpecies, clades, supported_clusters_counter): 415 | nSp = len(allSpecies) 416 | nTerminal = sum([v for k,v in supported_clusters_counter.items() if (len(k) == 1 or len(k) == nSp-1)]) 417 | nNonTerminal = sum([v for k,v in supported_clusters_counter.items() if not (len(k) == 1 or len(k) == nSp-1)]) 418 | contradictions_term = dict() 419 | contradictions_nonterm = dict() 420 | for clade in clades: 421 | clade_p = allSpecies.difference(clade) 422 | against_term = 0 423 | against_nonterm = 0 424 | for observed, n in supported_clusters_counter.items(): 425 | if (not observed.issubset(clade)) and (not observed.issubset(clade_p)): 426 | if len(observed) == 1 or len(allSpecies.difference(observed)) == 1: 427 | against_term += n 428 | else: 429 | against_nonterm += n 430 | contradictions_term[clade] = against_term 431 | contradictions_nonterm[clade] = against_nonterm 432 | m = min([contradictions_nonterm[k] + contradictions_term[k] for k in clades]) 433 | n = sum(supported_clusters_counter.values()) 434 | # nSupport = n-m 435 | nFP_term = [] 436 | nFP_nonterm = [] 437 | for clade in clades: 438 | if contradictions_nonterm[clade] + contradictions_term[clade] == m: 439 | nFP_nonterm.append(contradictions_nonterm[clade]) 440 | nFP_term.append(contradictions_term[clade]) 441 | nFP_mean = np.mean(nFP_nonterm) 442 | alpha_nonTerm = float(nFP_mean + 1.) / float(nNonTerminal - nFP_mean + 1.) 443 | nFP_mean = np.mean(nFP_term) 444 | alpha_term = float(np.mean(nFP_term) + 1.) / float(nTerminal - nFP_mean) 445 | # print("Observed %d duplications. %d support the best root(s) and %d contradict them." % (nTerminal + nNonTerminal, nSupport, nTerminal + nNonTerminal - nSupport)) 446 | # nNonTrivial = sum([v for k,v in supported_clusters_counter.items() if len(k) != 1]) 447 | # print("%d non-trivial duplications" % nNonTrivial) 448 | # print("alpha (non-terminal) = %f" % alpha_nonTerm) 449 | # print("alpha (terminal) = %f" % alpha_term) 450 | return alpha_nonTerm, alpha_term 451 | 452 | 453 | def GetProbabilities(species_tree, allSpecies, clades, supported_clusters_counter): 454 | biparts = get_bipartitions(species_tree) 455 | alpha_nonTerm, alpha_term = GetAlpha(allSpecies, clades, supported_clusters_counter) 456 | tp_term, fp_term = GetTerminalRates(allSpecies, clades, supported_clusters_counter) 457 | probModel = PoissonModel_WithTeminalModel(alpha_nonTerm/10, tp_term, fp_term) 458 | # probModel = PoissonModel_IntergrateBranchLenthsSumFP(alpha_nonTerm/10., alpha_term/10., qExcludeTerminals = True) 459 | p_cond = GetBranchProbs(supported_clusters_counter, biparts, probModel) 460 | return GetFinalProbs(biparts, p_cond, species_tree) 461 | 462 | -------------------------------------------------------------------------------- /stride/stride.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # 4 | # Copyright 2017 David Emms 5 | # 6 | # This program (OrthoFinder) is distributed under the terms of the GNU General Public License v3 7 | # 8 | # This program is free software: you can redistribute it and/or modify 9 | # it under the terms of the GNU General Public License as published by 10 | # the Free Software Foundation, either version 3 of the License, or 11 | # (at your option) any later version. 12 | # 13 | # This program is distributed in the hope that it will be useful, 14 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 15 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 16 | # GNU General Public License for more details. 17 | # 18 | # You should have received a copy of the GNU General Public License 19 | # along with this program. If not, see . 20 | # 21 | # When publishing work that uses OrthoFinder please cite: 22 | # Emms, D.M. and Kelly, S. (2015) OrthoFinder: solving fundamental biases in whole genome comparisons dramatically 23 | # improves orthogroup inference accuracy, Genome Biology 16:157 24 | # 25 | # For any enquiries send an email to David Emms 26 | # david_emms@hotmail.com 27 | 28 | import os 29 | import sys 30 | import csv 31 | import glob 32 | #import shelve 33 | import datetime 34 | import argparse 35 | import itertools 36 | import multiprocessing as mp 37 | from collections import Counter, defaultdict 38 | 39 | import probroot 40 | 41 | qete = False 42 | try: 43 | import ete3 as ete 44 | qete = True 45 | except ImportError: 46 | try: 47 | import ete2 as ete 48 | qete = True 49 | except ImportError: 50 | pass 51 | 52 | def compare(exp, act): 53 | """exp - expected set of species 54 | act - actual set of species 55 | """ 56 | return act.issubset(exp) 57 | 58 | criteria = "crit_all" 59 | 60 | nProcs = mp.cpu_count() 61 | spTreeFormat = 1 62 | 63 | tooLarge = frozenset(["Too many genes"]) 64 | 65 | class Node(object): 66 | """ The class allows the user to get the 'child' nodes in any of the three directions 67 | rather than just for the two 'child' nodes in the tree model 68 | """ 69 | def __init__(self, node): 70 | """ 71 | node - tree node 72 | """ 73 | self.node = node 74 | 75 | def get_child_species_clades(self, c): 76 | nodes = [c] if c.is_leaf() else c.get_children() # what if non-binary? 77 | return [ch.sp_down for ch in nodes] 78 | 79 | def get_grandchild_species_clades(self, c): 80 | if c.is_root(): raise Exception 81 | if c.is_leaf(): 82 | return [[c.sp_down]] 83 | else: 84 | return [self.get_child_species_clades(ch) for ch in c.get_children()] 85 | 86 | def get_up_grand_species_clades(self): 87 | """ 88 | Exceptions: 89 | - if tree is non-binary 90 | """ 91 | if self.node.is_root(): 92 | children = self.node.get_children() 93 | if len(children) != 3: raise Exception("Expected binary tree") 94 | raise Exception("Don't know which of the 3 to return") 95 | # should it be self.get_grandchild_species_clades(children[-1]) 96 | return [self.get_grandchild_species_clades(c) for c in children] 97 | ancestors = self.node.get_ancestors() # ordered list starting with parent and progressing upwards 98 | c = ancestors[0] 99 | # must be at least one ancestor 100 | ch_temp = c.get_children() 101 | if c.is_root() and len(ch_temp) == 3: 102 | # both easy 103 | c1, c2 = ch_temp[:2] if ch_temp[2] == self.node else ch_temp[1:3] if ch_temp[0] == self.node else (ch_temp[0], ch_temp[2]) 104 | c1clades = self.get_child_species_clades(c1) 105 | c2clades = self.get_child_species_clades(c2) 106 | return [c1clades, c2clades] 107 | elif len(ch_temp) == 2: 108 | # easy one 109 | c1 = ch_temp[0] if ch_temp[0] != self.node else ch_temp[1] 110 | c1clades = self.get_child_species_clades(c1) 111 | # difficult one 112 | if len(ancestors) < 2: raise Exception("Expected binary tree") 113 | c2 = ancestors[1] 114 | ch_temp = c2.get_children() 115 | if c2.is_root() and len(ch_temp) == 3: 116 | # easy - both are children 117 | c21, c22 = ch_temp[:2] if ch_temp[2] == c else ch_temp[1:3] if ch_temp[0] == c else (ch_temp[0], ch_temp[2]) 118 | c21clade = c21.sp_down 119 | c22clade = c22.sp_down 120 | elif (not c2.is_root()) and len(ch_temp) == 2: 121 | c21 = ch_temp[0] if ch_temp[0] != c else ch_temp[1] 122 | c21clade = c21.sp_down 123 | c22clade = c2.sp_up # c21clade = everything not in the others, work out at the end 124 | else: 125 | raise Exception("Expected binary tree") 126 | return [c1clades, [c21clade, c22clade]] 127 | else: 128 | raise Exception("Expected binary tree") 129 | 130 | def get_up_genes(self, nMax): 131 | if self.node.is_root(): 132 | raise Exception("Error in duplicate gene identification, no 'up' node.") 133 | nUp = len(self.node.get_tree_root()) - len(self.node) 134 | if nUp > nMax: return tooLarge 135 | else: 136 | return frozenset(self.node.get_tree_root().get_leaf_names()).difference(frozenset(self.node.get_leaf_names())) 137 | 138 | def get_gene_sets(self, i, j, nMax=2000): 139 | """ 140 | Mirrors get_grandrelative_clades_stored. 141 | i - 0,1,2: set of clades with respect to order in get_grandrelative_clades_stored 142 | j - 0,1,2: set of clades with respect to order in get_grandrelative_clades_stored 143 | 144 | Returns the actuall genes in the clades 145 | """ 146 | if self.node.is_root(): 147 | children = self.node.get_children() 148 | if len(children) != 3: return None 149 | return (tooLarge if len(children[i]) > nMax else frozenset(children[i].get_leaf_names()), tooLarge if len(children[j]) > nMax else frozenset(children[j].get_leaf_names())) 150 | else: 151 | children = self.node.get_children() 152 | if len(children) != 2: return frozenset([]), frozenset([]) 153 | if i <= 1: 154 | iGenes = frozenset(children[i].get_leaf_names()) if len(children[i]) < nMax else tooLarge 155 | else: 156 | iGenes = frozenset(self.get_up_genes(nMax)) 157 | if j <=1: 158 | jGenes = frozenset(children[j].get_leaf_names()) if len(children[j]) < nMax else tooLarge 159 | else: 160 | jGenes = frozenset(self.get_up_genes(nMax)) 161 | return iGenes, jGenes 162 | 163 | def get_grandrelative_clades_stored(self): 164 | """ 165 | returns the hierarchically stored sets of species 166 | """ 167 | if self.node.is_root(): 168 | children = self.node.get_children() 169 | if len(children) != 3: return None 170 | return [self.get_grandchild_species_clades(c) for c in children] 171 | else: 172 | children = self.node.get_children() 173 | if len(children) != 2: return None 174 | down_clades = [self.get_grandchild_species_clades(c) for c in children] 175 | return down_clades + [self.get_up_grand_species_clades()] 176 | 177 | def GetSpeciesSets(self, allTaxa, GeneMap): 178 | if self.node.is_root(): 179 | na, nb, nc = self.node.get_children() 180 | a = na.get_leaf_names() 181 | b = nb.get_leaf_names() 182 | c = nc.get_leaf_names() 183 | else: 184 | ch = self.node.get_children() 185 | if len(ch) != 2: return None 186 | na, nb = ch 187 | a = set(na.get_leaf_names()) 188 | b = set(nb.get_leaf_names()) 189 | # now get node with set of species, c, below it 190 | c = allTaxa.difference(a).difference(b) 191 | return [set(map(GeneMap, x)) for x in (a,b,c)] 192 | 193 | @staticmethod 194 | def ToSpecies(clades_of_clades, GeneMap): 195 | return [[set(map(GeneMap, [gene for gene in grandchild])) for grandchild in child] for child in clades_of_clades] 196 | 197 | @staticmethod 198 | def FlatenSpeciesSet(clades_of_clades): 199 | return set([sp for child in clades_of_clades for grandchild in child for sp in grandchild]) 200 | 201 | 202 | def get_partitions(tree): 203 | all_leaves = frozenset(tree.get_leaf_names()) 204 | all_partitions = set([all_leaves]) 205 | for n in tree.iter_descendants(): 206 | p1 = frozenset(n.get_leaf_names()) 207 | p2 = frozenset(all_leaves - p1) 208 | all_partitions.add(p1) 209 | all_partitions.add(p2) 210 | return all_partitions 211 | 212 | def StoreSpeciesSets(t, GeneMap, allTaxa): 213 | for node in t.traverse('postorder'): 214 | if node.is_leaf(): 215 | node.add_feature('sp_down', {GeneMap(node.name)}) 216 | elif node.is_root(): 217 | continue 218 | else: 219 | node.add_feature('sp_down', set.union(*[ch.sp_down for ch in node.get_children()])) 220 | for node in t.traverse('preorder'): 221 | if node.is_root(): 222 | node.add_feature('sp_up', set()) 223 | else: 224 | parent = node.up 225 | if parent.is_root(): 226 | others = [ch for ch in parent.get_children() if ch != node] 227 | node.add_feature('sp_up', set.union(*[other.sp_down for other in others])) 228 | else: 229 | others = [ch for ch in parent.get_children() if ch != node] 230 | sp_downs = set.union(*[other.sp_down for other in others]) 231 | node.add_feature('sp_up', parent.sp_up.union(sp_downs)) 232 | 233 | def SaveTree(tree, root_clade, cladeName, treeName, iExample): 234 | fn = outputDir + "/%s_%s_%d_%d.tre" % (cladeName, os.path.split(treeName)[1].split(".")[0], iExample, len(root_clade)) 235 | t = RootAtClade(tree, root_clade) 236 | t.write(outfile = fn) 237 | print(fn) 238 | 239 | def StoreGeneSets(t): 240 | for node in t.traverse('postorder'): 241 | if node.is_leaf(): 242 | node.add_feature('g_down', [node.name]) 243 | elif node.is_root(): 244 | continue 245 | else: 246 | node.add_feature('g_down', [g for ch in node.get_children() for g in ch.g_down]) 247 | for node in t.traverse('preorder'): 248 | if node.is_root(): 249 | node.add_feature('g_up', []) 250 | else: 251 | parent = node.up 252 | if parent.is_root(): 253 | others = [ch for ch in parent.get_children() if ch != node] 254 | x = [g for other in others for g in other.g_down] 255 | node.add_feature('g_up', x) 256 | else: 257 | others = [ch for ch in parent.get_children() if ch != node] 258 | g_downs = [g for other in others for g in other.g_down] 259 | x = parent.g_up + g_downs 260 | node.add_feature('g_up', x) 261 | 262 | def GetStoredSpeciesSets(node): 263 | children = node.get_children() 264 | if node.is_root(): 265 | if len(children) != 3: return None 266 | return [ch.sp_down for ch in children] 267 | else: 268 | if len(children) != 2: return None 269 | return [ch.sp_down for ch in children] + [node.sp_up] 270 | 271 | def GetStoredGeneSets(node): 272 | children = node.get_children() 273 | if node.is_root(): 274 | if len(children) != 3: return None 275 | return [ch.g_down for ch in children] 276 | else: 277 | if len(children) != 2: return None 278 | return [ch.g_down for ch in children] + [node.g_up] 279 | 280 | def GeneToSpecies_dash(g): 281 | return g.split("_", 1)[0] 282 | 283 | def GeneToSpecies_secondDash(g): 284 | return "_".join(g.split("_", 2)[:2]) 285 | 286 | def GeneToSpecies_3rdDash(g): 287 | return "_".join(g.split("_", 3)[:3]) 288 | 289 | def GeneToSpecies_dot(g): 290 | return g.split(".", 1)[0] 291 | 292 | def GeneToSpecies_hyphen(g): 293 | return g.split("-", 1)[0] 294 | 295 | def LocalCheck_clades(clade1, clade2, expClades, GeneToSpecies): 296 | """Expected clades are now in tree structure going down two levels: [[A,B], [C,D]] 297 | Observed clades should match up with expected clades in terms of containing no unexpected species 298 | """ 299 | X = set.union(*expClades[0]) 300 | Y = set.union(*expClades[1]) 301 | x0, x1 = expClades[0] if len(expClades[0]) == 2 else (None, None) if len(expClades[0]) == 1 else (Exception, Exception) 302 | y0, y1 = expClades[1] if len(expClades[1]) == 2 else (None, None) if len(expClades[1]) == 1 else (Exception, Exception) 303 | if x0 == Exception or y0 == Exception: 304 | return False # Can't get a well-supported duplication meeting topology criteria if tree is not fully resolved 305 | for actClades in [clade1, clade2]: 306 | iUsed = None 307 | for clade in actClades: 308 | if len(clade) == 1: 309 | c = clade[0] 310 | if compare(X, c) and iUsed != 0: 311 | iUsed = 0 312 | continue 313 | elif compare(Y, c) and iUsed != 1: 314 | iUsed = 1 315 | continue 316 | else: 317 | return False 318 | else: 319 | c0 = clade[0] 320 | c1 = clade[1] 321 | if (iUsed != 0 and x0 != None) and ((compare(x0, c0) and compare(x1, c1)) or (compare(x1, c0) and compare(x0, c1))): 322 | iUsed = 0 323 | continue 324 | elif iUsed != 0 and (compare(X, c0) and compare(X, c1)): 325 | iUsed = 0 326 | continue 327 | elif (iUsed != 1 and y0 != None) and ((compare(y0, c0) and compare(y1, c1)) or (compare(y1, c0) and compare(y0, c1))): 328 | iUsed = 1 329 | continue 330 | elif iUsed != 1 and (compare(Y, c0) and compare(Y, c1)): 331 | iUsed = 1 332 | continue 333 | else: 334 | return False 335 | return True 336 | 337 | def Join(clades): 338 | return set([s for child in clades for grandchild in child for s in grandchild]) 339 | 340 | def SupportedHierachies(t, G, S, GeneToSpecies, species, dict_clades, clade_names, treeName, qWriteDupTrees=False): 341 | """ 342 | Only get the species sets in the first instance as work out the clades as and when 343 | """ 344 | qAncient = False 345 | supported = defaultdict(int) 346 | genesPostDup = set() 347 | # Pre-calcualte species sets on tree: traverse tree from leaves inwards 348 | StoreSpeciesSets(t, GeneToSpecies, G) 349 | if qWriteDupTrees: 350 | t_write = None 351 | StoreGeneSets(t) 352 | for n in t.traverse(): 353 | if n.is_root(): continue 354 | iExample = 0 355 | for counter, n in enumerate(t.traverse()): 356 | if n.is_leaf(): continue 357 | # just get the list of putative descendant species at this point without worrying about grandchild clades 358 | spSets = GetStoredSpeciesSets(n) 359 | if spSets == None: continue # non-binary 360 | clades = None 361 | # check each of three directions to the root 362 | for i, j in itertools.combinations(range(3), 2): 363 | s1 = spSets[i] 364 | s2 = spSets[j] 365 | # check for terminal duplications 366 | if len(s1) == 1 and s1 == s2: 367 | supported[frozenset(s1)] += 1 368 | if min(len(s1), len(s2)) < 2: continue 369 | k1 = frozenset(species) 370 | k2 = frozenset(species) 371 | for kk in dict_clades: 372 | if s1.issubset(kk) and len(kk) < len(k1): k1 = kk 373 | if s2.issubset(kk) and len(kk) < len(k2): k2 = kk 374 | edges1 = set([kk for kk in dict_clades if s1.issubset(kk) and len(kk) == len(k1)]) 375 | edges2 = set([kk for kk in dict_clades if s2.issubset(kk) and len(kk) == len(k2)]) 376 | for k1 in edges1.intersection(edges2): 377 | if len(k1) == 1: 378 | # terminal duplciations dealt with above (since no extra info about species tree topology is needed) 379 | break 380 | elif k1 == species: 381 | # putative ancient duplication 382 | if not qAncient: 383 | #print(treeName) 384 | qAncient = True 385 | break 386 | elif all([not clade.isdisjoint(s1) for clade0 in dict_clades[k1] for clade in clade0]) and all([not clade.isdisjoint(s2) for clade0 in dict_clades[k1] for clade in clade0]): 387 | # Passed the check that the required species are present but at this point don't know where in the tree 388 | # Get grandchild clades as required (could still avoid the more costly up clade if it's not required) 389 | if clades == None: 390 | N = Node(n) 391 | clades = N.get_grandrelative_clades_stored() 392 | if Join(clades[0]) != spSets[0] or Join(clades[1]) != spSets[1] or Join(clades[2]) != spSets[2]: 393 | print(clades[0]) 394 | print(spSets[0]) 395 | print("") 396 | print(clades[1]) 397 | print(spSets[1]) 398 | print("") 399 | print(clades[2]) 400 | print(spSets[2]) 401 | print("") 402 | raise Exception("Mismatch") 403 | if clades == None: break # locally non-binary in vicinity of node, skip to next node 404 | if not LocalCheck_clades(clades[i], clades[j], dict_clades[k1], GeneToSpecies): continue 405 | supported[frozenset(k1)] +=1 406 | genes = N.get_gene_sets(i, j) 407 | genesPostDup.add(genes[0].union(genes[1])) 408 | if qWriteDupTrees: 409 | if t_write == None: 410 | try: 411 | t_write = t.copy() 412 | except: 413 | t_write = ete.Tree(treeName, format=1) 414 | ii = 0 if (0!= i and 0!=j) else 1 if (1!=i and 1!=j) else 2 415 | gSets = GetStoredGeneSets(n) 416 | SaveTree(t_write, gSets[ii], clade_names[k1], treeName, iExample) 417 | iExample += 1 418 | return supported, genesPostDup 419 | 420 | """ 421 | Parallelisation wrappers 422 | ================================================================================================================================ 423 | """ 424 | 425 | def SupportedHierachies_wrapper(treeName, GeneToSpecies, species, dict_clades, clade_names, qWriteDupTrees=False): 426 | if not os.path.exists(treeName): return [], [] 427 | t = ete.Tree(treeName, format=1) 428 | G = set(t.get_leaf_names()) 429 | S = set(map(GeneToSpecies, G)) 430 | if not S.issubset(species): 431 | print("ERROR in %s" % treeName) 432 | print("Some genes cannot be mapped to species in the species tree") 433 | print(S.difference(species)) 434 | return None 435 | if len(S) < 3: 436 | return defaultdict(int), [] 437 | result = SupportedHierachies(t, G, S, GeneToSpecies, species, dict_clades, clade_names, treeName, qWriteDupTrees) 438 | return result 439 | 440 | def SupportedHierachies_wrapper2(args): 441 | return SupportedHierachies_wrapper(*args) 442 | 443 | """ 444 | End of Parallelisation wrappers 445 | ================================================================================================================================ 446 | """ 447 | 448 | def AnalyseSpeciesTree(speciesTree): 449 | species = frozenset(speciesTree.get_leaf_names()) 450 | parts = list(get_partitions(speciesTree)) 451 | nSpecies = len(species) 452 | dict_clades = dict() # dictionary of clades we require evidence of duplicates from for each partition 453 | clade_names = dict() 454 | for p in parts: 455 | if len(p) == nSpecies: continue 456 | speciesTree.set_outgroup(list(species.difference(p))[0]) 457 | if len(p) == 1: 458 | # no use for identifying clades 459 | continue 460 | n = speciesTree.get_common_ancestor(p) 461 | clade_names[p] = n.name + "_" + str(hash("".join(p)))[-8:] 462 | # go down two steps 463 | ch = n.get_children() 464 | ch0 = [ch[0]] if ch[0].is_leaf() else ch[0].get_children() 465 | ch1 = [ch[1]] if ch[1].is_leaf() else ch[1].get_children() 466 | dict_clades[p] = [[set(c.get_leaf_names()) for c in ch0], [set(c.get_leaf_names()) for c in ch1]] 467 | return species, dict_clades, clade_names 468 | 469 | def RootAtClade(t, accs_in_clade): 470 | if len(accs_in_clade) == 1: 471 | t.set_outgroup(list(accs_in_clade)[0]) 472 | return t 473 | accs = set(t.get_leaf_names()) 474 | dummy = list(accs.difference(accs_in_clade))[0] 475 | t.set_outgroup(dummy) 476 | node = t.get_common_ancestor(accs_in_clade) 477 | t.set_outgroup(node) 478 | return t 479 | 480 | def ParsimonyRoot(allSpecies, clades, supported_clusters_counter): 481 | contradictions = dict() 482 | for clade in clades: 483 | clade_p = allSpecies.difference(clade) 484 | against = 0 485 | for observed, n in supported_clusters_counter.items(): 486 | if (not observed.issubset(clade)) and (not observed.issubset(clade_p)): 487 | against += n 488 | contradictions[clade] = against 489 | m = min(contradictions.values()) 490 | n = sum(supported_clusters_counter.values()) 491 | nSupport = n-m 492 | roots = [] 493 | for clade, score in contradictions.items(): 494 | if score == m: 495 | if len(clade) > len(allSpecies)/2: 496 | roots.append(allSpecies.difference(clade)) 497 | elif len(clade) == len(allSpecies)/ 2: 498 | if allSpecies.difference(clade) not in roots: roots.append(clade) 499 | else: 500 | roots.append(clade) 501 | return roots, nSupport 502 | 503 | def GetRoot(speciesTreeFN, treesDir, GeneToSpeciesMap, nProcessors, qWriteDupTrees=False, qWriteRootedTree=False): 504 | """ 505 | ******* The Main method ******* 506 | """ 507 | qHaveBranchSupport = False 508 | try: 509 | speciesTree = ete.Tree(speciesTreeFN, format=2) 510 | qHaveBranchSupport = True 511 | except: 512 | speciesTree = ete.Tree(speciesTreeFN, format=1) 513 | species, dict_clades, clade_names = AnalyseSpeciesTree(speciesTree) 514 | pool = mp.Pool(nProcessors, maxtasksperchild=1) 515 | list_of_dicts = pool.map(SupportedHierachies_wrapper2, [(fn, GeneToSpeciesMap, species, dict_clades, clade_names, qWriteDupTrees) for fn in glob.glob(treesDir + "/*")]) 516 | clusters = Counter() 517 | all_stride_dup_genes = set() 518 | for l, stride_dup_genes in list_of_dicts: 519 | if l == None: 520 | sys.exit() 521 | clusters.update(l) 522 | all_stride_dup_genes.update(stride_dup_genes) 523 | roots, nSupport = ParsimonyRoot(species, list(dict_clades.keys()), clusters) 524 | roots = list(set(roots)) 525 | speciesTrees_rootedFNs =[] 526 | # Get distance of each from a supported clade 527 | topDist = [] 528 | branchDist = [] 529 | if len(clusters) > 0 and len(roots) > 1: 530 | # Evaluate which is the best one 531 | for r in roots: 532 | speciesTree = RootAtClade(speciesTree, r) 533 | topDist.append(min([speciesTree.get_distance(n, topology_only=True) for n in speciesTree.traverse('preorder') if frozenset(n.get_leaf_names()) in clusters])) 534 | branchDist.append(min([speciesTree.get_distance(n) for n in speciesTree.traverse('preorder') if (frozenset(n.get_leaf_names()) in clusters and speciesTree.get_distance(n, topology_only=True) == topDist[-1])])) 535 | maxTopDist = max(topDist) 536 | bestDist = -1 537 | for i, (t, d) in enumerate(zip(topDist, branchDist)): 538 | if t == maxTopDist and d > bestDist: 539 | bestDist = d 540 | iRootToUse = i 541 | rootToUse = roots.pop(iRootToUse) 542 | roots = [rootToUse] + roots 543 | 544 | if qWriteRootedTree: 545 | for i, r in enumerate(roots): 546 | speciesTree = RootAtClade(speciesTree, r) 547 | speciesTree_rootedFN = os.path.splitext(speciesTreeFN)[0] + "_%d_rooted.txt" % i 548 | # speciesTree = LabelNodes() 549 | speciesTree.write(outfile=speciesTree_rootedFN, format = 2 if qHaveBranchSupport else 1) 550 | speciesTrees_rootedFNs.append(speciesTree_rootedFN) 551 | return roots, clusters, speciesTrees_rootedFNs, nSupport, list(dict_clades.keys()), species, all_stride_dup_genes 552 | 553 | def PrintRootingSummary(roots, clusters_counter, nSupport): 554 | nAll = sum(clusters_counter.values()) 555 | nFP_mp = nAll - nSupport 556 | n_non_trivial = sum([v for k, v in clusters_counter.items() if len(k) > 1]) 557 | if len(roots) > 1: print("Identified %d non-terminal duplications.\n%d support the best roots and %d contradict them." % (n_non_trivial, n_non_trivial-nFP_mp, nFP_mp)) 558 | else: print("Identified %d non-terminal duplications.\n%d support the best root and %d contradict it." % (n_non_trivial, n_non_trivial-nFP_mp, nFP_mp)) 559 | print("Most parsimonious outgroup(s) for species tree:") 560 | for r in roots[:5]: 561 | print("{" + ", ".join(r) + "}") 562 | if len(roots) > 5: 563 | print("Etc...") 564 | print("%d possible roots" % len(roots)) 565 | return nFP_mp, n_non_trivial 566 | 567 | def GetDirectoryName(baseDirName, i): 568 | if i == 0: 569 | return baseDirName + os.sep 570 | else: 571 | return baseDirName + ("_%d" % i) + os.sep 572 | 573 | def CreateNewWorkingDirectory(baseDirectoryName): 574 | dateStr = datetime.date.today().strftime("%b%d") 575 | iAppend = 0 576 | newDirectoryName = GetDirectoryName(baseDirectoryName + dateStr, iAppend) 577 | while os.path.exists(newDirectoryName): 578 | iAppend += 1 579 | newDirectoryName = GetDirectoryName(baseDirectoryName + dateStr, iAppend) 580 | os.mkdir(newDirectoryName) 581 | return newDirectoryName 582 | 583 | def GetCluseterName(species_tree, S, cluster): 584 | if len(cluster) == 1: return list(cluster)[0] 585 | else: 586 | n = species_tree.get_common_ancestor(cluster) 587 | if n == species_tree or (n.up == species_tree and len(S.difference(cluster)) < len(cluster)): 588 | complement = S.difference(cluster) # complement 589 | # print(complement) 590 | if len(complement) == 1: 591 | # continue 592 | n = species_tree & list(complement)[0] 593 | else: 594 | n = species_tree.get_common_ancestor(complement) 595 | return n.name + "_" + list(cluster)[0] 596 | 597 | def WriteResults(species_tree_fn_or_text, roots, S, clades, clusters_counter, output_dir): 598 | # for c in clusters_counter: 599 | # print((clusters_counter[c], c)) 600 | print("\nResults written to:\n" + os.path.realpath(output_dir)) 601 | # Label species tree nodes 602 | try: 603 | species_tree = ete.Tree(species_tree_fn_or_text, format=2) 604 | except: 605 | species_tree = ete.Tree(species_tree_fn_or_text, format=1) 606 | thisRoot = roots[0] 607 | species_tree = RootAtClade(species_tree, thisRoot) 608 | iNode = 0 609 | for n in species_tree.traverse(): 610 | if not n.is_leaf(): 611 | n.name = "N%d" % iNode 612 | iNode+=1 613 | species_tree.write(outfile=output_dir + "Species_tree_labelled.tre", format=1) 614 | # print(species_tree) 615 | # species_tree = ete.Tree(output_dir + "Species_tree_labelled.tre", format=1) 616 | # Calculate probabilities 617 | qBinary = True 618 | for n in species_tree.traverse(): 619 | if len(n.get_children()) > 2: 620 | qBinary = False 621 | if qBinary: 622 | p_final = probroot.GetProbabilities(species_tree, S, clades, clusters_counter) 623 | else: 624 | print("Probability distribution for root location is not supported for non-binary trees") 625 | print("To get a probability distribution for the root, please supply a fully resolved input species tree") 626 | # Write numbers of duplications 627 | table = dict() 628 | new_tree = ete.Tree(output_dir + "Species_tree_labelled.tre", format=1) 629 | for clade in clades + [frozenset([s]) for s in S]: 630 | qAnti = False 631 | anticlade = S.difference(clade) 632 | if len(clade) == 1: 633 | node = new_tree & list(clade)[0] 634 | else: 635 | node = new_tree.get_common_ancestor(clade) 636 | if node == new_tree: 637 | node = new_tree.get_common_ancestor(anticlade) 638 | qAnti = True 639 | x = anticlade if qAnti else clade 640 | y = clade if qAnti else anticlade 641 | X = ("(%d)" % clusters_counter[x]) if len(clade) == 1 else clusters_counter[x] 642 | if qBinary: 643 | p = p_final[clade] if clade in p_final else p_final[anticlade] 644 | else: 645 | p = 0. 646 | table[node.name] = [node.name, "X" if (clade in roots or anticlade in roots) else "", "%0.1f%%" % (100.*p) , X, clusters_counter[y]] 647 | with open(output_dir + "Duplication_counts.csv", 'wb') as outfile: 648 | writer = csv.writer(outfile, delimiter="\t") 649 | writer.writerow(["Branch", "MP Root", "Probability", "Duplications supporting clade", "Duplications supporting opposite clade"]) 650 | qSingle = len(thisRoot) == 1 651 | root_branches = [n.name for n in new_tree.get_children()] 652 | writer.writerow([root_branches[0] + " (& " + root_branches[1] + ")"] + table[root_branches[0]][1:]) 653 | for i in range(2 if qSingle else 3, iNode): 654 | name = "N%d" % i 655 | if name in table: 656 | writer.writerow(table[name]) 657 | else: 658 | print("Skipping %s" % name) 659 | for sp in S: 660 | if sp in table: 661 | if qSingle and sp in thisRoot: continue 662 | writer.writerow(table[sp]) 663 | 664 | def Main_Full(args): 665 | text = """ 666 | **************************************************************************** 667 | * STRIDE: Species Tree Root Inference from Gene Duplication Events * 668 | * * 669 | ****************************************************************************""" 670 | # text = "STRIDE: Species Tree Root Inference from Gene Duplication Events" 671 | print(text[1:]) 672 | # print(text + "\n" + "="*len(text)) 673 | GeneToSpecies = GeneToSpecies_dash 674 | if args.separator and args.separator == "dot": 675 | GeneToSpecies = GeneToSpecies_dot 676 | elif args.separator and args.separator == "second_dash": 677 | GeneToSpecies = GeneToSpecies_secondDash 678 | elif args.separator and args.separator == "3rd_dash": 679 | GeneToSpecies = GeneToSpecies_3rdDash 680 | elif args.separator and args.separator == "hyphen": 681 | GeneToSpecies = GeneToSpecies_hyphen 682 | 683 | if not args.directory: 684 | speciesTree = ete.Tree(args.Species_tree, format=spTreeFormat) 685 | species, dict_clades, clade_names = AnalyseSpeciesTree(speciesTree) 686 | c, stride_dup_genes = SupportedHierachies_wrapper(args.gene_trees, GeneToSpecies, species, dict_clades, clade_names) 687 | for k, v in c.items(): print((k, v)) 688 | elif args.debug: 689 | speciesTree = ete.Tree(args.Species_tree, format=spTreeFormat) 690 | species, dict_clades, clade_names = AnalyseSpeciesTree(speciesTree) 691 | clusters_counter = Counter() 692 | for fn in glob.glob(args.gene_trees + "/*"): 693 | c, stride_dup_genes = SupportedHierachies_wrapper(fn, GeneToSpecies, species, dict_clades, clade_names) 694 | clusters_counter.update(c) 695 | roots, nSupport = ParsimonyRoot(species, dict_clades.keys(), clusters_counter) 696 | PrintRootingSummary(roots, clusters_counter, nSupport) 697 | else: 698 | nTrees = len(glob.glob(args.gene_trees + "/*")) 699 | if nTrees == 0: 700 | print("No trees found in %s\nExiting" % args.gene_trees) 701 | sys.exit() 702 | print("Analysing %d gene trees" % nTrees) 703 | # roots, clusters_counter, _, nSupport, clades, species = GetRoot(args.Species_tree, args.gene_trees, GeneToSpecies, nProcs, treeFmt = 1, qWriteDupTrees=args.output) 704 | roots, clusters_counter, _, nSupport, clades, species, all_stride_dup_genes = GetRoot(args.Species_tree, args.gene_trees, GeneToSpecies, nProcs) 705 | PrintRootingSummary(roots, clusters_counter, nSupport) 706 | outputDir = CreateNewWorkingDirectory(args.gene_trees + "/../STRIDE_Results") 707 | # shelveFN = outputDir + "STRIDE_data.shv" 708 | # d = shelve.open(shelveFN) 709 | # d['roots'] = roots 710 | # d['clusters_counter'] = clusters_counter 711 | # d['species'] = species 712 | # d['nSupport'] = nSupport 713 | # d['SpeciesTreeFN'] = os.path.abspath(args.Species_tree) 714 | # with open(args.Species_tree, 'rb') as infile: 715 | # tree_text = "".join([l.rstrip() for l in infile.readlines()]) 716 | # d['SpeciesTreeText'] = tree_text 717 | # d['TreesDir'] = os.path.abspath(args.gene_trees) 718 | # d['clades'] = clades 719 | # d.close() 720 | # outputFigFN = outputFN_base + ".pdf" 721 | #DrawDuplicationsTree(args.Species_tree, clusters_counter, outputFigFN) 722 | WriteResults(args.Species_tree, roots, species, clades, clusters_counter, outputDir) 723 | print("") 724 | 725 | if __name__ == "__main__": 726 | parser = argparse.ArgumentParser() 727 | parser.add_argument("gene_trees", help = "Directory conaining gene trees (with -d argument) or filename of a single gene tree to analyse (no -d argument)") 728 | parser.add_argument("-s", "--separator", choices=("dot", "dash", "second_dash", "3rd_dash", "hyphen"), help="Separator been species name and gene name in gene tree taxa") 729 | parser.add_argument("-S", "--Species_tree", help="Unrooted species tree in newick format") 730 | parser.add_argument("-d", "--directory", action="store_true", help="Process all trees in input directory") 731 | parser.add_argument("--debug", action="store_true", help="Run in serial to enable easier debugging") 732 | # parser.add_argument("-o", "--output", action="store_true", help="Write out gene trees rooted at duplications") 733 | parser.set_defaults(Func=Main_Full) 734 | args = parser.parse_args() 735 | # if args.output: 736 | # x = (args.gene_trees if args.directory else os.path.split(args.gene_trees)[0]) 737 | # while x[-1] == "/": 738 | # x = x[:-1] 739 | # outputDir = x + "_rooted_duplications/" 740 | # if not os.path.exists(outputDir): os.mkdir(outputDir) 741 | args.Func(args) 742 | 743 | 744 | --------------------------------------------------------------------------------