├── LICENSE ├── Makefile ├── PI ├── __init__.py ├── __init__.pyc ├── align.pyx ├── align.so ├── distance.py ├── input.py ├── input.pyc ├── multialign.py ├── output.py ├── phylogeny.py └── tree.py ├── README ├── docs ├── PI_Toorcon.pdf └── pi.pdf ├── main.py └── setup.py /LICENSE: -------------------------------------------------------------------------------- 1 | GNU LESSER GENERAL PUBLIC LICENSE 2 | Version 2.1, February 1999 3 | 4 | Copyright (C) 1991, 1999 Free Software Foundation, Inc. 5 | 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 6 | Everyone is permitted to copy and distribute verbatim copies 7 | of this license document, but changing it is not allowed. 8 | 9 | [This is the first released version of the Lesser GPL. It also counts 10 | as the successor of the GNU Library Public License, version 2, hence 11 | the version number 2.1.] 12 | 13 | Preamble 14 | 15 | The licenses for most software are designed to take away your 16 | freedom to share and change it. By contrast, the GNU General Public 17 | Licenses are intended to guarantee your freedom to share and change 18 | free software--to make sure the software is free for all its users. 19 | 20 | This license, the Lesser General Public License, applies to some 21 | specially designated software packages--typically libraries--of the 22 | Free Software Foundation and other authors who decide to use it. You 23 | can use it too, but we suggest you first think carefully about whether 24 | this license or the ordinary General Public License is the better 25 | strategy to use in any particular case, based on the explanations below. 26 | 27 | When we speak of free software, we are referring to freedom of use, 28 | not price. Our General Public Licenses are designed to make sure that 29 | you have the freedom to distribute copies of free software (and charge 30 | for this service if you wish); that you receive source code or can get 31 | it if you want it; that you can change the software and use pieces of 32 | it in new free programs; and that you are informed that you can do 33 | these things. 34 | 35 | To protect your rights, we need to make restrictions that forbid 36 | distributors to deny you these rights or to ask you to surrender these 37 | rights. These restrictions translate to certain responsibilities for 38 | you if you distribute copies of the library or if you modify it. 39 | 40 | For example, if you distribute copies of the library, whether gratis 41 | or for a fee, you must give the recipients all the rights that we gave 42 | you. You must make sure that they, too, receive or can get the source 43 | code. If you link other code with the library, you must provide 44 | complete object files to the recipients, so that they can relink them 45 | with the library after making changes to the library and recompiling 46 | it. And you must show them these terms so they know their rights. 47 | 48 | We protect your rights with a two-step method: (1) we copyright the 49 | library, and (2) we offer you this license, which gives you legal 50 | permission to copy, distribute and/or modify the library. 51 | 52 | To protect each distributor, we want to make it very clear that 53 | there is no warranty for the free library. Also, if the library is 54 | modified by someone else and passed on, the recipients should know 55 | that what they have is not the original version, so that the original 56 | author's reputation will not be affected by problems that might be 57 | introduced by others. 58 | 59 | Finally, software patents pose a constant threat to the existence of 60 | any free program. We wish to make sure that a company cannot 61 | effectively restrict the users of a free program by obtaining a 62 | restrictive license from a patent holder. Therefore, we insist that 63 | any patent license obtained for a version of the library must be 64 | consistent with the full freedom of use specified in this license. 65 | 66 | Most GNU software, including some libraries, is covered by the 67 | ordinary GNU General Public License. This license, the GNU Lesser 68 | General Public License, applies to certain designated libraries, and 69 | is quite different from the ordinary General Public License. We use 70 | this license for certain libraries in order to permit linking those 71 | libraries into non-free programs. 72 | 73 | When a program is linked with a library, whether statically or using 74 | a shared library, the combination of the two is legally speaking a 75 | combined work, a derivative of the original library. The ordinary 76 | General Public License therefore permits such linking only if the 77 | entire combination fits its criteria of freedom. The Lesser General 78 | Public License permits more lax criteria for linking other code with 79 | the library. 80 | 81 | We call this license the "Lesser" General Public License because it 82 | does Less to protect the user's freedom than the ordinary General 83 | Public License. It also provides other free software developers Less 84 | of an advantage over competing non-free programs. These disadvantages 85 | are the reason we use the ordinary General Public License for many 86 | libraries. However, the Lesser license provides advantages in certain 87 | special circumstances. 88 | 89 | For example, on rare occasions, there may be a special need to 90 | encourage the widest possible use of a certain library, so that it becomes 91 | a de-facto standard. To achieve this, non-free programs must be 92 | allowed to use the library. A more frequent case is that a free 93 | library does the same job as widely used non-free libraries. In this 94 | case, there is little to gain by limiting the free library to free 95 | software only, so we use the Lesser General Public License. 96 | 97 | In other cases, permission to use a particular library in non-free 98 | programs enables a greater number of people to use a large body of 99 | free software. For example, permission to use the GNU C Library in 100 | non-free programs enables many more people to use the whole GNU 101 | operating system, as well as its variant, the GNU/Linux operating 102 | system. 103 | 104 | Although the Lesser General Public License is Less protective of the 105 | users' freedom, it does ensure that the user of a program that is 106 | linked with the Library has the freedom and the wherewithal to run 107 | that program using a modified version of the Library. 108 | 109 | The precise terms and conditions for copying, distribution and 110 | modification follow. Pay close attention to the difference between a 111 | "work based on the library" and a "work that uses the library". The 112 | former contains code derived from the library, whereas the latter must 113 | be combined with the library in order to run. 114 | 115 | GNU LESSER GENERAL PUBLIC LICENSE 116 | TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 117 | 118 | 0. This License Agreement applies to any software library or other 119 | program which contains a notice placed by the copyright holder or 120 | other authorized party saying it may be distributed under the terms of 121 | this Lesser General Public License (also called "this License"). 122 | Each licensee is addressed as "you". 123 | 124 | A "library" means a collection of software functions and/or data 125 | prepared so as to be conveniently linked with application programs 126 | (which use some of those functions and data) to form executables. 127 | 128 | The "Library", below, refers to any such software library or work 129 | which has been distributed under these terms. A "work based on the 130 | Library" means either the Library or any derivative work under 131 | copyright law: that is to say, a work containing the Library or a 132 | portion of it, either verbatim or with modifications and/or translated 133 | straightforwardly into another language. (Hereinafter, translation is 134 | included without limitation in the term "modification".) 135 | 136 | "Source code" for a work means the preferred form of the work for 137 | making modifications to it. For a library, complete source code means 138 | all the source code for all modules it contains, plus any associated 139 | interface definition files, plus the scripts used to control compilation 140 | and installation of the library. 141 | 142 | Activities other than copying, distribution and modification are not 143 | covered by this License; they are outside its scope. The act of 144 | running a program using the Library is not restricted, and output from 145 | such a program is covered only if its contents constitute a work based 146 | on the Library (independent of the use of the Library in a tool for 147 | writing it). Whether that is true depends on what the Library does 148 | and what the program that uses the Library does. 149 | 150 | 1. You may copy and distribute verbatim copies of the Library's 151 | complete source code as you receive it, in any medium, provided that 152 | you conspicuously and appropriately publish on each copy an 153 | appropriate copyright notice and disclaimer of warranty; keep intact 154 | all the notices that refer to this License and to the absence of any 155 | warranty; and distribute a copy of this License along with the 156 | Library. 157 | 158 | You may charge a fee for the physical act of transferring a copy, 159 | and you may at your option offer warranty protection in exchange for a 160 | fee. 161 | 162 | 2. You may modify your copy or copies of the Library or any portion 163 | of it, thus forming a work based on the Library, and copy and 164 | distribute such modifications or work under the terms of Section 1 165 | above, provided that you also meet all of these conditions: 166 | 167 | a) The modified work must itself be a software library. 168 | 169 | b) You must cause the files modified to carry prominent notices 170 | stating that you changed the files and the date of any change. 171 | 172 | c) You must cause the whole of the work to be licensed at no 173 | charge to all third parties under the terms of this License. 174 | 175 | d) If a facility in the modified Library refers to a function or a 176 | table of data to be supplied by an application program that uses 177 | the facility, other than as an argument passed when the facility 178 | is invoked, then you must make a good faith effort to ensure that, 179 | in the event an application does not supply such function or 180 | table, the facility still operates, and performs whatever part of 181 | its purpose remains meaningful. 182 | 183 | (For example, a function in a library to compute square roots has 184 | a purpose that is entirely well-defined independent of the 185 | application. Therefore, Subsection 2d requires that any 186 | application-supplied function or table used by this function must 187 | be optional: if the application does not supply it, the square 188 | root function must still compute square roots.) 189 | 190 | These requirements apply to the modified work as a whole. If 191 | identifiable sections of that work are not derived from the Library, 192 | and can be reasonably considered independent and separate works in 193 | themselves, then this License, and its terms, do not apply to those 194 | sections when you distribute them as separate works. But when you 195 | distribute the same sections as part of a whole which is a work based 196 | on the Library, the distribution of the whole must be on the terms of 197 | this License, whose permissions for other licensees extend to the 198 | entire whole, and thus to each and every part regardless of who wrote 199 | it. 200 | 201 | Thus, it is not the intent of this section to claim rights or contest 202 | your rights to work written entirely by you; rather, the intent is to 203 | exercise the right to control the distribution of derivative or 204 | collective works based on the Library. 205 | 206 | In addition, mere aggregation of another work not based on the Library 207 | with the Library (or with a work based on the Library) on a volume of 208 | a storage or distribution medium does not bring the other work under 209 | the scope of this License. 210 | 211 | 3. You may opt to apply the terms of the ordinary GNU General Public 212 | License instead of this License to a given copy of the Library. To do 213 | this, you must alter all the notices that refer to this License, so 214 | that they refer to the ordinary GNU General Public License, version 2, 215 | instead of to this License. (If a newer version than version 2 of the 216 | ordinary GNU General Public License has appeared, then you can specify 217 | that version instead if you wish.) Do not make any other change in 218 | these notices. 219 | 220 | Once this change is made in a given copy, it is irreversible for 221 | that copy, so the ordinary GNU General Public License applies to all 222 | subsequent copies and derivative works made from that copy. 223 | 224 | This option is useful when you wish to copy part of the code of 225 | the Library into a program that is not a library. 226 | 227 | 4. You may copy and distribute the Library (or a portion or 228 | derivative of it, under Section 2) in object code or executable form 229 | under the terms of Sections 1 and 2 above provided that you accompany 230 | it with the complete corresponding machine-readable source code, which 231 | must be distributed under the terms of Sections 1 and 2 above on a 232 | medium customarily used for software interchange. 233 | 234 | If distribution of object code is made by offering access to copy 235 | from a designated place, then offering equivalent access to copy the 236 | source code from the same place satisfies the requirement to 237 | distribute the source code, even though third parties are not 238 | compelled to copy the source along with the object code. 239 | 240 | 5. A program that contains no derivative of any portion of the 241 | Library, but is designed to work with the Library by being compiled or 242 | linked with it, is called a "work that uses the Library". Such a 243 | work, in isolation, is not a derivative work of the Library, and 244 | therefore falls outside the scope of this License. 245 | 246 | However, linking a "work that uses the Library" with the Library 247 | creates an executable that is a derivative of the Library (because it 248 | contains portions of the Library), rather than a "work that uses the 249 | library". The executable is therefore covered by this License. 250 | Section 6 states terms for distribution of such executables. 251 | 252 | When a "work that uses the Library" uses material from a header file 253 | that is part of the Library, the object code for the work may be a 254 | derivative work of the Library even though the source code is not. 255 | Whether this is true is especially significant if the work can be 256 | linked without the Library, or if the work is itself a library. The 257 | threshold for this to be true is not precisely defined by law. 258 | 259 | If such an object file uses only numerical parameters, data 260 | structure layouts and accessors, and small macros and small inline 261 | functions (ten lines or less in length), then the use of the object 262 | file is unrestricted, regardless of whether it is legally a derivative 263 | work. (Executables containing this object code plus portions of the 264 | Library will still fall under Section 6.) 265 | 266 | Otherwise, if the work is a derivative of the Library, you may 267 | distribute the object code for the work under the terms of Section 6. 268 | Any executables containing that work also fall under Section 6, 269 | whether or not they are linked directly with the Library itself. 270 | 271 | 6. As an exception to the Sections above, you may also combine or 272 | link a "work that uses the Library" with the Library to produce a 273 | work containing portions of the Library, and distribute that work 274 | under terms of your choice, provided that the terms permit 275 | modification of the work for the customer's own use and reverse 276 | engineering for debugging such modifications. 277 | 278 | You must give prominent notice with each copy of the work that the 279 | Library is used in it and that the Library and its use are covered by 280 | this License. You must supply a copy of this License. If the work 281 | during execution displays copyright notices, you must include the 282 | copyright notice for the Library among them, as well as a reference 283 | directing the user to the copy of this License. Also, you must do one 284 | of these things: 285 | 286 | a) Accompany the work with the complete corresponding 287 | machine-readable source code for the Library including whatever 288 | changes were used in the work (which must be distributed under 289 | Sections 1 and 2 above); and, if the work is an executable linked 290 | with the Library, with the complete machine-readable "work that 291 | uses the Library", as object code and/or source code, so that the 292 | user can modify the Library and then relink to produce a modified 293 | executable containing the modified Library. (It is understood 294 | that the user who changes the contents of definitions files in the 295 | Library will not necessarily be able to recompile the application 296 | to use the modified definitions.) 297 | 298 | b) Use a suitable shared library mechanism for linking with the 299 | Library. A suitable mechanism is one that (1) uses at run time a 300 | copy of the library already present on the user's computer system, 301 | rather than copying library functions into the executable, and (2) 302 | will operate properly with a modified version of the library, if 303 | the user installs one, as long as the modified version is 304 | interface-compatible with the version that the work was made with. 305 | 306 | c) Accompany the work with a written offer, valid for at 307 | least three years, to give the same user the materials 308 | specified in Subsection 6a, above, for a charge no more 309 | than the cost of performing this distribution. 310 | 311 | d) If distribution of the work is made by offering access to copy 312 | from a designated place, offer equivalent access to copy the above 313 | specified materials from the same place. 314 | 315 | e) Verify that the user has already received a copy of these 316 | materials or that you have already sent this user a copy. 317 | 318 | For an executable, the required form of the "work that uses the 319 | Library" must include any data and utility programs needed for 320 | reproducing the executable from it. However, as a special exception, 321 | the materials to be distributed need not include anything that is 322 | normally distributed (in either source or binary form) with the major 323 | components (compiler, kernel, and so on) of the operating system on 324 | which the executable runs, unless that component itself accompanies 325 | the executable. 326 | 327 | It may happen that this requirement contradicts the license 328 | restrictions of other proprietary libraries that do not normally 329 | accompany the operating system. Such a contradiction means you cannot 330 | use both them and the Library together in an executable that you 331 | distribute. 332 | 333 | 7. You may place library facilities that are a work based on the 334 | Library side-by-side in a single library together with other library 335 | facilities not covered by this License, and distribute such a combined 336 | library, provided that the separate distribution of the work based on 337 | the Library and of the other library facilities is otherwise 338 | permitted, and provided that you do these two things: 339 | 340 | a) Accompany the combined library with a copy of the same work 341 | based on the Library, uncombined with any other library 342 | facilities. This must be distributed under the terms of the 343 | Sections above. 344 | 345 | b) Give prominent notice with the combined library of the fact 346 | that part of it is a work based on the Library, and explaining 347 | where to find the accompanying uncombined form of the same work. 348 | 349 | 8. You may not copy, modify, sublicense, link with, or distribute 350 | the Library except as expressly provided under this License. Any 351 | attempt otherwise to copy, modify, sublicense, link with, or 352 | distribute the Library is void, and will automatically terminate your 353 | rights under this License. However, parties who have received copies, 354 | or rights, from you under this License will not have their licenses 355 | terminated so long as such parties remain in full compliance. 356 | 357 | 9. You are not required to accept this License, since you have not 358 | signed it. However, nothing else grants you permission to modify or 359 | distribute the Library or its derivative works. These actions are 360 | prohibited by law if you do not accept this License. Therefore, by 361 | modifying or distributing the Library (or any work based on the 362 | Library), you indicate your acceptance of this License to do so, and 363 | all its terms and conditions for copying, distributing or modifying 364 | the Library or works based on it. 365 | 366 | 10. Each time you redistribute the Library (or any work based on the 367 | Library), the recipient automatically receives a license from the 368 | original licensor to copy, distribute, link with or modify the Library 369 | subject to these terms and conditions. You may not impose any further 370 | restrictions on the recipients' exercise of the rights granted herein. 371 | You are not responsible for enforcing compliance by third parties with 372 | this License. 373 | 374 | 11. If, as a consequence of a court judgment or allegation of patent 375 | infringement or for any other reason (not limited to patent issues), 376 | conditions are imposed on you (whether by court order, agreement or 377 | otherwise) that contradict the conditions of this License, they do not 378 | excuse you from the conditions of this License. If you cannot 379 | distribute so as to satisfy simultaneously your obligations under this 380 | License and any other pertinent obligations, then as a consequence you 381 | may not distribute the Library at all. For example, if a patent 382 | license would not permit royalty-free redistribution of the Library by 383 | all those who receive copies directly or indirectly through you, then 384 | the only way you could satisfy both it and this License would be to 385 | refrain entirely from distribution of the Library. 386 | 387 | If any portion of this section is held invalid or unenforceable under any 388 | particular circumstance, the balance of the section is intended to apply, 389 | and the section as a whole is intended to apply in other circumstances. 390 | 391 | It is not the purpose of this section to induce you to infringe any 392 | patents or other property right claims or to contest validity of any 393 | such claims; this section has the sole purpose of protecting the 394 | integrity of the free software distribution system which is 395 | implemented by public license practices. Many people have made 396 | generous contributions to the wide range of software distributed 397 | through that system in reliance on consistent application of that 398 | system; it is up to the author/donor to decide if he or she is willing 399 | to distribute software through any other system and a licensee cannot 400 | impose that choice. 401 | 402 | This section is intended to make thoroughly clear what is believed to 403 | be a consequence of the rest of this License. 404 | 405 | 12. If the distribution and/or use of the Library is restricted in 406 | certain countries either by patents or by copyrighted interfaces, the 407 | original copyright holder who places the Library under this License may add 408 | an explicit geographical distribution limitation excluding those countries, 409 | so that distribution is permitted only in or among countries not thus 410 | excluded. In such case, this License incorporates the limitation as if 411 | written in the body of this License. 412 | 413 | 13. The Free Software Foundation may publish revised and/or new 414 | versions of the Lesser General Public License from time to time. 415 | Such new versions will be similar in spirit to the present version, 416 | but may differ in detail to address new problems or concerns. 417 | 418 | Each version is given a distinguishing version number. If the Library 419 | specifies a version number of this License which applies to it and 420 | "any later version", you have the option of following the terms and 421 | conditions either of that version or of any later version published by 422 | the Free Software Foundation. If the Library does not specify a 423 | license version number, you may choose any version ever published by 424 | the Free Software Foundation. 425 | 426 | 14. If you wish to incorporate parts of the Library into other free 427 | programs whose distribution conditions are incompatible with these, 428 | write to the author to ask for permission. For software which is 429 | copyrighted by the Free Software Foundation, write to the Free 430 | Software Foundation; we sometimes make exceptions for this. Our 431 | decision will be guided by the two goals of preserving the free status 432 | of all derivatives of our free software and of promoting the sharing 433 | and reuse of software generally. 434 | 435 | NO WARRANTY 436 | 437 | 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO 438 | WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. 439 | EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR 440 | OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY 441 | KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE 442 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 443 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE 444 | LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME 445 | THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 446 | 447 | 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN 448 | WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY 449 | AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU 450 | FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR 451 | CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE 452 | LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING 453 | RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A 454 | FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF 455 | SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH 456 | DAMAGES. 457 | 458 | END OF TERMS AND CONDITIONS 459 | 460 | How to Apply These Terms to Your New Libraries 461 | 462 | If you develop a new library, and you want it to be of the greatest 463 | possible use to the public, we recommend making it free software that 464 | everyone can redistribute and change. You can do so by permitting 465 | redistribution under these terms (or, alternatively, under the terms of the 466 | ordinary General Public License). 467 | 468 | To apply these terms, attach the following notices to the library. It is 469 | safest to attach them to the start of each source file to most effectively 470 | convey the exclusion of warranty; and each file should have at least the 471 | "copyright" line and a pointer to where the full notice is found. 472 | 473 | 474 | Copyright (C) 475 | 476 | This library is free software; you can redistribute it and/or 477 | modify it under the terms of the GNU Lesser General Public 478 | License as published by the Free Software Foundation; either 479 | version 2.1 of the License, or (at your option) any later version. 480 | 481 | This library is distributed in the hope that it will be useful, 482 | but WITHOUT ANY WARRANTY; without even the implied warranty of 483 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 484 | Lesser General Public License for more details. 485 | 486 | You should have received a copy of the GNU Lesser General Public 487 | License along with this library; if not, write to the Free Software 488 | Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 489 | 490 | Also add information on how to contact you by electronic and paper mail. 491 | 492 | You should also get your employer (if you work as a programmer) or your 493 | school, if any, to sign a "copyright disclaimer" for the library, if 494 | necessary. Here is a sample; alter the names: 495 | 496 | Yoyodyne, Inc., hereby disclaims all copyright interest in the 497 | library `Frob' (a library for tweaking knobs) written by James Random Hacker. 498 | 499 | , 1 April 1990 500 | Ty Coon, President of Vice 501 | 502 | That's all there is to it! 503 | 504 | 505 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | all: 2 | python setup.py build 3 | find build -name align.so | xargs -J % cp % PI 4 | 5 | install: 6 | python setup.py install 7 | 8 | clean: 9 | find . -name "*.pyc" | xargs rm -f 10 | rm -f PI/align.c 11 | rm -rf build 12 | 13 | -------------------------------------------------------------------------------- /PI/__init__.py: -------------------------------------------------------------------------------- 1 | __all__ = [ "input", "distance", "phylogeny", "multialign", "output" ] 2 | -------------------------------------------------------------------------------- /PI/__init__.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/unmarshal/protocol-informatics/fcae1cd7cb71cd1914f9bc9fb9dcecdd0a6cfa86/PI/__init__.pyc -------------------------------------------------------------------------------- /PI/align.pyx: -------------------------------------------------------------------------------- 1 | cdef extern from "Numeric/arrayobject.h": 2 | 3 | struct PyArray_Descr: 4 | int type_num, elsize 5 | char type 6 | 7 | ctypedef class Numeric.ArrayType [object PyArrayObject]: 8 | cdef char *data 9 | cdef int nd 10 | cdef int *dimensions, *strides 11 | cdef object base 12 | cdef PyArray_Descr *descr 13 | cdef int flags 14 | 15 | import Numeric 16 | 17 | def NeedlemanWunsch(seq1, seq2, S, int g, int e): 18 | 19 | cdef int M, N, i, j, t_max, i_max, j_max, dir 20 | cdef int v1, v2, v3, m, new_len, data, ni, nj 21 | 22 | cdef ArrayType a 23 | cdef int nrows, ncols 24 | cdef int *matrix, x 25 | 26 | edits1 = [] 27 | edits2 = [] 28 | 29 | M = len(seq1) + 1 30 | N = len(seq2) + 1 31 | 32 | table = Numeric.zeros([M, N], 'i') 33 | 34 | a = table 35 | 36 | nrows = a.dimensions[0] 37 | ncols = a.dimensions[1] 38 | 39 | matrix = a.data 40 | 41 | # Iterate through matrix and score similarities 42 | for i from 1 <= i < M: 43 | for j from 1 <= j < N: 44 | if seq1[i - 1] == seq2[j - 1]: 45 | matrix[i * ncols + j] = 1 46 | else: 47 | matrix[i * ncols + j] = 0 48 | 49 | # Sum the matrix 50 | i_max = 0 51 | j_max = 0 52 | t_max = 0 53 | 54 | for i from 1 <= i < M: 55 | for j from 1 <= j < N: 56 | 57 | dir = 0 58 | 59 | v1 = matrix[(i - 1) * ncols + (j - 1)] 60 | v2 = matrix[i * ncols + (j - 1)] 61 | v3 = matrix[(i - 1) * ncols + j] 62 | 63 | if v1 > 255: 64 | v1 = v1 >> 8 65 | if v2 > 255: 66 | v2 = v2 >> 8 67 | if v3 > 255: 68 | v3 = v3 >> 8 69 | 70 | v1 = v1 + matrix[i * ncols + j] 71 | v2 = v2 - e 72 | v3 = v3 - e 73 | 74 | if v1 > 0: 75 | m = v1 76 | else: 77 | m = 0 78 | 79 | if v2 > m: 80 | m = v2 81 | 82 | if v3 > m: 83 | m = v3 84 | 85 | if m == v1: 86 | dir = dir | (1 << 0) 87 | 88 | if m == v2: 89 | dir = dir | (1 << 1) 90 | 91 | if m == v3: 92 | dir = dir | (1 << 2) 93 | 94 | if m >= t_max: 95 | t_max = m 96 | i_max = i 97 | j_max = j 98 | 99 | m = m << 8 100 | m = m | dir 101 | 102 | matrix[i * ncols + j] = m 103 | 104 | # Do backtrace through matrix 105 | i = i_max 106 | j = j_max 107 | 108 | new_len = 0 109 | 110 | while i and j: 111 | data = matrix[i * ncols + j] 112 | 113 | if data & (1 << 2): 114 | ni = i - 1 115 | nj = j 116 | 117 | if data & (1 << 1): 118 | ni = i 119 | nj = j - 1 120 | 121 | if data & (1 << 0): 122 | ni = i - 1 123 | nj = j - 1 124 | 125 | new_len = new_len + 1 126 | 127 | i = ni 128 | j = nj 129 | 130 | new_seq1 = Numeric.zeros((new_len), 'i') 131 | new_seq2 = Numeric.zeros((new_len), 'i') 132 | 133 | cdef int s1, s2, gaps 134 | s1 = s2 = new_len 135 | gaps = 0 136 | 137 | i = i_max 138 | j = j_max 139 | 140 | while i and j: 141 | data = matrix[i * ncols + j] 142 | 143 | if data & (1 << 2): 144 | ni = i - 1 145 | nj = j 146 | dir = (1 << 2) 147 | 148 | if data & (1 << 1): 149 | ni = i 150 | nj = j - 1 151 | dir = (1 << 1) 152 | 153 | if data & (1 << 0): 154 | ni = i - 1 155 | nj = j - 1 156 | dir = (1 << 0) 157 | 158 | if dir == (1 << 0): 159 | s1 = s1 - 1 160 | s2 = s2 - 1 161 | new_seq1[s1] = seq1[i - 1] 162 | new_seq2[s2] = seq2[j - 1] 163 | 164 | if dir == (1 << 1): 165 | s1 = s1 - 1 166 | s2 = s2 - 1 167 | 168 | edits1.append(s1) 169 | new_seq1[s1] = 256 # '_' 170 | new_seq2[s2] = seq2[j - 1] 171 | gaps = gaps + 1 172 | 173 | if dir == (1 << 2): 174 | s1 = s1 - 1 175 | s2 = s2 - 1 176 | 177 | new_seq1[s1] = seq1[i - 1] 178 | new_seq2[s2] = 256 # '_' 179 | edits2.append(s2) 180 | gaps = gaps + 1 181 | 182 | i = ni 183 | j = nj 184 | 185 | return (new_seq1, new_seq2, edits1, edits2, t_max, gaps) 186 | 187 | def SmithWaterman(seq1, seq2, S, int g, int e): 188 | 189 | cdef int M, N, i, j, t_max, i_max, j_max, dir 190 | cdef int v1, v2, v3, m, new_len, data, ni, nj 191 | 192 | cdef ArrayType a 193 | cdef int nrows, ncols 194 | cdef int *matrix, x 195 | 196 | edits1 = [] 197 | edits2 = [] 198 | 199 | M = len(seq1) + 1 200 | N = len(seq2) + 1 201 | 202 | table = Numeric.zeros([M, N], 'i') 203 | 204 | for i in range(M): 205 | table[i][0] = 0 - i 206 | 207 | for i in range(N): 208 | table[0][i] = 0 - i 209 | 210 | a = table 211 | 212 | nrows = a.dimensions[0] 213 | ncols = a.dimensions[1] 214 | 215 | matrix = a.data 216 | 217 | # Iterate through matrix and score similarities 218 | for i from 1 <= i < M: 219 | for j from 1 <= j < N: 220 | if seq1[i - 1] == seq2[j - 1]: 221 | matrix[i * ncols + j] = 2 222 | else: 223 | matrix[i * ncols + j] = -1 224 | 225 | # Sum the matrix 226 | i_max = 0 227 | j_max = 0 228 | t_max = 0 229 | 230 | for i from 1 <= i < M: 231 | for j from 1 <= j < N: 232 | 233 | dir = 0 234 | 235 | v1 = matrix[(i - 1) * ncols + (j - 1)] 236 | v2 = matrix[i * ncols + (j - 1)] 237 | v3 = matrix[(i - 1) * ncols + j] 238 | 239 | if v1 > 255 or (v1 & 0xffffff00) == False: 240 | v1 = v1 >> 8 241 | if v2 > 255 or (v1 & 0xffffff00) == False: 242 | v2 = v2 >> 8 243 | if v3 > 255 or (v1 & 0xffffff00) == False: 244 | v3 = v3 >> 8 245 | 246 | v1 = v1 + matrix[i * ncols + j] 247 | v2 = v2 - 2 248 | v3 = v3 - 2 249 | 250 | if v1 > 0: 251 | m = v1 252 | else: 253 | m = 0 254 | 255 | if v2 > m: 256 | m = v2 257 | 258 | if v3 > m: 259 | m = v3 260 | 261 | if m == v1: 262 | dir = dir | (1 << 0) 263 | 264 | if m == v2: 265 | dir = dir | (1 << 1) 266 | 267 | if m == v3: 268 | dir = dir | (1 << 2) 269 | 270 | if m >= t_max: 271 | t_max = m 272 | i_max = i 273 | j_max = j 274 | 275 | m = m << 8 276 | m = m | dir 277 | 278 | matrix[i * ncols + j] = m 279 | 280 | # Do backtrace through matrix 281 | i = i_max 282 | j = j_max 283 | 284 | new_len = 0 285 | 286 | while matrix[i * ncols + j] > 0: 287 | data = matrix[i * ncols + j] 288 | 289 | if data & (1 << 2): 290 | ni = i - 1 291 | nj = j 292 | 293 | if data & (1 << 1): 294 | ni = i 295 | nj = j - 1 296 | 297 | if data & (1 << 0): 298 | ni = i - 1 299 | nj = j - 1 300 | 301 | new_len = new_len + 1 302 | 303 | i = ni 304 | j = nj 305 | 306 | new_seq1 = Numeric.zeros((new_len), 'i') 307 | new_seq2 = Numeric.zeros((new_len), 'i') 308 | 309 | cdef int s1, s2, gaps 310 | s1 = s2 = new_len 311 | gaps = 0 312 | 313 | i = i_max 314 | j = j_max 315 | 316 | while matrix[i * ncols + j] > 0: 317 | data = matrix[i * ncols + j] 318 | 319 | if data & (1 << 2): 320 | ni = i - 1 321 | nj = j 322 | dir = (1 << 2) 323 | 324 | if data & (1 << 1): 325 | ni = i 326 | nj = j - 1 327 | dir = (1 << 1) 328 | 329 | if data & (1 << 0): 330 | ni = i - 1 331 | nj = j - 1 332 | dir = (1 << 0) 333 | 334 | if dir == (1 << 0): 335 | s1 = s1 - 1 336 | s2 = s2 - 1 337 | new_seq1[s1] = seq1[i - 1] 338 | new_seq2[s2] = seq2[j - 1] 339 | 340 | if dir == (1 << 1): 341 | s1 = s1 - 1 342 | s2 = s2 - 1 343 | 344 | edits1.append(s1) 345 | new_seq1[s1] = 256 # '_' 346 | new_seq2[s2] = seq2[j - 1] 347 | gaps = gaps + 1 348 | 349 | if dir == (1 << 2): 350 | s1 = s1 - 1 351 | s2 = s2 - 1 352 | 353 | new_seq1[s1] = seq1[i - 1] 354 | new_seq2[s2] = 256 # '_' 355 | edits2.append(s2) 356 | gaps = gaps + 1 357 | 358 | i = ni 359 | j = nj 360 | 361 | return (new_seq1, new_seq2, edits1, edits2, t_max, gaps) 362 | -------------------------------------------------------------------------------- /PI/align.so: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/unmarshal/protocol-informatics/fcae1cd7cb71cd1914f9bc9fb9dcecdd0a6cfa86/PI/align.so -------------------------------------------------------------------------------- /PI/distance.py: -------------------------------------------------------------------------------- 1 | 2 | """ 3 | Distance module 4 | 5 | Find distance between sequences 6 | 7 | Written by Marshall Beddoe 8 | Copyright (c) 2004 Baseline Research 9 | 10 | Licensed under the LGPL 11 | """ 12 | 13 | # 14 | # Note: Gaps are denoted by the integer value 256 as to avoid '_' problems 15 | # 16 | 17 | import align, zlib 18 | from Numeric import * 19 | 20 | __all__ = [ "Distance", "Entropic", "PairwiseIdentity", "LocalAlignment" ] 21 | 22 | class Distance: 23 | 24 | """Implementation of classify base class""" 25 | 26 | def __init__(self, sequences): 27 | self.sequences = sequences 28 | self.N = len(sequences) 29 | 30 | # NxN Distance matrix 31 | self.dmx = zeros((self.N, self.N), Float) 32 | 33 | for i in range(len(sequences)): 34 | for j in range(len(sequences)): 35 | self.dmx[i][j] = -1 36 | 37 | self._go() 38 | 39 | def __repr__(self): 40 | return "%s" % self.dmx 41 | 42 | def __getitem__(self, i): 43 | return self.dmx[i] 44 | 45 | def __len__(self): 46 | return len(self.dmx) 47 | 48 | def _go(self): 49 | """Perform distance calculations""" 50 | pass 51 | 52 | class Entropic(Distance): 53 | 54 | """Distance calculation based off compression ratios""" 55 | 56 | def _go(self): 57 | 58 | # Similarity matrix 59 | similar = zeros((self.N, self.N), Float) 60 | 61 | for i in range(self.N): 62 | for j in range(self.N): 63 | similar[i][j] = -1 64 | 65 | # 66 | # Do compression ratio calculations 67 | # 68 | for i in range(self.N): 69 | for j in range(self.N): 70 | 71 | if similar[i][j] >= 0: 72 | continue 73 | 74 | seq1 = self.sequences[i][1] 75 | seq2 = self.sequences[j][1] 76 | 77 | # Convert sequences to strings, gaps denoted by '_' 78 | seq1str = "" 79 | for x in seq1: 80 | if x == 256: 81 | seq1str += '_' 82 | else: 83 | seq1str += chr(x) 84 | 85 | seq2str = "" 86 | for x in seq2: 87 | if x == 256: 88 | seq2str += '_' 89 | else: 90 | seq2str += chr(x) 91 | 92 | comp1 = zlib.compress(seq1str) 93 | comp2 = zlib.compress(seq2str) 94 | 95 | if len(comp1) > len(comp2): 96 | score = len(comp2) * 1.0 / len(comp1) * 1.0 97 | else: 98 | score = len(comp1) * 1.0 / len(comp2) * 1.0 99 | 100 | similar[i][j] = similar[j][i] = score 101 | 102 | # 103 | # Distance matrix 104 | # 105 | for i in range(self.N): 106 | for j in range(self.N): 107 | self.dmx[i][j] = similar[i][i] - similar[i][j] 108 | 109 | 110 | class PairwiseIdentity(Distance): 111 | 112 | """Distance through basic pairwise similarity""" 113 | 114 | def _go(self): 115 | 116 | # Similarity matrix 117 | similar = zeros((self.N, self.N), Float) 118 | 119 | for i in range(self.N): 120 | for j in range(self.N): 121 | similar[i][j] = -1 122 | 123 | # 124 | # Find pairs 125 | # 126 | for i in range(self.N): 127 | for j in range(self.N): 128 | 129 | if similar[i][j] >= 0: 130 | continue 131 | 132 | seq1 = self.sequences[i][1] 133 | seq2 = self.sequences[j][1] 134 | 135 | minlen = min(len(seq1), len(seq2)) 136 | 137 | len1 = len2 = idents = 0 138 | 139 | for x in range(minlen): 140 | if seq1[x] != 256: 141 | len1 += 1.0 142 | 143 | if seq1[x] == seq2[x]: 144 | idents += 1.0 145 | 146 | if seq2[x] != 256: 147 | len2 += 1.0 148 | 149 | m = max(len1, len2) 150 | 151 | similar[i][j] = idents / m 152 | 153 | # 154 | # Distance matrix 155 | # 156 | for i in range(self.N): 157 | for j in range(self.N): 158 | self.dmx[i][j] = similar[i][i] - similar[i][j] 159 | 160 | class LocalAlignment(Distance): 161 | 162 | """Distance through local alignment similarity""" 163 | 164 | def __init__(self, sequences, smx=None): 165 | self.smx = smx 166 | 167 | # If similarity matrix is None, make a quick identity matrix 168 | if self.smx == None: 169 | 170 | self.smx = zeros((257, 257), Float) 171 | 172 | for i in range(257): 173 | for j in range(257): 174 | if i == j: 175 | self.smx[i][j] = 1.0 176 | else: 177 | self.smx[i][j] = 0.0 178 | 179 | Distance.__init__(self, sequences) 180 | 181 | def _go(self): 182 | 183 | # Similarity matrix 184 | similar = zeros((self.N, self.N), Float) 185 | 186 | for i in range(self.N): 187 | for j in range(self.N): 188 | similar[i][j] = -1 189 | 190 | # 191 | # Compute similarity matrix of SW scores 192 | # 193 | for i in range(self.N): 194 | for j in range(self.N): 195 | 196 | if similar[i][j] >= 0: 197 | continue 198 | 199 | seq1 = self.sequences[i][1] 200 | seq2 = self.sequences[j][1] 201 | 202 | (nseq1, nseq2, edits1, edits2, score, gaps) = \ 203 | align.SmithWaterman(seq1, seq2, self.smx, 0, 0) 204 | 205 | similar[i][j] = similar[j][i] = score 206 | 207 | # 208 | # Compute distance matrix of SW scores 209 | # 210 | for i in range(self.N): 211 | for j in range(self.N): 212 | 213 | if self.dmx[i][j] >= 0: 214 | continue 215 | 216 | self.dmx[i][j] = 1 - (similar[i][j] / similar[i][i]) 217 | self.dmx[j][i] = self.dmx[i][j] 218 | -------------------------------------------------------------------------------- /PI/input.py: -------------------------------------------------------------------------------- 1 | 2 | """ 3 | Input module 4 | 5 | Handle different input file types and digitize sequences 6 | 7 | Written by Marshall Beddoe 8 | Copyright (c) 2004 Baseline Research 9 | 10 | Licensed under the LGPL 11 | """ 12 | 13 | from pcapy import * 14 | from socket import * 15 | from sets import * 16 | 17 | __all__ = ["Input", "Pcap", "ASCII" ] 18 | 19 | class Input: 20 | 21 | """Implementation of base input class""" 22 | 23 | def __init__(self, filename): 24 | """Import specified filename""" 25 | 26 | self.set = Set() 27 | self.sequences = [] 28 | self.index = 0 29 | 30 | def __iter__(self): 31 | self.index = 0 32 | return self 33 | 34 | def next(self): 35 | if self.index == len(self.sequences): 36 | raise StopIteration 37 | 38 | self.index += 1 39 | 40 | return self.sequences[self.index - 1] 41 | 42 | def __len__(self): 43 | return len(self.sequences) 44 | 45 | def __repr__(self): 46 | return "%s" % self.sequences 47 | 48 | def __getitem__(self, index): 49 | return self.sequences[index] 50 | 51 | class Pcap(Input): 52 | 53 | """Handle the pcap file format""" 54 | 55 | def __init__(self, filename, offset=14): 56 | Input.__init__(self, filename) 57 | self.pktNumber = 0 58 | self.offset = offset 59 | 60 | try: 61 | pd = open_offline(filename) 62 | except: 63 | raise IOError 64 | 65 | pd.dispatch(-1, self.handler) 66 | 67 | def handler(self, hdr, pkt): 68 | if hdr.getlen() <= 0: 69 | return 70 | 71 | # Increment packet counter 72 | self.pktNumber += 1 73 | 74 | # Ethernet is a safe assumption 75 | offset = self.offset 76 | 77 | # Parse IP header 78 | iphdr = pkt[offset:] 79 | 80 | ip_hl = ord(iphdr[0]) & 0x0f # header length 81 | ip_len = (ord(iphdr[2]) << 8) | ord(iphdr[3]) # total length 82 | ip_p = ord(iphdr[9]) # protocol type 83 | ip_srcip = inet_ntoa(iphdr[12:16]) # source ip address 84 | ip_dstip = inet_ntoa(iphdr[16:20]) # dest ip address 85 | 86 | offset += (ip_hl * 4) 87 | 88 | # Parse TCP if applicable 89 | if ip_p == 6: 90 | tcphdr = pkt[offset:] 91 | 92 | th_sport = (ord(tcphdr[0]) << 8) | ord(tcphdr[1]) # source port 93 | th_dport = (ord(tcphdr[2]) << 8) | ord(tcphdr[3]) # dest port 94 | th_off = ord(tcphdr[12]) >> 4 # tcp offset 95 | 96 | offset += (th_off * 4) 97 | 98 | # Parse UDP if applicable 99 | elif ip_p == 17: 100 | offset += 8 101 | 102 | # Parse out application layer 103 | seq_len = (ip_len - offset) + 14 104 | 105 | if seq_len <= 0: 106 | return 107 | 108 | seq = pkt[offset:] 109 | 110 | l = len(self.set) 111 | self.set.add(seq) 112 | 113 | if len(self.set) == l: 114 | return 115 | 116 | # Digitize sequence 117 | digitalSeq = [] 118 | for c in seq: 119 | digitalSeq.append(ord(c)) 120 | 121 | self.sequences.append((self.pktNumber, digitalSeq)) 122 | 123 | class ASCII(Input): 124 | 125 | """Handle newline delimited ASCII input files""" 126 | 127 | def __init__(self, filename): 128 | Input.__init__(self, filename) 129 | 130 | try: 131 | fd = open(filename, "r") 132 | except: 133 | raise IOError 134 | 135 | lineno = 0 136 | 137 | while 1: 138 | lineno += 1 139 | line = fd.readline() 140 | 141 | if not line: 142 | break 143 | 144 | l = len(self.set) 145 | self.set.add(line) 146 | 147 | if len(self.set) == l: 148 | continue 149 | 150 | # Digitize sequence 151 | digitalSeq = [] 152 | for c in line: 153 | digitalSeq.append(ord(c)) 154 | 155 | self.sequences.append((lineno, digitalSeq)) 156 | -------------------------------------------------------------------------------- /PI/input.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/unmarshal/protocol-informatics/fcae1cd7cb71cd1914f9bc9fb9dcecdd0a6cfa86/PI/input.pyc -------------------------------------------------------------------------------- /PI/multialign.py: -------------------------------------------------------------------------------- 1 | 2 | """ 3 | Multialign module 4 | 5 | Perform multiple sequence alignment using tree as guide 6 | 7 | Written by Marshall Beddoe 8 | Copyright (c) 2004 Baseline Research 9 | 10 | Licensed under the LGPL 11 | """ 12 | 13 | import align 14 | from Numeric import * 15 | 16 | class Multialign: 17 | 18 | """Implementation of multialign base class""" 19 | 20 | def __init__(self, tree): 21 | self.tree = tree 22 | self.aligned = [] 23 | self.index = 0 24 | 25 | self._go() 26 | 27 | def _go(self): 28 | pass 29 | 30 | def __len__(self): 31 | return len(self.aligned) 32 | 33 | def __getitem__(self, index): 34 | return self.aligned[index] 35 | 36 | def __iter__(self): 37 | self.index = 0 38 | return self 39 | 40 | def next(self): 41 | if self.index == len(self.aligned): 42 | raise StopIteration 43 | 44 | self.index += 1 45 | 46 | return self.aligned[self.index - 1] 47 | 48 | class NeedlemanWunsch(Multialign): 49 | 50 | """Perform global multiple sequence alignment""" 51 | def __init__(self, tree, smx=None): 52 | self.smx = smx 53 | 54 | # If similarity matrix is None, make a quick identity matrix 55 | if self.smx == None: 56 | self.smx = zeros((257, 257), Float) 57 | 58 | for i in range(257): 59 | for j in range(257): 60 | if i == j: 61 | self.smx[i][j] = 1.0 62 | else: 63 | self.smx[i][j] = 0.0 64 | 65 | Multialign.__init__(self, tree) 66 | 67 | def _go(self): 68 | 69 | self._assign(self.tree) 70 | self._alignSum(self.tree, []) 71 | 72 | def _assign(self, root): 73 | """Traverse tree and align sequences""" 74 | 75 | if root.getValue()[2] == None: 76 | if root.getLeft().getValue()[2] == None: 77 | self._assign(root.getLeft()) 78 | 79 | if root.getRight().getValue()[2] == None: 80 | self._assign(root.getRight()) 81 | 82 | # Get sequences 83 | seq1 = root.getLeft().getValue()[2][1] 84 | seq2 = root.getRight().getValue()[2][1] 85 | 86 | (a1, a2, e1, e2, score, gaps) = \ 87 | align.NeedlemanWunsch(seq1, seq2, self.smx, 0, 0) 88 | 89 | v1 = root.getLeft().getValue() 90 | v2 = root.getRight().getValue() 91 | 92 | nv1 = (v1[0], v1[1], v1[2], e1) 93 | nv2 = (v2[0], v2[1], v2[2], e2) 94 | 95 | root.getLeft().setValue(nv1) 96 | root.getRight().setValue(nv2) 97 | 98 | # Choose the sequence with least gaps 99 | if e1 < e2: 100 | nseq = a1 101 | else: 102 | nseq = a2 103 | 104 | v1 = root.getValue() 105 | nseq = (v1[0], nseq) 106 | nv1 = (v1[0], v1[1], nseq, v1[3]) 107 | root.setValue(nv1) 108 | 109 | def _alignSum(self, root, edits): 110 | 111 | if root.getLeft() == None and root.getRight() == None: 112 | seq1 = root.getValue()[2][1] 113 | id = root.getValue()[2][0] 114 | new = seq1 115 | 116 | for i in range(len(edits)): 117 | e = edits[i] 118 | self._applyEdits(new, e) 119 | 120 | self.aligned.append((id, new)) 121 | else: 122 | e = root.getLeft().getValue()[3] 123 | edits.insert(0, e) 124 | self._alignSum(root.getLeft(), edits) 125 | k = edits.pop(0) 126 | 127 | e = root.getRight().getValue()[3] 128 | edits.insert(0, e) 129 | self._alignSum(root.getRight(), edits) 130 | k = edits.pop(0) 131 | 132 | def _applyEdits(self, seq, edits): 133 | i = 0 134 | gap = 256 135 | 136 | edits.sort() 137 | 138 | for e in edits: 139 | seq.insert(e, gap) 140 | 141 | return seq 142 | -------------------------------------------------------------------------------- /PI/output.py: -------------------------------------------------------------------------------- 1 | 2 | """ 3 | Consensus module 4 | Generate consensus based on multiple sequence alignment 5 | 6 | Written by Marshall Beddoe 7 | Copyright (c) 2004 Baseline Research 8 | 9 | Licensed under the LGPL 10 | """ 11 | 12 | from curses.ascii import * 13 | 14 | class Output: 15 | 16 | def __init__(self, sequences): 17 | 18 | self.sequences = sequences 19 | self.consensus = [] 20 | self._go() 21 | 22 | def _go(self): 23 | pass 24 | 25 | class Ansi(Output): 26 | 27 | def __init__(self, sequences): 28 | 29 | # Color defaults for composition 30 | self.gap = "\033[41;30m%s\033[0m" 31 | self.printable = "\033[42;30m%s\033[0m" 32 | self.space = "\033[43;30m%s\033[0m" 33 | self.binary = "\033[44;30m%s\033[0m" 34 | self.zero = "\033[45;30m%s\033[0m" 35 | self.bit = "\033[46;30m%s\033[0m" 36 | self.default = "\033[47;30m%s\033[0m" 37 | 38 | Output.__init__(self, sequences) 39 | 40 | def _go(self): 41 | 42 | seqLength = len(self.sequences[0][1]) 43 | rounds = seqLength / 18 44 | remainder = seqLength % 18 45 | l = len(self.sequences[0][1]) 46 | 47 | start = 0 48 | end = 18 49 | 50 | dtConsensus = [] 51 | mtConsensus = [] 52 | 53 | for i in range(rounds): 54 | for id, seq in self.sequences: 55 | print "%04d" % id, 56 | for byte in seq[start:end]: 57 | if byte == 256: 58 | print self.gap % "___", 59 | elif isspace(byte): 60 | print self.space % " ", 61 | elif isprint(byte): 62 | print self.printable % "x%02x" % byte, 63 | elif byte == 0: 64 | print self.zero % "x00", 65 | else: 66 | print self.default % "x%02x" % byte, 67 | print "" 68 | 69 | # Calculate datatype consensus 70 | 71 | print "DT ", 72 | for j in range(start, end): 73 | column = [] 74 | for id, seq in self.sequences: 75 | column.append(seq[j]) 76 | dt = self._dtConsensus(column) 77 | print dt, 78 | dtConsensus.append(dt) 79 | print "" 80 | 81 | print "MT ", 82 | for j in range(start, end): 83 | column = [] 84 | for id, seq in self.sequences: 85 | column.append(seq[j]) 86 | rate = self._mutationRate(column) 87 | print "%03d" % (rate * 100), 88 | mtConsensus.append(rate) 89 | print "\n" 90 | 91 | start += 18 92 | end += 18 93 | 94 | if remainder: 95 | for id, seq in self.sequences: 96 | print "%04d" % id, 97 | for byte in seq[start:start + remainder]: 98 | if byte == 256: 99 | print self.gap % "___", 100 | elif isspace(byte): 101 | print self.space % " ", 102 | elif isprint(byte): 103 | print self.printable % "x%02x" % byte, 104 | elif byte == 0: 105 | print self.zero % "x00", 106 | else: 107 | print self.default % "x%02x" % byte, 108 | print "" 109 | 110 | print "DT ", 111 | for j in range(start, start + remainder): 112 | column = [] 113 | for id, seq in self.sequences: 114 | column.append(seq[j]) 115 | dt = self._dtConsensus(column) 116 | print dt, 117 | dtConsensus.append(dt) 118 | print "" 119 | 120 | print "MT ", 121 | for j in range(start, start + remainder): 122 | column = [] 123 | for id, seq in self.sequences: 124 | column.append(seq[j]) 125 | rate = self._mutationRate(column) 126 | mtConsensus.append(rate) 127 | print "%03d" % (rate * 100), 128 | print "" 129 | 130 | # Calculate consensus sequence 131 | l = len(self.sequences[0][1]) 132 | 133 | for i in range(l): 134 | histogram = {} 135 | for id, seq in self.sequences: 136 | try: 137 | histogram[seq[i]] += 1 138 | except: 139 | histogram[seq[i]] = 1 140 | 141 | items = histogram.items() 142 | items.sort() 143 | 144 | m = 1 145 | v = 257 146 | for j in items: 147 | if j[1] > m: 148 | m = j[1] 149 | v = j[0] 150 | 151 | self.consensus.append(v) 152 | 153 | real = [] 154 | 155 | for i in range(len(self.consensus)): 156 | if self.consensus[i] == 256: 157 | continue 158 | real.append((self.consensus[i], dtConsensus[i], mtConsensus[i])) 159 | 160 | # 161 | # Display consensus data 162 | # 163 | totalLen = len(real) 164 | rounds = totalLen / 18 165 | remainder = totalLen % 18 166 | 167 | start = 0 168 | end = 18 169 | 170 | print "\nUngapped Consensus:" 171 | 172 | for i in range(rounds): 173 | print "CONS", 174 | for byte,type,rate in real[start:end]: 175 | if byte == 256: 176 | print self.gap % "___", 177 | elif byte == 257: 178 | print self.default % "???", 179 | elif isspace(byte): 180 | print self.space % " ", 181 | elif isprint(byte): 182 | print self.printable % "x%02x" % byte, 183 | elif byte == 0: 184 | print self.zero % "x00", 185 | else: 186 | print self.default % "x%02x" % byte, 187 | print "" 188 | 189 | print "DT ", 190 | for byte,type,rate in real[start:end]: 191 | print type, 192 | print "" 193 | 194 | print "MT ", 195 | for byte,type,rate in real[start:end]: 196 | print "%03d" % (rate * 100), 197 | print "\n" 198 | 199 | start += 18 200 | end += 18 201 | 202 | if remainder: 203 | print "CONS", 204 | for byte,type,rate in real[start:start + remainder]: 205 | if byte == 256: 206 | print self.gap % "___", 207 | elif byte == 257: 208 | print self.default % "???", 209 | elif isspace(byte): 210 | print self.space % " ", 211 | elif isprint(byte): 212 | print self.printable % "x%02x" % byte, 213 | elif byte == 0: 214 | print self.zero % "x00", 215 | else: 216 | print self.default % "x%02x" % byte, 217 | print "" 218 | 219 | print "DT ", 220 | for byte,type,rate in real[start:end]: 221 | print type, 222 | print "" 223 | 224 | print "MT ", 225 | for byte,type,rate in real[start:end]: 226 | print "%03d" % (rate * 100), 227 | print "" 228 | 229 | def _dtConsensus(self, data): 230 | histogram = {} 231 | 232 | for byte in data: 233 | if byte == 256: 234 | try: 235 | histogram["G"] += 1 236 | except: 237 | histogram["G"] = 1 238 | elif isspace(byte): 239 | try: 240 | histogram["S"] += 1 241 | except: 242 | histogram["S"] = 1 243 | elif isprint(byte): 244 | try: 245 | histogram["A"] += 1 246 | except: 247 | histogram["A"] = 1 248 | elif byte == 0: 249 | try: 250 | histogram["Z"] += 1 251 | except: 252 | histogram["Z"] = 1 253 | else: 254 | try: 255 | histogram["B"] += 1 256 | except: 257 | histogram["B"] = 1 258 | 259 | items = histogram.items() 260 | items.sort() 261 | 262 | m = 1 263 | v = '?' 264 | for j in items: 265 | if j[1] > m: 266 | m = j[1] 267 | v = j[0] 268 | 269 | return v * 3 270 | 271 | def _mutationRate(self, data): 272 | 273 | histogram = {} 274 | 275 | for x in data: 276 | try: 277 | histogram[x] += 1 278 | except: 279 | histogram[x] = 1 280 | 281 | items = histogram.items() 282 | items.sort() 283 | 284 | if len(items) == 1: 285 | rate = 0.0 286 | else: 287 | rate = len(items) * 1.0 / len(data) * 1.0 288 | 289 | return rate 290 | -------------------------------------------------------------------------------- /PI/phylogeny.py: -------------------------------------------------------------------------------- 1 | 2 | """ 3 | Phylogenetic tree module 4 | 5 | Implementation of multiple tree building algorithms: 6 | - UPGMA 7 | - Maximum Parsimony 8 | - Maximum Likelihood 9 | 10 | Also, contains class to perform clustering based on results of tree generation 11 | 12 | Written by Marshall Beddoe 13 | Copyright (c) 2004 Baseline Research 14 | 15 | Licensed under the LGPL 16 | """ 17 | 18 | from sets import * 19 | from tree import * 20 | from pydot import * 21 | 22 | class Phylogeny: 23 | 24 | """Implementation of base phylogenetic class""" 25 | 26 | def __init__(self, sequences, dmx, minval=1.0): 27 | self.dmx = dmx 28 | self.index = 0 29 | self.tree = None 30 | self.clusters = [] 31 | self.minval = minval 32 | self.sequences = sequences 33 | 34 | self._go() 35 | 36 | def __len__(self): 37 | return len(self.clusters) 38 | 39 | def __iter__(self): 40 | self.index = 0 41 | return self 42 | 43 | def next(self): 44 | if self.index == len(self.clusters): 45 | raise StopIteration 46 | 47 | self.index += 1 48 | 49 | return self.clusters[self.index - 1] 50 | 51 | def __getitem__(self, index): 52 | return self.clusters[index] 53 | 54 | def _go(self): 55 | """Perform tree construction""" 56 | pass 57 | 58 | class UPGMA(Phylogeny): 59 | 60 | """UPGMA tree construction method""" 61 | 62 | def _go(self): 63 | 64 | # Universal set 65 | Cu = Set() 66 | 67 | # Place each sequence into individual tree node 68 | for i in range(len(self.sequences)): 69 | ntree = Tree() 70 | ntree.setValue((i, 0, self.sequences[i], None)) 71 | Cu.add(ntree) 72 | 73 | n = len(Cu) - 1 74 | totalNodes = len(Cu) 75 | 76 | for i in range(n): 77 | min = 10000 78 | 79 | for A in Cu: 80 | for B in Cu: 81 | if A == B: 82 | continue 83 | 84 | Dab = self._distance(A, B) 85 | 86 | # Choose closest clusters 87 | if Dab <= min: 88 | min = Dab 89 | 90 | savex = A.getValue()[0] 91 | savey = B.getValue()[0] 92 | 93 | # Create new root with clusters as children 94 | C = Tree() 95 | C.setLeft(A) 96 | C.setRight(B) 97 | 98 | A.setParent(C) 99 | B.setParent(C) 100 | 101 | C.setValue((10000 + i, min, None, None)) 102 | 103 | totalNodes += 1 104 | 105 | # Remove closest clusters from Cu and add new cluster 106 | #print "%d,%d = %f" % (savex, savey, min) 107 | Cu.remove(C.getLeft()) 108 | Cu.remove(C.getRight()) 109 | Cu.add(C) 110 | 111 | self.tree = Cu.pop() 112 | 113 | self._cluster(self.tree) 114 | 115 | def _distance(self, A, B): 116 | 117 | # If both nodes are leaves, return distance 118 | if A.getIsLeaf() and B.getIsLeaf(): 119 | return self.dmx[A.getValue()[0]][B.getValue()[0]] 120 | 121 | elif A.getIsLeaf(): 122 | d = self._distance(A, B.getLeft()) + self._distance(A, B.getRight()) 123 | return d / 2.0 124 | 125 | else: 126 | d = self._distance(A.getRight(), B) + self._distance(A.getLeft(), B) 127 | return d / 2.0 128 | 129 | def _cluster(self, root): 130 | 131 | if root.getIsLeaf(): 132 | return 133 | 134 | if root.getValue()[1] <= self.minval: 135 | self.clusters.append(root) 136 | return 137 | 138 | if root.getLeft(): 139 | self._cluster(root.getLeft()) 140 | 141 | if root.getRight(): 142 | self._cluster(root.getRight()) 143 | -------------------------------------------------------------------------------- /PI/tree.py: -------------------------------------------------------------------------------- 1 | 2 | """ 3 | Binary Tree module 4 | 5 | Implementation of binary tree class. 6 | 7 | Written by Marshall Beddoe 8 | Copyright (c) 2004 Baseline Research 9 | 10 | Licensed under the LGPL 11 | """ 12 | 13 | from pydot import * 14 | 15 | class Tree: 16 | 17 | def __init__(self): 18 | 19 | self._parentNode = None 20 | self._leftNode = None 21 | self._rightNode = None 22 | self._value = None 23 | self._height = 0 24 | 25 | def getIsLeaf(self): 26 | """Is this tree a leaf node""" 27 | 28 | if self._leftNode == None and self._rightNode == None: 29 | return True 30 | else: 31 | return False 32 | 33 | def getHeight(self): 34 | """Return height at this root""" 35 | return self._height 36 | 37 | def getParent(self): 38 | """Return parent node""" 39 | return self._parentNode 40 | 41 | def setParent(self, parentNode): 42 | """Set new parent node""" 43 | self._parentNode = parentNode 44 | 45 | def getLeft(self): 46 | """Return left child node""" 47 | return self._leftNode 48 | 49 | def setLeft(self, leftNode): 50 | """Set left child node""" 51 | self._leftNode = leftNode 52 | 53 | def getRight(self): 54 | """Return right child node""" 55 | return self._rightNode 56 | 57 | def setRight(self, rightNode): 58 | """Set right child node""" 59 | self._rightNode = rightNode 60 | 61 | def getValue(self): 62 | """Return node value""" 63 | return self._value 64 | 65 | def setValue(self, value): 66 | """Set node value""" 67 | self._value = value 68 | 69 | def graph(self, output, format="raw"): 70 | """Graph tree from root using graphviz""" 71 | 72 | self.i = 0 73 | self.graph = Dot(center="TRUE",rankdir="TB") 74 | #self.graph = Dot(size="3.5,4",page="4.5,6",center="TRUE",rankdir="TB") 75 | self.subgraph = Subgraph("subG", rank="same") 76 | 77 | self._traverse(self) 78 | 79 | self.graph.add_subgraph(self.subgraph) 80 | 81 | #if format == "raw": 82 | self.graph.write_raw(output + ".dot", prog="dot") 83 | self.graph.write_png(output + ".png", prog="dot") 84 | #elif format == "png": 85 | # self.graph.write_png(output + ".png", prog="dot") 86 | #else: 87 | # raise "UnknownFormat" 88 | 89 | def _traverse(self, root): 90 | 91 | if root.getParent(): 92 | 93 | if root.getIsLeaf(): 94 | v1 = root.getValue()[2][0] 95 | else: 96 | v1 = root.getValue()[0] 97 | 98 | weight = root.getValue()[1] 99 | 100 | v2 = root.getParent().getValue()[0] 101 | 102 | l = "%.02f%%" % (weight * 100.0) 103 | 104 | if v1 >= 10000: 105 | node1 = Node(v1, shape="plaintext", ratio="auto", label=l) 106 | else: 107 | node1 = Node(v1, shape="house", ratio="auto") 108 | node1.set("style", "filled") 109 | node1.set("fillcolor", "cyan") 110 | 111 | if v2 >= 10000: 112 | weight = root.getParent().getValue()[1] 113 | l = "%.02f%%" % (weight * 100.0) 114 | node2 = Node(v2, shape="plaintext", ratio="auto", label=l) 115 | 116 | if root.getIsLeaf(): 117 | self.subgraph.add_node(node1) 118 | 119 | self.graph.add_node(node1) 120 | self.graph.add_node(node2) 121 | 122 | edge = Edge(v2, v1) 123 | 124 | self.graph.add_edge(edge) 125 | 126 | if root.getLeft(): 127 | self._traverse(root.getLeft()) 128 | 129 | if root.getRight(): 130 | self._traverse(root.getRight()) 131 | -------------------------------------------------------------------------------- /README: -------------------------------------------------------------------------------- 1 | The Protocol Informatics Framework 2 | Written by Marshall Beddoe 3 | Copyright (c) 2004 Baseline Research 4 | ---- 5 | 6 | Wired Article: 7 | https://www.wired.com/2004/10/genome-model-applied-to-software/ 8 | 9 | Overview: 10 | 11 | The Protocol Informatics project is a software framework that allows for 12 | advanced sequence and protocol stream analysis by utilizing bioinformatics 13 | algorithms. The sole purpose of this software is to identify protocol fields in 14 | unknown or poorly documented network protocol formats. The algorithms that are 15 | utilized perform comparative analysis on a series of samples to better 16 | understand the underlying structure of the otherwise random-looking data. The 17 | PI framework was designed for experimentation through the use of a widget-based 18 | component set. 19 | 20 | Requirements: 21 | 22 | Python >= 2.3.4 http://www.python.org 23 | Numerical Python http://www.stsci.edu/resources/software_hardware/numarray 24 | Pyrex http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/ 25 | Pcapy http://oss.coresecurity.com/projects/pcapy.html 26 | Pydot http://dkbza.org/pydot.html 27 | 28 | This software has been tested and works correctly under: 29 | - OpenBSD 30 | - FreeBSD 31 | - Linux 32 | - MacOSX 33 | 34 | Example usage: Analyzing the ICMP protocol 35 | 36 | ICMP is a simple fixed length protocol. 37 | Let's use the PI framework to discover the format. 38 | 39 | Step 1: Gather 100 ICMP packets using tcpdump 40 | 41 | # tcpdump -s 42 -c 100 -nl -w icmp.dump icmp 42 | 43 | Step 2: Run dump through PI prototype 44 | 45 | # ./main.py -g -p ./icmp.dump 46 | 47 | Protocol Informatics Prototype (v0.01 beta) 48 | Written by Marshall Beddoe 49 | Copyright (c) 2004 Baseline Research 50 | 51 | Found 100 unique sequences in '../dumps/icmp.out' 52 | Creating distance matrix .. complete 53 | Creating phylogenetic tree .. complete 54 | 55 | Discovered 1 clusters using a weight of 1.00 56 | Performing multiple alignment on cluster 1 .. complete 57 | 58 | Output of cluster 1 59 | 0097 x08 x00 xad x4b x05 xbe x00 x60 60 | 0039 x08 x00 x30 x54 x05 xbe x00 x26 61 | 0026 x08 x00 xf7 xb2 x05 xbe x00 x19 62 | 0015 x08 x00 x01 xdb x05 xbe x00 x0e 63 | 0048 x08 x00 x4f xdf x05 xbe x00 x2f 64 | 0040 x08 x00 xf8 xa4 x05 xbe x00 x27 65 | 0077 x08 x00 xe8 x28 x05 xbe x00 x4c 66 | 0017 x08 x00 xe8 x6c x05 xbe x00 x10 67 | 0027 x08 x00 xc3 xa9 x05 xbe x00 x1a 68 | 0087 x08 x00 xdd xc1 x05 xbe x00 x56 69 | 0081 x08 x00 x88 x42 x05 xbe x00 x50 70 | 0058 x08 x00 xb0 x42 x05 xbe x00 x39 71 | 0013 x08 x00 x3e x38 x05 xbe x00 72 | 0067 x08 x00 x99 x36 x05 xbe x00 x42 73 | 0055 x08 x00 x0f x56 x05 xbe x00 x36 74 | 0004 x08 x00 xe6 xda x05 xbe x00 x03 75 | 0028 x08 x00 x83 xd9 x05 xbe x00 x1b 76 | 0095 x08 x00 xc1 xd9 x05 xbe x00 x5e 77 | 0075 x08 x00 x3a x63 x05 xbe x00 x4a 78 | 0053 x08 x00 x6d x2a x05 xbe x00 x34 79 | 0021 x08 x00 x6d x8d x05 xbe x00 x14 80 | 0088 x08 x00 xa8 x07 x05 xbe x00 x57 81 | 0005 x08 x00 xa8 x8a x05 xbe x00 x04 82 | 0080 x08 x00 xa8 x62 x05 xbe x00 x4f 83 | 0023 x08 x00 x3f x18 x05 xbe x00 x16 84 | 0002 x08 x00 x3f x65 x05 xbe x00 x01 85 | 0074 x08 x00 x3f xc2 x05 xbe x00 x49 86 | 0030 x08 x00 x3f x15 x05 xbe x00 x1d 87 | 0044 x08 x00 xcc xc2 x05 xbe x00 x2b 88 | 0078 x08 x00 xcc x8a x05 xbe x00 x4d 89 | 0071 x08 x00 xd8 x18 x05 xbe x00 x46 90 | 0035 x08 x00 x9a xfd x05 xbe x00 x22 91 | 0001 x08 x00 x69 xf9 x05 xbe x00 x00 92 | 0034 x08 x00 xc5 x9e x05 xbe x00 x21 93 | 0031 x08 x00 x38 x00 x05 xbe x00 x1e 94 | 0092 x08 x00 x38 x4c x05 xbe x00 x5b 95 | 0100 x08 x00 x2b x1a x05 xbe x00 x63 96 | 0049 x08 x00 x15 x1d x05 xbe x00 x30 97 | 0008 x08 x00 x2f x64 x05 xbe x00 x07 98 | 0089 x08 x00 x80 xe5 x05 xbe x00 x58 99 | 0096 x08 x00 xb2 xb0 x05 xbe x00 x5f 100 | 0079 x08 x00 xc2 xae x05 xbe x00 x4e 101 | 0057 x08 x00 xc2 x79 x05 xbe x00 x38 102 | 0046 x08 x00 x77 x7a x05 xbe x00 x2d 103 | 0018 x08 x00 xbb xce x05 xbe x00 x11 104 | 0025 x08 x00 xfe xaa x05 xbe x00 x18 105 | 0068 x08 x00 x50 xe3 x05 xbe x00 x43 106 | 0065 x08 x00 xe0 xb7 x05 xbe x00 x40 107 | 0011 x08 x00 x8d xd6 x05 xbe x00 108 | 0029 x08 x00 x7c xf3 x05 xbe x00 x1c 109 | 0033 x08 x00 xef xf3 x05 xbe x00 110 | 0069 x08 x00 x25 x6b x05 xbe x00 x44 111 | 0083 x08 x00 x25 xff x05 xbe x00 x52 112 | 0099 x08 x00 x56 x99 x05 xbe x00 x62 113 | 0061 x08 x00 x33 x81 x05 xbe x00 x3c 114 | 0050 x08 x00 xe9 xba x05 xbe x00 x31 115 | 0042 x08 x00 xb3 x49 x05 xbe x00 x29 116 | 0059 x08 x00 x81 x4e x05 xbe x00 x3a 117 | 0098 x08 x00 x81 xad x05 xbe x00 x61 118 | 0091 x08 x00 x42 xa0 x05 xbe x00 x5a 119 | 0054 x08 x00 x42 xd8 x05 xbe x00 x35 120 | 0037 x08 x00 x4c xe8 x05 xbe x00 x24 121 | 0041 x08 x00 xeb x4d x05 xbe x00 x28 122 | 0086 x08 x00 xe4 x53 x05 xbe x00 x55 123 | 0006 x08 x00 x71 x7b x05 xbe x00 x05 124 | 0012 x08 x00 x63 x7b x05 xbe x00 125 | 0070 x08 x00 xee x7d x05 xbe x00 x45 126 | 0051 x08 x00 xc8 x57 x05 xbe x00 x32 127 | 0066 x08 x00 xb4 x3c x05 xbe x00 x41 128 | 0014 x08 x00 x2c x26 x05 xbe x00 129 | 0062 x08 x00 x2c x7c x05 xbe x00 x3d 130 | 0016 x08 x00 xed x8e x05 xbe x00 x0f 131 | 0007 x08 x00 x47 x3d x05 xbe x00 x06 132 | 0073 x08 x00 x5e x72 x05 xbe x00 x48 133 | 0052 x08 x00 x9e x06 x05 xbe x00 x33 134 | 0072 x08 x00 x9e x9d x05 xbe x00 x47 135 | 0036 x08 x00 x6f x6e x05 xbe x00 x23 136 | 0060 x08 x00 x6c xc6 x05 xbe x00 x3b 137 | 0045 x08 x00 xa2 xf5 x05 xbe x00 x2c 138 | 0085 x08 x00 x00 x47 x05 xbe x00 x54 139 | 0076 x08 x00 x14 x85 x05 xbe x00 x4b 140 | 0020 x08 x00 xa0 x85 x05 xbe x00 x13 141 | 0019 x08 x00 xa6 x2c x05 xbe x00 x12 142 | 0003 x08 x00 x14 x2c x05 xbe x00 x02 143 | 0022 x08 x00 x44 x8c x05 xbe x00 x15 144 | 0082 x08 x00 x5d xe0 x05 xbe x00 x51 145 | 0009 x08 x00 xfc x41 x05 xbe x00 x08 146 | 0084 x08 x00 x35 x05 xbe x00 x53 147 | 0032 x08 x00 x0e x17 x05 xbe x00 x1f 148 | 0056 x08 x00 xe5 x05 xbe x00 x37 149 | 0043 x08 x00 xa1 xde x05 xbe x00 x2a 150 | 0094 x08 x00 x03 x92 x05 xbe x00 x5d 151 | 0047 x08 x00 x55 x83 x05 xbe x00 x2e 152 | 0090 x08 x00 x55 x94 x05 xbe x00 x59 153 | 0064 x08 x00 x8f x05 xbe x00 x3f 154 | 0093 x08 x00 xb6 x05 xbe x00 x5c 155 | 0010 x08 x00 xd1 xb6 x05 xbe x00 156 | 0024 x08 x00 x11 x8f x05 xbe x00 x17 157 | 0063 x08 x00 x11 x04 x05 xbe x00 x3e 158 | 0038 x08 x00 x37 x3b x05 xbe x00 x25 159 | DT BBB ZZZ BBB BBB BBB BBB ZZZ AAA 160 | MT 000 000 081 089 000 000 000 100 161 | 162 | Ungapped Consensus: 163 | CONS x08 x00 x3f x18 x05 xbe x00 ??? 164 | DT BBB ZZZ BBB BBB BBB BBB ZZZ AAA 165 | MT 000 000 081 089 000 000 000 100 166 | 167 | Step 3: Analyze Consensus Sequence 168 | 169 | Pay attention to datatype composition and mutation rate. 170 | 171 | Offset 0: Binary data, 0% mutation rate 172 | Offset 1: Zeroed data, 0% mutation rate 173 | Offset 2: Binary data, 81% mutation rate 174 | Offset 3: Binary data, 89% mutation rate 175 | Offset 4: Binary data, 0% mutation rate 176 | Offset 5: Binary data, 0% mutation rate 177 | Offset 6: Zeroed data, 0% mutation rate 178 | Offset 7: ASCII data, 100% mutation rate 179 | 180 | Using this information we can construct the structure of the format: 181 | 182 | [ 1 byte ] [ 1 byte ] [ 2 byte ] [ 2 byte ] [ 1 byte ] [ 1 byte ] 183 | 184 | The real format of an ICMP message: 185 | 186 | [ 1 byte ] [ 1 byte ] [ 2 byte ] [ 2 byte ] [ 2 byte ] 187 | 188 | The reason PI made the mistake in identifying the last field was due to the 189 | fact that the last field in an ICMP packet is a 16 bit sequence identifier. 190 | We only gathered 100 packets therefore the greatest significant byte never 191 | changed as the field incremented. 192 | 193 | Therefore, it is very important to gather data efficiently as PI is only as 194 | good as the data that is fed to it. 195 | -------------------------------------------------------------------------------- /docs/PI_Toorcon.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/unmarshal/protocol-informatics/fcae1cd7cb71cd1914f9bc9fb9dcecdd0a6cfa86/docs/PI_Toorcon.pdf -------------------------------------------------------------------------------- /docs/pi.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/unmarshal/protocol-informatics/fcae1cd7cb71cd1914f9bc9fb9dcecdd0a6cfa86/docs/pi.pdf -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python -u 2 | 3 | # 4 | # Protocol Informatics Prototype 5 | # Written by Marshall Beddoe 6 | # Copyright (c) 2004 Baseline Research 7 | # 8 | # Licensed under the LGPL 9 | # 10 | 11 | from PI import * 12 | import sys, getopt 13 | 14 | def main(): 15 | 16 | print "Protocol Informatics Prototype (v0.01 beta)" 17 | print "Written by Marshall Beddoe " 18 | print "Copyright (c) 2004 Baseline Research\n" 19 | 20 | # Defaults 21 | format = None 22 | weight = 1.0 23 | graph = False 24 | 25 | # 26 | # Parse command line options and do sanity checking on arguments 27 | # 28 | try: 29 | (opts, args) = getopt.getopt(sys.argv[1:], "pagw:") 30 | except: 31 | usage() 32 | 33 | for o,a in opts: 34 | if o in ["-p"]: 35 | format = "pcap" 36 | elif o in ["-a"]: 37 | format = "ascii" 38 | elif o in ["-w"]: 39 | weight = float(a) 40 | elif o in ["-g"]: 41 | graph = True 42 | else: 43 | usage() 44 | 45 | if len(args) == 0: 46 | usage() 47 | 48 | if weight < 0.0 or weight > 1.0: 49 | print "FATAL: Weight must be between 0 and 1" 50 | sys.exit(-1) 51 | 52 | file = sys.argv[len(sys.argv) - 1] 53 | 54 | try: 55 | file 56 | except: 57 | usage() 58 | 59 | # 60 | # Open file and get sequences 61 | # 62 | if format == "pcap": 63 | try: 64 | sequences = input.Pcap(file) 65 | except: 66 | print "FATAL: Error opening '%s'" % file 67 | sys.exit(-1) 68 | elif format == "ascii": 69 | try: 70 | sequences = input.ASCII(file) 71 | except: 72 | print "FATAL: Error opening '%s'" % file 73 | sys.exit(-1) 74 | else: 75 | print "FATAL: Specify file format" 76 | sys.exit(-1) 77 | 78 | if len(sequences) == 0: 79 | print "FATAL: No sequences found in '%s'" % file 80 | sys.exit(-1) 81 | else: 82 | print "Found %d unique sequences in '%s'" % (len(sequences), file) 83 | 84 | # 85 | # Create distance matrix (LocalAlignment, PairwiseIdentity, Entropic) 86 | # 87 | print "Creating distance matrix ..", 88 | dmx = distance.LocalAlignment(sequences) 89 | print "complete" 90 | 91 | # 92 | # Pass distance matrix to phylogenetic creation function 93 | # 94 | print "Creating phylogenetic tree ..", 95 | phylo = phylogeny.UPGMA(sequences, dmx, minval=weight) 96 | print "complete" 97 | 98 | # 99 | # Output some pretty graphs of each cluster 100 | # 101 | if graph: 102 | cnum = 1 103 | for cluster in phylo: 104 | out = "graph-%d" % cnum 105 | print "Creating %s .." % out, 106 | cluster.graph(out) 107 | print "complete" 108 | cnum += 1 109 | 110 | print "\nDiscovered %d clusters using a weight of %.02f" % \ 111 | (len(phylo), weight) 112 | 113 | # 114 | # Perform progressive multiple alignment against clusters 115 | # 116 | i = 1 117 | alist = [] 118 | for cluster in phylo: 119 | print "Performing multiple alignment on cluster %d .." % i, 120 | aligned = multialign.NeedlemanWunsch(cluster) 121 | print "complete" 122 | alist.append(aligned) 123 | i += 1 124 | print "" 125 | 126 | # 127 | # Display each cluster of aligned sequences 128 | # 129 | i = 1 130 | for seqs in alist: 131 | print "Output of cluster %d" % i 132 | output.Ansi(seqs) 133 | i += 1 134 | print "" 135 | 136 | def usage(): 137 | print "usage: %s [-gpa] [-w ] " % \ 138 | sys.argv[0] 139 | print " -g\toutput graphviz of phylogenetic trees" 140 | print " -p\tpcap format" 141 | print " -a\tascii format" 142 | print " -w\tdifference weight for clustering" 143 | sys.exit(-1) 144 | 145 | if __name__ == "__main__": 146 | main() 147 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from distutils.core import setup 2 | from distutils.extension import Extension 3 | 4 | import sys, os.path 5 | 6 | # Use Pyrex 7 | #import distutils.sysconfig 8 | #if distutils.sysconfig.get_config_var('CC').startswith("gcc"): 9 | # pyrex_compile_options = [] 10 | #else: 11 | pyrex_compile_options = [] 12 | 13 | if sys.platform == "win32" and len(sys.argv) < 2: 14 | sys.argv[1:] = ["bdist_wininst"] 15 | 16 | # Compiling Pyrex modules to .c and .so 17 | try: 18 | import Pyrex.Distutils 19 | except ImportError: 20 | distutils_extras = {} 21 | pyrex_suffix = ".c" 22 | else: 23 | class pyrex_build_ext(Pyrex.Distutils.build_ext): 24 | def pyrex_compile(self, source): 25 | from Pyrex.Compiler.Main import CompilationOptions, default_options 26 | options = CompilationOptions(default_options) 27 | result = Pyrex.Compiler.Main.compile(source, options) 28 | if result.num_errors <> 0: 29 | sys.exit(1) 30 | distutils_extras = { 31 | "cmdclass": { 32 | 'build_ext': pyrex_build_ext}} 33 | pyrex_suffix = ".pyx" 34 | 35 | def PIExtension(module_name): 36 | path = module_name.replace('.', '/') 37 | return Extension(module_name, [path + pyrex_suffix], 38 | extra_compile_args = pyrex_compile_options) 39 | 40 | setup( 41 | name = "PI", 42 | version = "0.01", 43 | url = "http://www.baselineresearch.net/PI", 44 | author_email = "mbeddoe@baselineresearch.net", 45 | description = "Protocol analysis toolkit using bioinformatics algorithms", 46 | packages = ["PI"], 47 | ext_modules=[ 48 | PIExtension("PI.align")],**distutils_extras 49 | ) 50 | --------------------------------------------------------------------------------