├── .github └── ISSUE_TEMPLATE │ ├── bug_report.md │ └── feature_request.md ├── .gitignore ├── .travis.yml ├── AUTHORS ├── CHANGELOG ├── CITATION.cff ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── Dockerfile ├── LICENSE ├── MANIFEST.in ├── README.md ├── VERSION ├── example_data ├── ERS654932_plasmids.fastq.gz └── expected_output ├── homopolymer_compression.c ├── homopolymer_compression.pyx ├── paper.bib ├── paper.md ├── scripts ├── tiptoft └── tiptoft_database_downloader ├── setup.py └── tiptoft ├── Blocks.py ├── Fasta.py ├── Fastq.py ├── Gene.py ├── InputTypes.py ├── Kmers.py ├── Read.py ├── RefGenesGetter.py ├── TipToft.py ├── TipToftDatabaseDownloader.py ├── __init__.py ├── data ├── 20180928.txt ├── plasmid_data.fa └── plasmid_data.tsv └── tests ├── Blocks_test.py ├── Fasta_test.py ├── Fastq_test.py ├── Gene_test.py ├── Kmers_test.py ├── Read_test.py └── data ├── fasta └── sample1.fa ├── fastq ├── expected_outputfile ├── plasmid_data.fa ├── query.fastq ├── query_gz.fastq.gz ├── reverse.fastq └── sample1.fa └── read └── sample.fastq /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | A clear and concise description of what the bug is. 12 | 13 | **To Reproduce** 14 | Steps to reproduce the behavior: 15 | 1. Go to '...' 16 | 2. Click on '....' 17 | 3. Scroll down to '....' 18 | 4. See error 19 | 20 | **Expected behavior** 21 | A clear and concise description of what you expected to happen. 22 | 23 | **Screenshots** 24 | If applicable, add screenshots to help explain your problem. 25 | 26 | **Desktop (please complete the following information):** 27 | - OS: [e.g. iOS] 28 | - Browser [e.g. chrome, safari] 29 | - Version [e.g. 22] 30 | 31 | **Smartphone (please complete the following information):** 32 | - Device: [e.g. iPhone6] 33 | - OS: [e.g. iOS8.1] 34 | - Browser [e.g. stock browser, safari] 35 | - Version [e.g. 22] 36 | 37 | **Additional context** 38 | Add any other context about the problem here. 39 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Is your feature request related to a problem? Please describe.** 11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 12 | 13 | **Describe the solution you'd like** 14 | A clear and concise description of what you want to happen. 15 | 16 | **Describe alternatives you've considered** 17 | A clear and concise description of any alternative solutions or features you've considered. 18 | 19 | **Additional context** 20 | Add any other context or screenshots about the feature request here. 21 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | 49 | # Translations 50 | *.mo 51 | *.pot 52 | 53 | # Django stuff: 54 | *.log 55 | local_settings.py 56 | 57 | # Flask stuff: 58 | instance/ 59 | .webassets-cache 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | .ipynb_checkpoints 71 | 72 | # pyenv 73 | .python-version 74 | 75 | # celery beat schedule file 76 | celerybeat-schedule 77 | 78 | 79 | # SageMath parsed files 80 | *.sage.py 81 | 82 | 83 | # dotenv 84 | .env 85 | 86 | # virtualenv 87 | 88 | .venv 89 | venv/ 90 | ENV/ 91 | 92 | # Spyder project settings 93 | .spyderproject 94 | 95 | .spyproject 96 | 97 | 98 | # Rope project settings 99 | .ropeproject 100 | 101 | 102 | .DS_Store 103 | 104 | # mkdocs documentation 105 | /site 106 | 107 | # mypy 108 | .mypy_cache/ 109 | 110 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "3.4" 4 | - "3.5" 5 | - "3.6" 6 | sudo: false 7 | 8 | addons: 9 | apt: 10 | packages: 11 | - cython 12 | cache: 13 | directories: 14 | - "build" 15 | - "$HOME/.cache/pip" 16 | install: 17 | - pip3 install cython 18 | - python3 setup.py install 19 | before_script: 20 | - pip3 install codecov 21 | script: 22 | - coverage run setup.py test 23 | after_success: 24 | - codecov 25 | -------------------------------------------------------------------------------- /AUTHORS: -------------------------------------------------------------------------------- 1 | Andrew J. Page 2 | Torsten Seemann 3 | -------------------------------------------------------------------------------- /CHANGELOG: -------------------------------------------------------------------------------- 1 | v0.1.0 - 26/08/18 2 | ------ 3 | Adjust weight of repeated kmers. 4 | 5 | v0.0.6 - 12/04/18 6 | ------ 7 | Move cython code 8 | 9 | v0.0.5 - 11/04/18 10 | ------ 11 | Filter out partial hits with same prefix 12 | 13 | v0.0.4 - 10/04/18 14 | ------ 15 | Fix bug in one X kmers 16 | 17 | v0.0.3 - 6/04/18 18 | ------ 19 | Homopolymer compression of k-mers (TS) 20 | Filter out k-mers occuring more than 10 times instead of 2 21 | Add Cython for Homoploymer compression speedup 22 | 23 | v0.0.2 - 5/04/18 24 | ------ 25 | Output full reads with plasmid gene. 26 | Output to a file actually works. 27 | Use bundled database if none provided. 28 | DB downloader needs a prefix to work. 29 | Grouping of input options. 30 | Documentation 31 | 32 | v0.0.1 - 3/04/18 33 | ------ 34 | Initial version. 35 | -------------------------------------------------------------------------------- /CITATION.cff: -------------------------------------------------------------------------------- 1 | # YAML 1.2 2 | --- 3 | abstract: "In the work presented here, we designed and developed two easy-to-use Web tools for in silico detection and characterization of whole-genome sequence (WGS) and whole-plasmid sequence data from members of the family Enterobacteriaceae. These tools will facilitate bacterial typing based on draft genomes of multidrug-resistant Enterobacteriaceae species by the rapid detection of known plasmid types. Replicon sequences from 559 fully sequenced plasmids associated with the family Enterobacteriaceae in the NCBI nucleotide database were collected to build a consensus database for integration into a Web tool called PlasmidFinder that can be used for replicon sequence analysis of raw, contig group, or completely assembled and closed plasmid sequencing data. The PlasmidFinder database currently consists of 116 replicon sequences that match with at least at 80% nucleotide identity all replicon sequences identified in the 559 fully sequenced plasmids. For plasmid multilocus sequence typing (pMLST) analysis, a database that is updated weekly was generated from www.pubmlst.org and integrated into a Web tool called pMLST. Both databases were evaluated using draft genomes from a collection of Salmonella enterica serovar Typhimurium isolates. PlasmidFinder identified a total of 103 replicons and between zero and five different plasmid replicons within each of 49 S. Typhimurium draft genomes tested. The pMLST Web tool was able to subtype genomic sequencing data of plasmids, revealing both known plasmid sequence types (STs) and new alleles and ST variants. In conclusion, testing of the two Web tools using both fully assembled plasmid sequences and WGS-generated draft genomes showed them to be able to detect a broad variety of plasmids that are often associated with antimicrobial resistance in clinically relevant bacterial pathogens." 4 | authors: 5 | - 6 | affiliation: "Department of Infectious, Parasitic and Immuno-Mediated Diseases, Istituto Superiore di Sanità, Rome, Italy" 7 | family-names: Carattoli 8 | given-names: Alessandra 9 | - 10 | affiliation: "Danish Technical University, National Food Institute, Division for Epidemiology and Microbial Genomics, Lyngby, Denmark" 11 | family-names: Zankari 12 | given-names: Ea 13 | - 14 | affiliation: "Department of Infectious, Parasitic and Immuno-Mediated Diseases, Istituto Superiore di Sanità, Rome, Italy" 15 | family-names: "García-Fernández" 16 | given-names: Aurora 17 | - 18 | affiliation: "Danish Technical University, Center for Biological Sequence Analysis, Department of Systems Biology, Lyngby, Denmark" 19 | family-names: "Voldby Larsen" 20 | given-names: Mette 21 | - 22 | affiliation: "Danish Technical University, Center for Biological Sequence Analysis, Department of Systems Biology, Lyngby, Denmark" 23 | family-names: Lund 24 | given-names: Ole 25 | - 26 | affiliation: "Department of Infectious, Parasitic and Immuno-Mediated Diseases, Istituto Superiore di Sanità, Rome, Italy" 27 | family-names: Villa 28 | given-names: Laura 29 | - 30 | affiliation: "Danish Technical University, National Food Institute, Division for Epidemiology and Microbial Genomics, Lyngby, Denmark" 31 | family-names: Aarestrup 32 | given-names: "Frank Møller" 33 | - 34 | affiliation: "Danish Technical University, National Food Institute, Division for Epidemiology and Microbial Genomics, Lyngby, Denmark" 35 | family-names: Hasman 36 | given-names: Henrik 37 | cff-version: "1.0.3" 38 | doi: "10.1128/AAC.02412-14" 39 | message: "Please remember to cite the plasmidFinder paper as their database makes this software work" 40 | repository-code: "https://bitbucket.org/genomicepidemiology/plasmidfinder" 41 | title: "In Silico Detection and Typing of Plasmids using PlasmidFinder and Plasmid Multilocus Sequence Typing" 42 | ... -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | In the interest of fostering an open and welcoming environment, we as 6 | contributors and maintainers pledge to making participation in our project and 7 | our community a harassment-free experience for everyone, regardless of age, body 8 | size, disability, ethnicity, sex characteristics, gender identity and expression, 9 | level of experience, education, socio-economic status, nationality, personal 10 | appearance, race, religion, or sexual identity and orientation. 11 | 12 | ## Our Standards 13 | 14 | Examples of behavior that contributes to creating a positive environment 15 | include: 16 | 17 | * Using welcoming and inclusive language 18 | * Being respectful of differing viewpoints and experiences 19 | * Gracefully accepting constructive criticism 20 | * Focusing on what is best for the community 21 | * Showing empathy towards other community members 22 | 23 | Examples of unacceptable behavior by participants include: 24 | 25 | * The use of sexualized language or imagery and unwelcome sexual attention or 26 | advances 27 | * Trolling, insulting/derogatory comments, and personal or political attacks 28 | * Public or private harassment 29 | * Publishing others' private information, such as a physical or electronic 30 | address, without explicit permission 31 | * Other conduct which could reasonably be considered inappropriate in a 32 | professional setting 33 | 34 | ## Our Responsibilities 35 | 36 | Project maintainers are responsible for clarifying the standards of acceptable 37 | behavior and are expected to take appropriate and fair corrective action in 38 | response to any instances of unacceptable behavior. 39 | 40 | Project maintainers have the right and responsibility to remove, edit, or 41 | reject comments, commits, code, wiki edits, issues, and other contributions 42 | that are not aligned to this Code of Conduct, or to ban temporarily or 43 | permanently any contributor for other behaviors that they deem inappropriate, 44 | threatening, offensive, or harmful. 45 | 46 | ## Scope 47 | 48 | This Code of Conduct applies both within project spaces and in public spaces 49 | when an individual is representing the project or its community. Examples of 50 | representing a project or community include using an official project e-mail 51 | address, posting via an official social media account, or acting as an appointed 52 | representative at an online or offline event. Representation of a project may be 53 | further defined and clarified by project maintainers. 54 | 55 | ## Enforcement 56 | 57 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 58 | reported by contacting the project team at andrew.page@quadram.ac.uk. All 59 | complaints will be reviewed and investigated and will result in a response that 60 | is deemed necessary and appropriate to the circumstances. The project team is 61 | obligated to maintain confidentiality with regard to the reporter of an incident. 62 | Further details of specific enforcement policies may be posted separately. 63 | 64 | Project maintainers who do not follow or enforce the Code of Conduct in good 65 | faith may face temporary or permanent repercussions as determined by other 66 | members of the project's leadership. 67 | 68 | ## Attribution 69 | 70 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, 71 | available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html 72 | 73 | [homepage]: https://www.contributor-covenant.org 74 | 75 | For answers to common questions about this code of conduct, see 76 | https://www.contributor-covenant.org/faq 77 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | If you wish to fix a bug or add new features to the software we welcome Pull Requests. We use 2 | [GitHub Flow style development](https://guides.github.com/introduction/flow/). Please fork the repo, make the change, then submit a Pull Request against out master branch, with details about what the change is and what it fixes/adds. 3 | We will then review your changes and merge them, or provide feedback on enhancements. -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM debian:testing 2 | MAINTAINER andrew.page@quadram.ac.uk 3 | 4 | RUN apt-get update -qq && apt-get install -y git python3 python3-setuptools python3-biopython python3-pip 5 | RUN pip3 install cython 6 | 7 | RUN pip3 install git+git://github.com/andrewjpage/tiptoft.git 8 | 9 | WORKDIR /data -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 3, 29 June 2007 3 | 4 | Copyright (C) 2007 Free Software Foundation, Inc. 5 | Everyone is permitted to copy and distribute verbatim copies 6 | of this license document, but changing it is not allowed. 7 | 8 | Preamble 9 | 10 | The GNU General Public License is a free, copyleft license for 11 | software and other kinds of works. 12 | 13 | The licenses for most software and other practical works are designed 14 | to take away your freedom to share and change the works. By contrast, 15 | the GNU General Public License is intended to guarantee your freedom to 16 | share and change all versions of a program--to make sure it remains free 17 | software for all its users. We, the Free Software Foundation, use the 18 | GNU General Public License for most of our software; it applies also to 19 | any other work released this way by its authors. You can apply it to 20 | your programs, too. 21 | 22 | When we speak of free software, we are referring to freedom, not 23 | price. Our General Public Licenses are designed to make sure that you 24 | have the freedom to distribute copies of free software (and charge for 25 | them if you wish), that you receive source code or can get it if you 26 | want it, that you can change the software or use pieces of it in new 27 | free programs, and that you know you can do these things. 28 | 29 | To protect your rights, we need to prevent others from denying you 30 | these rights or asking you to surrender the rights. Therefore, you have 31 | certain responsibilities if you distribute copies of the software, or if 32 | you modify it: responsibilities to respect the freedom of others. 33 | 34 | For example, if you distribute copies of such a program, whether 35 | gratis or for a fee, you must pass on to the recipients the same 36 | freedoms that you received. You must make sure that they, too, receive 37 | or can get the source code. And you must show them these terms so they 38 | know their rights. 39 | 40 | Developers that use the GNU GPL protect your rights with two steps: 41 | (1) assert copyright on the software, and (2) offer you this License 42 | giving you legal permission to copy, distribute and/or modify it. 43 | 44 | For the developers' and authors' protection, the GPL clearly explains 45 | that there is no warranty for this free software. For both users' and 46 | authors' sake, the GPL requires that modified versions be marked as 47 | changed, so that their problems will not be attributed erroneously to 48 | authors of previous versions. 49 | 50 | Some devices are designed to deny users access to install or run 51 | modified versions of the software inside them, although the manufacturer 52 | can do so. This is fundamentally incompatible with the aim of 53 | protecting users' freedom to change the software. The systematic 54 | pattern of such abuse occurs in the area of products for individuals to 55 | use, which is precisely where it is most unacceptable. Therefore, we 56 | have designed this version of the GPL to prohibit the practice for those 57 | products. If such problems arise substantially in other domains, we 58 | stand ready to extend this provision to those domains in future versions 59 | of the GPL, as needed to protect the freedom of users. 60 | 61 | Finally, every program is threatened constantly by software patents. 62 | States should not allow patents to restrict development and use of 63 | software on general-purpose computers, but in those that do, we wish to 64 | avoid the special danger that patents applied to a free program could 65 | make it effectively proprietary. To prevent this, the GPL assures that 66 | patents cannot be used to render the program non-free. 67 | 68 | The precise terms and conditions for copying, distribution and 69 | modification follow. 70 | 71 | TERMS AND CONDITIONS 72 | 73 | 0. Definitions. 74 | 75 | "This License" refers to version 3 of the GNU General Public License. 76 | 77 | "Copyright" also means copyright-like laws that apply to other kinds of 78 | works, such as semiconductor masks. 79 | 80 | "The Program" refers to any copyrightable work licensed under this 81 | License. Each licensee is addressed as "you". "Licensees" and 82 | "recipients" may be individuals or organizations. 83 | 84 | To "modify" a work means to copy from or adapt all or part of the work 85 | in a fashion requiring copyright permission, other than the making of an 86 | exact copy. The resulting work is called a "modified version" of the 87 | earlier work or a work "based on" the earlier work. 88 | 89 | A "covered work" means either the unmodified Program or a work based 90 | on the Program. 91 | 92 | To "propagate" a work means to do anything with it that, without 93 | permission, would make you directly or secondarily liable for 94 | infringement under applicable copyright law, except executing it on a 95 | computer or modifying a private copy. Propagation includes copying, 96 | distribution (with or without modification), making available to the 97 | public, and in some countries other activities as well. 98 | 99 | To "convey" a work means any kind of propagation that enables other 100 | parties to make or receive copies. Mere interaction with a user through 101 | a computer network, with no transfer of a copy, is not conveying. 102 | 103 | An interactive user interface displays "Appropriate Legal Notices" 104 | to the extent that it includes a convenient and prominently visible 105 | feature that (1) displays an appropriate copyright notice, and (2) 106 | tells the user that there is no warranty for the work (except to the 107 | extent that warranties are provided), that licensees may convey the 108 | work under this License, and how to view a copy of this License. If 109 | the interface presents a list of user commands or options, such as a 110 | menu, a prominent item in the list meets this criterion. 111 | 112 | 1. Source Code. 113 | 114 | The "source code" for a work means the preferred form of the work 115 | for making modifications to it. "Object code" means any non-source 116 | form of a work. 117 | 118 | A "Standard Interface" means an interface that either is an official 119 | standard defined by a recognized standards body, or, in the case of 120 | interfaces specified for a particular programming language, one that 121 | is widely used among developers working in that language. 122 | 123 | The "System Libraries" of an executable work include anything, other 124 | than the work as a whole, that (a) is included in the normal form of 125 | packaging a Major Component, but which is not part of that Major 126 | Component, and (b) serves only to enable use of the work with that 127 | Major Component, or to implement a Standard Interface for which an 128 | implementation is available to the public in source code form. A 129 | "Major Component", in this context, means a major essential component 130 | (kernel, window system, and so on) of the specific operating system 131 | (if any) on which the executable work runs, or a compiler used to 132 | produce the work, or an object code interpreter used to run it. 133 | 134 | The "Corresponding Source" for a work in object code form means all 135 | the source code needed to generate, install, and (for an executable 136 | work) run the object code and to modify the work, including scripts to 137 | control those activities. However, it does not include the work's 138 | System Libraries, or general-purpose tools or generally available free 139 | programs which are used unmodified in performing those activities but 140 | which are not part of the work. For example, Corresponding Source 141 | includes interface definition files associated with source files for 142 | the work, and the source code for shared libraries and dynamically 143 | linked subprograms that the work is specifically designed to require, 144 | such as by intimate data communication or control flow between those 145 | subprograms and other parts of the work. 146 | 147 | The Corresponding Source need not include anything that users 148 | can regenerate automatically from other parts of the Corresponding 149 | Source. 150 | 151 | The Corresponding Source for a work in source code form is that 152 | same work. 153 | 154 | 2. Basic Permissions. 155 | 156 | All rights granted under this License are granted for the term of 157 | copyright on the Program, and are irrevocable provided the stated 158 | conditions are met. This License explicitly affirms your unlimited 159 | permission to run the unmodified Program. The output from running a 160 | covered work is covered by this License only if the output, given its 161 | content, constitutes a covered work. This License acknowledges your 162 | rights of fair use or other equivalent, as provided by copyright law. 163 | 164 | You may make, run and propagate covered works that you do not 165 | convey, without conditions so long as your license otherwise remains 166 | in force. You may convey covered works to others for the sole purpose 167 | of having them make modifications exclusively for you, or provide you 168 | with facilities for running those works, provided that you comply with 169 | the terms of this License in conveying all material for which you do 170 | not control copyright. Those thus making or running the covered works 171 | for you must do so exclusively on your behalf, under your direction 172 | and control, on terms that prohibit them from making any copies of 173 | your copyrighted material outside their relationship with you. 174 | 175 | Conveying under any other circumstances is permitted solely under 176 | the conditions stated below. Sublicensing is not allowed; section 10 177 | makes it unnecessary. 178 | 179 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law. 180 | 181 | No covered work shall be deemed part of an effective technological 182 | measure under any applicable law fulfilling obligations under article 183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or 184 | similar laws prohibiting or restricting circumvention of such 185 | measures. 186 | 187 | When you convey a covered work, you waive any legal power to forbid 188 | circumvention of technological measures to the extent such circumvention 189 | is effected by exercising rights under this License with respect to 190 | the covered work, and you disclaim any intention to limit operation or 191 | modification of the work as a means of enforcing, against the work's 192 | users, your or third parties' legal rights to forbid circumvention of 193 | technological measures. 194 | 195 | 4. Conveying Verbatim Copies. 196 | 197 | You may convey verbatim copies of the Program's source code as you 198 | receive it, in any medium, provided that you conspicuously and 199 | appropriately publish on each copy an appropriate copyright notice; 200 | keep intact all notices stating that this License and any 201 | non-permissive terms added in accord with section 7 apply to the code; 202 | keep intact all notices of the absence of any warranty; and give all 203 | recipients a copy of this License along with the Program. 204 | 205 | You may charge any price or no price for each copy that you convey, 206 | and you may offer support or warranty protection for a fee. 207 | 208 | 5. Conveying Modified Source Versions. 209 | 210 | You may convey a work based on the Program, or the modifications to 211 | produce it from the Program, in the form of source code under the 212 | terms of section 4, provided that you also meet all of these conditions: 213 | 214 | a) The work must carry prominent notices stating that you modified 215 | it, and giving a relevant date. 216 | 217 | b) The work must carry prominent notices stating that it is 218 | released under this License and any conditions added under section 219 | 7. This requirement modifies the requirement in section 4 to 220 | "keep intact all notices". 221 | 222 | c) You must license the entire work, as a whole, under this 223 | License to anyone who comes into possession of a copy. This 224 | License will therefore apply, along with any applicable section 7 225 | additional terms, to the whole of the work, and all its parts, 226 | regardless of how they are packaged. This License gives no 227 | permission to license the work in any other way, but it does not 228 | invalidate such permission if you have separately received it. 229 | 230 | d) If the work has interactive user interfaces, each must display 231 | Appropriate Legal Notices; however, if the Program has interactive 232 | interfaces that do not display Appropriate Legal Notices, your 233 | work need not make them do so. 234 | 235 | A compilation of a covered work with other separate and independent 236 | works, which are not by their nature extensions of the covered work, 237 | and which are not combined with it such as to form a larger program, 238 | in or on a volume of a storage or distribution medium, is called an 239 | "aggregate" if the compilation and its resulting copyright are not 240 | used to limit the access or legal rights of the compilation's users 241 | beyond what the individual works permit. Inclusion of a covered work 242 | in an aggregate does not cause this License to apply to the other 243 | parts of the aggregate. 244 | 245 | 6. Conveying Non-Source Forms. 246 | 247 | You may convey a covered work in object code form under the terms 248 | of sections 4 and 5, provided that you also convey the 249 | machine-readable Corresponding Source under the terms of this License, 250 | in one of these ways: 251 | 252 | a) Convey the object code in, or embodied in, a physical product 253 | (including a physical distribution medium), accompanied by the 254 | Corresponding Source fixed on a durable physical medium 255 | customarily used for software interchange. 256 | 257 | b) Convey the object code in, or embodied in, a physical product 258 | (including a physical distribution medium), accompanied by a 259 | written offer, valid for at least three years and valid for as 260 | long as you offer spare parts or customer support for that product 261 | model, to give anyone who possesses the object code either (1) a 262 | copy of the Corresponding Source for all the software in the 263 | product that is covered by this License, on a durable physical 264 | medium customarily used for software interchange, for a price no 265 | more than your reasonable cost of physically performing this 266 | conveying of source, or (2) access to copy the 267 | Corresponding Source from a network server at no charge. 268 | 269 | c) Convey individual copies of the object code with a copy of the 270 | written offer to provide the Corresponding Source. This 271 | alternative is allowed only occasionally and noncommercially, and 272 | only if you received the object code with such an offer, in accord 273 | with subsection 6b. 274 | 275 | d) Convey the object code by offering access from a designated 276 | place (gratis or for a charge), and offer equivalent access to the 277 | Corresponding Source in the same way through the same place at no 278 | further charge. You need not require recipients to copy the 279 | Corresponding Source along with the object code. If the place to 280 | copy the object code is a network server, the Corresponding Source 281 | may be on a different server (operated by you or a third party) 282 | that supports equivalent copying facilities, provided you maintain 283 | clear directions next to the object code saying where to find the 284 | Corresponding Source. Regardless of what server hosts the 285 | Corresponding Source, you remain obligated to ensure that it is 286 | available for as long as needed to satisfy these requirements. 287 | 288 | e) Convey the object code using peer-to-peer transmission, provided 289 | you inform other peers where the object code and Corresponding 290 | Source of the work are being offered to the general public at no 291 | charge under subsection 6d. 292 | 293 | A separable portion of the object code, whose source code is excluded 294 | from the Corresponding Source as a System Library, need not be 295 | included in conveying the object code work. 296 | 297 | A "User Product" is either (1) a "consumer product", which means any 298 | tangible personal property which is normally used for personal, family, 299 | or household purposes, or (2) anything designed or sold for incorporation 300 | into a dwelling. In determining whether a product is a consumer product, 301 | doubtful cases shall be resolved in favor of coverage. For a particular 302 | product received by a particular user, "normally used" refers to a 303 | typical or common use of that class of product, regardless of the status 304 | of the particular user or of the way in which the particular user 305 | actually uses, or expects or is expected to use, the product. A product 306 | is a consumer product regardless of whether the product has substantial 307 | commercial, industrial or non-consumer uses, unless such uses represent 308 | the only significant mode of use of the product. 309 | 310 | "Installation Information" for a User Product means any methods, 311 | procedures, authorization keys, or other information required to install 312 | and execute modified versions of a covered work in that User Product from 313 | a modified version of its Corresponding Source. The information must 314 | suffice to ensure that the continued functioning of the modified object 315 | code is in no case prevented or interfered with solely because 316 | modification has been made. 317 | 318 | If you convey an object code work under this section in, or with, or 319 | specifically for use in, a User Product, and the conveying occurs as 320 | part of a transaction in which the right of possession and use of the 321 | User Product is transferred to the recipient in perpetuity or for a 322 | fixed term (regardless of how the transaction is characterized), the 323 | Corresponding Source conveyed under this section must be accompanied 324 | by the Installation Information. But this requirement does not apply 325 | if neither you nor any third party retains the ability to install 326 | modified object code on the User Product (for example, the work has 327 | been installed in ROM). 328 | 329 | The requirement to provide Installation Information does not include a 330 | requirement to continue to provide support service, warranty, or updates 331 | for a work that has been modified or installed by the recipient, or for 332 | the User Product in which it has been modified or installed. Access to a 333 | network may be denied when the modification itself materially and 334 | adversely affects the operation of the network or violates the rules and 335 | protocols for communication across the network. 336 | 337 | Corresponding Source conveyed, and Installation Information provided, 338 | in accord with this section must be in a format that is publicly 339 | documented (and with an implementation available to the public in 340 | source code form), and must require no special password or key for 341 | unpacking, reading or copying. 342 | 343 | 7. Additional Terms. 344 | 345 | "Additional permissions" are terms that supplement the terms of this 346 | License by making exceptions from one or more of its conditions. 347 | Additional permissions that are applicable to the entire Program shall 348 | be treated as though they were included in this License, to the extent 349 | that they are valid under applicable law. If additional permissions 350 | apply only to part of the Program, that part may be used separately 351 | under those permissions, but the entire Program remains governed by 352 | this License without regard to the additional permissions. 353 | 354 | When you convey a copy of a covered work, you may at your option 355 | remove any additional permissions from that copy, or from any part of 356 | it. (Additional permissions may be written to require their own 357 | removal in certain cases when you modify the work.) You may place 358 | additional permissions on material, added by you to a covered work, 359 | for which you have or can give appropriate copyright permission. 360 | 361 | Notwithstanding any other provision of this License, for material you 362 | add to a covered work, you may (if authorized by the copyright holders of 363 | that material) supplement the terms of this License with terms: 364 | 365 | a) Disclaiming warranty or limiting liability differently from the 366 | terms of sections 15 and 16 of this License; or 367 | 368 | b) Requiring preservation of specified reasonable legal notices or 369 | author attributions in that material or in the Appropriate Legal 370 | Notices displayed by works containing it; or 371 | 372 | c) Prohibiting misrepresentation of the origin of that material, or 373 | requiring that modified versions of such material be marked in 374 | reasonable ways as different from the original version; or 375 | 376 | d) Limiting the use for publicity purposes of names of licensors or 377 | authors of the material; or 378 | 379 | e) Declining to grant rights under trademark law for use of some 380 | trade names, trademarks, or service marks; or 381 | 382 | f) Requiring indemnification of licensors and authors of that 383 | material by anyone who conveys the material (or modified versions of 384 | it) with contractual assumptions of liability to the recipient, for 385 | any liability that these contractual assumptions directly impose on 386 | those licensors and authors. 387 | 388 | All other non-permissive additional terms are considered "further 389 | restrictions" within the meaning of section 10. If the Program as you 390 | received it, or any part of it, contains a notice stating that it is 391 | governed by this License along with a term that is a further 392 | restriction, you may remove that term. If a license document contains 393 | a further restriction but permits relicensing or conveying under this 394 | License, you may add to a covered work material governed by the terms 395 | of that license document, provided that the further restriction does 396 | not survive such relicensing or conveying. 397 | 398 | If you add terms to a covered work in accord with this section, you 399 | must place, in the relevant source files, a statement of the 400 | additional terms that apply to those files, or a notice indicating 401 | where to find the applicable terms. 402 | 403 | Additional terms, permissive or non-permissive, may be stated in the 404 | form of a separately written license, or stated as exceptions; 405 | the above requirements apply either way. 406 | 407 | 8. Termination. 408 | 409 | You may not propagate or modify a covered work except as expressly 410 | provided under this License. Any attempt otherwise to propagate or 411 | modify it is void, and will automatically terminate your rights under 412 | this License (including any patent licenses granted under the third 413 | paragraph of section 11). 414 | 415 | However, if you cease all violation of this License, then your 416 | license from a particular copyright holder is reinstated (a) 417 | provisionally, unless and until the copyright holder explicitly and 418 | finally terminates your license, and (b) permanently, if the copyright 419 | holder fails to notify you of the violation by some reasonable means 420 | prior to 60 days after the cessation. 421 | 422 | Moreover, your license from a particular copyright holder is 423 | reinstated permanently if the copyright holder notifies you of the 424 | violation by some reasonable means, this is the first time you have 425 | received notice of violation of this License (for any work) from that 426 | copyright holder, and you cure the violation prior to 30 days after 427 | your receipt of the notice. 428 | 429 | Termination of your rights under this section does not terminate the 430 | licenses of parties who have received copies or rights from you under 431 | this License. If your rights have been terminated and not permanently 432 | reinstated, you do not qualify to receive new licenses for the same 433 | material under section 10. 434 | 435 | 9. Acceptance Not Required for Having Copies. 436 | 437 | You are not required to accept this License in order to receive or 438 | run a copy of the Program. Ancillary propagation of a covered work 439 | occurring solely as a consequence of using peer-to-peer transmission 440 | to receive a copy likewise does not require acceptance. However, 441 | nothing other than this License grants you permission to propagate or 442 | modify any covered work. These actions infringe copyright if you do 443 | not accept this License. Therefore, by modifying or propagating a 444 | covered work, you indicate your acceptance of this License to do so. 445 | 446 | 10. Automatic Licensing of Downstream Recipients. 447 | 448 | Each time you convey a covered work, the recipient automatically 449 | receives a license from the original licensors, to run, modify and 450 | propagate that work, subject to this License. You are not responsible 451 | for enforcing compliance by third parties with this License. 452 | 453 | An "entity transaction" is a transaction transferring control of an 454 | organization, or substantially all assets of one, or subdividing an 455 | organization, or merging organizations. If propagation of a covered 456 | work results from an entity transaction, each party to that 457 | transaction who receives a copy of the work also receives whatever 458 | licenses to the work the party's predecessor in interest had or could 459 | give under the previous paragraph, plus a right to possession of the 460 | Corresponding Source of the work from the predecessor in interest, if 461 | the predecessor has it or can get it with reasonable efforts. 462 | 463 | You may not impose any further restrictions on the exercise of the 464 | rights granted or affirmed under this License. For example, you may 465 | not impose a license fee, royalty, or other charge for exercise of 466 | rights granted under this License, and you may not initiate litigation 467 | (including a cross-claim or counterclaim in a lawsuit) alleging that 468 | any patent claim is infringed by making, using, selling, offering for 469 | sale, or importing the Program or any portion of it. 470 | 471 | 11. Patents. 472 | 473 | A "contributor" is a copyright holder who authorizes use under this 474 | License of the Program or a work on which the Program is based. The 475 | work thus licensed is called the contributor's "contributor version". 476 | 477 | A contributor's "essential patent claims" are all patent claims 478 | owned or controlled by the contributor, whether already acquired or 479 | hereafter acquired, that would be infringed by some manner, permitted 480 | by this License, of making, using, or selling its contributor version, 481 | but do not include claims that would be infringed only as a 482 | consequence of further modification of the contributor version. For 483 | purposes of this definition, "control" includes the right to grant 484 | patent sublicenses in a manner consistent with the requirements of 485 | this License. 486 | 487 | Each contributor grants you a non-exclusive, worldwide, royalty-free 488 | patent license under the contributor's essential patent claims, to 489 | make, use, sell, offer for sale, import and otherwise run, modify and 490 | propagate the contents of its contributor version. 491 | 492 | In the following three paragraphs, a "patent license" is any express 493 | agreement or commitment, however denominated, not to enforce a patent 494 | (such as an express permission to practice a patent or covenant not to 495 | sue for patent infringement). To "grant" such a patent license to a 496 | party means to make such an agreement or commitment not to enforce a 497 | patent against the party. 498 | 499 | If you convey a covered work, knowingly relying on a patent license, 500 | and the Corresponding Source of the work is not available for anyone 501 | to copy, free of charge and under the terms of this License, through a 502 | publicly available network server or other readily accessible means, 503 | then you must either (1) cause the Corresponding Source to be so 504 | available, or (2) arrange to deprive yourself of the benefit of the 505 | patent license for this particular work, or (3) arrange, in a manner 506 | consistent with the requirements of this License, to extend the patent 507 | license to downstream recipients. "Knowingly relying" means you have 508 | actual knowledge that, but for the patent license, your conveying the 509 | covered work in a country, or your recipient's use of the covered work 510 | in a country, would infringe one or more identifiable patents in that 511 | country that you have reason to believe are valid. 512 | 513 | If, pursuant to or in connection with a single transaction or 514 | arrangement, you convey, or propagate by procuring conveyance of, a 515 | covered work, and grant a patent license to some of the parties 516 | receiving the covered work authorizing them to use, propagate, modify 517 | or convey a specific copy of the covered work, then the patent license 518 | you grant is automatically extended to all recipients of the covered 519 | work and works based on it. 520 | 521 | A patent license is "discriminatory" if it does not include within 522 | the scope of its coverage, prohibits the exercise of, or is 523 | conditioned on the non-exercise of one or more of the rights that are 524 | specifically granted under this License. You may not convey a covered 525 | work if you are a party to an arrangement with a third party that is 526 | in the business of distributing software, under which you make payment 527 | to the third party based on the extent of your activity of conveying 528 | the work, and under which the third party grants, to any of the 529 | parties who would receive the covered work from you, a discriminatory 530 | patent license (a) in connection with copies of the covered work 531 | conveyed by you (or copies made from those copies), or (b) primarily 532 | for and in connection with specific products or compilations that 533 | contain the covered work, unless you entered into that arrangement, 534 | or that patent license was granted, prior to 28 March 2007. 535 | 536 | Nothing in this License shall be construed as excluding or limiting 537 | any implied license or other defenses to infringement that may 538 | otherwise be available to you under applicable patent law. 539 | 540 | 12. No Surrender of Others' Freedom. 541 | 542 | If conditions are imposed on you (whether by court order, agreement or 543 | otherwise) that contradict the conditions of this License, they do not 544 | excuse you from the conditions of this License. If you cannot convey a 545 | covered work so as to satisfy simultaneously your obligations under this 546 | License and any other pertinent obligations, then as a consequence you may 547 | not convey it at all. For example, if you agree to terms that obligate you 548 | to collect a royalty for further conveying from those to whom you convey 549 | the Program, the only way you could satisfy both those terms and this 550 | License would be to refrain entirely from conveying the Program. 551 | 552 | 13. Use with the GNU Affero General Public License. 553 | 554 | Notwithstanding any other provision of this License, you have 555 | permission to link or combine any covered work with a work licensed 556 | under version 3 of the GNU Affero General Public License into a single 557 | combined work, and to convey the resulting work. The terms of this 558 | License will continue to apply to the part which is the covered work, 559 | but the special requirements of the GNU Affero General Public License, 560 | section 13, concerning interaction through a network will apply to the 561 | combination as such. 562 | 563 | 14. Revised Versions of this License. 564 | 565 | The Free Software Foundation may publish revised and/or new versions of 566 | the GNU General Public License from time to time. Such new versions will 567 | be similar in spirit to the present version, but may differ in detail to 568 | address new problems or concerns. 569 | 570 | Each version is given a distinguishing version number. If the 571 | Program specifies that a certain numbered version of the GNU General 572 | Public License "or any later version" applies to it, you have the 573 | option of following the terms and conditions either of that numbered 574 | version or of any later version published by the Free Software 575 | Foundation. If the Program does not specify a version number of the 576 | GNU General Public License, you may choose any version ever published 577 | by the Free Software Foundation. 578 | 579 | If the Program specifies that a proxy can decide which future 580 | versions of the GNU General Public License can be used, that proxy's 581 | public statement of acceptance of a version permanently authorizes you 582 | to choose that version for the Program. 583 | 584 | Later license versions may give you additional or different 585 | permissions. However, no additional obligations are imposed on any 586 | author or copyright holder as a result of your choosing to follow a 587 | later version. 588 | 589 | 15. Disclaimer of Warranty. 590 | 591 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY 592 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT 593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY 594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, 595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 596 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM 597 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF 598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 599 | 600 | 16. Limitation of Liability. 601 | 602 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS 604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY 605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE 606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF 607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD 608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), 609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF 610 | SUCH DAMAGES. 611 | 612 | 17. Interpretation of Sections 15 and 16. 613 | 614 | If the disclaimer of warranty and limitation of liability provided 615 | above cannot be given local legal effect according to their terms, 616 | reviewing courts shall apply local law that most closely approximates 617 | an absolute waiver of all civil liability in connection with the 618 | Program, unless a warranty or assumption of liability accompanies a 619 | copy of the Program in return for a fee. 620 | 621 | END OF TERMS AND CONDITIONS 622 | 623 | How to Apply These Terms to Your New Programs 624 | 625 | If you develop a new program, and you want it to be of the greatest 626 | possible use to the public, the best way to achieve this is to make it 627 | free software which everyone can redistribute and change under these terms. 628 | 629 | To do so, attach the following notices to the program. It is safest 630 | to attach them to the start of each source file to most effectively 631 | state the exclusion of warranty; and each file should have at least 632 | the "copyright" line and a pointer to where the full notice is found. 633 | 634 | 635 | Copyright (C) 636 | 637 | This program is free software: you can redistribute it and/or modify 638 | it under the terms of the GNU General Public License as published by 639 | the Free Software Foundation, either version 3 of the License, or 640 | (at your option) any later version. 641 | 642 | This program is distributed in the hope that it will be useful, 643 | but WITHOUT ANY WARRANTY; without even the implied warranty of 644 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 645 | GNU General Public License for more details. 646 | 647 | You should have received a copy of the GNU General Public License 648 | along with this program. If not, see . 649 | 650 | Also add information on how to contact you by electronic and paper mail. 651 | 652 | If the program does terminal interaction, make it output a short 653 | notice like this when it starts in an interactive mode: 654 | 655 | Copyright (C) 656 | This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 657 | This is free software, and you are welcome to redistribute it 658 | under certain conditions; type `show c' for details. 659 | 660 | The hypothetical commands `show w' and `show c' should show the appropriate 661 | parts of the General Public License. Of course, your program's commands 662 | might be different; for a GUI interface, you would use an "about box". 663 | 664 | You should also get your employer (if you work as a programmer) or school, 665 | if any, to sign a "copyright disclaimer" for the program, if necessary. 666 | For more information on this, and how to apply and follow the GNU GPL, see 667 | . 668 | 669 | The GNU General Public License does not permit incorporating your program 670 | into proprietary programs. If your program is a subroutine library, you 671 | may consider it more useful to permit linking proprietary applications with 672 | the library. If this is what you want to do, use the GNU Lesser General 673 | Public License instead of this License. But first, please read 674 | . 675 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | recursive-include cython * 2 | include VERSION 3 | include LICENSE -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # TipToft 2 | Given some raw uncorrected long reads, such as those from PacBio or Oxford Nanopore, predict which plasmid should be present. Assemblies of long read data can often miss out on plasmids, particularly if they are very small or have a copy number which is too high/low when compared to the chromosome. This software gives you an indication of which plasmids to expect, flagging potential issues with an assembly. 3 | 4 | [![Build Status](https://travis-ci.org/andrewjpage/tiptoft.svg?branch=master)](https://travis-ci.org/andrewjpage/tiptoft) 5 | [![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-brightgreen.svg)](https://github.com/andrewjpage/tiptoft/blob/master/LICENSE) 6 | [![codecov](https://codecov.io/gh/andrewjpage/tiptoft/branch/master/graph/badge.svg)](https://codecov.io/gh/andrewjpage/tiptoft) 7 | [![Docker Build Status](https://img.shields.io/docker/build/andrewjpage/tiptoft.svg)](https://hub.docker.com/r/andrewjpage/tiptoft) 8 | [![Docker Pulls](https://img.shields.io/docker/pulls/andrewjpage/tiptoft.svg)](https://hub.docker.com/r/andrewjpage/tiptoft) 9 | 10 | # Paper 11 | [![DOI](http://joss.theoj.org/papers/10.21105/joss.01021/status.svg)](https://doi.org/10.21105/joss.01021) 12 | 13 | AJ Page, T Seemann (2019). TipToft: detecting plasmids contained in uncorrected long read sequencing data. Journal of Open Source Software, 4(35), 1021, https://doi.org/10.21105/joss.01021 14 | 15 | Please remember to cite the plasmidFinder paper as their database makes this software work: 16 | 17 | Carattoli *et al*, *In Silico Detection and Typing of Plasmids using PlasmidFinder and Plasmid Multilocus Sequence Typing*, **Antimicrob Agents Chemother.** 2014;58(7):3895–3903. [view](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4068535/) 18 | 19 | 20 | # Installation 21 | The only dependancies are Python3 and a compiler (gcc, clang,...) and this should work on Linux or OSX. Cython needs to be installed in advance. Assuming you have Python 3.4+ and pip installed, just run: 22 | ``` 23 | pip3 install cython 24 | pip3 install tiptoft 25 | ``` 26 | 27 | or if you wish to install the latest development version: 28 | ``` 29 | pip3 install git+git://github.com/andrewjpage/tiptoft.git 30 | ``` 31 | 32 | ## Debian/Ubuntu (Trusty/Xenial) 33 | To install Python3 on Ubuntu run: 34 | ``` 35 | sudo apt-get update -qq 36 | sudo apt-get install -y git python3 python3-setuptools python3-biopython python3-pip 37 | pip3 install cython 38 | pip3 install tiptoft 39 | ``` 40 | 41 | ## Docker 42 | Install [Docker](https://www.docker.com/). There is a docker container which gets automatically built from the latest version of TipToft. To install it: 43 | 44 | ``` 45 | docker pull andrewjpage/tiptoft 46 | ``` 47 | 48 | To use it you would use a command such as this (substituting in your filename/directories), using the example file in this respository: 49 | ``` 50 | docker run --rm -it -v /path/to/example_data:/example_data andrewjpage/tiptoft tiptoft /example_data/ERS654932_plasmids.fastq.gz 51 | ``` 52 | 53 | ## Homebrew 54 | Install [Brew](https://brew.sh/) for OSX or [LinuxBrew](http://linuxbrew.sh/) for Linux, then run: 55 | 56 | ``` 57 | brew install python # this is python v3 58 | pip3 install cython 59 | pip3 install tiptoft 60 | ``` 61 | ## Bioconda 62 | Install [Bioconda](http://bioconda.github.io/), then run: 63 | 64 | ``` 65 | conda install tiptoft 66 | ``` 67 | 68 | ## Windows 69 | It has been reported that the software works when using Ubuntu on Windows 10. This is not a supported platform as the authors don't use windows, so use at your own risk. 70 | 71 | # Usage 72 | ## tiptoft_database_downloader script 73 | First of all you need plasmid database from PlasmidFinder. There is a snapshot bundled with this repository for your convenience, or alternatively you can use the downloader script to get the latest data. You will need internet access for this step. Please remember to cite the PlasmidFinder paper. 74 | 75 | ``` 76 | usage: tiptoft_database_downloader [options] output_prefix 77 | 78 | Download PlasmidFinder database 79 | 80 | positional arguments: 81 | output_prefix Output prefix 82 | 83 | optional arguments: 84 | -h, --help show this help message and exit 85 | --verbose, -v Turn on debugging (default: False) 86 | --version show program's version number and exit 87 | ``` 88 | 89 | Just run: 90 | ``` 91 | tiptoft_database_downloader 92 | ``` 93 | You will now have a file called 'plasmid_files.fa' which can be used with the main script. 94 | 95 | ## tiptoft script 96 | This is the main script of the application. The mandatory inputs are a FASTQ file of long reads, which can be optionally gzipped. 97 | ``` 98 | usage: tiptoft [options] input.fastq 99 | 100 | plasmid incompatibility group prediction from uncorrected long reads 101 | 102 | positional arguments: 103 | input_fastq Input FASTQ file (optionally gzipped) 104 | 105 | optional arguments: 106 | -h, --help show this help message and exit 107 | 108 | Optional input arguments: 109 | --plasmid_data PLASMID_DATA, -d PLASMID_DATA 110 | FASTA file containing plasmid data from downloader 111 | script, defaults to bundled database (default: None) 112 | --kmer KMER, -k KMER k-mer size (default: 13) 113 | 114 | Optional output arguments: 115 | --filtered_reads_file FILTERED_READS_FILE, -f FILTERED_READS_FILE 116 | Filename to save matching reads to (default: None) 117 | --output_file OUTPUT_FILE, -o OUTPUT_FILE 118 | Output file [STDOUT] (default: None) 119 | --print_interval PRINT_INTERVAL, -p PRINT_INTERVAL 120 | Print results every this number of reads (default: 121 | None) 122 | --verbose, -v Turn on debugging [False] 123 | --version show program's version number and exit 124 | 125 | Optional advanced input arguments: 126 | --max_gap MAX_GAP Maximum gap for blocks to be contigous, measured in 127 | multiples of the k-mer size (default: 3) 128 | --margin MARGIN Flanking region around a block to use for mapping 129 | (default: 10) 130 | --min_block_size MIN_BLOCK_SIZE 131 | Minimum block size in bases (default: 130) 132 | --min_fasta_hits MIN_FASTA_HITS, -m MIN_FASTA_HITS 133 | Minimum No. of kmers matching a read (default: 10) 134 | --min_perc_coverage MIN_PERC_COVERAGE, -c MIN_PERC_COVERAGE 135 | Minimum percentage coverage of typing sequence to 136 | report (default: 85) 137 | --min_kmers_for_onex_pass MIN_KMERS_FOR_ONEX_PASS 138 | Minimum No. of kmers matching a read in 1st pass 139 | (default: 10) 140 | ``` 141 | 142 | ### Required argument 143 | 144 | __input_fastq__: This is a single FASTQ file. It can be optionally gzipped. Alternatively input can be read from stdin by using the dash character (-) as the input file name. The file must contain long reads, such as those from PacBio or Oxford Nanopore. The quality scores are ignored. 145 | 146 | ### Optional input arguments 147 | 148 | __plasmid_data__: This is a FASTA file containing all of the plasmid typing sequences. This is generated by the tiptoft_database_downloader script. It comes from the PlasmidFinder website, so please be sure to cite their paper (citation gets printed every time you run the script). 149 | 150 | __kmer__: The most important parameter. 13 works well for Nanopore, 15 works well for PacBio, but you may need to play around with it for your data. Long reads have a high error rate, so if you set this too high, nothing will match (because it will contain errors). If you set it too low, everything will match, which isnt much use to you. Thinking about your data, on average how long of a stretch of bases can you get in your read without errors? This is what you should set your kmer to. For example, if you have an average of 1 error every 10 bases, then the ideal kmer would be 9. 151 | 152 | ### Optional output arguments 153 | 154 | __filtered_reads_file__: Save the reads which contain the rep/inc sequences to a new FASTQ file. This is useful if you want to undertake a further assembly just on the plasmids.This file should not already exist. 155 | 156 | __output_file OUTPUT_FILE__: By default the results are printed to STDOUT. If you provide an output filename (which must not exist already), it will print the results to the file. 157 | 158 | __print_interval__: By default the whole file is processed and the final results are printed out. However you can get intermediate results printed after every X number of reads, which is useful if you are doing real time streaming of data into the application and can halt when you have enough information. They are separated by "****". 159 | 160 | __verbose__: Enable debugging mode where lots of extra output is printed to STDOUT. 161 | 162 | __version__: Print the version number and exit. 163 | 164 | 165 | ### Optional advanced input arguments 166 | 167 | __max_gap__: Maximum gap for blocks to be contigous, measured in multiples of the k-mer size. This allows for short regions of elevated errors in the reads to be spanned. 168 | 169 | __margin__: Expand the analysis to look at a few bases on either side of where the sequence is predicted to be on the read. This allows for k-mers to overlap the ends. 170 | 171 | __min_block_size__: This is the minimum sub read size of a read to consider for indepth analysis after matching k-mers have been identified in the read. This speeds up the analysis quite a bit, but there is the risk that some reads may be missed, particularly if they have partial rep/inc sequences. 172 | 173 | __min_fasta_hits__: This is the minimum number of matching kmers in a read, for the read to be considered for analysis. It is a hard minimum threshold to speed up analysis. 174 | 175 | __min_perc_coverage__: Only report rep/inc sequences above this percentage coverage. Coverage in this instance is kmer coverage of the underlying sequence (rather than depth of coverage). 176 | 177 | __min_kmers_for_onex_pass__: The number of k-mers that must be present in the read for the initial onex pass of the database to be considered for further analysis. This speeds up the analysis quite a bit, but there is the risk that some reads may be missed, particularly if they have partial rep/inc sequences. 178 | 179 | # Output 180 | The output is tab delmited and printed to STDOUT by default. You can optionally print it to a file using the '-o' parameter. If you would like to see intermediate results, you can tell it to print every X reads with the '-p' parameter, separated by '****'. An example of the output is: 181 | 182 | ``` 183 | GENE COMPLETENESS %COVERAGE ACCESSION DATABASE PRODUCT 184 | rep7.1 Full 100 AB037671 plasmidfinder rep7.1_repC(Cassette)_AB037671 185 | rep7.5 Partial 99 AF378372 plasmidfinder rep7.5_CDS1(pKC5b)_AF378372 186 | rep7.6 Partial 94 SAU38656 plasmidfinder rep7.6_ORF(pKH1)_SAU38656 187 | rep7.9 Full 100 NC007791 plasmidfinder rep7.9_CDS3(pUSA02)_NC007791 188 | rep7.10 Partial 91 NC_010284.1 plasmidfinder rep7.10_repC(pKH17)_NC_010284.1 189 | rep7.12 Partial 93 GQ900417.1 plasmidfinder rep7.12_rep(SAP060B)_GQ900417.1 190 | rep7.17 Full 100 AM990993.1 plasmidfinder rep7.17_repC(pS0385-1)_AM990993.1 191 | rep20.11 Full 100 AP003367 plasmidfinder rep20.11_repA(VRSAp)_AP003367 192 | repUS14. Full 100 AP003367 plasmidfinder repUS14._repA(VRSAp)_AP003367 193 | ``` 194 | 195 | __GENE__: The first column is the first part of the product name. 196 | 197 | __COMPLETENESS__: If all of the k-mers in the gene are found in the reads, the completeness is noted as 'Full', otherwise if there are some k-mers missing, it is noted as 'Partial'. 198 | 199 | __%COVERAGE__: The percentage coverage is the number of underlying k-mers in the gene where at least 1 matching k-mer has been found in the reads. 100 indicates that every k-mer in the gene is covered. Low coverage results are not shown (controlled by the --min_perc_coverage parameter). 200 | 201 | __ACCESSION__: This is the accession number from where the typing sequence originates. You can look this up at NCBI or EBI. 202 | 203 | __DATABASE__: This is where the data has come from, which is currently always plasmidfinder. 204 | 205 | __PRODUCT__: This is the full product of the gene as found in the database. 206 | 207 | # Example usage 208 | A real [test file](https://github.com/andrewjpage/tiptoft/raw/master/example_data/ERS654932_plasmids.fastq.gz) is bundled in the repository. Download it then run: 209 | 210 | ``` 211 | tiptoft ERS654932_plasmids.fastq.gz 212 | ``` 213 | 214 | The [expected output](https://raw.githubusercontent.com/andrewjpage/tiptoft/master/example_data/expected_output) is in the repository. This uses a bundled database, however if you wish to use the latest up to date database, you should run the tiptoft_database_downloader script. 215 | 216 | # Resource usage 217 | For an 800 MB FASTQ file (unzipped) of long reads from a Oxford Nanopore MinION containing Salmonella required 80 MB of RAM and took under 1 minute. 218 | 219 | ## License 220 | TipToft is free software, licensed under [GPLv3](https://github.com/andrewjpage/tiptoft/blob/master/GPL-LICENSE). 221 | 222 | ## Feedback/Issues 223 | Please report any issues to the [issues page](https://github.com/andrewjpage/tiptoft/issues). 224 | 225 | ## Contribute to the software 226 | If you wish to fix a bug or add new features to the software we welcome Pull Requests. We use 227 | [GitHub Flow style development](https://guides.github.com/introduction/flow/). Please fork the repo, make the change, then submit a Pull Request against out master branch, with details about what the change is and what it fixes/adds. 228 | We will then review your changes and merge them, or provide feedback on enhancements. 229 | 230 | -------------------------------------------------------------------------------- /VERSION: -------------------------------------------------------------------------------- 1 | 1.0.2 2 | -------------------------------------------------------------------------------- /example_data/ERS654932_plasmids.fastq.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewjpage/tiptoft/e71b002bef09f97d6881fbc69bedc7ad65fe6b48/example_data/ERS654932_plasmids.fastq.gz -------------------------------------------------------------------------------- /example_data/expected_output: -------------------------------------------------------------------------------- 1 | GENE COMPLETENESS %COVERAGE ACCESSION DATABASE PRODUCT 2 | rep7.1 Full 100 AB037671 plasmidfinder rep7.1_repC(Cassette)_AB037671 3 | rep7.5 Full 100 AF378372 plasmidfinder rep7.5_CDS1(pKC5b)_AF378372 4 | rep7.9 Full 100 NC007791 plasmidfinder rep7.9_CDS3(pUSA02)_NC007791 5 | rep7.17 Full 100 AM990993.1 plasmidfinder rep7.17_repC(pS0385-1)_AM990993.1 6 | repUS14. Full 100 AP003367 plasmidfinder repUS14._repA(VRSAp)_AP003367 -------------------------------------------------------------------------------- /homopolymer_compression.pyx: -------------------------------------------------------------------------------- 1 | # Run it with Cython 2 | def homopolymer_compression_of_sequence(sequence): 3 | previous_base = '' 4 | compressed_sequence = [] 5 | for base in sequence: 6 | if base != previous_base: 7 | previous_base = base 8 | compressed_sequence.append(base) 9 | return ''.join(compressed_sequence) -------------------------------------------------------------------------------- /paper.bib: -------------------------------------------------------------------------------- 1 | @article{Faria2017, 2 | doi = {10.1038/nature22401}, 3 | url = {https://doi.org/10.1038/nature22401}, 4 | year = {2017}, 5 | month = {may}, 6 | publisher = {Springer Nature}, 7 | volume = {546}, 8 | number = {7658}, 9 | pages = {406--410}, 10 | author = {N. R. Faria and J. Quick and I.M. Claro and J. Th{\'{e}}z{\'{e}} and J. G. de Jesus and M. Giovanetti and M. U. G. Kraemer and S. C. Hill and A. Black and A. C. da Costa and L. C. Franco and S. P. Silva and C.-H. Wu and J. Raghwani and S. Cauchemez and L. du Plessis and M. P. Verotti and W. K. de Oliveira and E. H. Carmo and G. E. Coelho and A. C. F. S. Santelli and L. C. Vinhal and C. M. Henriques and J. T. Simpson and M. Loose and K. G. Andersen and N. D. Grubaugh and S. Somasekar and C. Y. Chiu and J. E. Mu{\~{n}}oz-Medina and C. R. Gonzalez-Bonilla and C. F. Arias and L. L. Lewis-Ximenez and S. A. Baylis and A. O. Chieppe and S. F. Aguiar and C. A. Fernandes and P. S. Lemos and B. L. S. Nascimento and H. A. O. Monteiro and I. C. Siqueira and M. G. de Queiroz and T. R. de Souza and J. F. Bezerra and M. R. Lemos and G. F. Pereira and D. Loudal and L. C. Moura and R. Dhalia and R. F. Fran{\c{c}}a and T. Magalh{\~{a}}es and E. T. Marques and T. Jaenisch and G. L. Wallau and M. C. de Lima and V. Nascimento and E. M. de Cerqueira and M. M. de Lima and D. L. Mascarenhas and J. P. Moura Neto and A. S. Levin and T. R. Tozetto-Mendoza and S. N. Fonseca and M. C. Mendes-Correa and F. P. Milagres and A. Segurado and E. C. Holmes and A. Rambaut and T. Bedford and M. R. T. Nunes and E. C. Sabino and L. C. J. Alcantara and N. J. Loman and O. G. Pybus}, 11 | title = {Establishment and cryptic transmission of Zika virus in Brazil and the Americas}, 12 | journal = {Nature} 13 | } 14 | @article{Quick2015, 15 | doi = {10.1186/s13059-015-0677-2}, 16 | url = {https://doi.org/10.1186/s13059-015-0677-2}, 17 | year = {2015}, 18 | month = {may}, 19 | publisher = {Springer Nature}, 20 | volume = {16}, 21 | number = {1}, 22 | author = {Joshua Quick and Philip Ashton and Szymon Calus and Carole Chatt and Savita Gossain and Jeremy Hawker and Satheesh Nair and Keith Neal and Kathy Nye and Tansy Peters and Elizabeth De Pinna and Esther Robinson and Keith Struthers and Mark Webber and Andrew Catto and Timothy J. Dallman and Peter Hawkey and Nicholas J. Loman}, 23 | title = {Rapid draft sequencing and real-time nanopore sequencing in a hospital outbreak of Salmonella}, 24 | journal = {Genome Biology} 25 | } 26 | @article{Votintseva2017, 27 | doi = {10.1128/jcm.02483-16}, 28 | url = {https://doi.org/10.1128/jcm.02483-16}, 29 | year = {2017}, 30 | month = {mar}, 31 | publisher = {American Society for Microbiology}, 32 | volume = {55}, 33 | number = {5}, 34 | pages = {1285--1298}, 35 | author = {Antonina A. Votintseva and Phelim Bradley and Louise Pankhurst and Carlos del Ojo Elias and Matthew Loose and Kayzad Nilgiriwala and Anirvan Chatterjee and E. Grace Smith and Nicolas Sanderson and Timothy M. Walker and Marcus R. Morgan and David H. Wyllie and A. Sarah Walker and Tim E. A. Peto and Derrick W. Crook and Zamin Iqbal}, 36 | editor = {Yi-Wei Tang}, 37 | title = {Same-Day Diagnostic and Surveillance Data for Tuberculosis via Whole-Genome Sequencing of Direct Respiratory Samples}, 38 | journal = {Journal of Clinical Microbiology} 39 | } 40 | @article{Gardy2017, 41 | doi = {10.1038/nrg.2017.88}, 42 | url = {https://doi.org/10.1038/nrg.2017.88}, 43 | year = {2017}, 44 | month = {nov}, 45 | publisher = {Springer Nature}, 46 | volume = {19}, 47 | number = {1}, 48 | pages = {9--20}, 49 | author = {Jennifer L. Gardy and Nicholas J. Loman}, 50 | title = {Towards a genomics-informed, real-time, global pathogen surveillance system}, 51 | journal = {Nature Reviews Genetics} 52 | } 53 | @article{Koren2017, 54 | doi = {10.1101/gr.215087.116}, 55 | url = {https://doi.org/10.1101/gr.215087.116}, 56 | year = {2017}, 57 | month = {mar}, 58 | publisher = {Cold Spring Harbor Laboratory}, 59 | volume = {27}, 60 | number = {5}, 61 | pages = {722--736}, 62 | author = {Sergey Koren and Brian P. Walenz and Konstantin Berlin and Jason R. Miller and Nicholas H. Bergman and Adam M. Phillippy}, 63 | title = {Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation}, 64 | journal = {Genome Research} 65 | } 66 | @article{Carattoli2014, 67 | doi = {10.1128/aac.02412-14}, 68 | url = {https://doi.org/10.1128/aac.02412-14}, 69 | year = {2014}, 70 | month = {apr}, 71 | publisher = {American Society for Microbiology}, 72 | volume = {58}, 73 | number = {7}, 74 | pages = {3895--3903}, 75 | author = {Alessandra Carattoli and Ea Zankari and Aurora Garc{\'{\i}}a-Fern{\'{a}}ndez and Mette Voldby Larsen and Ole Lund and Laura Villa and Frank M{\o}ller Aarestrup and Henrik Hasman}, 76 | title = {In {SilicoDetection} and Typing of Plasmids using {PlasmidFinder} and Plasmid Multilocus Sequence Typing}, 77 | journal = {Antimicrobial Agents and Chemotherapy} 78 | } -------------------------------------------------------------------------------- /paper.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'TipToft: detecting plasmids contained in uncorrected long read sequencing data' 3 | tags: 4 | - bioinformatics 5 | - plasmid typing 6 | - long read sequencing 7 | - bacteria 8 | authors: 9 | - name: Andrew J. Page 10 | orcid: 0000-0001-6919-6062 11 | affiliation: 1 12 | - name: Torsten Seemann 13 | orcid: 0000-0001-6046-610X 14 | affiliation: 2 15 | affiliations: 16 | - name: Quadram Institute Bioscience, Norwich Research Park, Norwich, UK. 17 | index: 1 18 | - name: Melbourne Bioinformatics, The University of Melbourne, Parkville, Australia. 19 | index: 2 20 | date: 1 October 2018 21 | bibliography: paper.bib 22 | --- 23 | 24 | # Summary 25 | With rapidly falling costs, long-read DNA sequencing technology from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), are beginning to be used for outbreak investigations [@Faria2017; @Quick2015] and rapid infectious disease clinical diagnostics [@Votintseva2017]. ONT instruments can produce data within minutes, and PacBio within hours compared to short-read sequencing technologies which takes hours/days. By reducing the time from swab to an actionable answer, genomics can begin to directly influence clinical decisions, with the potential for a positive impact for patients [@Gardy2017]. Clinically important genes, like those conferring animicrobial resistance or encoding virulence factors, can be horizontally acquired from plasmids. With the increased speed afforded by long-read sequencing technologies comes increased base errors rates. The high error rates inherent in long-read sequencing reads require specialised tools to correct the reads [@Koren2017], however, these methods require substantial computational requirements, and often take longer to run than the original time to generate the sequencing data, and can result in the loss of small, clinically important plasmids. 26 | 27 | We present ``TipToft`` which uses raw uncorrected reads to predict which plasmids are present in the underlying raw data. This provides an independent method for validating the plasmid content of a *de novo* assembly. It is the only tool which can do this from uncorrected long reads. ``TipToft`` is fast and can accept streaming input data to provide results in a realtime manner. Plasmids are identified using replicon sequences used for typing from PlasmidFinder [@Carattoli2014]. We tested the software on 1975 samples (https://www.sanger.ac.uk/resources/downloads/bacteria/nctc/) sequenced using long read sequencing technologies from PacBio, predicting plasmids from de novo assemblies using abricate (https://github.com/tseemann/abricate). It identified 84 samples containing plasmids with a 100% match to a plasmid sequence, but where no corresponding plasmid was present in the de novo assembly. Taking all the plasmids identified in the assemblies with 100% match, Tiptoft identified 97% (n=326) of these, representing 95% (236) of the samples. A higher depth of read coverage will increase the power to accurately identify plasmid sequences, the level of which depends on the underlying base error rate. For sequence data with 90% base accuracy, an approximate depth of read coverage of 5 is required to identify a plasmid replicon sequence with 99.5% confidence. The software is written in Python 3 and is available under the open source GNU GPLv3 licence from https://github.com/andrewjpage/tiptoft. 28 | 29 | # Acknowledgements 30 | This work was supported by the Quadram Institute Bioscience BBSRC funded Core Capability Grant (project number BB/CCG1860/1) and by the Wellcome Trust (grant WT 098051). 31 | 32 | # References 33 | -------------------------------------------------------------------------------- /scripts/tiptoft: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import argparse 4 | import sys 5 | import os 6 | import pkg_resources 7 | sys.path.append('../') 8 | sys.path.append('./') 9 | from tiptoft.TipToft import TipToft 10 | from tiptoft.InputTypes import InputTypes 11 | 12 | version = '' 13 | try: 14 | version = pkg_resources.get_distribution("tiptoft").version 15 | except pkg_resources.DistributionNotFound: 16 | version = 'x.y.z' 17 | 18 | parser = argparse.ArgumentParser( 19 | description = 'Plasmid replicon and incompatibility group prediction from uncorrected long reads', 20 | usage = 'tiptoft [options] input.fastq', formatter_class=argparse.ArgumentDefaultsHelpFormatter) 21 | 22 | parser.add_argument('input_fastq', help='Input FASTQ file (optionally gzipped)', type=InputTypes.is_fastq_file_valid) 23 | 24 | inputs = parser.add_argument_group('Optional input arguments') 25 | inputs.add_argument('--plasmid_data', '-d', help='FASTA file containing plasmid data from downloader script, defaults to bundled database', type=InputTypes.is_plasmid_file_valid ) 26 | inputs.add_argument('--kmer', '-k', help='k-mer size', type=InputTypes.is_kmer_valid, default = 13) 27 | 28 | outputs = parser.add_argument_group('Optional output arguments') 29 | outputs.add_argument('--filtered_reads_file', '-f', help='Filename to save matching reads to') 30 | outputs.add_argument('--output_file', '-o', help='Output file [STDOUT]') 31 | outputs.add_argument('--print_interval', '-p', help='Print results every this number of reads', type=int) 32 | outputs.add_argument('--verbose', '-v', action='store_true', help='Turn on debugging [%(default)s]', default = False) 33 | outputs.add_argument('--version', action='version', version=str(version)) 34 | 35 | advanced = parser.add_argument_group('Optional advanced input arguments') 36 | advanced.add_argument('--no_hc_compression', action='store_true', help='Turn off homoploymer compression of k-mers', default = False) 37 | advanced.add_argument('--no_gene_filter', action='store_true', help='Dont filter out lower coverage genes from same group', default = False) 38 | advanced.add_argument('--max_gap', help='Maximum gap for blocks to be contigous, measured in multiples of the k-mer size', type=int, default=3) 39 | advanced.add_argument('--max_kmer_count', help='Exclude k-mers which occur more than this number of times in a sequence', type=int, default=10) 40 | advanced.add_argument('--margin', help='Flanking region around a block to use for mapping', type=int, default=10) 41 | advanced.add_argument('--min_block_size', help='Minimum block size in bases', type=int, default=50) 42 | advanced.add_argument('--min_fasta_hits', '-m', help='Minimum No. of kmers matching a read', type=int, default = 8) 43 | advanced.add_argument('--min_perc_coverage', '-c', help='Minimum percentage coverage of typing sequence to report', type=int, default = 85) 44 | advanced.add_argument('--min_kmers_for_onex_pass', help='Minimum No. of kmers matching a read in 1st pass', type=int, default = 5) 45 | 46 | options = parser.parse_args() 47 | 48 | if options.verbose: 49 | 50 | import cProfile, pstats, io 51 | pr = cProfile.Profile() 52 | pr.enable() 53 | 54 | tiptoft = TipToft(options) 55 | tiptoft.run() 56 | 57 | pr.disable() 58 | s = io.StringIO() 59 | sortby = 'cumulative' 60 | ps = pstats.Stats(pr, stream=s).sort_stats(sortby) 61 | ps.print_stats() 62 | print(s.getvalue()) 63 | 64 | else: 65 | tiptoft = TipToft(options) 66 | tiptoft.run() 67 | -------------------------------------------------------------------------------- /scripts/tiptoft_database_downloader: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import argparse 4 | import sys 5 | import os 6 | import pkg_resources 7 | sys.path.append('../') 8 | sys.path.append('./') 9 | from tiptoft.TipToftDatabaseDownloader import TipToftDatabaseDownloader 10 | from tiptoft.InputTypes import InputTypes 11 | 12 | version = '' 13 | try: 14 | version = pkg_resources.get_distribution("tiptoft").version 15 | except pkg_resources.DistributionNotFound: 16 | version = 'x.y.z' 17 | 18 | parser = argparse.ArgumentParser( 19 | description = 'Download PlasmidFinder database', 20 | usage = 'tiptoft_database_downloader [options] output_prefix', 21 | formatter_class=argparse.ArgumentDefaultsHelpFormatter) 22 | 23 | parser.add_argument('output_prefix', help='Output prefix') 24 | parser.add_argument('--verbose', '-v', action='store_true', help='Turn on debugging') 25 | parser.add_argument('--version', action='version', version=str(version)) 26 | 27 | options = parser.parse_args() 28 | 29 | tiptoft = TipToftDatabaseDownloader(options) 30 | tiptoft.run() 31 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | from setuptools import setup, find_packages, Extension 4 | 5 | with open("README.md", encoding="utf-8") as fname: 6 | README = fname.read() 7 | 8 | 9 | version = 'x.y.z' 10 | if os.path.exists('VERSION'): 11 | version = open('VERSION').read().strip() 12 | 13 | USE_CYTHON = False 14 | ext = '.pyx' if USE_CYTHON else '.c' 15 | extensions = [Extension("homopolymer_compression", ["homopolymer_compression"+ext])] 16 | 17 | if USE_CYTHON: 18 | from Cython.Build import cythonize 19 | extensions = cythonize(extensions) 20 | 21 | setup( 22 | name='tiptoft', 23 | version=version, 24 | description='tiptoft: predict which plasmid should be present from uncorrected long read data', 25 | long_description=README, 26 | packages = find_packages(), 27 | author='Andrew J. Page', 28 | author_email='andrew.page@quadram.ac.uk', 29 | url='https://github.com/andrewjpage/tiptoft', 30 | scripts=glob.glob('scripts/*'), 31 | test_suite='nose.collector', 32 | tests_require=['nose >= 1.3'], 33 | install_requires=[ 34 | 'biopython >= 1.68', 35 | 'pyfastaq >= 3.12.0', 36 | 'cython' 37 | ], 38 | ext_modules = extensions, 39 | package_data={'tiptoft': ['data/*']}, 40 | license='GPLv3', 41 | classifiers=[ 42 | 'Development Status :: 4 - Beta', 43 | 'Intended Audience :: Science/Research', 44 | 'Topic :: Scientific/Engineering :: Bio-Informatics', 45 | 'Programming Language :: Python :: 3 :: Only', 46 | 'License :: OSI Approved :: GNU General Public License v3 (GPLv3)' 47 | ], 48 | ) 49 | -------------------------------------------------------------------------------- /tiptoft/Blocks.py: -------------------------------------------------------------------------------- 1 | '''Identify blocks which are likely to be genes of interest''' 2 | 3 | 4 | class Blocks: 5 | def __init__(self, k, min_block_size, max_gap, margin): 6 | self.k = k 7 | self.min_block_size = min_block_size 8 | self.max_gap = max_gap # multiples of the kmer 9 | self.margin = margin 10 | 11 | '''An array of pairs, with start and end coordinate of each block is passed 12 | in as input. If these blocks are close together, merge them.''' 13 | 14 | def merge_blocks(self, blocks): 15 | for i in range(0, len(blocks)-1): 16 | if blocks[i][1] + self.max_gap > blocks[i+1][0]: 17 | if blocks[i][1] < blocks[i+1][1]: 18 | blocks[i][1] = blocks[i+1][1] 19 | blocks[i+1][0] = blocks[i][0] 20 | blocks[i+1][1] = blocks[i][1] 21 | return blocks 22 | 23 | '''Given a set of blocks (start and end coordinates), return the largest 24 | block coordinates''' 25 | 26 | def find_largest_block(self, sequence_hits): 27 | blocks = self.find_all_blocks(sequence_hits) 28 | merged_blocks = self.merge_blocks(blocks) 29 | 30 | largest_block = 0 31 | lbi = 0 32 | 33 | for i, block in enumerate(merged_blocks): 34 | block_size = block[1] - block[0] 35 | if block_size > largest_block: 36 | lbi = i 37 | largest_block = block_size 38 | 39 | if largest_block < (self.min_block_size/self.k): 40 | return 0, 0 41 | 42 | return merged_blocks[lbi][0], merged_blocks[lbi][1] 43 | 44 | '''Given an array with kmer matches, find the coordinates of each 45 | contiguous segment''' 46 | 47 | def find_all_blocks(self, sequence_hits): 48 | blocks = [] 49 | in_block = False 50 | current_block_start = 0 51 | for i, val_count in enumerate(sequence_hits): 52 | 53 | if not in_block and val_count > 0: 54 | in_block = True 55 | current_block_start = i 56 | elif in_block and val_count == 0: 57 | in_block = False 58 | blocks.append([current_block_start, i]) 59 | 60 | if in_block: 61 | blocks.append([current_block_start, len(sequence_hits)]) 62 | 63 | return blocks 64 | 65 | '''Rescale the blocks, add a flanking margin and make sure the start is in 66 | bounds''' 67 | 68 | def adjust_block_start(self, block_start): 69 | block_start *= self.k 70 | if block_start - self.margin < 0: 71 | block_start = 0 72 | else: 73 | block_start -= self.margin 74 | return block_start 75 | 76 | '''Rescale the blocks, add a flanking margin and make sure the end is in 77 | bounds''' 78 | 79 | def adjust_block_end(self, block_end, seq_length): 80 | block_end *= self.k 81 | if block_end + self.margin > seq_length: 82 | block_end = seq_length 83 | else: 84 | block_end += self.margin 85 | return block_end 86 | -------------------------------------------------------------------------------- /tiptoft/Fasta.py: -------------------------------------------------------------------------------- 1 | '''Read in a FASTA file and extract all the k-mers''' 2 | from Bio import SeqIO 3 | from tiptoft.Kmers import Kmers 4 | 5 | 6 | class Fasta: 7 | def __init__( 8 | self, 9 | logger, 10 | filename, 11 | k, 12 | homopolyer_compression, 13 | max_kmer_count=10): 14 | self.logger = logger 15 | self.filename = filename 16 | self.k = k 17 | self.homopolyer_compression = homopolyer_compression 18 | self.max_kmer_count = max_kmer_count 19 | 20 | self.sequences_to_kmers = self.sequence_kmers('get_all_kmers_counter') 21 | self.sequences_to_kmers_count =\ 22 | self.sequence_kmers('get_all_kmers_freq') 23 | self.all_kmers = self.all_kmers_in_file() 24 | self.kmers_to_genes = self.all_kmers_to_seq_in_file() 25 | self.kmer_keys_set = set(self.all_kmers.keys()) 26 | 27 | '''Count the kmers in a sequence''' 28 | 29 | def sequence_kmers(self, kmer_action='get_all_kmers_counter'): 30 | seq_counter = 0 31 | kmer_to_sequences = {} 32 | for record in SeqIO.parse(self.filename, "fasta"): 33 | kmers = Kmers(str(record.seq), self.k, self.homopolyer_compression) 34 | # We assume here that the sequence name is unique in the FASTA file 35 | kmer_to_sequences[record.id] = getattr(kmers, kmer_action)( 36 | max_kmer_count=self.max_kmer_count) 37 | 38 | seq_counter += 1 39 | 40 | return kmer_to_sequences 41 | 42 | '''create a dictionary of kmers to underlying genes''' 43 | 44 | def all_kmers_to_seq_in_file(self): 45 | kmers_to_genes = {} 46 | for seq_name, kmer_counts in self.sequences_to_kmers.items(): 47 | for kmer, count in kmer_counts.items(): 48 | 49 | if kmer not in kmers_to_genes: 50 | kmers_to_genes[kmer] = [] 51 | kmers_to_genes[kmer].append(seq_name) 52 | 53 | return kmers_to_genes 54 | 55 | '''given a fasta file, extract all kmers''' 56 | 57 | def all_kmers_in_file(self): 58 | self.logger.info("Finding all k-mers in plasmid FASTA file") 59 | all_kmers = {} 60 | for seq_name, kmer_counts in self.sequences_to_kmers.items(): 61 | for kmer, count in kmer_counts.items(): 62 | if kmer in all_kmers: 63 | all_kmers[kmer] += 1 64 | else: 65 | all_kmers[kmer] = 1 66 | 67 | return all_kmers 68 | -------------------------------------------------------------------------------- /tiptoft/Fastq.py: -------------------------------------------------------------------------------- 1 | '''Read in a FASTQ file and identify matching plasmids''' 2 | from tiptoft.Kmers import Kmers 3 | from tiptoft.Read import Read 4 | from tiptoft.Gene import Gene 5 | from tiptoft.Blocks import Blocks 6 | import subprocess 7 | import os 8 | import numpy 9 | import sys 10 | 11 | 12 | class Error (Exception): 13 | pass 14 | 15 | 16 | class Fastq: 17 | def __init__( 18 | self, 19 | logger, 20 | filename, 21 | k, 22 | fasta_kmers, 23 | min_fasta_hits, 24 | print_interval, 25 | output_file, 26 | filtered_reads_file, 27 | fasta_obj, 28 | homopolyer_compression, 29 | max_gap=4, 30 | min_block_size=150, 31 | margin=100, 32 | start_time=0, 33 | min_kmers_for_onex_pass=10, 34 | min_perc_coverage=95, 35 | max_kmer_count=10, 36 | no_gene_filter=False): 37 | self.logger = logger 38 | self.filename = filename 39 | self.k = k 40 | self.fasta_kmers = fasta_kmers 41 | self.min_fasta_hits = min_fasta_hits 42 | self.print_interval = print_interval 43 | self.output_file = output_file 44 | self.filtered_reads_file = filtered_reads_file 45 | self.max_gap = max_gap # multiples of the kmer 46 | self.min_block_size = min_block_size 47 | self.margin = margin 48 | self.start_time = start_time 49 | self.min_kmers_for_onex_pass = min_kmers_for_onex_pass 50 | self.fasta_obj = fasta_obj 51 | self.min_perc_coverage = min_perc_coverage 52 | self.genes_with_100_percent = {} 53 | self.homopolyer_compression = homopolyer_compression 54 | self.max_kmer_count = max_kmer_count 55 | self.no_gene_filter = no_gene_filter 56 | 57 | '''Read in a whole FASTQ file, do a quick pass filter, then map more 58 | sensitively''' 59 | 60 | def read_filter_and_map(self): 61 | counter = 0 62 | match_counter = 0 63 | f = 0 64 | r = 0 65 | 66 | self.logger.info("Reading in FASTQ file") 67 | self.print_out_header() 68 | fh = self.open_file_read() 69 | read = Read() 70 | 71 | while read.get_next_from_file(fh): 72 | counter += 1 73 | 74 | if(self.print_interval is not None and 75 | counter % self.print_interval == 0): 76 | self.full_gene_coverage(counter) 77 | 78 | if self.map_read(read): 79 | match_counter += 1 80 | f += 1 81 | elif self.map_read(read.reverse_read()): 82 | match_counter += 1 83 | r += 1 84 | fh.close() 85 | 86 | alleles = self.full_gene_coverage(counter) 87 | self.logger.info("Number of reads: "+str(counter)) 88 | self.logger.info("Number of matching reads: " + 89 | str(match_counter)+"\t"+str(f)+"\t"+str(r)) 90 | 91 | print("\t".join([self.filename, str(len(alleles)), str(match_counter), str(counter), str(match_counter*100/counter)])) 92 | 93 | return self 94 | 95 | '''Take a single read, do a quick kmer check to see if it matches any gene, 96 | then map more sensitively''' 97 | 98 | def map_read(self, read): 99 | candidate_gene_names = self.does_read_contain_quick_pass_kmers( 100 | read.seq) 101 | if len(candidate_gene_names) > 0: 102 | self.logger.info("Read passes 1X check") 103 | return self.map_kmers_to_read(read.seq, read, candidate_gene_names) 104 | else: 105 | return False 106 | 107 | '''Taking kmers at 1x, see do any match the gene kmers''' 108 | 109 | def does_read_contain_quick_pass_kmers(self, sequence): 110 | self.logger.info("Perform quick pass k-mer check on read") 111 | seq_length = len(sequence) 112 | if seq_length < self.min_block_size: 113 | self.logger.info("Read below minimum size") 114 | return {} 115 | 116 | kmers_obj = Kmers(sequence, self.k, self.homopolyer_compression) 117 | read_onex_kmers = kmers_obj.get_one_x_coverage_of_kmers() 118 | 119 | intersect_read_fasta_kmers = self.fasta_obj.kmer_keys_set & set( 120 | read_onex_kmers) 121 | 122 | if len(intersect_read_fasta_kmers) > self.min_kmers_for_onex_pass: 123 | gene_names = self.genes_containing_first_pass_kmers( 124 | self.fasta_obj, intersect_read_fasta_kmers) 125 | 126 | mk1x = self.min_kmers_for_onex_pass 127 | candidate_gene_names = { 128 | k: v for k, v in gene_names.items() if v > mk1x} 129 | 130 | return candidate_gene_names 131 | 132 | return {} 133 | 134 | '''place kmers into read bins''' 135 | 136 | def put_kmers_in_read_bins(self, seq_length, end, fasta_kmers, read_kmers): 137 | self.logger.info("Put k-mers in read bins") 138 | sequence_hits = numpy.zeros(int(seq_length/self.k)+1, dtype=int) 139 | hit_counter = 0 140 | 141 | hit_kmers = {} 142 | for read_kmer, read_kmer_hit in read_kmers.items(): 143 | if read_kmer in fasta_kmers: 144 | for coordinate in read_kmer_hit: 145 | hit_counter += 1 146 | sequence_hits[int(coordinate/self.k)] += 1 147 | hit_kmers[read_kmer] = read_kmer_hit 148 | return sequence_hits, hit_counter, hit_kmers 149 | 150 | '''do a fine grained mapping of kmers to a read''' 151 | 152 | def map_kmers_to_read(self, sequence, read, candidate_gene_names): 153 | self.logger.info("Map k-mers to read") 154 | 155 | seq_length = len(sequence) 156 | end = seq_length - self.k 157 | 158 | kmers_obj = Kmers(sequence, self.k, self.homopolyer_compression) 159 | read_kmers = kmers_obj.get_all_kmers( 160 | max_kmer_count=self.max_kmer_count) 161 | is_read_matching = False 162 | 163 | sequence_hits, hit_counter, read_kmer_hits =\ 164 | self.put_kmers_in_read_bins( 165 | seq_length, 166 | end, 167 | self.fasta_kmers, 168 | read_kmers) 169 | 170 | blocks_obj = Blocks(self.k, self.min_block_size, 171 | self.max_gap, self.margin) 172 | block_start, block_end = blocks_obj.find_largest_block(sequence_hits) 173 | 174 | block_start = blocks_obj.adjust_block_start(block_start) 175 | block_end = blocks_obj.adjust_block_end(block_end, seq_length) 176 | 177 | block_kmers = self.create_kmers_for_block( 178 | block_start, block_end, read_kmer_hits) 179 | is_read_matching = self.apply_kmers_to_genes( 180 | self.fasta_obj, block_kmers, candidate_gene_names) 181 | 182 | if self.filtered_reads_file: 183 | self.append_read_to_fastq_file(read, block_start, block_end) 184 | 185 | return is_read_matching 186 | 187 | '''optional method to output only reads matching the genes allowing for 188 | offline assembly''' 189 | 190 | def append_read_to_fastq_file(self, read, block_start, block_end): 191 | with open(self.filtered_reads_file, 'a+') as output_fh: 192 | output_fh.write(str(read)) 193 | 194 | '''given a putative block where a gene may match, get all the kmers''' 195 | 196 | def create_kmers_for_block(self, block_start, block_end, read_kmer_hits): 197 | if block_end == 0: 198 | return {} 199 | 200 | def get_one_x_coverage_of_kmers(self, sequence, k, end): 201 | return [sequence[i:i+k] for i in range(0, end, k)] 202 | block_kmers = {} 203 | 204 | for read_kmer, read_kmer_hit in read_kmer_hits.items(): 205 | found_match_in_block = False 206 | for coordinate in read_kmer_hit: 207 | if coordinate >= block_start and coordinate <= block_end: 208 | found_match_in_block = True 209 | continue 210 | if found_match_in_block: 211 | block_kmers[read_kmer] = 1 212 | 213 | return block_kmers 214 | 215 | '''Get a list of genes which may be in the reads''' 216 | 217 | def genes_containing_first_pass_kmers(self, fasta_obj, first_pass_kmers): 218 | genes = {} 219 | for current_kmer in first_pass_kmers: 220 | if current_kmer in fasta_obj.kmers_to_genes: 221 | for gene_name in fasta_obj.kmers_to_genes[current_kmer]: 222 | if gene_name in self.genes_with_100_percent: 223 | continue 224 | 225 | if gene_name in genes: 226 | genes[gene_name] += 1 227 | else: 228 | genes[gene_name] = 1 229 | return genes 230 | 231 | '''given some kmer hits, apply the kmers to the genes so theres a count''' 232 | 233 | def apply_kmers_to_genes(self, fasta_obj, hit_kmers, gene_names): 234 | num_genes_applied = 0 235 | min_kmers = self.min_kmers_for_onex_pass * self.k 236 | hit_kmers_set = set(hit_kmers) 237 | 238 | for gene_name in gene_names.keys(): 239 | if gene_names[gene_name] < self.min_kmers_for_onex_pass: 240 | continue 241 | kmers_dict = fasta_obj.sequences_to_kmers[gene_name] 242 | intersection_hit_keys = set(kmers_dict) & hit_kmers_set 243 | 244 | if len(intersection_hit_keys) > min_kmers: 245 | num_genes_applied += 1 246 | for kmer in intersection_hit_keys: 247 | 248 | if fasta_obj.sequences_to_kmers_count[gene_name][kmer] > 0: 249 | fasta_obj.sequences_to_kmers[gene_name][kmer] += 1 / \ 250 | fasta_obj.sequences_to_kmers_count[gene_name][kmer] 251 | 252 | if num_genes_applied > 0: 253 | return True 254 | else: 255 | return False 256 | 257 | '''calculate the coverage of the genes''' 258 | 259 | def full_gene_coverage(self, counter): 260 | self.logger.info("Check the coverage of a sequence") 261 | alleles = [] 262 | s2k = self.fasta_obj.sequences_to_kmers 263 | for (gene_name, kmers_dict) in s2k.items(): 264 | kv = kmers_dict.values() 265 | kv_total_length = len(kmers_dict) 266 | kz = len([x for x in kv if x == 0]) 267 | kl = kv_total_length - kz 268 | 269 | if kl > kz: 270 | alleles.append(Gene(gene_name, kl, kz)) 271 | 272 | self.print_out_alleles(self.filter_contained_alleles(alleles)) 273 | self.identify_alleles_with_100_percent(alleles) 274 | 275 | return alleles 276 | 277 | '''Identify genes which match 100 percent''' 278 | 279 | def identify_alleles_with_100_percent(self, alleles): 280 | for g in alleles: 281 | if(g.percentage_coverage() == 100 282 | and g.name not in self.genes_with_100_percent): 283 | self.genes_with_100_percent[g.name] = 1 284 | 285 | '''Some genes are subsets of other genes, so optionally filter out''' 286 | 287 | def filter_contained_alleles(self, alleles): 288 | if self.no_gene_filter: 289 | return alleles 290 | 291 | prefix_to_coverage = {} 292 | filtered_alleles = [] 293 | for g in alleles: 294 | if g.prefix_short_name() in prefix_to_coverage: 295 | prefix_to_coverage[g.prefix_short_name()].append(g) 296 | else: 297 | prefix_to_coverage[g.prefix_short_name()] = [g] 298 | 299 | for genes in prefix_to_coverage.values(): 300 | genes.sort(key=lambda x: x.percentage_coverage(), reverse=True) 301 | for index, gene in enumerate(genes): 302 | if gene.percentage_coverage() == 100 or index == 0: 303 | filtered_alleles.append(gene) 304 | 305 | return filtered_alleles 306 | 307 | '''print alleles to stdout''' 308 | 309 | def print_out_alleles(self, alleles): 310 | found_alleles = False 311 | 312 | for g in alleles: 313 | if g.percentage_coverage() >= self.min_perc_coverage: 314 | found_alleles = True 315 | if self.output_file: 316 | with open(self.output_file, 'a+') as output_fh: 317 | output_fh.write(str(g) + "\n") 318 | else: 319 | print(g) 320 | 321 | if found_alleles and self.print_interval is not None: 322 | print("****") 323 | 324 | '''print out the header''' 325 | 326 | def print_out_header(self): 327 | header = "GENE\tCOMPLETENESS\t%COVERAGE\tACCESSION\tDATABASE\tPRODUCT" 328 | if self.output_file: 329 | with open(self.output_file, 'w+') as output_fh: 330 | output_fh.write(header + "\n") 331 | else: 332 | print(header) 333 | 334 | # Derived from https://github.com/sanger-pathogens/Fastaq 335 | # Author: Martin Hunt 336 | def open_file_read(self): 337 | if self.filename == '-': 338 | f = sys.stdin 339 | elif self.filename.endswith('.gz'): 340 | # first check that the file is OK according to gunzip 341 | retcode = subprocess.call('gunzip -t ' + self.filename, shell=True) 342 | if retcode > 1: 343 | raise Error("Error file may not be gzipped correctly") 344 | 345 | # now open the file 346 | f = os.popen('gunzip -c ' + self.filename) 347 | else: 348 | try: 349 | f = open(self.filename) 350 | except IOError: 351 | raise Error("Error opening for reading file '" + 352 | self.filename + "'") 353 | 354 | return f 355 | -------------------------------------------------------------------------------- /tiptoft/Gene.py: -------------------------------------------------------------------------------- 1 | '''Represents a gene or sequence allele''' 2 | import re 3 | 4 | 5 | class Gene: 6 | def __init__(self, name, kmers_with_coverage, kmers_without_coverage): 7 | self.name = name 8 | self.kmers_with_coverage = kmers_with_coverage 9 | self.kmers_without_coverage = kmers_without_coverage 10 | 11 | def __str__(self): 12 | return "\t".join(( 13 | self.short_name(), 14 | self.completeness(), 15 | str(self.percentage_coverage()), 16 | self.accession(), 17 | 'plasmidfinder', 18 | self.name)) 19 | 20 | '''is the gene fully covered''' 21 | 22 | def completeness(self): 23 | if self.is_full_coverage(): 24 | return "Full" 25 | else: 26 | return "Partial" 27 | 28 | def is_full_coverage(self): 29 | if self.kmers_without_coverage == 0: 30 | return True 31 | else: 32 | return False 33 | 34 | '''calculate the coverage of the gene''' 35 | 36 | def percentage_coverage(self): 37 | total_kmers = self.kmers_with_coverage + self.kmers_without_coverage 38 | if total_kmers > 0: 39 | return int((self.kmers_with_coverage*100)/total_kmers) 40 | else: 41 | return 0 42 | 43 | '''construct the short human readable name for the output''' 44 | 45 | def prefix_short_name(self): 46 | regex = r"^([^\.]+)\." 47 | 48 | m = re.search(regex, self.short_name()) 49 | if m and m.group: 50 | return str(m.group(1)) 51 | else: 52 | return '' 53 | 54 | def short_name(self): 55 | regex = r"^([^_]+)_" 56 | 57 | m = re.search(regex, self.name) 58 | if m and m.group: 59 | return str(m.group(1)) 60 | else: 61 | return '' 62 | 63 | '''extract the accession which is hardcoded in the DB sequence name''' 64 | 65 | def accession(self): 66 | regex = r"^([^_]+)_([^_]*)_(.+)$" 67 | 68 | m = re.search(regex, self.name) 69 | if m and m.group: 70 | return str(m.group(3)) 71 | else: 72 | return '' 73 | -------------------------------------------------------------------------------- /tiptoft/InputTypes.py: -------------------------------------------------------------------------------- 1 | '''check the input types from the command line script''' 2 | import os 3 | import argparse 4 | 5 | 6 | class InputTypes: 7 | 8 | '''The input file should exist''' 9 | def is_fastq_file_valid(filename): 10 | if not os.path.exists(filename) and filename != "-": 11 | raise argparse.ArgumentTypeError('Cannot access input file') 12 | return filename 13 | 14 | '''The input file should exist''' 15 | def is_plasmid_file_valid(filename): 16 | if not os.path.exists(filename): 17 | raise argparse.ArgumentTypeError('Cannot access input file') 18 | return filename 19 | 20 | def is_kmer_valid(value_str): 21 | if value_str.isdigit(): 22 | kmer = int(value_str) 23 | if kmer >= 7 and kmer <= 31: 24 | return kmer 25 | raise argparse.ArgumentTypeError( 26 | """Invalid Kmer value, it must be an integer between 7 and 31. 27 | If you need values outside of this range, your probably 28 | trying to use this software for something it was never 29 | designed to do.""") 30 | -------------------------------------------------------------------------------- /tiptoft/Kmers.py: -------------------------------------------------------------------------------- 1 | from homopolymer_compression import homopolymer_compression_of_sequence 2 | '''Given a string of nucleotides and k, return all kmers''' 3 | 4 | # The homopolymer compression is quite intensive so run it with cython 5 | import pyximport 6 | pyximport.install() 7 | 8 | 9 | class Kmers: 10 | def __init__(self, sequence, k, homopolyer_compression): 11 | self.sequence = homopolymer_compression_of_sequence( 12 | sequence) if homopolyer_compression else sequence 13 | self.k = k 14 | self.end = len(self.sequence) - self.k + 1 15 | 16 | '''Get all kmers''' 17 | 18 | def get_all_kmers_counter(self, max_kmer_count=10): 19 | kmers = self.get_all_kmers_filtered(max_kmer_count) 20 | return {x: 0 for x in kmers.keys()} 21 | 22 | '''get all kmers and count them''' 23 | 24 | def get_all_kmers_freq(self, max_kmer_count=10): 25 | kmers = self.get_all_kmers_filtered(max_kmer_count) 26 | return {k: len(v) for k, v in kmers.items()} 27 | 28 | '''old api''' 29 | 30 | def get_all_kmers(self, max_kmer_count=10): 31 | return self.get_all_kmers_filtered(max_kmer_count) 32 | 33 | '''Filter out kmers which are too numerous''' 34 | 35 | def get_all_kmers_filtered(self, max_kmer_count=10): 36 | kmers = {} 37 | 38 | kmer_sequences = [self.sequence[i:i+self.k] 39 | for i in range(0, self.end)] 40 | 41 | for i, k in enumerate(kmer_sequences): 42 | if k in kmers: 43 | kmers[k].append(i) 44 | else: 45 | kmers[k] = [i] 46 | 47 | filtered_kmers = {k: v for k, 48 | v in kmers.items() if len(v) <= max_kmer_count} 49 | return filtered_kmers 50 | 51 | '''get a 1x coverage of kmers for a sequence''' 52 | 53 | def get_one_x_coverage_of_kmers(self): 54 | return [self.sequence[i:i+self.k] for i in range(0, self.end, self.k)] 55 | -------------------------------------------------------------------------------- /tiptoft/Read.py: -------------------------------------------------------------------------------- 1 | # Derived from https://github.com/sanger-pathogens/Fastaq 2 | # Author: Martin Hunt 3 | # Copyright (c) 2013 - 2017 by Genome Research Ltd. 4 | # GNU GPL version 3 5 | 6 | 7 | class Read: 8 | def __init__(self, id=None, seq=None, qual=None): 9 | self.id = id 10 | self.seq = seq 11 | self.qual = qual 12 | 13 | def subsequence(self, start, end): 14 | return Read( 15 | id=self.id+"_"+str(start)+"_"+str(end), 16 | seq=self.seq[start:end], 17 | qual=self.qual[start:end]) 18 | 19 | def __str__(self): 20 | return '@' + self.id + '\n' + self.seq + '\n+\n' + self.qual + '\n' 21 | 22 | def reverse_complement_sequence(self): 23 | return self.seq.translate(str.maketrans("ATCGatcg", "TAGCtagc"))[::-1] 24 | 25 | def reverse_read(self): 26 | return Read( 27 | id=self.id+"_reverse", 28 | seq=self.reverse_complement_sequence(), 29 | qual=self.qual) 30 | 31 | def get_next_from_file(self, f): 32 | line = f.readline() 33 | 34 | while line == '\n': 35 | line = f.readline() 36 | 37 | if not line: 38 | return False 39 | 40 | self.id = line.rstrip()[1:] 41 | line = f.readline() 42 | 43 | self.seq = line.strip() 44 | line = f.readline() 45 | line = f.readline() 46 | 47 | self.qual = line.rstrip() 48 | return self 49 | -------------------------------------------------------------------------------- /tiptoft/RefGenesGetter.py: -------------------------------------------------------------------------------- 1 | # From ARIBA 2 | # github.com/sanger-pathogens/ariba/blob/master/ariba/ref_genes_getter.py 3 | # bfd6cc9a409828ecc940b554164fabc8e6cc6b9a 4 | # Author: Martin Hunt 5 | # Copyright (c) 2015 - 2018 by Genome Research Ltd. 6 | # GNU GPL version 3 7 | 8 | import os 9 | import subprocess 10 | import pyfastaq 11 | import shutil 12 | 13 | 14 | class Error (Exception): 15 | pass 16 | 17 | 18 | class RefGenesGetter: 19 | def __init__(self, verbose=False): 20 | self.ref_db = 'plasmidfinder' 21 | self.verbose = verbose 22 | self.genetic_code = 11 23 | self.max_download_attempts = 3 24 | self.sleep_time = 2 25 | 26 | def _get_from_plasmidfinder(self, outprefix): 27 | outprefix = os.path.abspath(outprefix) 28 | final_fasta = outprefix + '.fa' 29 | final_tsv = outprefix + '.tsv' 30 | tmpdir = outprefix + '.tmp.download' 31 | current_dir = os.getcwd() 32 | 33 | try: 34 | os.mkdir(tmpdir) 35 | os.chdir(tmpdir) 36 | except Error: 37 | raise Error('Error mkdir/chdir ' + tmpdir) 38 | 39 | files_to_download = ['enterobacteriaceae.fsa', 'Inc18.fsa', 'NT_Rep.fsa', 'Rep1.fsa', 'Rep2.fsa', 'Rep3.fsa', 'RepA_N.fsa', 'RepL.fsa', 'Rep_trans.fsa'] 40 | 41 | for f in files_to_download: 42 | cmd = 'curl -o '+str(f)+' https://bitbucket.org/' \ 43 | 'genomicepidemiology/plasmidfinder_db/raw/'\ 44 | 'master/'+str(f) 45 | print('Downloading data with:', cmd, sep='\n') 46 | subprocess.check_call(cmd, shell=True) 47 | 48 | print('Combining downloaded fasta files...') 49 | fout_fa = pyfastaq.utils.open_file_write(final_fasta) 50 | fout_tsv = pyfastaq.utils.open_file_write(final_tsv) 51 | name_count = {} 52 | 53 | for filename in os.listdir(tmpdir): 54 | if filename.endswith('.fsa'): 55 | print(' ', filename) 56 | file_reader = pyfastaq.sequences.file_reader( 57 | os.path.join(tmpdir, filename)) 58 | for seq in file_reader: 59 | original_id = seq.id 60 | seq.id = seq.id.replace('_', '.', 1) 61 | seq.seq = seq.seq.upper() 62 | if seq.id in name_count: 63 | name_count[seq.id] += 1 64 | seq.id = seq.id + '.' + str(name_count[seq.id]) 65 | else: 66 | name_count[seq.id] = 1 67 | 68 | print(seq, file=fout_fa) 69 | print(seq.id, '0', '0', '.', '.', 'Original name was ' + 70 | original_id, sep='\t', file=fout_tsv) 71 | 72 | pyfastaq.utils.close(fout_fa) 73 | pyfastaq.utils.close(fout_tsv) 74 | print('\nFinished combining files\n') 75 | os.chdir(current_dir) 76 | if not self.verbose: 77 | shutil.rmtree(tmpdir) 78 | print('Finished. Final files are:', final_fasta, 79 | final_tsv, sep='\n\t', end='\n\n') 80 | print('If you use this downloaded data, please cite:') 81 | print('"PlasmidFinder and pMLST: in silico detection and typing') 82 | print(' of plasmids", Carattoli et al 2014, PMID: 24777092\n') 83 | 84 | def run(self, outprefix): 85 | exec('self._get_from_' + self.ref_db + '(outprefix)') 86 | -------------------------------------------------------------------------------- /tiptoft/TipToft.py: -------------------------------------------------------------------------------- 1 | '''Driver class for the tiptoft script''' 2 | import logging 3 | import os 4 | import sys 5 | import time 6 | import pkg_resources 7 | from tiptoft.Fasta import Fasta 8 | from tiptoft.Fastq import Fastq 9 | 10 | 11 | class TipToft: 12 | def __init__(self, options): 13 | self.logger = logging.getLogger(__name__) 14 | self.plasmid_data = options.plasmid_data 15 | self.input_fastq = options.input_fastq 16 | self.kmer = options.kmer 17 | self.verbose = options.verbose 18 | self.min_fasta_hits = options.min_fasta_hits 19 | self.print_interval = options.print_interval 20 | self.output_file = options.output_file 21 | self.filtered_reads_file = options.filtered_reads_file 22 | self.max_gap = options.max_gap 23 | self.min_block_size = options.min_block_size 24 | self.margin = options.margin 25 | self.start_time = int(time.time()) 26 | self.min_kmers_for_onex_pass = options.min_kmers_for_onex_pass 27 | self.min_perc_coverage = options.min_perc_coverage 28 | if options.no_hc_compression: 29 | self.homopolyer_compression = False 30 | else: 31 | self.homopolyer_compression = True 32 | self.max_kmer_count = options.max_kmer_count 33 | self.no_gene_filter = options.no_gene_filter 34 | 35 | if self.plasmid_data is None: 36 | self.plasmid_data = str(pkg_resources.resource_filename( 37 | __name__, 'data/plasmid_data.fa')) 38 | 39 | #if self.output_file and os.path.exists(self.output_file): 40 | # self.logger.error( 41 | # "The output file already exists, " 42 | # "please choose another filename: " 43 | # + self.output_file) 44 | # sys.exit(1) 45 | 46 | if(self.filtered_reads_file and 47 | os.path.exists(self.filtered_reads_file)): 48 | self.logger.error( 49 | "The output filtered reads file already exists," 50 | " please choose another filename: " + self.filtered_reads_file) 51 | sys.exit(1) 52 | 53 | if self.verbose: 54 | self.logger.setLevel(logging.DEBUG) 55 | else: 56 | self.logger.setLevel(logging.ERROR) 57 | 58 | '''pass everything over to other classes''' 59 | 60 | def run(self): 61 | self.logger.info("Starting analysis") 62 | fasta = Fasta(self.logger, 63 | self.plasmid_data, 64 | self.kmer, 65 | self.homopolyer_compression, 66 | max_kmer_count=self.max_kmer_count) 67 | fastq = Fastq(self.logger, 68 | self.input_fastq, 69 | self.kmer, 70 | fasta.all_kmers_in_file(), 71 | self.min_fasta_hits, 72 | self.print_interval, 73 | self.output_file, 74 | self.filtered_reads_file, 75 | fasta, 76 | self.homopolyer_compression, 77 | max_gap=self.max_gap, 78 | min_block_size=self.min_block_size, 79 | margin=self.margin, 80 | start_time=self.start_time, 81 | min_kmers_for_onex_pass=self.min_kmers_for_onex_pass, 82 | min_perc_coverage=self.min_perc_coverage, 83 | max_kmer_count=self.max_kmer_count, 84 | no_gene_filter=self.no_gene_filter) 85 | fastq.read_filter_and_map() 86 | -------------------------------------------------------------------------------- /tiptoft/TipToftDatabaseDownloader.py: -------------------------------------------------------------------------------- 1 | '''driver class for downloading database''' 2 | import logging 3 | from tiptoft.RefGenesGetter import RefGenesGetter 4 | 5 | 6 | class TipToftDatabaseDownloader: 7 | def __init__(self, options): 8 | self.logger = logging.getLogger(__name__) 9 | self.output_prefix = options.output_prefix 10 | self.verbose = options.verbose 11 | 12 | if self.verbose: 13 | self.logger.setLevel(logging.DEBUG) 14 | else: 15 | self.logger.setLevel(logging.ERROR) 16 | 17 | '''pass all over to other classes''' 18 | 19 | def run(self): 20 | refgenes = RefGenesGetter(verbose=self.verbose) 21 | 22 | if self.output_prefix: 23 | refgenes.run(self.output_prefix) 24 | else: 25 | self.logger.error("Please check the input parameters") 26 | -------------------------------------------------------------------------------- /tiptoft/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewjpage/tiptoft/e71b002bef09f97d6881fbc69bedc7ad65fe6b48/tiptoft/__init__.py -------------------------------------------------------------------------------- /tiptoft/data/20180928.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewjpage/tiptoft/e71b002bef09f97d6881fbc69bedc7ad65fe6b48/tiptoft/data/20180928.txt -------------------------------------------------------------------------------- /tiptoft/data/plasmid_data.tsv: -------------------------------------------------------------------------------- 1 | rep1.1_repE(pAMbeta1)_AF007787 0 0 . . Original name was rep1_1_repE(pAMbeta1)_AF007787 2 | rep1.2_repS(pBT233)_X64695 0 0 . . Original name was rep1_2_repS(pBT233)_X64695 3 | rep1.3_repR(pGB354)_U83488 0 0 . . Original name was rep1_3_repR(pGB354)_U83488 4 | rep1.4_repE(orf17)_NC011140 0 0 . . Original name was rep1_4_repE(orf17)_NC011140 5 | rep1.5_CDS6(pRE25)_X92945 0 0 . . Original name was rep1_5_CDS6(pRE25)_X92945 6 | rep1.6_repE(pTEF1)_AE016833 0 0 . . Original name was rep1_6_repE(pTEF1)_AE016833 7 | rep2.1_ORF(E.faeciumContig1183)_JDOE 0 0 . . Original name was rep2_1_ORF(E.faeciumContig1183)_JDOE 8 | rep2.2_repR(pEF1)_DQ198088 0 0 . . Original name was rep2_2_repR(pEF1)_DQ198088 9 | rep3.1_rep63A(pAW63)_BTH011655 0 0 . . Original name was rep3_1_rep63A(pAW63)_BTH011655 10 | rep3.2_rep165(pBMB165)_DQ242517 0 0 . . Original name was rep3_2_rep165(pBMB165)_DQ242517 11 | rep3.3_rep(pBT9727)_CP000047 0 0 . . Original name was rep3_3_rep(pBT9727)_CP000047 12 | rep3.4_CDS38(pOX2)_NC002146 0 0 . . Original name was rep3_4_CDS38(pOX2)_NC002146 13 | rep4.1_CDS2(pCRL291.1)_NC002799 0 0 . . Original name was rep4_1_CDS2(pCRL291.1)_NC002799 14 | rep4.2_CDS3(pKC5b)_AF378372 0 0 . . Original name was rep4_2_CDS3(pKC5b)_AF378372 15 | rep4.3_ORF(pMBB1)_EFU26268 0 0 . . Original name was rep4_3_ORF(pMBB1)_EFU26268 16 | rep5.1_rep(pMW2)_NC005011 0 0 . . Original name was rep5_1_rep(pMW2)_NC005011 17 | rep5.2_rep(pN315)_AP003139 0 0 . . Original name was rep5_2_rep(pN315)_AP003139 18 | rep5.3_rep(pWBG752)_GQ900394.1 0 0 . . Original name was rep5_3_rep(pWBG752)_GQ900394.1 19 | rep5.4_rep(SAP047A)_GQ900405.1 0 0 . . Original name was rep5_4_rep(SAP047A)_GQ900405.1 20 | rep5.5_rep(pRJ6)_AF241888.2 0 0 . . Original name was rep5_5_rep(pRJ6)_AF241888.2 21 | rep5.6_rep(pRJ9)_AF447813.1 0 0 . . Original name was rep5_6_rep(pRJ9)_AF447813.1 22 | rep5.7_rep(pJE1)_AF051916.1 0 0 . . Original name was rep5_7_rep(pJE1)_AF051916.1 23 | rep6.1_repA(p703/5)_AF109375 0 0 . . Original name was rep6_1_repA(p703/5)_AF109375 24 | rep7.1_repC(Cassette)_AB037671 0 0 . . Original name was rep7_1_repC(Cassette)_AB037671 25 | rep7.3_CDS1(pCW7)_J03323 0 0 . . Original name was rep7_3_CDS1(pCW7)_J03323 26 | rep7.4_repD(pK214)_NC009751 0 0 . . Original name was rep7_4_repD(pK214)_NC009751 27 | rep7.5_CDS1(pKC5b)_AF378372 0 0 . . Original name was rep7_5_CDS1(pKC5b)_AF378372 28 | rep7.6_ORF(pKH1)_SAU38656 0 0 . . Original name was rep7_6_ORF(pKH1)_SAU38656 29 | rep7.7_rep(pKH7)_NC002096 0 0 . . Original name was rep7_7_rep(pKH7)_NC002096 30 | rep7.8_CDS4(pS194)_NC005564 0 0 . . Original name was rep7_8_CDS4(pS194)_NC005564 31 | rep7.9_CDS3(pUSA02)_NC007791 0 0 . . Original name was rep7_9_CDS3(pUSA02)_NC007791 32 | rep7.10_repC(pKH17)_NC_010284.1 0 0 . . Original name was rep7_10_repC(pKH17)_NC_010284.1 33 | rep7.11_repD(pTZ4)_NC010111.1 0 0 . . Original name was rep7_11_repD(pTZ4)_NC010111.1 34 | rep7.12_rep(SAP060B)_GQ900417.1 0 0 . . Original name was rep7_12_rep(SAP060B)_GQ900417.1 35 | rep7.13_rep(pSBK203)_U35036.1 0 0 . . Original name was rep7_13_rep(pSBK203)_U35036.1 36 | rep7.14_rep(MSSA476)_BX571857.1 0 0 . . Original name was rep7_14_rep(MSSA476)_BX571857.1 37 | rep7.15_rep(pS0385-2)_AM990994.1 0 0 . . Original name was rep7_15_rep(pS0385-2)_AM990994.1 38 | rep7.16_repD_AY939911 0 0 . . Original name was rep7_16_repD_AY939911 39 | rep7.17_repC(pS0385-1)_AM990993.1 0 0 . . Original name was rep7_17_repC(pS0385-1)_AM990993.1 40 | rep7b.1_repJ(pC223)_NC_005243.1 0 0 . . Original name was rep7b_1_repJ(pC223)_NC_005243.1 41 | rep7b.2_pVGA_NC_011605.1 0 0 . . Original name was rep7b_2_pVGA_NC_011605.1 42 | rep7b.3_SAP089A_GQ900440.1 0 0 . . Original name was rep7b_3_SAP089A_GQ900440.1 43 | rep8.1_EP002*(pAM373)_NC002630 0 0 . . Original name was rep8_1_EP002*(pAM373)_NC002630 44 | rep8.2_repA(pEJ97-1)_AJ49170 0 0 . . Original name was rep8_2_repA(pEJ97-1)_AJ49170 45 | rep9.1_repA(pAD1)_L01794 0 0 . . Original name was rep9_1_repA(pAD1)_L01794 46 | rep9.2_prgW(pCF10)_AY855841 0 0 . . Original name was rep9_2_prgW(pCF10)_AY855841 47 | rep9.3_repA(pPD1)_D78016 0 0 . . Original name was rep9_3_repA(pPD1)_D78016 48 | rep9.4_repA2(pTEF2)_AE016831 0 0 . . Original name was rep9_4_repA2(pTEF2)_AE016831 49 | rep10.2_CDS1(pIM13)_M13761 0 0 . . Original name was rep10_2_CDS1(pIM13)_M13761 50 | rep10.3_ORF(pNE131)_NC001390 0 0 . . Original name was rep10_3_ORF(pNE131)_NC001390 51 | rep10.4_repL(pDLK1)_GU562624.1 0 0 . . Original name was rep10_4_repL(pDLK1)_GU562624.1 52 | rep10.5_repL(pWBG738)_DQ088624.1 0 0 . . Original name was rep10_5_repL(pWBG738)_DQ088624.1 53 | rep10.6_rep(SAP093B)_GQ900442.1 0 0 . . Original name was rep10_6_rep(SAP093B)_GQ900442.1 54 | rep10b.1_rep(pSK3)_NC001994 0 0 . . Original name was rep10b_1_rep(pSK3)_NC001994 55 | rep10b.2_rep(pSK6)_SAU96610 0 0 . . Original name was rep10b_2_rep(pSK6)_SAU96610 56 | rep11.1_repA(pB82)_AB178871 0 0 . . Original name was rep11_1_repA(pB82)_AB178871 57 | rep11.2_repA(pEF1071)_AF164559 0 0 . . Original name was rep11_2_repA(pEF1071)_AF164559 58 | rep11.3_rep(pEFR)_AF511037 0 0 . . Original name was rep11_3_rep(pEFR)_AF511037 59 | rep12.1_CDS32(pBMB67)_DQ363750 0 0 . . Original name was rep12_1_CDS32(pBMB67)_DQ363750 60 | rep12.2_CDS1(plasmid)_AY278324 0 0 . . Original name was rep12_2_CDS1(plasmid)_AY278324 61 | rep13.1_ORF(pC194)_NC002013 0 0 . . Original name was rep13_1_ORF(pC194)_NC002013 62 | rep13.2_CDS39(pSSP1)_AP008935 0 0 . . Original name was rep13_2_CDS39(pSSP1)_AP008935 63 | rep13.3_rep(pWBG1773)_EF537646 0 0 . . Original name was rep13_3_rep(pWBG1773)_EF537646 64 | rep13.4_rep(pKH13)_NC010426.1 0 0 . . Original name was rep13_4_rep(pKH13)_NC010426.1 65 | rep13.5_rep(pMC524/MBM)_AJ312056.2 0 0 . . Original name was rep13_5_rep(pMC524/MBM)_AJ312056.2 66 | rep14.1_CDS2(pEFNP1)_AB038522 0 0 . . Original name was rep14_1_CDS2(pEFNP1)_AB038522 67 | rep14.2_ORF(pKQ10)_EFU01917 0 0 . . Original name was rep14_2_ORF(pKQ10)_EFU01917 68 | rep14.3_ORF(pRI1)_EU327398 0 0 . . Original name was rep14_3_ORF(pRI1)_EU327398 69 | rep15.1_repA(pLW043)_AE017171 0 0 . . Original name was rep15_1_repA(pLW043)_AE017171 70 | rep16.1_CDS8(pSAS)_BX571858 0 0 . . Original name was rep16_1_CDS8(pSAS)_BX571858 71 | rep16.2_CDS6(pSJH101)_CP000737 0 0 . . Original name was rep16_2_CDS6(pSJH101)_CP000737 72 | rep16.3_rep(Saa6159)_CP002115.1 0 0 . . Original name was rep16_3_rep(Saa6159)_CP002115.1 73 | rep16.4_unknown(SAP056A)_GQ900478.1 0 0 . . Original name was rep16_4_unknown(SAP056A)_GQ900478.1 74 | rep16.5_unknown(pWBG759)_GQ900401.1 0 0 . . Original name was rep16_5_unknown(pWBG759)_GQ900401.1 75 | rep16.6_unknown(SAP071A)_GQ900485.1 0 0 . . Original name was rep16_6_unknown(SAP071A)_GQ900485.1 76 | rep16.7_rep(pBORa53)_AY917098.1 0 0 . . Original name was rep16_7_rep(pBORa53)_AY917098.1 77 | rep17.1_CDS29(pRUM)_AF507977 0 0 . . Original name was rep17_1_CDS29(pRUM)_AF507977 78 | rep18.1_repA(p200B)_AB158402 0 0 . . Original name was rep18_1_repA(p200B)_AB158402 79 | rep18.2_repA(pEF418)_AF408195 0 0 . . Original name was rep18_2_repA(pEF418)_AF408195 80 | rep19.1_CDS45(pLEW6932)_NC009130 0 0 . . Original name was rep19_1_CDS45(pLEW6932)_NC009130 81 | rep19.2_rep(pSA1379)_NC007931 0 0 . . Original name was rep19_2_rep(pSA1379)_NC007931 82 | rep19.3_CDS20(pSJH901)_CP000704 0 0 . . Original name was rep19_3_CDS20(pSJH901)_CP000704 83 | rep19.4_repA(pUB101)_NC005127 0 0 . . Original name was rep19_4_repA(pUB101)_NC005127 84 | rep19.5_rep(pWBG747)_GQ900399.1 0 0 . . Original name was rep19_5_rep(pWBG747)_GQ900399.1 85 | rep19.6_ORF34(EDINA)_AP003089.1 0 0 . . Original name was rep19_6_ORF34(EDINA)_AP003089.1 86 | rep19.7_repA(SAP019A)_GQ900385.1 0 0 . . Original name was rep19_7_repA(SAP019A)_GQ900385.1 87 | rep19.8_repA(pWBG759)_GQ900401.1 0 0 . . Original name was rep19_8_repA(pWBG759)_GQ900401.1 88 | rep19.9_repA(AP071A)_GQ900485.1 0 0 . . Original name was rep19_9_repA(AP071A)_GQ900485.1 89 | rep19.10_rep(pWBG746)_GQ900390.1 0 0 . . Original name was rep19_10_rep(pWBG746)_GQ900390.1 90 | rep20.1_ORF1(EDINA)_AP003089 0 0 . . Original name was rep20_1_ORF1(EDINA)_AP003089 91 | rep20.2_repA(pWBG753)_GQ900395.1 0 0 . . Original name was rep20_2_repA(pWBG753)_GQ900395.1 92 | rep20.3_rep(pTW20)_FN433597.1 0 0 . . Original name was rep20_3_rep(pTW20)_FN433597.1 93 | rep20.4_repA(SAP102A)_GQ900496.1 0 0 . . Original name was rep20_4_repA(SAP102A)_GQ900496.1 94 | rep20.5_repA(pSK23)_GQ900491.1 0 0 . . Original name was rep20_5_repA(pSK23)_GQ900491.1 95 | rep20.6_repA(SAP059A)_GQ900480.1 0 0 . . Original name was rep20_6_repA(SAP059A)_GQ900480.1 96 | rep20.7_repA(SAP074A)_GQ900426.1 0 0 . . Original name was rep20_7_repA(SAP074A)_GQ900426.1 97 | rep20.8_repA(SAP063A)_GQ900418.1 0 0 . . Original name was rep20_8_repA(SAP063A)_GQ900418.1 98 | rep20.9_repA(SAP055A)_GQ900414.1 0 0 . . Original name was rep20_9_repA(SAP055A)_GQ900414.1 99 | rep20.10_repA(SAP057A)_NC_013334.1 0 0 . . Original name was rep20_10_repA(SAP057A)_NC_013334.1 100 | rep21.1_rep(pWBG754)_GQ900396.1 0 0 . . Original name was rep21_1_rep(pWBG754)_GQ900396.1 101 | rep21.2_rep(pNVH01)_AJ512814.1 0 0 . . Original name was rep21_2_rep(pNVH01)_AJ512814.1 102 | rep21.3_pS0385-3_AM990995.1 0 0 . . Original name was rep21_3_pS0385-3_AM990995.1 103 | rep21.4_repRC(pSK41)_AF051917.1 0 0 . . Original name was rep21_4_repRC(pSK41)_AF051917.1 104 | rep21.5_repRC(pGO1)_FM207042.1 0 0 . . Original name was rep21_5_repRC(pGO1)_FM207042.1 105 | rep21.6_rep(pBMSa1)_AY541446.1 0 0 . . Original name was rep21_6_rep(pBMSa1)_AY541446.1 106 | rep21.7_rep(pWBG760)_GQ900473.1 0 0 . . Original name was rep21_7_rep(pWBG760)_GQ900473.1 107 | rep21.8_rep(SAP071A)_GQ900485.1 0 0 . . Original name was rep21_8_rep(SAP071A)_GQ900485.1 108 | rep21.9_rep(pKH12)_EU168704.2 0 0 . . Original name was rep21_9_rep(pKH12)_EU168704.2 109 | rep21.11_rep(pSA1308)_AB254848.1 0 0 . . Original name was rep21_11_rep(pSA1308)_AB254848.1 110 | rep21.12_rep(pKH3)_AF151117.1 0 0 . . Original name was rep21_12_rep(pKH3)_AF151117.1 111 | rep21.13_rep(SAP101A)_GQ900495.1 0 0 . . Original name was rep21_13_rep(SAP101A)_GQ900495.1 112 | rep21.14_rep(pKH21)_EU350088.1 0 0 . . Original name was rep21_14_rep(pKH21)_EU350088.1 113 | rep22.1_repB(pUB110)_X03408.1 0 0 . . Original name was rep22_1_repB(pUB110)_X03408.1 114 | rep22.2_repU(pKKS825)_FN377602.2 0 0 . . Original name was rep22_2_repU(pKKS825)_FN377602.2 115 | rep23.1_rep(pPR9)_GU237136.1 0 0 . . Original name was rep23_1_rep(pPR9)_GU237136.1 116 | rep23.2_rep(pV030-8)_EU366902.1 0 0 . . Original name was rep23_2_rep(pV030-8)_EU366902.1 117 | rep24.1_rep(pWBG745)_GQ900389.1 0 0 . . Original name was rep24_1_rep(pWBG745)_GQ900389.1 118 | rep24.2_rep(pWBG749)_GQ900391.1 0 0 . . Original name was rep24_2_rep(pWBG749)_GQ900391.1 119 | repUS1._ORF(E.faeciumContig1258)_JDOE 0 0 . . Original name was repUS1__ORF(E.faeciumContig1258)_JDOE 120 | repUS2._repA(pBI143)_BFU30316 0 0 . . Original name was repUS2__repA(pBI143)_BFU30316 121 | repUS3._repM200(pBM200)_DQ437521 0 0 . . Original name was repUS3__repM200(pBM200)_DQ437521 122 | repUS4._repA(pCI2000)_AF178424 0 0 . . Original name was repUS4__repA(pCI2000)_AF178424 123 | repUS5._CDS20(pETB)_NC003265 0 0 . . Original name was repUS5__CDS20(pETB)_NC003265 124 | repUS6._rep(pFL1)_NC002132 0 0 . . Original name was repUS6__rep(pFL1)_NC002132 125 | repUS7._rep(pHTbeta)_AB183714 0 0 . . Original name was repUS7__rep(pHTbeta)_AB183714 126 | repUS8._ORF1(pOg32)_LOPPOG32 0 0 . . Original name was repUS8__ORF1(pOg32)_LOPPOG32 127 | repUS9._rep(pSK1)_AF203376 0 0 . . Original name was repUS9__rep(pSK1)_AF203376 128 | repUS10._CDS25(pSSP1)_AP008935 0 0 . . Original name was repUS10__CDS25(pSSP1)_AP008935 129 | repUS11._CDS16(pTEF3)_AE016832 0 0 . . Original name was repUS11__CDS16(pTEF3)_AE016832 130 | repUS12._rep(pUB110)_AF181950 0 0 . . Original name was repUS12__rep(pUB110)_AF181950 131 | repUS13._CDS1(pUSA01)_NC007790 0 0 . . Original name was repUS13__CDS1(pUSA01)_NC007790 132 | repUS14._repA(VRSAp)_AP003367 0 0 . . Original name was repUS14__repA(VRSAp)_AP003367 133 | repUS15._ORF(E.faecium287)_NZAAAK010000287 0 0 . . Original name was repUS15__ORF(E.faecium287)_NZAAAK010000287 134 | repUS16._repF(pE194)_M17811.1 0 0 . . Original name was repUS16__repF(pE194)_M17811.1 135 | repUS17._rep(pKKS825)_FN377602.2 0 0 . . Original name was repUS17__rep(pKKS825)_FN377602.2 136 | repUS18._rep(pKKS825)_FN377602.2 0 0 . . Original name was repUS18__rep(pKKS825)_FN377602.2 137 | repUS19._rep(pDLK3)_GU562626.1 0 0 . . Original name was repUS19__rep(pDLK3)_GU562626.1 138 | repUS20._rep(pAVX)_CP001784.1 0 0 . . Original name was repUS20__rep(pAVX)_CP001784.1 139 | repUS21._rep(pWBG764)_GQ900468.1 0 0 . . Original name was repUS21__rep(pWBG764)_GQ900468.1 140 | repUS22._rep(SAP015B)_GQ900502.1 0 0 . . Original name was repUS22__rep(SAP015B)_GQ900502.1 141 | repUS23._repA(SAP099B)_GQ900449.1 0 0 . . Original name was repUS23__repA(SAP099B)_GQ900449.1 142 | RepA.1_pKPC-CAV1321_CP011611 0 0 . . Original name was RepA_1_pKPC-CAV1321_CP011611 143 | IncHI1B(R27).1_R27_AF250878 0 0 . . Original name was IncHI1B(R27)_1_R27_AF250878 144 | IncHI1B(CIT).1_pNDM-CIT_JX182975 0 0 . . Original name was IncHI1B(CIT)_1_pNDM-CIT_JX182975 145 | IncHI2.1__BX664015 0 0 . . Original name was IncHI2_1__BX664015 146 | IncI1.1_Alpha_AP005147 0 0 . . Original name was IncI1_1_Alpha_AP005147 147 | IncB/O/K/Z.1__CU928147 0 0 . . Original name was IncB/O/K/Z_1__CU928147 148 | IncB/O/K/Z.3__GQ259888 0 0 . . Original name was IncB/O/K/Z_3__GQ259888 149 | IncB/O/K/Z.2__GU256641 0 0 . . Original name was IncB/O/K/Z_2__GU256641 150 | IncB/O/K/Z.4__FN868832 0 0 . . Original name was IncB/O/K/Z_4__FN868832 151 | IncL/M(pOXA-48).1_pOXA-48_JN626286 0 0 . . Original name was IncL/M(pOXA-48)_1_pOXA-48_JN626286 152 | IncL/M.1__AF550415 0 0 . . Original name was IncL/M_1__AF550415 153 | IncI2.1_Delta_AP002527 0 0 . . Original name was IncI2_1_Delta_AP002527 154 | IncN.1__AY046276 0 0 . . Original name was IncN_1__AY046276 155 | IncFIA.1__AP001918 0 0 . . Original name was IncFIA_1__AP001918 156 | IncW.1__EF633507 0 0 . . Original name was IncW_1__EF633507 157 | IncA/C2.1__JN157804 0 0 . . Original name was IncA/C2_1__JN157804 158 | IncP1.1__BN000925 0 0 . . Original name was IncP1_1__BN000925 159 | IncP1.3__KX377410 0 0 . . Original name was IncP1_3__KX377410 160 | IncT.1__AP004237 0 0 . . Original name was IncT_1__AP004237 161 | IncR.1__DQ449578 0 0 . . Original name was IncR_1__DQ449578 162 | IncFII(S).1__CP000858 0 0 . . Original name was IncFII(S)_1__CP000858 163 | IncFII(K).1__CP000648 0 0 . . Original name was IncFII(K)_1__CP000648 164 | IncU.1__DQ401103 0 0 . . Original name was IncU_1__DQ401103 165 | IncX1.1__EU370913 0 0 . . Original name was IncX1_1__EU370913 166 | IncX2.1__JQ269335 0 0 . . Original name was IncX2_1__JQ269335 167 | IncY.1__K02380 0 0 . . Original name was IncY_1__K02380 168 | IncFIB(AP001918).1__AP001918 0 0 . . Original name was IncFIB(AP001918)_1__AP001918 169 | IncFII.1__AY458016 0 0 . . Original name was IncFII_1__AY458016 170 | IncFIC(FII).1__AP001918 0 0 . . Original name was IncFIC(FII)_1__AP001918 171 | IncFII(pKPX1)._AP012055 0 0 . . Original name was IncFII(pKPX1)__AP012055 172 | IncFII(Yp).1_Yersenia_CP000670 0 0 . . Original name was IncFII(Yp)_1_Yersenia_CP000670 173 | IncHI1A.1__AF250878 0 0 . . Original name was IncHI1A_1__AF250878 174 | IncHI2A.1__BX664015 0 0 . . Original name was IncHI2A_1__BX664015 175 | IncHI1B.1_pNDM-MAR_JN420336 0 0 . . Original name was IncHI1B_1_pNDM-MAR_JN420336 176 | IncFIA(HI1).1_HI1_AF250878 0 0 . . Original name was IncFIA(HI1)_1_HI1_AF250878 177 | IncHI1A(CIT).1_pNDM-CIT_JX182975 0 0 . . Original name was IncHI1A(CIT)_1_pNDM-CIT_JX182975 178 | IncN2.1__JF785549 0 0 . . Original name was IncN2_1__JF785549 179 | IncN3.1__EF219134 0 0 . . Original name was IncN3_1__EF219134 180 | IncX3.1__JN247852 0 0 . . Original name was IncX3_1__JN247852 181 | IncX3(pEC14).1_pEC14_JN935899 0 0 . . Original name was IncX3(pEC14)_1_pEC14_JN935899 182 | IncX4.1__CP002895 0 0 . . Original name was IncX4_1__CP002895 183 | IncFIB(S).1__FN432031 0 0 . . Original name was IncFIB(S)_1__FN432031 184 | IncFIB(Mar).1_pNDM-Mar_JN420336 0 0 . . Original name was IncFIB(Mar)_1_pNDM-Mar_JN420336 185 | IncFIB(pHCM2).1_pHCM2_AL513384 0 0 . . Original name was IncFIB(pHCM2)_1_pHCM2_AL513384 186 | IncFIB(pQil).1_pQil_JN233705 0 0 . . Original name was IncFIB(pQil)_1_pQil_JN233705 187 | IncFIB(K).1_Kpn3_JN233704 0 0 . . Original name was IncFIB(K)_1_Kpn3_JN233704 188 | IncFII(Y).1_ps_CP001049 0 0 . . Original name was IncFII(Y)_1_ps_CP001049 189 | IncL/M(pMU407).1_pMU407_U27345 0 0 . . Original name was IncL/M(pMU407)_1_pMU407_U27345 190 | IncP1.2__U67194 0 0 . . Original name was IncP1_2__U67194 191 | IncP6.1__JF785550 0 0 . . Original name was IncP6_1__JF785550 192 | IncA/C.1__FJ705807 0 0 . . Original name was IncA/C_1__FJ705807 193 | IncX5.1__JX193302 0 0 . . Original name was IncX5_1__JX193302 194 | IncX5.2__MF062700.1 0 0 . . Original name was IncX5_2__MF062700.1 195 | IncX6.1__KU302800 0 0 . . Original name was IncX6_1__KU302800 196 | IncX7.1__NC_015054 0 0 . . Original name was IncX7_1__NC_015054 197 | IncX8.1__AM942760 0 0 . . Original name was IncX8_1__AM942760 198 | IncX9.1__NC_010257 0 0 . . Original name was IncX9_1__NC_010257 199 | p0111.1__AP010962 0 0 . . Original name was p0111_1__AP010962 200 | IncFIB(pECLA).1_pECLA_CP001919 0 0 . . Original name was IncFIB(pECLA)_1_pECLA_CP001919 201 | IncFIB(pLF82).1_pLF82_CU638872 0 0 . . Original name was IncFIB(pLF82)_1_pLF82_CU638872 202 | IncFIB(pKPHS1).1_pKPHS1_CP003223 0 0 . . Original name was IncFIB(pKPHS1)_1_pKPHS1_CP003223 203 | IncFIB(pCTU1).1_pCTU1_FN543094 0 0 . . Original name was IncFIB(pCTU1)_1_pCTU1_FN543094 204 | IncFIB(pCTU3).1_pCTU3_FN543096 0 0 . . Original name was IncFIB(pCTU3)_1_pCTU3_FN543096 205 | IncFIB(pENTE01).1_pENTE01_CP000654 0 0 . . Original name was IncFIB(pENTE01)_1_pENTE01_CP000654 206 | IncFIB(pB171).1_pB171_AB024946 0 0 . . Original name was IncFIB(pB171)_1_pB171_AB024946 207 | IncFIB(pENTAS01).1_pENTAS01_CP003027 0 0 . . Original name was IncFIB(pENTAS01)_1_pENTAS01_CP003027 208 | IncFII(pHN7A8).1_pHN7A8_JN232517 0 0 . . Original name was IncFII(pHN7A8)_1_pHN7A8_JN232517 209 | IncFII(pRSB107).1_pRSB107_AJ851089 0 0 . . Original name was IncFII(pRSB107)_1_pRSB107_AJ851089 210 | IncFII(pSE11).1_pSE11_AP009242 0 0 . . Original name was IncFII(pSE11)_1_pSE11_AP009242 211 | IncFII(pENTA).1_pENTA_CP003027 0 0 . . Original name was IncFII(pENTA)_1_pENTA_CP003027 212 | IncFII(pECLA).1_pECLA_CP001919 0 0 . . Original name was IncFII(pECLA)_1_pECLA_CP001919 213 | IncFII(pseudo).1_pseudo_NC_011759 0 0 . . Original name was IncFII(pseudo)_1_pseudo_NC_011759 214 | IncFII(pCTU2).1_pCTU2_FN543095 0 0 . . Original name was IncFII(pCTU2)_1_pCTU2_FN543095 215 | IncFII(pYVa12790).1_pYVa12790_AY150843 0 0 . . Original name was IncFII(pYVa12790)_1_pYVa12790_AY150843 216 | IncFII(pMET).1_pMET1_EU383016 0 0 . . Original name was IncFII(pMET)_1_pMET1_EU383016 217 | IncFII(pCRY).1_pCRY_NC_005814 0 0 . . Original name was IncFII(pCRY)_1_pCRY_NC_005814 218 | IncFII(Serratia).1_Serratia_NC_009829 0 0 . . Original name was IncFII(Serratia)_1_Serratia_NC_009829 219 | IncFII(pCoo).1_pCoo_CR942285 0 0 . . Original name was IncFII(pCoo)_1_pCoo_CR942285 220 | IncFII(p96A).1_p96A_JQ418521 0 0 . . Original name was IncFII(p96A)_1_p96A_JQ418521 221 | IncFII(SARC14).1_SARC14_JQ418540 0 0 . . Original name was IncFII(SARC14)_1_SARC14_JQ418540 222 | IncFII(p14).1_p14_JQ418538 0 0 . . Original name was IncFII(p14)_1_p14_JQ418538 223 | IncFII(29).1_pUTI89_CP003035 0 0 . . Original name was IncFII(29)_1_pUTI89_CP003035 224 | IncFII.1_pSFO_AF401292 0 0 . . Original name was IncFII_1_pSFO_AF401292 225 | IncFII.1_pKP91_CP000966 0 0 . . Original name was IncFII_1_pKP91_CP000966 226 | IncFII(pAMA1167-NDM-5).1_pAMA1167-NDM-5_CP024805.1 0 0 . . Original name was IncFII(pAMA1167-NDM-5)_1_pAMA1167-NDM-5_CP024805.1 227 | IncX4.2__FN543504 0 0 . . Original name was IncX4_2__FN543504 228 | pADAP.1__AF135182 0 0 . . Original name was pADAP_1__AF135182 229 | pXuzhou21.1__CP001927 0 0 . . Original name was pXuzhou21_1__CP001927 230 | pSL483.1__CP001137 0 0 . . Original name was pSL483_1__CP001137 231 | pYE854.1__AM905950 0 0 . . Original name was pYE854_1__AM905950 232 | pIP32953.1__BX936400 0 0 . . Original name was pIP32953_1__BX936400 233 | pIP31758(p153).1_p153_CP000719 0 0 . . Original name was pIP31758(p153)_1_p153_CP000719 234 | pIP31758(p59).1_p59_CP000718 0 0 . . Original name was pIP31758(p59)_1_p59_CP000718 235 | pESA2.1__CP000784 0 0 . . Original name was pESA2_1__CP000784 236 | pENTAS02.1__CP003028 0 0 . . Original name was pENTAS02_1__CP003028 237 | pEC4115.1__NC_011351 0 0 . . Original name was pEC4115_1__NC_011351 238 | pJARS36.1__NC_015068 0 0 . . Original name was pJARS36_1__NC_015068 239 | pSM22.1__NC_015972 0 0 . . Original name was pSM22_1__NC_015972 240 | IncQ2.1__FJ696404 0 0 . . Original name was IncQ2_1__FJ696404 241 | IncQ1.1__M28829.1 0 0 . . Original name was IncQ1_1__M28829.1 242 | IncX1.3__CP001123 0 0 . . Original name was IncX1_3__CP001123 243 | IncX1.2__CP003417 0 0 . . Original name was IncX1_2__CP003417 244 | IncX1.4__JN935898 0 0 . . Original name was IncX1_4__JN935898 245 | ColRNAI.1__DQ298019 0 0 . . Original name was ColRNAI_1__DQ298019 246 | Col3M.1__JX514065 0 0 . . Original name was Col3M_1__JX514065 247 | Col156.1__NC_009781 0 0 . . Original name was Col156_1__NC_009781 248 | Col8282.1__DQ995352 0 0 . . Original name was Col8282_1__DQ995352 249 | Col(MG828).1__NC_008486 0 0 . . Original name was Col(MG828)_1__NC_008486 250 | ColpVC.1__JX133088 0 0 . . Original name was ColpVC_1__JX133088 251 | ColKP3.1__JN205800 0 0 . . Original name was ColKP3_1__JN205800 252 | Col(IMGS31).1__NC_011406 0 0 . . Original name was Col(IMGS31)_1__NC_011406 253 | Col(BS512).1__NC_010656 0 0 . . Original name was Col(BS512)_1__NC_010656 254 | ColE10.1__X01654 0 0 . . Original name was ColE10_1__X01654 255 | Col(IRGK).1__AY543071 0 0 . . Original name was Col(IRGK)_1__AY543071 256 | Col(SD853).1__NC_015392 0 0 . . Original name was Col(SD853)_1__NC_015392 257 | Col(YC).1__NC_002144 0 0 . . Original name was Col(YC)_1__NC_002144 258 | Col(pWES).1__DQ268764 0 0 . . Original name was Col(pWES)_1__DQ268764 259 | Col(Ye4449).1__FJ696405 0 0 . . Original name was Col(Ye4449)_1__FJ696405 260 | Col(YF27601).1__JF937655 0 0 . . Original name was Col(YF27601)_1__JF937655 261 | Col(MP18).1__NC_013652 0 0 . . Original name was Col(MP18)_1__NC_013652 262 | Col(KPHS6).1__NC_016841 0 0 . . Original name was Col(KPHS6)_1__NC_016841 263 | Col(VCM04).1__HM231165 0 0 . . Original name was Col(VCM04)_1__HM231165 264 | Col(MGD2).1__NC_003789 0 0 . . Original name was Col(MGD2)_1__NC_003789 265 | IncI2.1__KP347127 0 0 . . Original name was IncI2_1__KP347127 266 | repA.1_pKPC-2_CP013325 0 0 . . Original name was repA_1_pKPC-2_CP013325 267 | repA.2_pKPC-2_JX397875 0 0 . . Original name was repA_2_pKPC-2_JX397875 268 | Rep.1_pKPC-2_CP011573 0 0 . . Original name was Rep_1_pKPC-2_CP011573 269 | FII(pBK30683).1__KF954760 0 0 . . Original name was FII(pBK30683)_1__KF954760 270 | FIA(pBK30683).1__KF954760 0 0 . . Original name was FIA(pBK30683)_1__KF954760 271 | Col440I.1__CP023920.1 0 0 . . Original name was Col440I_1__CP023920.1 272 | Col440II.1__CP023921.1 0 0 . . Original name was Col440II_1__CP023921.1 273 | -------------------------------------------------------------------------------- /tiptoft/tests/Blocks_test.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | from tiptoft.Blocks import Blocks 3 | 4 | 5 | class TestBlocks(unittest.TestCase): 6 | 7 | def test_two_blocks(self): 8 | b = Blocks(7, 7, 2, 5) 9 | hits = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 10 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1] 11 | self.assertEqual(b.find_all_blocks(hits), [[3, 14], [27, 36]]) 12 | 13 | def test_merging_blocks(self): 14 | b = Blocks(7, 7, 3, 5) 15 | blocks = [[4, 8], [10, 14], [20, 24], [26, 30]] 16 | self.assertEqual(b.merge_blocks(blocks), [ 17 | [4, 14], [4, 14], [20, 30], [20, 30]]) 18 | 19 | def test_largest_block(self): 20 | b = Blocks(7, 7, 2, 5) 21 | hits = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 22 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1] 23 | self.assertEqual(b.find_largest_block(hits), (3, 14)) 24 | 25 | def test_adjust_block_start(self): 26 | b = Blocks(7, 0, 0, 10) 27 | self.assertEqual(b.adjust_block_start(10), 60) 28 | self.assertEqual(b.adjust_block_start(1), 0) 29 | 30 | def test_adjust_block_end(self): 31 | b = Blocks(7, 0, 0, 10) 32 | self.assertEqual(b.adjust_block_end(10, 90), 80) 33 | self.assertEqual(b.adjust_block_end(10, 80), 80) 34 | self.assertEqual(b.adjust_block_end(10, 50), 50) 35 | -------------------------------------------------------------------------------- /tiptoft/tests/Fasta_test.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | import os 3 | import logging 4 | from tiptoft.Fasta import Fasta 5 | 6 | test_modules_dir = os.path.dirname(os.path.realpath(__file__)) 7 | data_dir = os.path.join(test_modules_dir, 8 | 'data', 9 | 'fasta') 10 | 11 | 12 | class TestFasta(unittest.TestCase): 13 | 14 | def test_four_kmers(self): 15 | logger = logging.getLogger(__name__) 16 | f = Fasta(logger, 17 | os.path.join(data_dir, 18 | 'sample1.fa'), 19 | 4, 20 | False) 21 | 22 | self.assertEqual(f.sequence_kmers(), 23 | {'gene1': { 24 | 'GCAA': 0, 25 | 'CAAC': 0, 26 | 'AACG': 0, 27 | 'ACGC': 0, 28 | 'CGCT': 0, 29 | 'GCTG': 0, 30 | 'CTGA': 0, 31 | 'TGAC': 0, 32 | 'GACG': 0, 33 | 'ACGG': 0, 34 | 'CGGA': 0, 35 | 'GGAA': 0, 36 | 'GAAA': 0, 37 | 'AAAA': 0, 38 | 'AAAC': 0, 39 | 'ACGA': 0, 40 | 'CGAT': 0, 41 | 'GATC': 0, 42 | 'ATCT': 0, 43 | 'TCTG': 0, 44 | 'CTGG': 0, 45 | 'TGGT': 0, 46 | 'GGTT': 0, 47 | 'GTTT': 0, 48 | 'TTTT': 0, 49 | 'TTTG': 0, 50 | 'TTGC': 0, 51 | 'TGCC': 0, 52 | 'GCCC': 0, 53 | 'CCCT': 0, 54 | 'CCTT': 0, 55 | 'CTTT': 0, 56 | 'TTTC': 0, 57 | 'TTCA': 0, 58 | 'TCAC': 0, 59 | 'CACA': 0, 60 | 'ACAG': 0, 61 | 'CAGC': 0}, 62 | 'gene2': { 63 | 'GGCC': 0, 64 | 'GCCG': 0, 65 | 'CCGC': 0, 66 | 'CGCA': 0, 67 | 'GCAC': 0, 68 | 'CACC': 0, 69 | 'ACCA': 0, 70 | 'CCAC': 0, 71 | 'CACG': 0, 72 | 'ACGG': 0, 73 | 'CGGC': 0, 74 | 'GGCG': 0, 75 | 'GCGC': 0, 76 | 'CGCT': 0, 77 | 'GCTC': 0, 78 | 'CTCG': 0, 79 | 'TCGC': 0, 80 | 'CGCC': 0, 81 | 'GCCC': 0, 82 | 'CCCT': 0, 83 | 'CCTT': 0, 84 | 'CTTC': 0, 85 | 'TTCA': 0}, 86 | 'gene3': { 87 | 'CGCG': 0, 88 | 'GCGC': 0, 89 | 'CGCT': 0, 90 | 'GCTG': 0, 91 | 'CTGA': 0, 92 | 'TGAT': 0, 93 | 'GATT': 0, 94 | 'ATTT': 0, 95 | 'TTTT': 0, 96 | 'TTTG': 0, 97 | 'TTGC': 0, 98 | 'TGCG': 0, 99 | 'GCGT': 0, 100 | 'CGTG': 0, 101 | 'GTGG': 0, 102 | 'TGGC': 0, 103 | 'GGCA': 0, 104 | 'GCAA': 0, 105 | 'CAAT': 0, 106 | 'AATG': 0, 107 | 'ATGG': 0, 108 | 'GGCG': 0, 109 | 'GCGG': 0, 110 | 'CGGC': 0}}) 111 | self.assertEqual(f.all_kmers_in_file(), 112 | {'GCAA': 2, 113 | 'CAAC': 1, 114 | 'AACG': 1, 115 | 'ACGC': 1, 116 | 'CGCT': 3, 117 | 'GCTG': 2, 118 | 'CTGA': 2, 119 | 'TGAC': 1, 120 | 'GACG': 1, 121 | 'ACGG': 2, 122 | 'CGGA': 1, 123 | 'GGAA': 1, 124 | 'GAAA': 1, 125 | 'AAAA': 1, 126 | 'AAAC': 1, 127 | 'ACGA': 1, 128 | 'CGAT': 1, 129 | 'GATC': 1, 130 | 'ATCT': 1, 131 | 'TCTG': 1, 132 | 'CTGG': 1, 133 | 'TGGT': 1, 134 | 'GGTT': 1, 135 | 'GTTT': 1, 136 | 'TTTT': 2, 137 | 'TTTG': 2, 138 | 'TTGC': 2, 139 | 'TGCC': 1, 140 | 'GCCC': 2, 141 | 'CCCT': 2, 142 | 'CCTT': 2, 143 | 'CTTT': 1, 144 | 'TTTC': 1, 145 | 'TTCA': 2, 146 | 'TCAC': 1, 147 | 'CACA': 1, 148 | 'ACAG': 1, 149 | 'CAGC': 1, 150 | 'GGCC': 1, 151 | 'GCCG': 1, 152 | 'CCGC': 1, 153 | 'CGCA': 1, 154 | 'GCAC': 1, 155 | 'CACC': 1, 156 | 'ACCA': 1, 157 | 'CCAC': 1, 158 | 'CACG': 1, 159 | 'CGGC': 2, 160 | 'GGCG': 2, 161 | 'GCGC': 2, 162 | 'GCTC': 1, 163 | 'CTCG': 1, 164 | 'TCGC': 1, 165 | 'CGCC': 1, 166 | 'CTTC': 1, 167 | 'CGCG': 1, 168 | 'TGAT': 1, 169 | 'GATT': 1, 170 | 'ATTT': 1, 171 | 'TGCG': 1, 172 | 'GCGT': 1, 173 | 'CGTG': 1, 174 | 'GTGG': 1, 175 | 'TGGC': 1, 176 | 'GGCA': 1, 177 | 'CAAT': 1, 178 | 'AATG': 1, 179 | 'ATGG': 1, 180 | 'GCGG': 1}) 181 | 182 | def test_four_kmers_with_hc(self): 183 | logger = logging.getLogger(__name__) 184 | f = Fasta(logger, 185 | os.path.join(data_dir, 186 | 'sample1.fa'), 187 | 4, 188 | True) 189 | 190 | self.assertEqual(f.sequence_kmers(), 191 | {'gene1': { 192 | 'GCAC': 0, 193 | 'CACG': 0, 194 | 'ACGC': 0, 195 | 'CGCT': 0, 196 | 'GCTG': 0, 197 | 'CTGA': 0, 198 | 'TGAC': 0, 199 | 'GACG': 0, 200 | 'ACGA': 0, 201 | 'CGAC': 0, 202 | 'CGAT': 0, 203 | 'GATC': 0, 204 | 'ATCT': 0, 205 | 'TCTG': 0, 206 | 'CTGT': 0, 207 | 'TGTG': 0, 208 | 'GTGC': 0, 209 | 'TGCT': 0, 210 | 'GCTC': 0, 211 | 'CTCA': 0, 212 | 'TCAC': 0, 213 | 'CACA': 0, 214 | 'ACAG': 0, 215 | 'CAGC': 0}, 216 | 'gene2': { 217 | 'GCGC': 0, 218 | 'CGCA': 0, 219 | 'GCAC': 0, 220 | 'CACA': 0, 221 | 'ACAC': 0, 222 | 'CACG': 0, 223 | 'ACGC': 0, 224 | 'CGCG': 0, 225 | 'CGCT': 0, 226 | 'GCTC': 0, 227 | 'CTCG': 0, 228 | 'TCGC': 0, 229 | 'CTCA': 0}, 230 | 'gene3': { 231 | 'CGCG': 0, 232 | 'GCGC': 0, 233 | 'CGCT': 0, 234 | 'GCTG': 0, 235 | 'CTGA': 0, 236 | 'TGAT': 0, 237 | 'GATG': 0, 238 | 'ATGC': 0, 239 | 'TGCG': 0, 240 | 'GCGT': 0, 241 | 'CGTG': 0, 242 | 'GTGC': 0, 243 | 'TGCA': 0, 244 | 'GCAT': 0, 245 | 'CATG': 0}}) 246 | 247 | self.assertEqual(f.all_kmers_in_file(), 248 | {'GCAC': 2, 249 | 'CACG': 2, 250 | 'ACGC': 2, 251 | 'CGCT': 3, 252 | 'GCTG': 2, 253 | 'CTGA': 2, 254 | 'TGAC': 1, 255 | 'GACG': 1, 256 | 'ACGA': 1, 257 | 'CGAC': 1, 258 | 'CGAT': 1, 259 | 'GATC': 1, 260 | 'ATCT': 1, 261 | 'TCTG': 1, 262 | 'CTGT': 1, 263 | 'TGTG': 1, 264 | 'GTGC': 2, 265 | 'TGCT': 1, 266 | 'GCTC': 2, 267 | 'CTCA': 2, 268 | 'TCAC': 1, 269 | 'CACA': 2, 270 | 'ACAG': 1, 271 | 'CAGC': 1, 272 | 'GCGC': 2, 273 | 'CGCA': 1, 274 | 'ACAC': 1, 275 | 'CGCG': 2, 276 | 'CTCG': 1, 277 | 'TCGC': 1, 278 | 'TGAT': 1, 279 | 'GATG': 1, 280 | 'ATGC': 1, 281 | 'TGCG': 1, 282 | 'GCGT': 1, 283 | 'CGTG': 1, 284 | 'TGCA': 1, 285 | 'GCAT': 1, 286 | 'CATG': 1}) 287 | -------------------------------------------------------------------------------- /tiptoft/tests/Fastq_test.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | import os 3 | import logging 4 | import filecmp 5 | from tiptoft.Fasta import Fasta 6 | from tiptoft.Fastq import Fastq 7 | from tiptoft.Fastq import Gene 8 | 9 | test_modules_dir = os.path.dirname(os.path.realpath(__file__)) 10 | data_dir = os.path.join(test_modules_dir, 11 | 'data', 12 | 'fastq') 13 | 14 | 15 | class TestFastq(unittest.TestCase): 16 | 17 | def test_four_kmers(self): 18 | logger = logging.getLogger(__name__) 19 | logger.setLevel(logging.ERROR) 20 | fasta = Fasta(logger, 21 | os.path.join( 22 | data_dir, 23 | 'plasmid_data.fa'), 24 | 4, 25 | True) 26 | 27 | fastq = Fastq(logger, 28 | os.path.join(data_dir, 29 | 'query.fastq'), 30 | 4, 31 | fasta.all_kmers_in_file(), 32 | 1, 33 | 50, 34 | None, 35 | None, 36 | fasta, 37 | True) 38 | 39 | self.assertTrue(fastq.read_filter_and_map()) 40 | 41 | def test_reverse(self): 42 | logger = logging.getLogger(__name__) 43 | logger.setLevel(logging.ERROR) 44 | fasta = Fasta(logger, 45 | os.path.join( 46 | data_dir, 47 | 'plasmid_data.fa'), 48 | 4, 49 | True) 50 | 51 | fastq = Fastq(logger, 52 | os.path.join(data_dir, 53 | 'reverse.fastq'), 54 | 4, 55 | fasta.all_kmers_in_file(), 56 | 1, 57 | 50, 58 | None, 59 | None, 60 | fasta, 61 | True) 62 | 63 | self.assertTrue(fastq.read_filter_and_map()) 64 | 65 | def test_gzipped_input(self): 66 | logger = logging.getLogger(__name__) 67 | logger.setLevel(logging.ERROR) 68 | fasta = Fasta(logger, 69 | os.path.join( 70 | data_dir, 71 | 'plasmid_data.fa'), 72 | 4, 73 | True) 74 | 75 | fastq = Fastq(logger, 76 | os.path.join(data_dir, 77 | 'query_gz.fastq.gz'), 78 | 4, 79 | fasta.all_kmers_in_file(), 80 | 1, 81 | 50, 82 | None, 83 | None, 84 | fasta, 85 | True) 86 | self.assertTrue(fastq.read_filter_and_map()) 87 | 88 | def test_writting_to_output_file(self): 89 | logger = logging.getLogger(__name__) 90 | logger.setLevel(logging.ERROR) 91 | fasta = Fasta(logger, 92 | os.path.join( 93 | data_dir, 94 | 'plasmid_data.fa'), 95 | 4, 96 | True) 97 | 98 | fastq = Fastq(logger, 99 | os.path.join(data_dir, 100 | 'query_gz.fastq.gz'), 101 | 4, 102 | fasta.all_kmers_in_file(), 103 | 1, 104 | 50, 105 | 'outputfile', 106 | None, 107 | fasta, 108 | True) 109 | fastq.read_filter_and_map() 110 | self.assertTrue(os.path.exists('outputfile')) 111 | self.assertTrue(filecmp.cmp(os.path.join( 112 | data_dir, 113 | 'expected_outputfile'), 114 | 'outputfile')) 115 | os.remove('outputfile') 116 | 117 | def test_with_nonmatching_read(self): 118 | logger = logging.getLogger(__name__) 119 | logger.setLevel(logging.ERROR) 120 | fasta = Fasta(logger, 121 | os.path.join( 122 | data_dir, 123 | 'plasmid_data.fa'), 124 | 4, 125 | True) 126 | 127 | fastq = Fastq(logger, 128 | os.path.join(data_dir, 129 | 'query.fastq'), 130 | 4, 131 | fasta.all_kmers_in_file(), 132 | 1, 133 | 50, 134 | None, 135 | None, 136 | fasta, 137 | True) 138 | 139 | self.assertFalse( 140 | fastq.does_read_contain_quick_pass_kmers("AAAAAAAAAAAAAAAA")) 141 | 142 | def test_with_matching_read(self): 143 | logger = logging.getLogger(__name__) 144 | logger.setLevel(logging.ERROR) 145 | fasta = Fasta(logger, 146 | os.path.join( 147 | data_dir, 148 | 'plasmid_data.fa'), 149 | 11, 150 | True) 151 | 152 | fastq = Fastq(logger, 153 | os.path.join(data_dir, 154 | 'query.fastq'), 155 | 11, 156 | fasta.all_kmers_in_file(), 157 | 1, 158 | 50, 159 | None, 160 | None, 161 | fasta, 162 | True) 163 | 164 | self.assertTrue(fastq.does_read_contain_quick_pass_kmers( 165 | "ATCAATACCTTCTTTATTGATTTTGATATTCACACGGCAAAAGAAACTATTTCAGCAAGCGATA" 166 | "ATTTTAACAACCGCTATTGATTTAGGTTTTATGCCTACTATGATTATCAAATCTGATAAAGGTT" 167 | "ATCAAGCATATTTTGTTTTAGAAACGCCAGTCTATGTGACTTCAAAATCAGAATTTAAATCTGT" 168 | "CAAAGCAGCCAAAATAATTTCGCAAAATATCCGAGAATATTTTGGAAAGTCTTTGCCAGTTGAT" 169 | "CTAACGTGTAATCATTTTGGTATTGCTCGCATACCAAGAACGGACAATGTAGAATTTTTTGATC" 170 | "CTAATTACCGTTATTCTTTCAAAGAATGGCAAGATTGGTCTTTCAAACAAACAGATAATAAGGG" 171 | "CTTTACTCGTTCAAGTCTAACGGTTTTAAGCGGTACAGAAGGCAAAAAACAAGTAGATGAACCC" 172 | "TGGTTTAATCTCTTATTGCACGAAACGAAATTTTCAGGAGAAAAGGGTTTAATAGGGCGTAATA" 173 | "ACGTCATGTTTACCCTCTCTTTAGCCTACTTTAGTTCAGGCTATTCAATCGAAACGTGCGAATA" 174 | "TAATATGTTTGAGTTTAATAATCGATTAGATCAACCCTTAGAAGAAAAAGAAGTAATCAAAATT" 175 | "GTTAGAAGTGCCTATTCAGAAAACTATCAAGGGGCTAATAGGGAATACATTACCATTCTTTGCA" 176 | "AAGCTTGGGTATCAAGTGATTTAACCAGTAAAGATTTATTTGTCCGTCAAGGGTGGTTTAAATT" 177 | "CAAGAAAAAAAGAAGCGAACGTCAACGTGTTCATTTGTCAGAATGGAAAGAAGATTTAATGGCT" 178 | "TATATTAGCGAAAAAAGCGATGTATACAAGCCTTATTTAGTGACGACCAAAAAAGAGATTAGAG" 179 | "AAGTG")) 180 | 181 | def test_filtering_alleles_one_complete(self): 182 | logger = logging.getLogger(__name__) 183 | logger.setLevel(logging.ERROR) 184 | fastq = Fastq(logger, 185 | os.path.join(data_dir, 186 | 'query.fastq'), 187 | 188 | 11, 189 | None, 190 | 1, 191 | 50, 192 | None, 193 | None, 194 | None, 195 | True) 196 | 197 | input_alleles = [Gene('rep7.1_repC(Cassette)_AB037671', 198 | 9, 199 | 1), 200 | Gene('rep7.5_CDS1(pKC5b)_AF378372', 201 | 8, 202 | 2), 203 | Gene( 204 | 'rep7.6_ORF(pKH1)_SAU38656', 205 | 10, 206 | 0), 207 | Gene('repUS14.1_repA(VRSAp)_AP003367', 208 | 10, 209 | 0)] 210 | expected_allele_names = ['rep7.6', 211 | 'repUS14.1'] 212 | filtered_alleles = fastq.filter_contained_alleles(input_alleles) 213 | self.assertEqual(expected_allele_names, 214 | sorted( 215 | list(map(lambda x: x.short_name(), 216 | filtered_alleles)))) 217 | 218 | def test_filtering_alleles_all_partial(self): 219 | logger = logging.getLogger(__name__) 220 | logger.setLevel(logging.ERROR) 221 | fastq = Fastq(logger, 222 | os.path.join(data_dir, 223 | 'query.fastq'), 224 | 225 | 11, 226 | None, 227 | 1, 228 | 50, 229 | None, 230 | None, 231 | None, 232 | True) 233 | 234 | input_alleles = [Gene('rep7.1_repC(Cassette)_AB037671', 235 | 9, 236 | 1), 237 | Gene('rep7.5_CDS1(pKC5b)_AF378372', 238 | 8, 239 | 2), 240 | Gene( 241 | 'rep7.6_ORF(pKH1)_SAU38656', 242 | 7, 243 | 3), 244 | Gene('repUS14.1_repA(VRSAp)_AP003367', 245 | 10, 246 | 0)] 247 | expected_allele_names = ['rep7.1', 248 | 'repUS14.1'] 249 | filtered_alleles = fastq.filter_contained_alleles(input_alleles) 250 | self.assertEqual(expected_allele_names, 251 | sorted( 252 | list(map(lambda x: x.short_name(), 253 | filtered_alleles)))) 254 | 255 | def test_filtering_alleles_partial_equal_values(self): 256 | logger = logging.getLogger(__name__) 257 | logger.setLevel(logging.ERROR) 258 | fastq = Fastq(logger, 259 | os.path.join(data_dir, 260 | 'query.fastq'), 261 | 262 | 11, 263 | None, 264 | 1, 265 | 50, 266 | None, 267 | None, 268 | None, 269 | True) 270 | 271 | input_alleles = [Gene('rep7.1_repC(Cassette)_AB037671', 272 | 9, 273 | 1), 274 | Gene('rep7.5_CDS1(pKC5b)_AF378372', 275 | 9, 276 | 1), 277 | Gene( 278 | 'rep7.6_ORF(pKH1)_SAU38656', 279 | 9, 280 | 1), 281 | Gene('repUS14.1_repA(VRSAp)_AP003367', 282 | 10, 283 | 0)] 284 | expected_allele_names = ['rep7.1', 285 | 'repUS14.1'] 286 | filtered_alleles = fastq.filter_contained_alleles(input_alleles) 287 | self.assertEqual(expected_allele_names, 288 | sorted( 289 | list(map(lambda x: x.short_name(), 290 | filtered_alleles)))) 291 | 292 | def test_filtering_alleles_all_complete(self): 293 | logger = logging.getLogger(__name__) 294 | logger.setLevel(logging.ERROR) 295 | fastq = Fastq(logger, 296 | os.path.join(data_dir, 297 | 'query.fastq'), 298 | 299 | 11, 300 | None, 301 | 1, 302 | 50, 303 | None, 304 | None, 305 | None, 306 | True) 307 | 308 | input_alleles = [Gene('rep7.1_repC(Cassette)_AB037671', 309 | 10, 310 | 0), 311 | Gene('rep7.5_CDS1(pKC5b)_AF378372', 312 | 10, 313 | 0), 314 | Gene( 315 | 'rep7.6_ORF(pKH1)_SAU38656', 316 | 10, 317 | 0), 318 | Gene('repUS14.1_repA(VRSAp)_AP003367', 319 | 10, 320 | 0)] 321 | expected_allele_names = ['rep7.1', 322 | 'rep7.5', 323 | 'rep7.6', 324 | 'repUS14.1'] 325 | filtered_alleles = fastq.filter_contained_alleles(input_alleles) 326 | self.assertEqual(expected_allele_names, 327 | sorted( 328 | list(map(lambda x: x.short_name(), 329 | filtered_alleles)))) 330 | -------------------------------------------------------------------------------- /tiptoft/tests/Gene_test.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | from tiptoft.Gene import Gene 3 | 4 | 5 | class TestGene(unittest.TestCase): 6 | 7 | def test_full_coverage(self): 8 | g = Gene('rep5.1_rep(pMW2)_NC005011', 10, 0) 9 | self.assertTrue(g.is_full_coverage()) 10 | self.assertEqual(str( 11 | g), 12 | 'rep5.1 Full 100 NC005011 plasmidfinder' 13 | ' rep5.1_rep(pMW2)_NC005011') 14 | 15 | def test_no_coverage(self): 16 | g = Gene('rep5.1_rep(pMW2)_NC005011', 0, 10) 17 | self.assertFalse(g.is_full_coverage()) 18 | self.assertEqual(str( 19 | g), 'rep5.1 Partial 0 NC005011 plasmidfinder' 20 | ' rep5.1_rep(pMW2)_NC005011') 21 | 22 | def test_medium_coverage(self): 23 | g = Gene('rep5.1_rep(pMW2)_NC005011', 5, 5) 24 | self.assertFalse(g.is_full_coverage()) 25 | self.assertEqual(str( 26 | g), 'rep5.1 Partial 50 NC005011 plasmidfinder' 27 | ' rep5.1_rep(pMW2)_NC005011') 28 | 29 | def test_nc_name(self): 30 | g = Gene('Col(BS512).1__NC_010656.2', 5, 5) 31 | self.assertEqual(g.accession(), "NC_010656.2") 32 | self.assertEqual(g.short_name(), "Col(BS512).1") 33 | 34 | def test_rep_name(self): 35 | g = Gene('rep5.1_rep(pMW2)_NC005011', 5, 5) 36 | self.assertEqual(g.short_name(), "rep5.1") 37 | self.assertEqual(g.accession(), "NC005011") 38 | 39 | def test_rep_with_slash_name(self): 40 | g = Gene('rep6.1_repA(p703/5)_AF109375', 5, 5) 41 | self.assertEqual(g.short_name(), "rep6.1") 42 | self.assertEqual(g.accession(), "AF109375") 43 | 44 | def test_inc_double_dash_name(self): 45 | g = Gene('IncFII(S).1__CP000851', 5, 5) 46 | self.assertEqual(g.short_name(), "IncFII(S).1") 47 | self.assertEqual(g.accession(), "CP000851") 48 | -------------------------------------------------------------------------------- /tiptoft/tests/Kmers_test.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | import os 3 | from tiptoft.Kmers import Kmers 4 | 5 | test_modules_dir = os.path.dirname(os.path.realpath(__file__)) 6 | data_dir = os.path.join(test_modules_dir, 'data', 'kmers') 7 | 8 | 9 | class TestKmers(unittest.TestCase): 10 | 11 | def test_four_kmers(self): 12 | k = Kmers('AAAAATTTTT', 4, False) 13 | self.assertEqual(k.get_all_kmers_counter(), { 14 | 'AAAA': 0, 15 | 'AAAT': 0, 16 | 'AATT': 0, 17 | 'ATTT': 0, 18 | 'TTTT': 0}) 19 | 20 | def test_four_kmers_compression_all_repeats(self): 21 | k = Kmers('AAAAATTTTTGGGGGCCCCC', 4, True) 22 | self.assertEqual(k.get_all_kmers_counter(), {'ATGC': 0}) 23 | 24 | def test_four_kmers_compression(self): 25 | k = Kmers('GCAAAAATTTTTGC', 4, True) 26 | self.assertEqual(k.get_all_kmers_counter(), { 27 | 'GCAT': 0, 'CATG': 0, 'ATGC': 0}) 28 | 29 | def test_short_sequence(self): 30 | k = Kmers('A', 10, False) 31 | self.assertEqual(k.get_all_kmers_counter(), {}) 32 | -------------------------------------------------------------------------------- /tiptoft/tests/Read_test.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | import os 3 | from tiptoft.Read import Read 4 | 5 | test_modules_dir = os.path.dirname(os.path.realpath(__file__)) 6 | data_dir = os.path.join(test_modules_dir, 'data', 'read') 7 | 8 | 9 | class TestRead(unittest.TestCase): 10 | 11 | def test_initialise(self): 12 | read = Read() 13 | self.assertEqual(read.id, None) 14 | self.assertEqual(read.seq, None) 15 | self.assertEqual(read.qual, None) 16 | 17 | def test_one_read(self): 18 | f = open(os.path.join(data_dir, 'sample.fastq')) 19 | read = Read() 20 | read.get_next_from_file(f) 21 | self.assertEqual(read.id, 'read1') 22 | self.assertEqual(read.seq, 'AAAAAAAAAAAAGGGGGGGGGGGGGGAAAAAA') 23 | self.assertEqual(read.qual, '77777777777788888888888888777777') 24 | f.close() 25 | 26 | def test_subsequence(self): 27 | f = open(os.path.join(data_dir, 'sample.fastq')) 28 | read = Read() 29 | read.get_next_from_file(f) 30 | self.assertEqual(read.id, 'read1') 31 | self.assertEqual(read.seq, 'AAAAAAAAAAAAGGGGGGGGGGGGGGAAAAAA') 32 | self.assertEqual(read.qual, '77777777777788888888888888777777') 33 | 34 | sub_read = read.subsequence(12, 26) 35 | self.assertEqual(sub_read.id, 'read1_12_26') 36 | self.assertEqual(sub_read.seq, 'GGGGGGGGGGGGGG') 37 | self.assertEqual(sub_read.qual, '88888888888888') 38 | f.close() 39 | -------------------------------------------------------------------------------- /tiptoft/tests/data/fasta/sample1.fa: -------------------------------------------------------------------------------- 1 | >gene1 2 | GCAACGCTGACGGAAAACGATCTGGTTTTTGCCCTTTCACAGC 3 | >gene2 4 | GGCCGCACCACGGCGCTCGCCCTTCA 5 | >gene3 6 | CGCGCGCTGATTTTGCGTGGCAATGGCGGC -------------------------------------------------------------------------------- /tiptoft/tests/data/fastq/expected_outputfile: -------------------------------------------------------------------------------- 1 | GENE COMPLETENESS %COVERAGE ACCESSION DATABASE PRODUCT 2 | rep1.1 Partial 98 AF007787 plasmidfinder rep1.1_repE(pAMbeta1)_AF007787 3 | -------------------------------------------------------------------------------- /tiptoft/tests/data/fastq/plasmid_data.fa: -------------------------------------------------------------------------------- 1 | >rep1.1_repE(pAMbeta1)_AF007787 2 | ATGAATATCCCTTTTGTTGTAGAAACTGTGCTTCATGACGGCTTGTTAAAGTACAAATTT 3 | AAAAATAGTAAAATTCGCTCAATCACTACCAAGCCAGGTAAAAGCAAAGGGGCTATTTTT 4 | GCGTATCGCTCAAAATCAAGCATGATTGGCGGTCGTGGTGTTGTTCTGACTTCCGAGGAA 5 | GCGATTCAAGAAAATCAAGATACATTTACACATTGGACACCCAACGTTTATCGTTATGGA 6 | ACGTATGCAGACGAAAACCGTTCATACACGAAAGGACATTCTGAAAACAATTTAAGACAA 7 | ATCAATACCTTCTTTATTGATTTTGATATTCACACGGCAAAAGAAACTATTTCAGCAAGC 8 | GATATTTTAACAACCGCTATTGATTTAGGTTTTATGCCTACTATGATTATCAAATCTGAT 9 | AAAGGTTATCAAGCATATTTTGTTTTAGAAACGCCAGTCTATGTGACTTCAAAATCAGAA 10 | TTTAAATCTGTCAAAGCAGCCAAAATAATTTCGCAAAATATCCGAGAATATTTTGGAAAG 11 | TCTTTGCCAGTTGATCTAACGTGTAATCATTTTGGTATTGCTCGCATACCAAGAACGGAC 12 | AATGTAGAATTTTTTGATCCTAATTACCGTTATTCTTTCAAAGAATGGCAAGATTGGTCT 13 | TTCAAACAAACAGATAATAAGGGCTTTACTCGTTCAAGTCTAACGGTTTTAAGCGGTACA 14 | GAAGGCAAAAAACAAGTAGATGAACCCTGGTTTAATCTCTTATTGCACGAAACGAAATTT 15 | TCAGGAGAAAAGGGTTTAATAGGGCGTAATAACGTCATGTTTACCCTCTCTTTAGCCTAC 16 | TTTAGTTCAGGCTATTCAATCGAAACGTGCGAATATAATATGTTTGAGTTTAATAATCGA 17 | TTAGATCAACCCTTAGAAGAAAAAGAAGTAATCAAAATTGTTAGAAGTGCCTATTCAGAA 18 | AACTATCAAGGGGCTAATAGGGAATACATTACCATTCTTTGCAAAGCTTGGGTATCAAGT 19 | GATTTAACCAGTAAAGATTTATTTGTCCGTCAAGGGTGGTTTAAATTCAAGAAAAAAAGA 20 | AGCGAACGTCAACGTGTTCATTTGTCAGAATGGAAAGAAGATTTAATGGCTTATATTAGC 21 | GAAAAAAGCGATGTATACAAGCCTTATTTAGTGACGACCAAAAAAGAGATTAGAGAAGTG 22 | CTAGGCATTCCTGAACGGACATTAGATAAATTGCTGAAGGTACTGAAGGCGAATCAGGAA 23 | ATTTTCTTTAAGATTAAACCAGGAAGAAATGGTGGCATTCAACTTGCTAGTGTTAAATCA 24 | TTGTTGCTATCGATCATTAAAGTAAAAAAAGAAGAAAAAGAAAGCTATATAAAGGCGCTG 25 | ACAAATTCTTTTGACTTAGAGCATACATTCATTCAAGAGACTTTAAACAAGCTAGCAGAA 26 | CGCCCTAAAACGGACACACAACTCGATTTGTTTAGCTATGATACAGGCTGA 27 | >rep1.2_repS(pBT233)_X64695 28 | ATGAATATCCCTTTTGTTGTAGAAACTGTGCTTCATGACGGCTTGTTAAAGTACAAATTT 29 | AAAAATAGTAAAATTCGCTCAATCACTACCAAGCCAGGTAAAAGCAAAGGGGCTATTTTT 30 | GCGTATCGCTCAAAAAAAAGCATGATTGGCGGACGTGGCGTTGTTCTAACTTCCGAAGAA 31 | GCGATTCACGAAAATCAAGATACATTTACGCATTGGACACCAAACGTTTATCGTTATGGT 32 | ACGTATGCAGACGAAAACCGTTCATACACTAAAGGACATTCTGAAAACAATTTAAGACAA 33 | ATCAATACCTTCTTTATTGATTTTGATATTCACACGGAAAAAGAAACTATTTCAGCAAGC 34 | GATATTTTAACAACAGCTATTGATTTAGGTTTTATGCCTACGTTAATTATCAAATCTGAT 35 | AAAGGTTATCAAGCATATTTTGTTTTAGAAACGCCAGTCTATGTGACTTCAAAATCAGAA 36 | TTTAAATCTGTCAAAGCAGCCAAAATAATCTCGCAAAATATCCGTGAATATTTTGGAAAG 37 | TCTTTGCCAGTTGATCTAACGTGCAATCATTTTGGGATTGCTCGTATACCAAGAACGGAC 38 | AATGTCGAATTTTTTGATCCCAATTACCGTTATTCTTTCAAAGAATGGCAAGATTGGTCT 39 | TTCAAACAAACAGATAATAAGGGCTTTACTCGTTCAAGTCTAATGGTTTTAAGCGGTACA 40 | GAAGGCAAAAAACAAGTAGATGAACCCTGGTTTAATCTCTTATTGCACGAAACGAAATTT 41 | TCAGGAGAAAAGGGTTTAGTAGGGCGTAATAGCGTTATGTTTACCCTCTCTTTAGCCTAC 42 | TTTAGTTCAGGCTATTCAATCGAAACGTGCGAATATAATATGTTTGAGTTTAATAATCGA 43 | TTAGATCAACCCTTAGAAGAAAAAGAAGTGATCAAACTTGTTAGAAGTGCCTACTCAGAA 44 | AACTATCAAGGGGCTAATAGGGAATACATTACCATTCTTTGCAAAGCTTGGGTATCAAGT 45 | GATTTAACCAGTAAAGATTTATTTGTCCGTCAAGGGTGGTTTAAATTCAAGAAAAAAAGA 46 | AGTGAACGTCAACGTGTTCATTTGTCAGAATGGAAAGAAGATTTAATGGCTTATATTAGC 47 | GAAAAAAGCGATGTATACAAGCCTTATTTAGTGACGACCAAAAAAGAGATTAGAGAAGCG 48 | CTAGGCATTCCTGAACGTACGCTAGATAAGCTATTGAAGGTATTAAAAGCGAATCAAGAA 49 | ATCTTCTTTAAGATTAAATCAGGAAGAAATGGTGGCATTCAACTTGCTAGTGGTAAATCA 50 | TTGTTGCTATCGATCATTAAAGTAAAAAAAGAAGAAAAAGAAAGCTATATAAAGGCGCTG 51 | ACAAATTCTTTTGACTTAGAGCATACATTCATTCAAGAGACTTTAAACAAGCTAGCAGAA 52 | CGCCCTAAAACGGACACACAACTCGATTTGTTTAGCTATGATACAGGCTGA -------------------------------------------------------------------------------- /tiptoft/tests/data/fastq/query.fastq: -------------------------------------------------------------------------------- 1 | @read1 2 | AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 3 | + 4 | ??>==>?=?=>><=>?>;=?>?>?<>8**=>=6%88%%(%,%%%%%%%%62?>==>?=?=>><=>?>;=?>?>?==>?=?8 5 | @read2 6 | TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT 7 | + 8 | %(%%%%%%%%,%%)%%69;::;7;1;<7<,(*15:<81<<.,.<::<<8576=;;::;7;1;<7<,(*15:<81<<.,.<::<<8576=;6 9 | @read3 10 | AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 11 | + 12 | ??>==>?=?=>><=>?>;=?>?>?<>8**=>=6%88%%(%,%%%%%%%%62?>==>?=?=>><=>?>; 13 | @read4 14 | ATATGAAATCTGTCTTTGGGCAGATAAAGGTTTGTTTGGGTTATAAGNNANGTC 15 | + 16 | BBBABCBCBBBB=BBBABA?BABBBBBB@@;BBB@B@@3:>@BABA8%%8%&98 17 | @read5 18 | AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 19 | + 20 | ??>==>?=?=>><=>?>;=?>?>?<>8**=>=6%88%%(%,%%%%%%% 21 | @read6 22 | AAAAAAAAAAAAATGAATATCCCTTTTGTTGTAGAAACTGTGCTTCATGACGGCTTGTTAAAGTACAAATTTAAAAATAGTAAAATTCGCTCAATCACTACCAAGCCAGGTAAAAGCAAAGGGGCTATTTTTGCGTATCGCTCAAAATCAAGCATGATTGGCGGTCGTGGTGTTGTTCTGACTTCCGAGGAAGCGATTCAAGAAAATCAAGATACATTTACACATTGGACACCCAACGTTTATCGTTATGGAACGTATGCAGACGAAAACCGTTCATACACGAAAGGACATTCTGAAAACAATTTAAGACAAATCAATACCTTCTTTATTGATTTTGATATTCACACGGCAAAAGAAACTATTTCAGCAAGCGATATTTTAACAACCGCTATTGATTTAGGTTTTATGCCTACTATGATTATCAAATCTGATAAAGGTTATCAAGCATATTTTGTTTTAGAAACGCCAGTCTATGTGACTTCAAAATCAGAATTTAAATCTGTCAAAGCAGCCAAAATAATTTCGCAAAATATCCGAGAATATTTTGGAAAGTCTTTGCCAGTTGATCTAACGTGTAATCATTTTGGTATTGCTCGCATACCAAGAACGGACAATGTAGAATTTTTTGATCCTAATTACCGTTATTCTTTCAAAGAATGGCAAGATTGGTCTTTCAAACAAACAGATAATAAGGGCTTTACTCGTTCAAGTCTAACGGTTTTAAGCGGTACAGAAGGCAAAAAACAAGTAGATGAACCCTGGTTTAATCTCTTATTGCACGAAACGAAATTTTCAGGAGAAAAGGGTTTAATAGGGCGTAATAACGTCATGTTTACCCTCTCTTTAGCCTACTTTAGTTCAGGCTATTCAATCGAAACGTGCGAATATAATATGTTTGAGTTTAATAATCGATTAGATCAACCCTTAGAAGAAAAAGAAGTAATCAAAATTGTTAGAAGTGCCTATTCAGAAAACTATCAAGGGGCTAATAGGGAATACATTACCATTCTTTGCAAAGCTTGGGTATCAAGTGATTTAACCAGTAAAGATTTATTTGTCCGTCAAGGGTGGTTTAAATTCAAGAAAAAAAGAAGCGAACGTCAACGTGTTCATTTGTCAGAATGGAAAGAAGATTTAATGGCTTATATTAGCGAAAAAAGCGATGTATACAAGCCTTATTTAGTGACGACCAAAAAAGAGATTAGAGAAGTGCTAGGCATTCCTGAACGGACATTAGATAAATTGCTGAAGGTACTGAAGGCGAATCAGGAAATTTTCTTTAAGATTAAACCAGGAAGAAATGGTGGCATTCAACTTGCTAGTGTTAAATCATTGTTGCTATCGATCATTAAAGTAAAAAAAGAAGAAAAAGAAAGCTATATAAAGGCGCTGACAAATTCTTTTGACTTAGAGCATACATTCATTCAAGAGACTTTAAACAAGCTAGCAGAACGCCCTAAAACGGACACACAACTCGATTTGTTTAGCTATGATACAGGCTGAAAAAAAAAAAAA 23 | + 24 | AAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBAAAAAAAAAAAA 25 | @read7 26 | AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 27 | + 28 | ??>==>?=?=>><=>?>;=?>?>?<>8**=>=6%88%%(%,%%%%%%%%62?>==>?=?=>><=>?>;=?>?>?= 29 | @read8 30 | TGCAGCAATCACTGGTGCTGTTTCTGGTGATACCTATGTCAAATTGGANCACGG 31 | + 32 | ?>>>?>??????=??>?@???>??>=>=>?????=??>==;>>=9-18%7><28 33 | @read9 34 | NCAAGCCATGGTTGGAAGTTAAAGACAATTTCATTGATACCAGCGTAT 35 | + 36 | %0<78<<1<::<644;9==9<<:9<8<:=;99;<<<7=;6796=<<<; 37 | -------------------------------------------------------------------------------- /tiptoft/tests/data/fastq/query_gz.fastq.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/andrewjpage/tiptoft/e71b002bef09f97d6881fbc69bedc7ad65fe6b48/tiptoft/tests/data/fastq/query_gz.fastq.gz -------------------------------------------------------------------------------- /tiptoft/tests/data/fastq/reverse.fastq: -------------------------------------------------------------------------------- 1 | @read1 2 | AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 3 | + 4 | ??>==>?=?=>><=>?>;=?>?>?<>8**=>=6%88%%(%,%%%%%%%%62?>==>?=?=>><=>?>;=?>?>?==>?=?8 5 | @read2 6 | TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT 7 | + 8 | %(%%%%%%%%,%%)%%69;::;7;1;<7<,(*15:<81<<.,.<::<<8576=;;::;7;1;<7<,(*15:<81<<.,.<::<<8576=;6 9 | @read3 10 | AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 11 | + 12 | ??>==>?=?=>><=>?>;=?>?>?<>8**=>=6%88%%(%,%%%%%%%%62?>==>?=?=>><=>?>; 13 | @read4 14 | GACNTNNCTTATAACCCAAACAAACCTTTATCTGCCCAAAGACAGATTTCATAT 15 | + 16 | BBBABCBCBBBB=BBBABA?BABBBBBB@@;BBB@B@@3:>@BABA8%%8%&98 17 | @read5 18 | AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 19 | + 20 | ??>==>?=?=>><=>?>;=?>?>?<>8**=>=6%88%%(%,%%%%%%% 21 | @read6 22 | TTTTTTTTTTTTTCAGCCTGTATCATAGCTAAACAAATCGAGTTGTGTGTCCGTTTTAGGGCGTTCTGCTAGCTTGTTTAAAGTCTCTTGAATGAATGTATGCTCTAAGTCAAAAGAATTTGTCAGCGCCTTTATATAGCTTTCTTTTTCTTCTTTTTTTACTTTAATGATCGATAGCAACAATGATTTAACACTAGCAAGTTGAATGCCACCATTTCTTCCTGGTTTAATCTTAAAGAAAATTTCCTGATTCGCCTTCAGTACCTTCAGCAATTTATCTAATGTCCGTTCAGGAATGCCTAGCACTTCTCTAATCTCTTTTTTGGTCGTCACTAAATAAGGCTTGTATACATCGCTTTTTTCGCTAATATAAGCCATTAAATCTTCTTTCCATTCTGACAAATGAACACGTTGACGTTCGCTTCTTTTTTTCTTGAATTTAAACCACCCTTGACGGACAAATAAATCTTTACTGGTTAAATCACTTGATACCCAAGCTTTGCAAAGAATGGTAATGTATTCCCTATTAGCCCCTTGATAGTTTTCTGAATAGGCACTTCTAACAATTTTGATTACTTCTTTTTCTTCTAAGGGTTGATCTAATCGATTATTAAACTCAAACATATTATATTCGCACGTTTCGATTGAATAGCCTGAACTAAAGTAGGCTAAAGAGAGGGTAAACATGACGTTATTACGCCCTATTAAACCCTTTTCTCCTGAAAATTTCGTTTCGTGCAATAAGAGATTAAACCAGGGTTCATCTACTTGTTTTTTGCCTTCTGTACCGCTTAAAACCGTTAGACTTGAACGAGTAAAGCCCTTATTATCTGTTTGTTTGAAAGACCAATCTTGCCATTCTTTGAAAGAATAACGGTAATTAGGATCAAAAAATTCTACATTGTCCGTTCTTGGTATGCGAGCAATACCAAAATGATTACACGTTAGATCAACTGGCAAAGACTTTCCAAAATATTCTCGGATATTTTGCGAAATTATTTTGGCTGCTTTGACAGATTTAAATTCTGATTTTGAAGTCACATAGACTGGCGTTTCTAAAACAAAATATGCTTGATAACCTTTATCAGATTTGATAATCATAGTAGGCATAAAACCTAAATCAATAGCGGTTGTTAAAATATCGCTTGCTGAAATAGTTTCTTTTGCCGTGTGAATATCAAAATCAATAAAGAAGGTATTGATTTGTCTTAAATTGTTTTCAGAATGTCCTTTCGTGTATGAACGGTTTTCGTCTGCATACGTTCCATAACGATAAACGTTGGGTGTCCAATGTGTAAATGTATCTTGATTTTCTTGAATCGCTTCCTCGGAAGTCAGAACAACACCACGACCGCCAATCATGCTTGATTTTGAGCGATACGCAAAAATAGCCCCTTTGCTTTTACCTGGCTTGGTAGTGATTGAGCGAATTTTACTATTTTTAAATTTGTACTTTAACAAGCCGTCATGAAGCACAGTTTCTACAACAAAAGGGATATTCATTTTTTTTTTTTT 23 | + 24 | AAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBAAAAAAAAAAAA 25 | @read7 26 | AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 27 | + 28 | ??>==>?=?=>><=>?>;=?>?>?<>8**=>=6%88%%(%,%%%%%%%%62?>==>?=?=>><=>?>;=?>?>?= 29 | @read8 30 | TGCAGCAATCACTGGTGCTGTTTCTGGTGATACCTATGTCAAATTGGANCACGG 31 | + 32 | ?>>>?>??????=??>?@???>??>=>=>?????=??>==;>>=9-18%7><28 33 | @read9 34 | NCAAGCCATGGTTGGAAGTTAAAGACAATTTCATTGATACCAGCGTAT 35 | + 36 | %0<78<<1<::<644;9==9<<:9<8<:=;99;<<<7=;6796=<<<; 37 | -------------------------------------------------------------------------------- /tiptoft/tests/data/fastq/sample1.fa: -------------------------------------------------------------------------------- 1 | >gene1 2 | AAAAAAAAAAAAA 3 | >gene2 4 | AAAAAAAA 5 | >gene3 6 | GGGGGGGGGGGGGG 7 | -------------------------------------------------------------------------------- /tiptoft/tests/data/read/sample.fastq: -------------------------------------------------------------------------------- 1 | @read1 2 | AAAAAAAAAAAAGGGGGGGGGGGGGGAAAAAA 3 | + 4 | 77777777777788888888888888777777 5 | @read2 6 | TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT 7 | + 8 | %(%%%%%%%%,%%)%%69;::;7;1;<7<,(*15:<81<<.,.<::<<8576=;;::;7;1;<7<,(*15:<81<<.,.<::<<8576=;6 9 | @read3 10 | AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 11 | + 12 | ??>==>?=?=>><=>?>;=?>?>?<>8**=>=6%88%%(%,%%%%%%%%62?>==>?=?=>><=>?>; 13 | --------------------------------------------------------------------------------