├── .gitignore ├── LICENSE ├── README.md ├── bpipe.config ├── data └── inc_seq_test_read.fa ├── inc-seq.py ├── pipeline.bpipe └── utils ├── PBDAGCON.LICENSE ├── __init__.py ├── aligners.py ├── blastn2bed.py ├── blastn2blasr.py ├── blastn_wrapper.sh ├── blosum80.mat ├── buildConsensus.py ├── filter_best_match.py ├── findUnit.py ├── graphmap ├── pbdagcon ├── poa ├── sam2blasr.py └── sam2blasr2.py /.gitignore: -------------------------------------------------------------------------------- 1 | # sessions 2 | session.* 3 | 4 | auto-save-list/ 5 | *.dat 6 | 7 | # backups 8 | *pyc 9 | *~ 10 | 11 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The INC-Seq pipeline is licensed under the MIT License (see below). 2 | Licenses for third party software that is part of the source: 3 | - pbdagcon (see util/PBDAGCON.LICENSE) 4 | - poa (GNU General Public License version 2.0 (GPLv2)) 5 | ---------------------------------------------------------------------- 6 | The MIT License (MIT) 7 | Copyright (c) 2016 Genome Institute of Singapore 8 | Permission is hereby granted, free of charge, to any person obtaining a copy 9 | of this software and associated documentation files (the "Software"), to deal 10 | in the Software without restriction, including without limitation the rights 11 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 12 | copies of the Software, and to permit persons to whom the Software is 13 | furnished to do so, subject to the following conditions: 14 | The above copyright notice and this permission notice shall be included in 15 | all copies or substantial portions of the Software. 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 17 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 18 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 19 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 20 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 21 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 22 | THE SOFTWARE. 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | INC-Seq: Accurate single molecule reads using nanopore sequencing 2 | ====== 3 | Description: 4 | ------ 5 | This repository contains the code for analyzing INC-Seq data (http://biorxiv.org/content/early/2016/01/27/038042). The full datasets have been deposited into ENA (http://www.ebi.ac.uk/ena/data/view/PRJEB12294). 6 | 7 | Note: 8 | ------- 9 | The PBDAGCON software attached with this pipeline was compiled on Ubuntu 16.04. If there is any issue with the consensus building, please recompile PBDAGCON. On Debian systems, this can be done by running: 10 | 11 | ```sh 12 | rm -i utils/pbdagcon 13 | sudo apt install pbdagcon 14 | ln -s `which pbdagcon` utils/ 15 | ``` 16 | 17 | Requirements: 18 | -------------- 19 | - Python 2.7 20 | - Biopython 1.65 21 | - BLAST 2.2.28+ 22 | 23 | Usage: 24 | -------------- 25 | ``` 26 | usage: inc-seq.py [-h] -i INFASTA [-o OUTFILE] [-a ALIGNER] [-m MINRL] 27 | [--anchor_seg_step ANCHOR_SEG_STEP] 28 | [--anchor_length ANCHOR_LEN] [--anchor_cov ANCHOR_COV] 29 | [--anchor_seq ANCHOR_SEQ] [--iterative] [--seg_cov SEG_COV] 30 | [--copy_num_thre COPY_NUM_THRE] 31 | [--length_difference_threshold LEN_DIFF_THRE] 32 | 33 | The INC-Seq pipeline 34 | 35 | optional arguments: 36 | -h, --help show this help message and exit 37 | -i INFASTA, --input INFASTA 38 | Input file in fasta format 39 | -o OUTFILE, --outfile OUTFILE 40 | Output file 41 | -a ALIGNER, --aligner ALIGNER 42 | The aligner used (blastn, graphmap, poa) [Default: 43 | blastn] 44 | -m MINRL, --minReadLength MINRL 45 | The reads shorter than this will be discarded 46 | [Default:2000] 47 | --anchor_seg_step ANCHOR_SEG_STEP 48 | Step of sliding window used as anchors [Default: 500] 49 | (eg. -s 500 : start at 0, 500, 1000, ...) 50 | --anchor_length ANCHOR_LEN 51 | The length of the anchor, should be smaller than the 52 | unit length [Default: 500] 53 | --anchor_cov ANCHOR_COV 54 | Anchor coverage required [Default: 0.8] 55 | --anchor_seq ANCHOR_SEQ 56 | A single file containing the sequences used as the 57 | anchor [Default: Use subsequences as anchors] 58 | --iterative Iteratively run pbdagcon on consensus [Default: False] 59 | --seg_cov SEG_COV Segment coverage required [Default: 0.8] 60 | --copy_num_thre COPY_NUM_THRE 61 | Minimal copy number required [Default: 6] 62 | --length_difference_threshold LEN_DIFF_THRE 63 | Segment length deviation from the median to be 64 | considered as concordant [Default: 0.05] 65 | ``` 66 | Examples: 67 | -------------- 68 | * Basic usage 69 | ``` 70 | ./inc-seq.py -i data/inc_seq_test_read.fa -o consensus.fa 71 | ``` 72 | * Use graphmap as segment aligner 73 | ``` 74 | ./inc-seq.py -i data/inc_seq_test_read.fa -o consensus.fa -a graphmap 75 | ``` 76 | * Use bpipe pipeline for pseudo-parallel computing 77 | * Split the reads into multiple files (300 reads per file) and run INC-Seq (4 instances) in parallel. 78 | ``` 79 | bpipe run -p READ_NUM=300 -n 4 pipeline.bpipe a_lot_of_incseq_reads.fa 80 | ``` 81 | -------------------------------------------------------------------------------- /bpipe.config: -------------------------------------------------------------------------------- 1 | executor="sge" 2 | walltime="4:00:00" 3 | procs='OpenMP 2' 4 | 5 | sge_request_options="-l mem_free=1G" 6 | 7 | commands { 8 | 'main' { 9 | sge_request_options="-l mem_free=50G" 10 | walltime="24:00:00" 11 | } 12 | } 13 | -------------------------------------------------------------------------------- /data/inc_seq_test_read.fa: -------------------------------------------------------------------------------- 1 | >ddfdd3f2-c50b-4843-b1f2-c1669785858a_Basecall_2D_2d GISNB474_default_sample_id_4351_1_ch325_file86_strand_twodirections RCA 3 bacteria/4th run RCA/2D basecalls 191115 4/pass/GISNB474_default_sample_id_4351_1_ch325_file86_strand.fast5 2 | GTTTCATTTATTTCCCAGTACTGGACTAACTTCACTGCTACAACCTGATTCCCACTCCCTCCCTGGCGCACTCAATGGTAGAGACATGTACTCCAAATGGTCATGTCCGGTTAACTCGCCACATAGCGTCGATCAGACTTTTAACCGCCTCGCTCTACGGTCAATAAATCCGGACACGGAGCGTCACGATTAGAAAGTAAGCGTCTTGCTCGAAGGCTGTTGTGTCGCGCGTCAAGATAACTTTAAGTTGTCCAAAGTCTCCATTCTAACCCCTACCTCTGCGGATAACAGCTACGATCGAAAACTCGCGGATGACGGGGCCGCTCGGACCTCCAGGCCATGCACGGATTCCTCGATTGTGTCGCAGCCTCCCGAGGAATAAACTACCCGACGACAACCCGCAGCTAGGGATCTGCGCTGAAGCAAATTCCCGTTGCTTTAGGATCGGCACCCGGGCTTAAGATTGTGAATATTCCGCGTTGCTTCTAAATTTTGAACCACATGCTTGACTGCTCTACCCCCAATTCCTAGAGCATCTTCGTCCGTATTCCCAGTGCTAGCTTAAGCGAGCTCTAAGTGCCGCGAGAAGGACCCAACAGGCACTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGGTTTGCTCCCACGCTTTCCAGCCTCAGTCCAATTACAGACCAGAGAGCCGCTTTTCACAGCGAGTTCCTCGATATAGCATTTCACGCTACCGAAGGCATGCCGCTCCCCTTCTGCACTCAAGTTAACAGTTTCTAAAGCGTACTATGGTTAAGCCACAGCCTATTTAATTTCAGACTTATCTAACGCCTGCGCTCGCTTTACGCTATACGTACCAACACTGGACACGCTCGGGACCTACGTATTACCGCGGCTGCTGAAGATTAGTTAGCGTTGCTTTTCTGGTAAGATATACCGTCACAGGCTTAAACTTTCCACTCTCGCACTCGTTTTCTCTCTTACAACAAGAGCTTTACGGATCCGAAAATACCTTCTTCACTCACGGGGTTTGCTCGTCAGATCGCCGTCCATTGCCGAAGATTCCCTACTGCAGCCTCCCGTCAGAAGATAAAGCAGCTGACGACAACCATGCACCACCTGTCACCGTGGCCCGAAGCAAATCCTATCTCCCCTAACGCGCTACGGGGTTTACAAGACCTGGTAAGAGGTTCTTCGCGTTGCTTCGCCGAATTAAACCACAAGTGCTCCACCGCTGTGCGGGCCCCCGTCAATTCCTTTGAGTTTCAACCTTGCGGTCGTACTCCCCAGGCAGAGTGCTTAATGCGTTAGCTGCGGCACTGAGTCTCTCCGGAAGGACCCAACACCTAGCACTCATCGTTTACGGCGTGAGCCCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTCGAGCCCCCAGCGTCAGAGTTACAGACCAGAGAGCCCGCTTTCACAAGGGGGTCTTCCTCTACATATCTACGCATTTCACCGCTACACATGGAATGCCCTACTCCCCTTCTGCCGGACCAGTGAACAGATTTCCAAAGCCATTAGCTGGTTAACTCGCCACAGCCTTTAACTTCAGACTTATCTAACCGCCTGCTCGCTTTTACGCCCAAATAAATGCGGACACGCTGGGGACCTACGTATTACCCCTATAATCATCGAACCTGTATGGCCCCCGTCCCTTTCTGGTAAGTACCGTCACAGGCTTAAACTTTCCACTCTCACACTCGTTCTTCTCTTACAACAGAGCATAGTTACGATCCGAAAACTCGTCTTCACTCACGCGGGGTTGCTCGGTCAGACTTCCCGTGAGTTGCCGAAGATTCCCTACTGCAGCCTCCCGTAGAAAGCTACACGAGCTGACAAGGCTGCACCATAATTAATCTTCGAAGAGGCAAATCCTATCTCTAGGACGGGCACGGGGTCCACATTGGGTAAGGTTCTTCGCGTTGCTTCGAAGCTTAAACCACATGCTCCACGGCTTGTGCGGGCCCCCGTCAATTCCTTTGGGTTTCAACCTTGCGAGTGCTCCCCAGGCCCAGGTGCTTAATCCCGTTAGCTGCACTGAGTCCTAGAAAGGACCCAACACCTAGCACTCATGGTTTTACGGCGTGGACTACCAGGGTATCAATGCCTGTTTGCTCCCCACGGTTTTCGAGCCTCAGCGTCAGTTACAGACCAGAGGAGCCGCTGTTCTCAGGGTGTTCCTCCATATCTACGCATTTCACCGCTACACATGGGAATTCCACTCCCTTCTGCACTCAAGTTAAACAGTTTCCAAAGGGATTATGATTGCTACACAGCCTTAACTTCACGCTATCTTAAAAGCCGGGCCTCGCTTTACGCCCAAGCTAAAATCCGGACAACGCTCGCCAACCTACGTATTACCGCGGCTTGCTGGCACGTAGTTAGCCGTCCTTTCTGGTAAGATACCGTCACAGTGACAACTTTCCACTCTCAATAGTTCTCTGCTACAACAGAGCTTTACGATCTCGAAAACCTTGCTTCACTCACGCGGCGTTGCTCGGTCAGACTTCCGTCCACTTGCCGAAAGTTCCCTACTGCAGCCTCCCGTAGAAGTAACACGAGCTGACAACCATGCACCACCTGTCACCGGTGCCGAAGCAAATCCTATCTCTAGGACGGGGCACTCAGGGTTTACATACTAGTAATTACAGTTGCTTCGCGACTTAAAGTGGCAGGATTGCTCCACCGGGTCCCTGGGCCCCCAATTCCTTTTCGCTTTTCAACCTTGCGTGGTCGTACTCCCCGGCTTGGCTTAATGCGTTAGCTGCGGCACTGAGTCCCAGGGACCGACCCACAGCCACTGTCGACGCCGTACCCCACGATATTAGAGATCTAATGTTTGCTCCCCACGCTTTCCGACCTCAGCGTCAGTTACAGACAAGAGCCGCTTCGACAGCGGTGTTCCTCCATATCTACGCAAGTTCACCGCTACACATGGAATTCCACTCTCCCCTTGCGCACTCAAGTTAAACAGTTCCAAAGCGTACTGGTTACGCCAAGGCCTTTGACTTCAGACTTATCTAACCGCCTGCCCTCGCTTACGCCCAATAATCCGGACAACGCTCGCCACATCACGTACTAAGGGCTGCTGGCACACGTAGGTCGCCGTCCTTTTCTGGTAAGATACCGTCTAATGTGAACTTTCCACTCTCACACTCGTTCTTTCTCTGCTACAACAGAGCTTTATAATGGAAAACCTTTCTTGTCACTCACGCGGCGTTGCTCGGTCAGACTTCTATTGCCGTGATTCCTTATTCGCAGGCCTCCTTTAGACATGAAAATAGGGTTACGAGCCGAAAGGTGACAAGGCTGCACCACCTGTCACCGTGTGCCGAAGCAAAATCCTATCTCTAGGACGGGCACCGGGATGTCAAGACCTGGTAAGGTTCTTCGCGTTGCTTGCCGAAATTAAACCACATGCTCCACCGTGCCGGCCCCCGTCAATTCCTTGCGTTCAACCTTGCCGTACTCCCCAGGCGGAGTGCTGATGCGTTAGCTGCGGCGACTGAGTCCCGGAAAGGACCCAACACCTAGCATCGCCGGTAGTTTACGGCGTGACTACCAGGGTATCTAATCCTGTTTGCTCCCACGCTTTCGAGCCTCAGCGTCAGTTACAGACCAGAGAGCCGCTTTCGCCAATCGTTCCTCCATATCTACGAATTTCACGCTACATGGGAATTCCACTCTCCCTTCTGCACTCAAGTTAAACAGATTTCCAAAAGGCGTACTATGGTTAAGCAGCGCGTCTAACTTCCGACTTATCTGAACGCCTGCGCTCGCTTTACGCCCAGACTGAATCCGGACAACGCTACGGGACCTACGTATTACGCGGCTGCTGGATCAGGTGTCGCCGTCCCTTTCGGTAAGATACCGTCACAGTGTAAACTTTCCACTCTCACACTCGTTGGTCTCTGCCCCAACAGCTTTACGATCCCGAACATGGTCTTCTACTCACGCGGGGTTTGCTCGCGTTAAACTTCGCGTCCATTGCCGAAGATTCCCTACTGCAGCCTCCCGTAGAAAGTAACACGAGCTGACGACAACCATGCACCACTGTCACCGCTGTATCCGAAGCAAATCCTATCTCTAAGACGGGCAAGAGCGATGTCAATAACTCTAAGGTTCTTCGCGTTTGCTTCGCACCGTAAAACCACATGCTCCACCGCTTGTGCGGGCCCCCGTCAATTCCCTTGCCATTTTAACCTTCGTCGTACTCCCCAGGCGGAGGTGCTTAATGCGTTAGCTGCGGCACTGAGTTAGCGAAGAGGACCCAACACCTAGCACTCATGGTTTACGGCGTGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGTGCGAGCCTCACCAGCGTCAGTTACAGACCAGAGCCGCTTTCGCCACCCGGTGTTCCTCCATATATCTACGCATTTCACGCGCCACATGGAAATTCCACTCTCCCTTCTGCACTCAAGTTAAACGATTTCCCAAGCGTACTATATGTAAGCCAAGGCCTTTAACTTGCCAGACCTTATCTAACTCGCTCGCTGGCTTTTACGCCCAATAAATCCGGACAACGCTCGGGACCTACGTATTACCGCGGCTGCTGGCCACGTGAGTTAGCCGTCCCTTTCTGGTAAGATACCGTCACAGTGAACTTTCCACTCACACTCGTTCTTCTCTGTACAACAGCTTTACGATCTCGAAAACCTTCTTCACTCACGCTTCTAGTTGCTGATAATTCCGTCCATTGCCGAAGATTCCCACTGCCGAGCCTCCCCCGTAGAAGAGTAACACGAGCTGACGACAACTCATGCACCACCTGGTTACCGGTGCTCAGAAGCAAATCCTATCTCTAGGACGGGCACTCGGAGTGTCAAGACCTGGGCTAGGTTCTTCGCGTGTTGCTTCGGAAACCTAAACCAAGCTCCACCGTTGTGCGGGCCCCCGTCAATTCCTTTGAGTTTCAACCTTGCCGGTCGTACTCCCCAGGCGCTTAATCGATGCGTTAGCTGCGGCACTGAGCGGTGAGAAGGACCCAAACCTAGCACTCATCGTTTACGGCGTGACACGAGGGTACCTAATCCTGTTTGCTCCCCACGCTTTCGAGCCTCAGCGTCAGAGTTACAGACCAGAGGCCAGTTCGCTGTCTCCCATATGGACGAATTTCACCGGCCCACATGGAATTCCACTCTCCCCTTCTGCACTCAAGTTAAACAGTTTCCAAAGCGTACTATGGTTACTGGCCCACAGCGCTTTAATGCCACGGTATCAAAACCGCCTGCGCTCGCTTTGTACGCGGACCAAATCCGGACAACGCTCGGGACCTACGTATTACCGCGGCTGCTGGCACGTAAGTGTTAGCCGTCCCTTTCTGGTAAGATACCGTCACAGTGAACTTTCCACTCACACTCGTTCTTCTGATCAACAGAGCCAGTTGGACGATCCGAAAACCTTCTTCACTCACGCGGCGTGCTCGGTCAATTCCGTCCATTGCCGAGATTCCCTACTGCAGCCTCCCGTAAGAGTAACACGAGCTGACGACAACCATGCACCACCTGTCACCGGTGCTGAAGCAAAATCCTATCTCTAAGGGCAGCGTATTTAGCAAGACCTGGTAAGGTTCTTCGCGTTGCTTGCGGAATTAAACCACATGCTCCACGCTGTGCGGGCCCCCGTCAATTCCTTTAGGTTTCAACCTTGCGTCGTACTCCCCAGCGAGGGTGCTGCTGCGTTAGCTGCGGCATGAGTGCGAAAGGACCCAACAGCGCACAGCCGGGTTTACGGGGTCAAACTACCAGGGTATCTAATCCTGGTTGCTCCCCACGCTTTCGCGCCTCAGCGTCAGTTACAGACCAGAGAGCCGCTTTCGCCACCATGTTTCCTCCATATATCTAAGTTCGCCAAGGCCCACATGGAATTCCACTCTCCCTTCTCGACTAATGCATAACGATTTCCAAAGCGTACTAGCGGGTTAAGCCACAGCCTTTAACTTCACAACTTATCTAACCGCCTGCGCTCGCCGTGACGCCCAAGATAAATCCGGACACGCTCGGGACCTACGTATTACCGCGGCTGCTGAAGTAGTTAGCCGTCCCTTTCTGGTAAGATACCGTCACAGTGAACTTTCCACTCTCAAAGCCGTTCTTCTGGTGAGCAACAGAGGCTTTACGATCCGAAAACCTTCTTCACTCACGCGGCGTTGCTCGGTCACCACGTCAGTTTGCCGTACGAGATTCCCCTACTGCAGCCTCCCCGTAGAAATAACACGAGCTTAAACGCTGCACGTGTCACCCTGTGCCCGAAGCAAAATCCTATCTCTATAGGACGGGCACGGGGTTGTCAAGACCTGGATTCGCTTTGCTGCGGACTTAAACCACATGCTCCACCGCTGTGCGGGCCCCCGTCAATTCCTTTGAGTTTCACAACCTTGCGTTGGTACTCCCCAGCGGAGTGCTTAATGCGTTAGCTGCGGCACTGAGTCCCGAAGGACCCGACCGCGCACTCATCGTTACGGCGTGGACTACCAGGGTACCCTAATCCTGTTTGCTCCCCACGGTTTCGAGCCTCAGCGTCAGTTACCAACCAGAGCCGGTTCGCACCGCGTCCTCCATATCGCAGTTTACCGCTACACATGGGGAATTCCACTCTCCCCTTCTGGCACTCAAGTTAAACAGTTTCCACGGGGTCACTGGTTAAGCCACAGCATTTGGGTTCAGACTTATCGGACCGCCTGCGCTCGGTTACGCCAATAAATCCGGACAACGCTCGCCACTACGTATTACCGCGGCTGCTCTCCAATGAGTGCGCCGTCTCGCTTTCTCCAATACCGTCAGTGGAATTTCCACTCGCACTCGTTCTTCTCTTACAACAGAGCTTACGATCTGAAAACCTTCTTCGCCCGGCGTTGCTCGTCAGACTTCCGTCCATGTCAGAAGATTCCCTAACTGCAGCCTCCCGTAGAAATAACCCAGCTGACGACAACCATGCACCACCTTACCGGTGTGCCCGAAGCAAATCCGCTCTCTAGGACTGGCGCAGGGTTTAAGACCTGGTAAGGTTCTTCGCGTTGCTTCAGAATTAAACCACATGCTCCACCGCTTGTGCGCTCCCCGTCAATGCTTTGGATTTCACATGGTTGCGGTCGTACTCCCCGGCGGAGTGCTTAATGCGTTAGCTGCGGCACACCTGAGTCCCGGAAGGACCCAAAGGGGCCACAGCGCTCGTTGGACGGCGTGGACTACCAGGTACCGTAATGTTTGCTCCCCACGCTTTCGAGCCTCAGCGTCAGTTACAGACCAGAGGACCGCTTTCGCCACATGTTCCTCCATATCTACGCATTTCACCGCTACACATGGGAAATTCCACTCTCCCCTTGCGGCCAATCTCAAATAAAGATTTCCGAAGGATCTATGGTTAAGCCACAGCCTTTAACTTCAGACCTTATCTAAAAATGCCTGCGCTCGCTTTACGCCCAATAAATCCGGACAACGCTCGGAGACCTACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGTCCCTTTCTGGTAAGATACCGTCACATCTAACTTTCCACTCTCACTCGTTCTTGCTCTCTTACAACAGAGCTTGATTGAATCCGAAAACCTTCTTCACTCACGCGGCGTTAGCGTTGGGTCAGACTTCCGTCCATTGCCGAAGATTCCCCACTGCAGCCTCCCGTACGAGTAACGCAAATGACAACCATGCACCACCTGTCACCCTGTACGAAGCAAAATCCTATCTCTAGGACGGCGGGGCAGGGGTTTTAAGACCTGGTAAGGTTCTTCGCGTTGCTTCGGAATCACTTACCGGCCTCCACCGCTGTGCGGGCCCCCGTCACATTCTCCTTTGAGTTTCAACCTTGCGGGTCGTACTCCCCAGCGGAGTGCTTAATGCGTCTTACTGCGGCACTGAGATCTTGGAAGGACCCAACACCTAGCACTCATCGTTTACGCGTGGACTACCAGGGTCCTAATCCTGGGTTTGCTCCCCACGCTTTTCGAGCCTCAGCGTCAGTCTACAGACCAGAGAGCCGCTTTTCGCCACTAGTGTTTCCTCCATATCTACGCATTTCAAGCTACATGGAATTCCACTCTCCCCCTTCTGCACTCAAGTTAAACAGTTTGCAAAGCGTACCTGGTTAAGCCACAGCCTTTAACTTCAGACTTATCTAACCGCCTGCGCTCGCTTTACGCCCAATAAATCCGGCCCTCAAGGCCATCACGTACCGCGGCTGCTGGCACGTAGTGCTCGTCCCTTTCTGGTAAGATACCGTCACATTAGAACTTTTCCACTCTCACTCGTTTATCTCTTACAACAGAGGCTTTACGATCCGAAAACCTTCTTCACTCACGCGGGATTGCTCGTCAGACTTGCGTCCATTGCCGAAGATTCTCCACTGCAGCCCCCGTAGAAATAACACGGCTGACGACAAGCAACCACCTGTCACCGGTGTGCGAAGCAAAATCCTATCTCTAGGACGGCACGGGGGTTGTCAAGCTGGTAAGGTTCTTGCGTTGCTTCGAAAGTCCAAACCACATGCTCCACCGCTTGTGCGGGCCCCCGTCAATTTCCTGTTGAGTTCAACCTTGCGGGTCGTACTCCCCAGGCGGAGTGCATGCGTTAGCTGCGGCACTGAGTTACCCCGCGGAAAGGACCCCAACACCTAGCACGCCTGTTTACACGGGTGTGGACTACCAGGGTATCTAATGGGTTTGCTCCCTACGGGTCTTCCAGCCTCAGCGTCAGTTACAGCCAACCGCTTTCGCCACCGGTGTTCCTCCATATATCTACGCAGTCTCACCGCTACACATGGAATTCCACTCTCCCCTTCTGCACTCAAGAGTTAAACAGATTTCCAAAAGCGTACTATGGTTAAGCCACAGCCTTTAACTTCAAGTATGGTAAACATGCCTGCTCGGTTGTACGCCCAATAATCCGGACACGCTCGGAGACCTACGTATTACCGCGGCCTGCTGGCACGTAGTGGCCGTCCCTTTCTGGACCATACATGTCACAGTGTGAATTTCCACTCTCACACTCGTTCTTCTCTTGAACAAGAGCTTTACGATCTAGAAACCTTTCTTCACTCACGCGGCGTGCTCGGTTACGACTTCCGTCCATTGCCGAAAGATTCCCTAACTGCAGCCTCCCGTGAAAGTAACACGAGCTGACGACAACCATGCACTATTTAGCAATGTGGCCGAAGCAAATCCTATCTCTAGGACGGGCACCGGATTAGGACCTGGTAAGGTTCTTCGCGTTGCTTCGAAACCTTAAACCACATGCTCCACCCTGTGCGCGGCCCCCGGCGTCAATTCCTTTGAGTTTTCAACCCTTGCGTGTCTGTACTCCCCAGGCGCTCTGCTTAATCCGTGCTGCGGCACTCAGTGCGCGGAAAGGACCCAACACCTAGCACTCACTGTTTACGGCGTGGACTACAGGGTATCTAATCCTGTGCTCCCCACGCTTTCGGAGCCTCAGCGTCAGTTACAGACCAGAGAGCCGTCGCTAGCCCGGTGTTCTCCCCCGCTATCTACGCAGTTTCACCGCTACCGCTGGAATTCCACTCTCCCCTTCTGCACTCAAGTTAAACGATTTCCAAAAGCGTACTATGGTTAAGCCACAGCCTTTAACTTCAGACCTTATCTAACGCCTGCTCGACCACGTCGCGCGGACAATCGCTCGGGACCTACGTATTACCGCGGCTGCTGGCGGGCAATGTCGCCGTCCCTTTCTGGTAAGATAAGTTAGGGACTTAATTTTCCACTCACACTCGTTCTTCTCTTACAACAGCTTTACGATCCGAAAACCTCTTATTCGACGCGGGGTTCGCTCGGTCAGACTTCCGTCCATTGCCGAAGATTCCCTACTGCAGCCTCCGTACGAGTAAACCCCATGACGACAACCATGCGCACTATTGTCACCCTTGAAGCAAATCCGTTTCTAGACGGGCACTCGATGTCAAGACCTGGTAAGGTTCTTCGCGTTGCTTCGCAAATTAAACCACATGCTCCACCGCTTGTGCGTGCCCCGTCAATTCCTTTGAGTTTCAACCTTGCGGTCGTACTCCCCAGCGAGTGCTGGATGCGTGCTGCGGCAAGGCCGATCAAGAGAGCGGCGACCTAAAGCTCGTTTGAGCGGACTACCAGGGGTATCCAATCCTGTTTGCTCCACGTTCGCGCGCCAGCGTTGAGTTACAGACCAGAGAGCCGCTTTCGCCACTAGTGTCTCCTCCATATCTACGCATTTCACGCTACACATGGGAATCCCACTCTCCCCCTTCTCCCTCAATTAGCAACAGGACTTTCCAAAGCGTACTGTTACTCGCCAAATTAACTTCAGACCTTATCTAACCGCCTGCTACGCTTTACGCCCAATAAATCCGGACACGCTCGGGACCTACGTATTACCGCGGCTGCTGAAGTAGTTAGCGTCGCGCTTTCTGGTCAAGATACCGTCACAGTGGACTTTCCACTCTCACACTCGTTCTTCTCTTACAACAAGAGCTTTACGATCCCGAATTGACCTTCACTCACGCGGGGTGGTCCTCCGTCAGACTTCCGTACTTCAGGAAGATCCCTACTGCAGCCTCCCGTAGAAGTAACACGAGCTGACGACAACCCGGGCACCACCTGTCACTCGAGTGCCGCGAAGCAAATCATCTCGGCTCGCGACGGGGACTCAGTGTCAAGACCTGGTAAGGTTCTTCGCGTTTGCTTCGGACACCTAAAACCAAGACCTCCACCGCTGTGCGGGCCCCCGTCAATTCCTTTAGATTTCAACCTTGCGGTCGTACTCCCCAGGCGGAGCTGTAATGATGCGTTGGCTGCGGCACTGAGTCCCGAGAAGGACCCAACACCCCACTCATCGTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGAGAGCCTCAGCGTCAGTTACAGACCAGAGAGCCGCTTTCGCCACCGGTGTTCCTCCATATCTACGCATTTCACGGCCCACATGGAATTCCACTCTCCCCTTTGGTGCACTCAAGGTTAAACGAGTTTTCCAAAGCGTACTATGGTTAAGCCACAGCCTTTAACTTCAGATGTAGCAAACGCCTGCGCTCGCTTTACCCCAATAAATCCGGACAACGCTCGGGACCTAAGTATCACCGCGGGGTCTCCAGATGCTACGTTAGCCCGTCCCTTTTCGGTAAGATACCGTCTAGAAACCTGTTCCACTCTAACCTCGTTCTTCTCTTACAACAGAGCTTTACGATCCCGAAAACCTTCTTCACTCACGCGGCGTGCTCGGTCAGACTTTCCGTCCATTGCCGAAGATTCCCCACTGCAGCCTCCCGTAGAAGTAACACGAGCTGACGACAACCATGCACCACCTGGTCACTAGTGTGCCCGAAGCAAATCCTATCTCTAGGACGGGACCGGGGCTGTCAAGACCTGGTAAGGTTCTTCGCGTTGCTTCTAGAAACATTAAACCAAGTGCTCCACCGGGTTGTGCGGGCCCCCGTCAATTCCTTGGGTTCAATCTGCGGTCGTACTCCCCCGGCGCATCTGCTTAATCGCGTTAGCTCGGGCGACCCCGTGGCGGAAGGACCCACCGGCGCACTCAGGGTACCTGTCCTAACAATTAGCCCAGGGTATCCAATCCTGTTTCTACGCGCTTTCGAGCCTCAGCGTCAGTTACAGACCAGAGAGCGCTTTCACCACTAGGTCTTCCTCCATATCTACGCATTCACCGCTCACATGGAATCCACTCTCCCCTTTCTGCACTCAAGGTTAAACAGTTTCCAAAGCGTGACCTGGTTAAGCCACAGCCTTTAACTTCAGACTGTAAACCGCCTGCGCTCATTTACGCCCAAATAAAATCCGGACAACGCTGGGACCTACGTATTAAGCGGCTGCTGAAGGCCGGTAGCCGTCCTCGTTATACGATACCGTCACAGTGTGAACTTTCCACTCACACTCGTTCTTGCTCTCTTATAAGGCCTTTACGATCTAGAAAACCTTCTTCACTCACGCGGCGTTGCTCGGTCATACTTTCCGTTGGCTTGCCGAAAGATTCCCACTGCAGCCTCCCGTACGAGTAACACGAGCTGACGACAACCATGCACTATCTTTACTAATCTTGCTCGAAGCAAAATCCTATCTCTAGGACGGGCACTACGGGATGTCAAGACCTGGTAAGGTTCTTCGCGTTGCTTCGAAATCAACCCACATGCTCCACCGCTGTGCGGGCCCCCGTCACTCCTTTGTACCTTCACATTCCTCTGTCCGTACTCCCCAGGCCCAGCTTACTGCCTTGGCGTGCCTGGCACTGAGTCCCGGAAGGACCCAACACAGGGCCTCAATTCATCATGAATTAGGACTACCCCGAGCATATGGGTTTGCCTCCACGCTTTTCGAGCCTCAGCGTCAGTTACAGACCAGAGCCGCTTTCGCCCAATAGCGGTATTCCTCCATATCTACGCAGTTTTCACTGGGACACCGGGAATTCCACTCTCCCCTTCTGCACTCAATGCTAACGATTCCAAACCATTAGCTGGTTAAGCCACAGCCTTTAACTTCAGATCGGCTCTAAATGCCTCTAGCTTTACGCCCAATAAATCCGGACAACGCTCGCCACGCTCACGTATTACGCGGCTGCTGGCGAATTAGTTAGCCGTCCTTTCTGGTAAGATACCGTCACATGTAACAATGTTCCACTCTCACACTCTAGTTCTTCTCTTACAACAAGTTGTACGATCCCAGAAAACCTTCTTCACTACGCGGCGTGCTCGTCAGACTTCCGTCCATTGCCGAAGATTCCCTACTGCAGCCTCCCGATACGAGTAAACACGAGCTGACAAGGCTGCACCACTGTCACCCCTGTGGAGGCGCTGACGACTCCCACTATCCCTGTTAGACGGGGCACTGGGCCATGTCTTTGAATTTGATTGCGGTAAGCTCCCCGCGCTGCTCCGACAGTGTGCAAACCACCGTCAATTCCTTTGGGGTTTCAACCTCTTGCGTCGTACTCCCCAGGCGGAGTGGGTGCATGCGTGCTGCGGCACTGGGGTCTCCGAAAGATGAGACACCTAGCACTCATCGTTTACGGCGTGACTACCAGGTGGGCTCGACTCCTGTTTGCTCCCCACGCTTTCTGGCGAGCCTCAGCGTCAGATTACAGACCAGAGAGCCGCTTTCGGGATGTGTCTCCTCCATATCTACGCAATTTCACCGCTACACATGGAATCCACTCTCCCTTCTGCACTCAAGTCAAACAGTTTCCAAAGCGTACTATGGTTAAGCCACAGCCTTTGACTTCAGACTTATCTAACCGCTGCGCTCGCTTACGCCCAATAAATCCGGACAACGCTCGGGACCTACGTATTACCGCGGCTGCCTGAAGTGCGTCTAGCCGTCCCTTTCTGGTAAGATACCGTCACAGTGTGAACTTTCCACTCTCACACTCGTTCTTGCTCTCTTACAACAGAGGCTTTACGATTAGAAAACCTTCTTCACTCACGCGGGGCTGCTCGGTCAGACTTCCGTCCGGTGCCGAGATTCCCCACTGCAGCCTCCATTGGAGACATAACACGAGCTGACGACAACCATGCACCACCTGTCACATCTTGGCCGAAGCAAAATCCTATCTCTAGACGGGCACGGGGGTTTAGCACCTGGCTAAGTTCTTCGCGTTGCTTCGACACATTCAAACCACATGCTCCACGACTGTTGTGCGGGCCCCCGTCAATTCTCCTTCGATTTCAACCTTGCGGTCGTACTCCCCGGCGGAGTGCTTAATGCCGTTAGCTGCGGCACTGAGTCCCGGAAAGGACCCAACACCTAGCACTCATGTTTGACGGCGTGATAACTACCAGGTACCTAATGCTGTTGCTCCCACGCTTTCGAGCCTCAGCGTCAGTTACCGACCAAGAGCCGCTTTCGCCACCGGTGTTCCTCCATATCTACGCATTCACCGCTACACATGGAAACTTCCACTCTCTCCCCTTCTGCACTCAAGTTAAACAGTTCCAAAGGGTCGTGCGAGCCACAGCCTTTAACTTCAGACTTATTAACGCCTATGCTCGGTTGACGCCCAATAAATCCGGACAACGCTCGGGACCTACGTATTACCGCGGCTGCTCCGGAATAGTTAGCCGTCCCTTTCTGGTAATACCGTCACAGTGTGAATATTTTCCACTCTCACACTCGTTCTTCTCTTACAACAGAGCTTGTCAGCCGGGCCGAAACCTTCTCATTCACGCGGCGTTGCCGCTTGTCAGACTTCCGTCAGTGCCGAAGATTCCCTACTGCAGCCTCCCGTAGAAGTAACACGAGCTGACGACAATAGGTCGCATACCTGTCACCGGTGTGCCGAAGCAAATCCTATCTCTAGGACGGCATAAGGGTTTACATACTGGTAAGGTTCTTCGCGTTGCTTCGGAATTAAACCACATGCTCCACCGCTGTGTGCGGGCCCCCGTCAATTCCTTTGAGTTTCAACCTGCGGTGGTCGTACTCTCCCCAGGCGGAGTGCTTAATCCGTTGGCTCCGGCACTATCCGTAGAAGGACCCAACACCTAGCATTCATACTGTTTACGGGACTACCAGGGTATCTAATCCTGTTAACCCCTAGTTCGCGCCTCAGCGTCAGTTACAGACCAGAGAGCCGCTTTCGCCACTAGTTCCTCCATATATCTACGCGGTTTCACGGCCCACTGGAATGCCCTCCCCTTCTGCACTCAAGTTAAACAGTTTCCAAAGCGATTATGGTTAAGCCAAGCCTTTAATTGCCGACTTATCTAACCGCCTGCTCGCTTTACGCCGATAAATCCGGACAACGCTCGGGACCTACGTATTAAGAGCGTGCTCCATGAATCCGTCTGCTTTCTCCGCTCTCGTCGCCGTAAACTTCCACTCTCACATTCGTTCTCTGTACAACAGCTTACGATACCGAAAACCTTCTTGTCACTCACGGCGTTGCTCGTCAGACTTCCGTCCATTGCCGAAGATTCCCACTGCCGCCTCCTAGATGCAGTAACACGAGCTGACGACACTGGCTGCACCACCTGTCACCGTGTGCCGAAGCAAAATCCTATCTCTAGGACGGGCACCGGGATGTCAAGACCTGGGGTTCTTCGCGTTGCTTCGAATTAAACCACAGTGCTCTCACCGCTTGTGCCGGCCGCGTCTAATTCCTTTGGATTTCAACCTTGAGTCCTGTACTCCCCAGGCGCATTAGCTTAAGTGCGTTAGCTGCGGGACCCCATCTTGGAAGGACCCAACACCTAGCACTCCGGGTTTACGGCGTGGACTACAGGGTATCTAATCTCGTGTTTGCTCCCCACGTTCGAGGCCTCAGCGTCAGTCAGAACAAGAGAGGCCGGTTCCGCCACTGTTTCCTCCATATATCCACGCAGTTTCACCGCTACACATGAGAAGTTCCACCTCTCCCCTTCTGCACTCAAGTTAAACATTTCCAAAGCGTACTATGGTTAAGCCACAGCCTTTAACTGCCAACTTATCTAACCGCCTGCGCTCGCTTTACGCCCAATAGAATCCGGACAACGCTCGCTACGTACGCCAGCTAATCTGGAACCTAGTTAGCGTCCCTTTCTGGTAAGTACCGTCACAGTGTGAATATTTTCCACTCTCACTCGTTCTTCTCTTACAACAGCTTTACGATCTAGAAAACCTTCTTCACTCACGCGGGGGTTGCTCGTCAGACTTCCGTCCATGCCCGCAGATGCCCTACTGCAGCCTCCCGATCGCAACACGAGTAATCACAACCATGCATGGTTAGCGATCTTGCCCGAAGCAAATCCTATCTCTAGGACTGGCACCGGGATGTCAAGACCTGTAAGGTTCTTCGCGTTGCTTCGAATGTAAACCACAATCGCTCCACCGAGTGTGCGGGCCCCCGTCAATTCCTTTGAGTTTCAACCTTGCGGTCGTACTCCCCGGCCCAGGGTGCTTAATGCGTTAGCTGCGGCACTGAGTCCTGGAAGGACCCAAGGCCCTACTCATGGTTTACGGCGTGGACTACCAGGGTATGGATCCTGTTTGCTCCCACGCCTTTCGAGCCTCAGCGTCAGTTGAGACCAGAGAGCCGTTGGCCCGGTGTTCCTCCATATCTACGCAGTTCACCGCTACATGGAATTCCACTCTCCCTTCTGCACTCAAGTTAAACAGTTTCCAAACGCTTATACTATGTTGAGCCACAGCCTTTAACTTCAGACTGTATGGACCGCCTGCGCTCGGTTGTACGCCCAATAAATCCGGACAACGCTCGGGACCTACGTATTACCGCGGCTGCTGAAGAGTGGTCGCCGTCCCTTTCTGGTAAGATACCGTCACGGTACAAACTTTCCACTCTCACACTCGTTCTTCTCTTACAACAGAGCTTTTACGATCGAAAAACCTTCTTTCACTCACAACGCGGCGTTGCTCGGTCAGACGTCGCCGTCCAGTGCCGAAGATTCCCTGCAAGCCTCCCGTGAAGTAAACTAGAGCTGACGACAACCATGCACCACCTGTCACCGTGGCCGAAGCAAATCCTATCTCTAGGACGGGCACCACCCGGGGTTGTTAAGACCTGGTAAGGTCTTCGCGTTGCTTCCAAAGGATCAAGCTCCCACCGCTGCTCTAGGCCT 3 | -------------------------------------------------------------------------------- /inc-seq.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """The INC-Seq pipeline 4 | """ 5 | import os 6 | import sys 7 | import argparse 8 | import subprocess 9 | 10 | from datetime import datetime 11 | from Bio import SeqIO 12 | from utils import findUnit, buildConsensus 13 | 14 | def get_tmp(program): 15 | program_time = datetime.now().strftime('%Y-%m-%d_%H-%M-%S.%f') 16 | tmp_folder = "/dev/shm/" if os.path.exists("/dev/shm") else "/tmp/" 17 | tmp_folder += program + program_time + "/" 18 | return tmp_folder 19 | 20 | def callBuildConsensus(aligner, record, aln, copy_num_thre, len_diff_thre, tmp_folder, seg_cov, iterative): 21 | if aligner == "blastn": 22 | consensus = buildConsensus.consensus_blastn(record, aln, copy_num_thre, 23 | len_diff_thre, tmp_folder, 24 | seg_cov, iterative) 25 | elif aligner == "graphmap": 26 | consensus = buildConsensus.consensus_graphmap(record, aln, copy_num_thre, 27 | len_diff_thre, tmp_folder, 28 | seg_cov, iterative) 29 | elif aligner == "poa": 30 | consensus = buildConsensus.consensus_poa(record, aln, copy_num_thre, 31 | len_diff_thre, tmp_folder) 32 | elif aligner == "marginAlign": 33 | consensus = buildConsensus.consensus_marginAlign(record, aln, copy_num_thre, 34 | len_diff_thre, tmp_folder, 35 | seg_cov, iterative) 36 | return consensus 37 | 38 | def main(arguments): 39 | #### parsing arguments 40 | parser = argparse.ArgumentParser(description=__doc__) 41 | parser.add_argument('-i', "--input", 42 | required="True", 43 | help="Input file in fasta format", 44 | dest="inFasta") 45 | parser.add_argument('-o', '--outfile', 46 | help="Output file", 47 | dest = "outFile", 48 | default=sys.stdout, type=argparse.FileType('w')) 49 | parser.add_argument("-a", "--aligner", 50 | default='blastn', 51 | dest="aligner", 52 | help="The aligner used (blastn, graphmap, poa) [Default: blastn]") 53 | parser.add_argument("-m", "--minReadLength", 54 | default=2000, 55 | dest="minRL", 56 | type=int, 57 | help="The reads shorter than this will be discarded [Default:2000]") 58 | ##find unit specific 59 | parser.add_argument("--anchor_seg_step", 60 | default=500, 61 | dest="anchor_seg_step", 62 | type=int, 63 | help="Step of sliding window used as anchors [Default: 500] (eg. -s 500 : start at 0, 500, 1000, ...)") 64 | parser.add_argument("--anchor_length", 65 | default=500, 66 | dest="anchor_len", 67 | type=int, 68 | help="The length of the anchor, should be smaller than the unit length [Default: 500]") 69 | parser.add_argument("--anchor_cov", 70 | default=0.8, 71 | dest="anchor_cov", 72 | type=float, 73 | help="Anchor coverage required [Default: 0.8]") 74 | parser.add_argument("--anchor_seq", 75 | dest="anchor_seq", 76 | type=str, 77 | help="A single file containing the sequences used as the anchor [Default: Use subsequences as anchors]") 78 | ##consensus building specific 79 | parser.add_argument("--iterative", 80 | action = "store_true", 81 | dest="iterative", 82 | help="Iteratively run pbdagcon on consensus [Default: False]") 83 | parser.add_argument("--segments_only", 84 | action = "store_true", 85 | dest="segments_only", 86 | help="Extract segments only without constucting consensus [Default: False]") 87 | parser.add_argument("--seg_cov", 88 | default=0.8, 89 | dest="seg_cov", 90 | type=float, 91 | help="Segment coverage required [Default: 0.8]") 92 | parser.add_argument("--copy_num_thre", 93 | dest="copy_num_thre", 94 | default = 6, 95 | type = int, 96 | help="Minimal copy number required [Default: 6]") 97 | parser.add_argument("--length_difference_threshold", 98 | dest="len_diff_thre", 99 | default = 0.05, 100 | type = float, 101 | help="Segment length deviation from the median to be considered as concordant [Default: 0.05]") 102 | 103 | args = parser.parse_args(arguments) 104 | 105 | #### parse input reads 106 | seqs = SeqIO.parse(args.inFasta, "fasta") 107 | 108 | #### create temp folder 109 | tmp_folder = get_tmp('incseq_' + args.inFasta.split("/")[-1] + '_') 110 | os.makedirs(tmp_folder) 111 | 112 | counter = 0 113 | 114 | if args.segments_only: 115 | outH = open("inc_seq.segments.fa", "w") 116 | for record in seqs: 117 | seqlen = len(record.seq) 118 | sys.stderr.write("---------- Processing read %i ----------\n" % (counter + 1)) 119 | counter += 1 120 | if seqlen < args.minRL: 121 | #### length filter 122 | sys.stderr.write("Failed to pass length filter!\n") 123 | else: 124 | #### find units 125 | if args.aligner == "blastn" or args.aligner == "graphmap" or args.aligner =="poa" or args.aligner == "marginAlign": ## FIXME graphmap implementation 126 | if args.anchor_seq: 127 | ## anchor sequence provided, run with INC-Seq2 mode 128 | aln = findUnit.find_unit_blastn(record, args.anchor_seq, tmp_folder, seqlen, 129 | args.anchor_seg_step, 130 | args.anchor_len, 131 | args.anchor_cov) 132 | else: 133 | ## use subsequences as anchors (INC-Seq mode) 134 | aln = findUnit.find_unit_blastn(record, None, tmp_folder, seqlen, 135 | args.anchor_seg_step, 136 | args.anchor_len, 137 | args.anchor_cov) 138 | 139 | #### build consensus 140 | if args.segments_only: 141 | tmp = buildConsensus.segmentize(record, aln, args.copy_num_thre, args.len_diff_thre, 142 | outH) 143 | sys.stderr.write("Consensus construction skipped, check \"inc_seq.segments.fa\" for extracted segments!\n") 144 | continue #skip consensus building 145 | consensus = callBuildConsensus(args.aligner, record, aln, args.copy_num_thre, 146 | args.len_diff_thre, tmp_folder, 147 | args.seg_cov, args.iterative) 148 | 149 | if consensus: 150 | sys.stderr.write("Consensus called\t%s\tNumber of segments\t%d\n" %(record.id, consensus[1])) 151 | args.outFile.write(consensus[0]) 152 | else: 153 | sys.stderr.write("Consensus construction failed!\n") 154 | if args.segments_only: 155 | outH.close() 156 | 157 | os.rmdir(tmp_folder) 158 | 159 | if __name__ == '__main__': 160 | sys.exit(main(sys.argv[1:])) 161 | -------------------------------------------------------------------------------- /pipeline.bpipe: -------------------------------------------------------------------------------- 1 | 2 | SCRIPT = "~/projects_backup/INCSeq" 3 | //ALIGNER = "graphmap" 4 | ALIGNER = "blastn" 5 | READ_NUM = 1000 6 | LINE_NUM = READ_NUM * 2 7 | 8 | set_global_var = { 9 | doc title: "Set global branch variables" 10 | branch.PREFIX = branch.name 11 | } 12 | 13 | split_fasta = { 14 | doc title: "Split the fasta files" 15 | desc: """ 16 | Split the fasta file into multiple chunks 17 | """ 18 | output.dir = "tmp_split" 19 | produce("*.fasta"){ 20 | exec "split -l ${LINE_NUM} -d -a 8 $input tmp_split/read.split_" 21 | exec """ 22 | for file in tmp_split/read.split_*; do mv "$file" "${file}.fasta"; done 23 | """ 24 | } 25 | } 26 | 27 | 28 | consensus = { 29 | doc title: "Consensus construction" 30 | output.dir = "tmp_consensus" 31 | transform (".fasta") to (".inc-seq.fa", ".inc-seq.log"){ 32 | exec "${SCRIPT}/inc-seq.py -i $input.fasta -o $output1 -a ${ALIGNER} --copy_num_thre 4 --length_difference_threshold 0.5 2> $output2", "main" 33 | } 34 | } 35 | 36 | merge_fasta = { 37 | doc title: "Merge fasta" 38 | desc:""" 39 | Merge all the consensus fasta files 40 | """ 41 | produce ("${PREFIX}.inc-seq.fasta", 42 | "${PREFIX}.inc-seq.log"){ 43 | multi "cat tmp_consensus/*.inc-seq.fa > $output1", 44 | "cat tmp_consensus/*.log > $output2" 45 | } 46 | } 47 | 48 | 49 | run { "%.fa" * [set_global_var + split_fasta + "read.split_%" * [ consensus ] + merge_fasta ]} 50 | -------------------------------------------------------------------------------- /utils/PBDAGCON.LICENSE: -------------------------------------------------------------------------------- 1 | #################################################################################$$ 2 | # Copyright (c) 2011-2016, Pacific Biosciences of California, Inc. 3 | # 4 | # All rights reserved. 5 | # 6 | # Redistribution and use in source and binary forms, with or without 7 | # modification, are permitted (subject to the limitations in the 8 | # disclaimer below) provided that the following conditions are met: 9 | # 10 | # * Redistributions of source code must retain the above copyright 11 | # notice, this list of conditions and the following disclaimer. 12 | # 13 | # * Redistributions in binary form must reproduce the above 14 | # copyright notice, this list of conditions and the following 15 | # disclaimer in the documentation and/or other materials provided 16 | # with the distribution. 17 | # 18 | # * Neither the name of Pacific Biosciences nor the names of its 19 | # contributors may be used to endorse or promote products derived 20 | # from this software without specific prior written permission. 21 | # 22 | # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE 23 | # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC 24 | # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 25 | # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES 26 | # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 27 | # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS 28 | # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 29 | # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 30 | # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF 31 | # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 32 | # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 33 | # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT 34 | # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 35 | # SUCH DAMAGE. 36 | #################################################################################$$ 37 | -------------------------------------------------------------------------------- /utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CSB5/INC-Seq/10306359c26b78d62f0b59603bf9c6ce0d2f91d5/utils/__init__.py -------------------------------------------------------------------------------- /utils/aligners.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import os 4 | import sys 5 | import subprocess 6 | 7 | 8 | ################################aligners######################################## 9 | def blastn(query, ref, wordSize, blastOutFMT, num_alignments, anchor_cov, toBlasr): 10 | script_dir = os.path.dirname(os.path.realpath(__file__)) 11 | ## blastn 12 | cmd = ["blastn", "-query", query, "-task", "blastn"]# , "-evalue","0.000001"] 13 | if wordSize: 14 | cmd.extend(["-word_size", str(wordSize)]) 15 | cmd.extend(["-subject", ref, "-num_alignments", str(num_alignments)]) 16 | cmd.extend(["-outfmt", blastOutFMT]) 17 | blastout = subprocess.Popen(cmd, stdout=subprocess.PIPE) 18 | #cmd = ["sort","-k","1,1","-k","2,2n"] 19 | #sortedblastout = subprocess.Popen(cmd, stdin = blastout.stdout, stdout= subprocess.PIPE) 20 | cmd = [script_dir+"/filter_best_match.py","-i","-", "-c", str(anchor_cov)] 21 | filteredblastout = subprocess.Popen(cmd, stdin = blastout.stdout, stdout = subprocess.PIPE) 22 | if not toBlasr: 23 | stdout, stderr = filteredblastout.communicate() 24 | filteredblastout.stdout.close() 25 | 26 | else: 27 | cmd = [script_dir+"/blastn2blasr.py", "-i", "-"] 28 | blast2blasr = subprocess.Popen(cmd, stdin = filteredblastout.stdout, stdout = subprocess.PIPE) 29 | stdout, stderr = blast2blasr.communicate() 30 | blast2blasr.stdout.close() 31 | 32 | return stdout 33 | 34 | def graphmap(query, ref): 35 | script_dir = os.path.dirname(os.path.realpath(__file__)) 36 | ## run graphmap 37 | cmd = [script_dir+"/graphmap", "-t", "1", "-d", query, "-r", ref, "-v", "0", "-z", "0.000001"] 38 | graphmapout = subprocess.Popen(cmd, stdout= subprocess.PIPE) 39 | cmd = [script_dir+"/sam2blasr.py", "-i", "-", "-r", ref] 40 | sam2blasr = subprocess.Popen(cmd, stdin = graphmapout.stdout, stdout = subprocess.PIPE) 41 | stdout, stderr = sam2blasr.communicate() 42 | sam2blasr.stdout.close() 43 | 44 | return stdout 45 | 46 | 47 | def marginAlign(query, ref, tmpName): 48 | script_dir = os.path.dirname(os.path.realpath(__file__)) 49 | marginAlign_dir = "/mnt/projects/lich/dream_challenge/rollingcircle/finalized/ana_scripts/marginAlign" 50 | ## run 51 | cmd = [marginAlign_dir+"/marginAlign", query, ref, tmpName+".sam", "--jobTree", tmpName+".jobTree", "--em" ] 52 | with open(tmpName+".margin.log", 'w') as log: 53 | tmp = subprocess.check_output(" ".join(cmd), shell = True, stderr=log) 54 | cmd = [script_dir+"/sam2blasr.py", "-i", tmpName+".sam", "-r", ref] 55 | stdout = subprocess.check_output(" ".join(cmd), shell = True) 56 | tmp = subprocess.check_output("rm -rf %s %s" %(tmpName+".jobTree", tmpName+".sam"), shell = True) 57 | return stdout 58 | 59 | 60 | 61 | def poa(fasta, tmpName, seqHeader): 62 | script_dir = os.path.dirname(os.path.realpath(__file__)) 63 | cmd = [script_dir+"/poa", "-do_global", "-do_progressive", 64 | "-read_fasta", fasta, 65 | "-pir", "pseudo", 66 | script_dir+"/blosum80.mat", "-hb"] 67 | with open(tmpName+".poa.log", 'w') as log: 68 | poaout = subprocess.check_output(" ".join(cmd), stderr=log, shell = True) 69 | 70 | ## post processing 71 | consensus = poaout.split(">")[-1] 72 | consensus = ''.join(consensus.split("\n")[1:]) 73 | consensus = consensus.replace(".","") 74 | consensus = "\n".join([seqHeader, consensus]) + "\n" 75 | 76 | return consensus 77 | -------------------------------------------------------------------------------- /utils/blastn2bed.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """Convert blastn files to bed format 4 | 5 | """ 6 | 7 | import os 8 | import sys 9 | import argparse 10 | 11 | 12 | def main(arguments): 13 | parser = argparse.ArgumentParser(description=__doc__) 14 | parser.add_argument('-i', '--infile', 15 | required = "True", 16 | dest="infile", 17 | help="Blastn output generated from customized format '6 sseqid sstart send ...'") 18 | parser.add_argument('-o', '--outfile', 19 | help="bed file name", 20 | dest='outFile', 21 | default=sys.stdout, 22 | type=argparse.FileType('w')) 23 | 24 | args = parser.parse_args(arguments) 25 | 26 | with open(args.infile) as f: 27 | for line in f: 28 | fields = line.strip().split()[0:3] 29 | name = fields[0] 30 | pos = [int (i) for i in fields[1:]] 31 | start = str(min(pos) - 1) 32 | end = str(max(pos)) 33 | args.outFile.write('\t'.join([name, start, end])+'\n') 34 | 35 | 36 | if __name__ == '__main__': 37 | sys.exit(main(sys.argv[1:])) 38 | -------------------------------------------------------------------------------- /utils/blastn2blasr.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """Convert blastn output to blasr m5 format. 4 | 5 | """ 6 | 7 | import os 8 | import sys 9 | import argparse 10 | 11 | 12 | def main(arguments): 13 | parser = argparse.ArgumentParser(description=__doc__) 14 | parser.add_argument('-i', '--infile', 15 | required = "True", 16 | dest="infile", 17 | help="Blastn output generated from filter_best_match.py") 18 | parser.add_argument('-o', '--outfile', 19 | help="blasr m5 file", 20 | dest='outFile', 21 | default=sys.stdout, 22 | type=argparse.FileType('w')) 23 | 24 | args = parser.parse_args(arguments) 25 | 26 | if args.infile == '-': 27 | h = sys.stdin 28 | else: 29 | h = open(args.infile, 'rU') 30 | 31 | blastOutFMT = 'sseqid sstart send slen qstart qend qlen evalue score length nident mismatch gaps sseq qseq qseqid'.split() 32 | blasrFMT = 'qName qLength qStart qEnd qStrand tName tLength tStart tEnd tStrand score numMatch numMismatch numIns numDel mapQV qAlignedSeq matchPattern tAlignedSeq' 33 | 34 | for line in h: 35 | fields = line.strip().split() 36 | record = dict(zip(blastOutFMT, fields)) 37 | output = [record['sseqid'], record['slen']] ## FIXME: 0 indexing, half open 38 | if int(record['sstart']) < int(record['send']): 39 | output += [ str(int(record['sstart'])-1), record['send'], "+ "] 40 | else: 41 | output += [ str(int(record['send'])-1), record['sstart'], "- "] 42 | output += [record['qseqid'], record['qlen'], str(int(record['qstart'])-1), record['qend'], '+'] 43 | output += ['-3000'] ## a fake score 44 | output += [record['nident'], record['mismatch']] 45 | output += [str(record['qseq'].count('-'))] 46 | output += [str(record['sseq'].count('-'))] 47 | output += ['254'] ## fake mapQV 48 | output += [record['sseq']] 49 | aln = '' 50 | for i,j in zip(record['qseq'],record['sseq']): 51 | aln += '|' if i==j else '*' 52 | output += [aln] 53 | output += [record['qseq']] 54 | 55 | args.outFile.write(' '.join(output)+'\n') 56 | h.close() 57 | 58 | if __name__ == '__main__': 59 | sys.exit(main(sys.argv[1:])) 60 | -------------------------------------------------------------------------------- /utils/blastn_wrapper.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | Q=$1 3 | DB=$2 4 | OUT=$3 5 | NTHREDS=$4 6 | NALN=$5 7 | 8 | blastn -query $Q -task blastn -evalue 0.1 -db $DB -outfmt "6 sseqid sstart send slen qstart qend qlen evalue length nident mismatch gaps sseq qseq qseqid" -num_alignments $NALN -num_threads $NTHREDS > $OUT 9 | 10 | -------------------------------------------------------------------------------- /utils/blosum80.mat: -------------------------------------------------------------------------------- 1 | # Blosum80 2 | # Matrix made by matblas from blosum80.iij 3 | # * column uses minimum score 4 | # BLOSUM Clustered Scoring Matrix in 1/3 Bit Units 5 | # Blocks Database = /data/blocks_5.0/blocks.dat 6 | # Cluster Percentage: >= 80 7 | # Entropy = 0.9868, Expected = -0.7442 8 | GAP-PENALTIES=12 6 6 9 | A R N D C Q E G H I L K M F P S T W Y V B Z X ? a g t c u ] n 10 | A 7 -3 -3 -3 -1 -2 -2 0 -3 -3 -3 -1 -2 -4 -1 2 0 -5 -4 -1 -3 -2 -1 -9 -9 -9 -9 -9 -9 -9 -9 11 | R -3 9 -1 -3 -6 1 -1 -4 0 -5 -4 3 -3 -5 -3 -2 -2 -5 -4 -4 -2 0 -2 -9 -9 -9 -9 -9 -9 -9 -9 12 | N -3 -1 9 2 -5 0 -1 -1 1 -6 -6 0 -4 -6 -4 1 0 -7 -4 -5 5 -1 -2 -9 -9 -9 -9 -9 -9 -9 -9 13 | D -3 -3 2 10 -7 -1 2 -3 -2 -7 -7 -2 -6 -6 -3 -1 -2 -8 -6 -6 6 1 -3 -9 -9 -9 -9 -9 -9 -9 -9 14 | C -1 -6 -5 -7 13 -5 -7 -6 -7 -2 -3 -6 -3 -4 -6 -2 -2 -5 -5 -2 -6 -7 -4 -9 -9 -9 -9 -9 -9 -9 -9 15 | Q -2 1 0 -1 -5 9 3 -4 1 -5 -4 2 -1 -5 -3 -1 -1 -4 -3 -4 -1 5 -2 -9 -9 -9 -9 -9 -9 -9 -9 16 | E -2 -1 -1 2 -7 3 8 -4 0 -6 -6 1 -4 -6 -2 -1 -2 -6 -5 -4 1 6 -2 -9 -9 -9 -9 -9 -9 -9 -9 17 | G 0 -4 -1 -3 -6 -4 -4 9 -4 -7 -7 -3 -5 -6 -5 -1 -3 -6 -6 -6 -2 -4 -3 -9 -9 -9 -9 -9 -9 -9 -9 18 | H -3 0 1 -2 -7 1 0 -4 12 -6 -5 -1 -4 -2 -4 -2 -3 -4 3 -5 -1 0 -2 -9 -9 -9 -9 -9 -9 -9 -9 19 | I -3 -5 -6 -7 -2 -5 -6 -7 -6 7 2 -5 2 -1 -5 -4 -2 -5 -3 4 -6 -6 -2 -9 -9 -9 -9 -9 -9 -9 -9 20 | L -3 -4 -6 -7 -3 -4 -6 -7 -5 2 6 -4 3 0 -5 -4 -3 -4 -2 1 -7 -5 -2 -9 -9 -9 -9 -9 -9 -9 -9 21 | K -1 3 0 -2 -6 2 1 -3 -1 -5 -4 8 -3 -5 -2 -1 -1 -6 -4 -4 -1 1 -2 -9 -9 -9 -9 -9 -9 -9 -9 22 | M -2 -3 -4 -6 -3 -1 -4 -5 -4 2 3 -3 9 0 -4 -3 -1 -3 -3 1 -5 -3 -2 -9 -9 -9 -9 -9 -9 -9 -9 23 | F -4 -5 -6 -6 -4 -5 -6 -6 -2 -1 0 -5 0 10 -6 -4 -4 0 4 -2 -6 -6 -3 -9 -9 -9 -9 -9 -9 -9 -9 24 | P -1 -3 -4 -3 -6 -3 -2 -5 -4 -5 -5 -2 -4 -6 12 -2 -3 -7 -6 -4 -4 -2 -3 -9 -9 -9 -9 -9 -9 -9 -9 25 | S 2 -2 1 -1 -2 -1 -1 -1 -2 -4 -4 -1 -3 -4 -2 7 2 -6 -3 -3 0 -1 -1 -9 -9 -9 -9 -9 -9 -9 -9 26 | T 0 -2 0 -2 -2 -1 -2 -3 -3 -2 -3 -1 -1 -4 -3 2 8 -5 -3 0 -1 -2 -1 -9 -9 -9 -9 -9 -9 -9 -9 27 | W -5 -5 -7 -8 -5 -4 -6 -6 -4 -5 -4 -6 -3 0 -7 -6 -5 16 3 -5 -8 -5 -5 -9 -9 -9 -9 -9 -9 -9 -9 28 | Y -4 -4 -4 -6 -5 -3 -5 -6 3 -3 -2 -4 -3 4 -6 -3 -3 3 11 -3 -5 -4 -3 -9 -9 -9 -9 -9 -9 -9 -9 29 | V -1 -4 -5 -6 -2 -4 -4 -6 -5 4 1 -4 1 -2 -4 -3 0 -5 -3 7 -6 -4 -2 -9 -9 -9 -9 -9 -9 -9 -9 30 | B -3 -2 5 6 -6 -1 1 -2 -1 -6 -7 -1 -5 -6 -4 0 -1 -8 -5 -6 6 0 -3 -9 -9 -9 -9 -9 -9 -9 -9 31 | Z -2 0 -1 1 -7 5 6 -4 0 -6 -5 1 -3 -6 -2 -1 -2 -5 -4 -4 0 6 -1 -9 -9 -9 -9 -9 -9 -9 -9 32 | X -1 -2 -2 -3 -4 -2 -2 -3 -2 -2 -2 -2 -2 -3 -3 -1 -1 -5 -3 -2 -3 -1 -2 -9 -9 -9 -9 -9 -9 -9 -9 33 | ? -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 34 | a -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 4 -2 -2 -2 -2 -9 0 35 | g -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -2 4 -2 -2 -2 -9 0 36 | t -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -2 -2 4 -2 4 -9 0 37 | c -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -2 -2 -2 4 -2 -9 0 38 | u -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -2 -2 4 -2 4 -9 0 39 | ] -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 40 | n -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 0 0 0 0 0 -9 0 41 | -------------------------------------------------------------------------------- /utils/buildConsensus.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import os 4 | import sys 5 | import subprocess 6 | import hashlib 7 | import time 8 | from Bio import SeqIO 9 | from Bio.Seq import Seq 10 | from Bio.SeqRecord import SeqRecord 11 | from aligners import * 12 | 13 | # ## for pbdagcon 14 | # ## facilitate our cluster setup 15 | # new_gcc = "/opt/gcc-4.9.3/lib64" 16 | # if not 'LD_LIBRARY_PATH' in os.environ: 17 | # os.environ['LD_LIBRARY_PATH'] = new_gcc + ":" 18 | # elif not new_gcc in os.environ.get('LD_LIBRARY_PATH'): 19 | # os.environ['LD_LIBRARY_PATH'] = new_gcc + ":" + os.environ['LD_LIBRARY_PATH'] 20 | 21 | ################################find primer location (not in use) ############################ 22 | def locate_primer(primer_fwd, primer_rev, consensus, tmp_folder, seqlen): 23 | tmpname = tmp_folder + hashlib.md5("primer").hexdigest() + ".tmp" 24 | tmpRef = tmpname + ".ref.fasta" 25 | tmpQ = tmpname + ".q.fasta" 26 | blastOutFMT = '6 sseqid sstart send slen qstart qend qlen evalue score length nident mismatch gaps qseqid' 27 | with open(tmpQ, 'w') as q_handle: 28 | for primer, name in zip((primer_fwd, primer_rev), ("fwd","rev")): 29 | qrecord = SeqRecord(Seq(primer), 30 | name, 31 | description= "") 32 | SeqIO.write(qrecord, q_handle, "fasta") 33 | 34 | with open(tmpRef, 'w') as q_handle: 35 | qrecord = SeqRecord(Seq(consensus), 36 | "consensus", 37 | description= "") 38 | SeqIO.write(qrecord, q_handle, "fasta") 39 | 40 | stdout = blastn(tmpQ, tmpRef, 4, blastOutFMT, 41 | 1, 0.3, False) 42 | alns = stdout.strip().split("\n") 43 | aln = '' 44 | lowest_e = 1 45 | for line in alns: 46 | fields = line.split("\t") 47 | e_value = float(fields[7]) 48 | if e_value < lowest_e: 49 | lowest_e = e_value 50 | aln = line 51 | tmp = subprocess.check_output("rm %s*" % (tmpname), shell = True) 52 | if aln == '': 53 | ## cannot find primer, discard 54 | return None 55 | else: 56 | fields = aln.split("\t") 57 | start, end, length, offset = [ int(x) for x in fields[1:5] ] 58 | 59 | if start < end: 60 | ## mapped in fwd orientation 61 | s_start = max(start - offset, 0) 62 | else: 63 | ## mapped in rev orientation 64 | s_start = min(start + offset, length) 65 | ## restore the correct position 66 | return (consensus[s_start:] + consensus[0:s_start]) 67 | 68 | ################################consensus building############################## 69 | #--------------functions used for consensus building----------- 70 | def median(mylist): 71 | sorts = sorted(mylist) 72 | length = len(sorts) 73 | if length == 0: 74 | return 0 75 | if not length % 2: 76 | return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0 77 | return sorts[length / 2] 78 | 79 | def get_errors(alignments): 80 | errors = 0 81 | count = 0 82 | if alignments == '\n' or alignments == '': 83 | return sys.maxint 84 | for l in alignments.split('\n'): 85 | if l != '': 86 | count += 1 87 | fields = l.split() 88 | errors += sum([ int(x) for x in fields[12:15] ]) 89 | return errors*1.0/count 90 | 91 | def post_processing(alignments): 92 | ## remove the self-alignments 93 | out = [] 94 | for l in alignments.split('\n'): 95 | if l != '': 96 | fields = l.split() 97 | if fields[0] != fields[5]: 98 | out.append(l) 99 | return '\n'.join(out) 100 | 101 | def segment_filter_orientation(aln): 102 | ## first filter by orientaion 103 | ## also return the lengths of the segments 104 | lengths = [] 105 | coordinates = [] 106 | for (i, j) in zip(aln[0:], aln[1:]): 107 | # check concordance 108 | if (int(i[0]) - int(i[1])) * (int(j[0]) - int(j[1])) > 0: 109 | # concordant prepare output 110 | # gather coordinates on the reads (be more relaxed at two ends) 111 | if int(i[0]) - int(i[1]) < 0: # in forward directions 112 | start = max(int(i[0]) - int(i[3]) + 1, 1) # avoid overshoot to negative coordinates 113 | ##start = int(i[0]) 114 | end = int(j[0]) 115 | else: # in reverse complement 116 | start = int(i[0]) 117 | end = int(j[0]) + int(j[3]) - 1 118 | ##end = int(j[0]) 119 | lengths.append(end - start + 1) 120 | coordinates += [(start,end)] 121 | else: 122 | # discordant 123 | coordinates.append('#') 124 | lengths.append("#") 125 | return (coordinates, lengths) 126 | 127 | def segment_filter_lengths(coordinates, lengths, len_median, len_diff_thre): 128 | coordinates_filtered = [] 129 | for l, c in zip(lengths, coordinates): 130 | if l == '#': 131 | # discordant 132 | coordinates_filtered.append(c) 133 | else: 134 | # concordant 135 | if abs(l-len_median) > len_median*len_diff_thre: 136 | # 0.066 for pacbio #expected length discrepency -- 0.22 (error) * (0.6-0.3) (insertion-deletion) 137 | # wrong length (indication of a chimera) 138 | coordinates_filtered.append("*") 139 | else: 140 | # correct length 141 | coordinates_filtered.append(c) 142 | coordinates_filtered.append("#") ## add a delimiter to the end 143 | return coordinates_filtered 144 | 145 | def segment_filter_longest_strech(coordinates_filtered): 146 | candidate = [] 147 | candidate_cur = [] 148 | for cor in coordinates_filtered: 149 | if cor == "*" or cor == "#": ## segment boundary: 150 | if len(candidate_cur) > len(candidate): 151 | ## found a strech with more segments 152 | candidate = candidate_cur 153 | candidate_cur = [] 154 | else: 155 | candidate_cur.append(cor) 156 | return candidate 157 | 158 | def segment_filters(alnFile, copy_num_thre, len_diff_thre): 159 | # perform filtering in three steps 160 | ## split into segments: 161 | ## the anchors flanking the segment must be concordant 162 | ## 1. in the same direction 163 | ## 2. the length must not be so different 164 | ## 3. the longest strech of concordant segments will be considered 165 | if not alnFile: 166 | return None 167 | 168 | aln = [] 169 | for line in alnFile.strip().split('\n'): 170 | fields = line.split() 171 | aln += [ fields[1:] ] 172 | 173 | ## split into segments: 174 | ## the anchors flanking the segment must be concordant 175 | ## 1. in the same direction 176 | ## 2. the length must not be so different 177 | ## 3. the longest strech of concordant segments will be considered 178 | 179 | if len(aln) >= copy_num_thre: 180 | seg_coordinates = [] 181 | ## filter for direction 182 | coordinates, lengths = segment_filter_orientation(aln) 183 | len_median = median([x for x in lengths if x != "#"]) 184 | ## filter for length 185 | coordinates_filtered = segment_filter_lengths(coordinates, lengths, len_median, len_diff_thre) 186 | ## find the longest strech of segments in concordance 187 | candidate = segment_filter_longest_strech(coordinates_filtered) 188 | 189 | sys.stderr.write("Number of segments of the candidate strech: %d\n" %(len(candidate))) 190 | 191 | ## need some copies for correction 192 | if len(candidate) >= copy_num_thre: 193 | seg_coordinates = candidate 194 | sys.stderr.write("Candidate read found!\n") 195 | return seg_coordinates 196 | else: 197 | sys.stderr.write("Not enough alignmets!\n") 198 | return None 199 | else: 200 | sys.stderr.write("Not enough alignmets!\n") 201 | return None 202 | 203 | def pbdagcon(m5, t): 204 | script_dir = os.path.dirname(os.path.realpath(__file__)) 205 | cmd = ("%s/pbdagcon -t %d -c 1 -m 1 %s" % (script_dir, t, m5)).split() 206 | ## hard coded threshold to prevent trimming too many bases 207 | if(t > 100): 208 | return None 209 | proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) 210 | ## if in 5 sec, pbdagcon does not finish, trim 1 base and recursively run it 211 | poll_seconds = 0.25 212 | deadline = time.time() + 5 213 | while time.time() < deadline and proc.poll() == None: 214 | time.sleep(poll_seconds) 215 | 216 | if proc.poll() == None: 217 | proc.terminate() 218 | sys.stderr.write("Warning: PBDAGCON timeout! Trimming %d base(s).\n" %(t+1)) 219 | 220 | stdout, stderr = proc.communicate() 221 | if proc.returncode != 0: 222 | stdout = pbdagcon(m5, t+1) 223 | return stdout 224 | 225 | def segmentize(record, alnFile, copy_num_thre, len_diff_thre, outH): 226 | ## only extract segments without building consensus 227 | seg_coordinates = segment_filters(alnFile, copy_num_thre, len_diff_thre) 228 | if not seg_coordinates: 229 | return None 230 | counter = 0 231 | # write refs (all the subreads) 232 | for s, e in seg_coordinates: 233 | counter += 1 234 | subRead = SeqRecord(record.seq[s-1:e], record.id+'_'+str(counter), description="") 235 | SeqIO.write(subRead, outH, "fasta") 236 | return 1 237 | 238 | #--------------------------------------------------------------------- 239 | def consensus_blastn(record, alnFile, copy_num_thre, len_diff_thre, tmp_folder, seg_cov, iterative): 240 | seg_coordinates = segment_filters(alnFile, copy_num_thre, len_diff_thre) 241 | if not seg_coordinates: 242 | return None 243 | 244 | #### split into segments and call consensus 245 | ## split this read into a multiple fasta file 246 | tmpname = tmp_folder + hashlib.md5(record.id).hexdigest() + ".tmp" 247 | tmpRef = tmpname + ".ref.fasta" 248 | tmpQ = tmpname + ".q.fasta" 249 | 250 | seg_num = len(seg_coordinates) 251 | 252 | ## three fields added to facilate convertion to blasr m5 format 253 | blastOutFMT = '6 sseqid sstart send slen qstart qend qlen evalue score length nident mismatch gaps sseq qseq qseqid' 254 | ## try using each subread as the backbone 255 | alignments = {"alignments":'\n',"num":0, "errors":sys.maxint} 256 | subReads = [] 257 | 258 | counter = 0 259 | # write refs (all the subreads) 260 | ref_handle = open(tmpRef, 'w') 261 | for s, e in seg_coordinates: 262 | counter += 1 263 | subRead = SeqRecord(record.seq[s-1:e], record.id+'_'+str(counter), description="") 264 | subReads.append(subRead) 265 | SeqIO.write(subRead, ref_handle, "fasta") 266 | ref_handle.close() 267 | # blast alignment 268 | for subRead in subReads: 269 | q_handle = open(tmpQ, 'w') 270 | SeqIO.write(subRead, q_handle, "fasta") 271 | q_handle.close() 272 | stdout = blastn(tmpQ, tmpRef, None, blastOutFMT, seg_num, seg_cov, True) 273 | num = stdout.count('\n') - 1 274 | errors = get_errors(stdout) 275 | if num > alignments["num"]: 276 | # prefer more alignments 277 | alignments["num"] = num 278 | alignments["errors"] = errors 279 | alignments["alignments"] = stdout 280 | elif num == alignments["num"]: 281 | # for the same number of alignments, prefer the one with lower error rates 282 | if errors < alignments["errors"]: 283 | alignments["num"] = num 284 | alignments["errors"] = errors 285 | alignments["alignments"] = stdout 286 | 287 | copy_num = alignments["alignments"].count("\n") 288 | if copy_num >= copy_num_thre: 289 | with open(tmpname + '.m5', 'w') as outH: 290 | outH.write(post_processing(alignments["alignments"])) 291 | consensus = pbdagcon(tmpname+'.m5', 0) 292 | if consensus == None: 293 | sys.stderr.write("PBDAGCON failed (trimmed more than 100 bases)!\n") 294 | tmp = subprocess.check_output("rm %s*" % (tmpname), shell = True) 295 | return None 296 | ## run iteratively 297 | #---------------------------------------- 298 | # write consensus seq 0 299 | if iterative: 300 | delta = 1 301 | consensus_p = consensus 302 | iteration = 0 303 | sys.stderr.write("Iteratively improving consensus\n") 304 | tmpRef_iter = tmpname + '.con.iter.fa' 305 | tmpRef_iter_next = tmpname + '.con.iter.n.fa' 306 | tmpRef_iter_m5 = tmpname + '.con.iter.m5' 307 | 308 | while (delta>0.001 and consensus and iteration<=10): 309 | sys.stderr.write("######################Iteration: %d########################\n" % (iteration+1)) 310 | iteration += 1 311 | with open(tmpRef_iter, 'w') as outH: 312 | outH.write(consensus) 313 | stdout = blastn(tmpRef_iter, tmpRef, None, blastOutFMT, seg_num, seg_cov, True) 314 | copy_num = stdout.count('\n') 315 | ## write new m5 316 | with open(tmpRef_iter_m5, 'w') as outH: 317 | outH.write(stdout) 318 | consensus_p = consensus 319 | consensus = pbdagcon(tmpRef_iter_m5, 0) 320 | if consensus == None: 321 | sys.stderr.write("PBDAGCON failed (trimmed more than 100 bases)!\n") 322 | tmp = subprocess.check_output("rm %s*" % (tmpname), shell = True) 323 | return None 324 | with open(tmpRef_iter_next, 'w') as outH: 325 | outH.write(consensus) 326 | # update delta 327 | ## check the % identity between two iterations 328 | tmp = blastn(tmpRef_iter, tmpRef_iter_next, None, blastOutFMT, 1, seg_cov, False) 329 | tmp_len, tmp_iden = tmp.strip().split()[9:11] 330 | delta = 1-float(tmp_iden)/int(tmp_len) 331 | sys.stderr.write("Delta: %f\n" %(delta)) 332 | #---------------------------------------- 333 | consensus = consensus_p 334 | tmp = subprocess.check_output("rm %s*" % (tmpname), shell = True) 335 | return (consensus, copy_num) 336 | else: 337 | sys.stderr.write("Not enough aligned copy to correct!\n") 338 | tmp = subprocess.check_output("rm %s*" % (tmpname), shell = True) 339 | return None 340 | 341 | 342 | 343 | def consensus_graphmap(record, alnFile, copy_num_thre, len_diff_thre, tmp_folder, seg_cov, iterative): 344 | seg_coordinates = segment_filters(alnFile, copy_num_thre, len_diff_thre) 345 | if not seg_coordinates: 346 | return None 347 | 348 | #### split into segments and call consensus 349 | ## split this read into a multiple fasta file 350 | tmpname = tmp_folder + hashlib.md5(record.id).hexdigest() + ".tmp" 351 | tmpRef = tmpname + ".ref.fasta" 352 | tmpQ = tmpname + ".q.fasta" 353 | 354 | ## try using each subread as the backbone, select the backbone with minimal errors 355 | alignments = {"alignments":'\n',"errors":sys.maxint} 356 | subReads = [] 357 | 358 | counter = 0 359 | # write queries (all the subreads) 360 | q_handle = open(tmpQ, 'w') 361 | for s, e in seg_coordinates: 362 | counter += 1 363 | subRead = SeqRecord(record.seq[s-1:e], record.id+'_'+str(counter), description="") 364 | subReads.append(subRead) 365 | SeqIO.write(subRead, q_handle, "fasta") 366 | q_handle.close() 367 | # graphmap 368 | for subRead in subReads: 369 | ref_handle = open(tmpRef, 'w') 370 | SeqIO.write(subRead, ref_handle, "fasta") 371 | ref_handle.close() 372 | stdout = graphmap(tmpQ, tmpRef) 373 | errors = get_errors(stdout) 374 | if errors < alignments["errors"]: 375 | alignments["errors"] = errors 376 | alignments["alignments"] = stdout 377 | ## remove index 378 | tmp = subprocess.check_output("rm %s*" % (tmpRef), shell = True) 379 | 380 | copy_num = alignments["alignments"].count("\n") 381 | if copy_num >= copy_num_thre: 382 | with open(tmpname + '.m5', 'w') as outH: 383 | outH.write(post_processing(alignments["alignments"])) 384 | consensus = pbdagcon(tmpname + '.m5', 0) 385 | if consensus == None: 386 | sys.stderr.write("PBDAGCON failed (trimmed more than 100 bases)!\n") 387 | tmp = subprocess.check_output("rm %s*" % (tmpname), shell = True) 388 | return None 389 | 390 | ## run iteratively 391 | #---------------------------------------- 392 | # write consensus seq 0 393 | if iterative: 394 | delta = 1 395 | consensus_p = consensus 396 | copy_num_p = copy_num 397 | iteration = 0 398 | sys.stderr.write("Iteratively improving consensus\n") 399 | tmpRef_iter = tmpname + '.con.iter.fa' 400 | tmpRef_iter_next = tmpname + '.con.iter.n.fa' 401 | tmpRef_iter_m5 = tmpname + '.con.iter.m5' 402 | 403 | while (delta>0.001 and consensus and iteration<10): 404 | sys.stderr.write("######################Iteration: %d########################\n" % (iteration+1)) 405 | iteration += 1 406 | with open(tmpRef_iter, 'w') as outH: 407 | outH.write(consensus) 408 | stdout = graphmap(tmpQ, tmpRef_iter) 409 | ## remove index 410 | tmp = subprocess.check_output("rm %s.*" % (tmpRef_iter), shell = True) 411 | copy_num_p = copy_num 412 | copy_num = stdout.count('\n') 413 | if copy_num < copy_num_p: 414 | sys.stderr.write("Less number of copies found, skip!\n") 415 | break 416 | ## write new m5 417 | with open(tmpRef_iter_m5, 'w') as outH: 418 | outH.write(stdout) 419 | consensus_p = consensus 420 | consensus = pbdagcon(tmpRef_iter_m5, 0) 421 | 422 | ## some cases pbdagcon return empty results 423 | if consensus == None: 424 | sys.stderr.write("PBDAGCON failed (trimmed more than 100 bases)!\n") 425 | tmp = subprocess.check_output("rm %s*" % (tmpname), shell = True) 426 | return None 427 | with open(tmpRef_iter_next, 'w') as outH: 428 | outH.write(consensus) 429 | # update delta 430 | ## check the % identity between two iterations 431 | blastOutFMT = '6 sseqid sstart send slen qstart qend qlen evalue score length nident mismatch gaps sseq qseq qseqid' 432 | tmp = blastn(tmpRef_iter, tmpRef_iter_next, None, blastOutFMT, 1, seg_cov, False) 433 | tmp_len, tmp_iden = tmp.strip().split()[9:11] 434 | delta = 1-float(tmp_iden)/int(tmp_len) 435 | sys.stderr.write("Delta: %f\n" %(delta)) 436 | #---------------------------------------- 437 | consensus = consensus_p 438 | copy_num = copy_num_p 439 | tmp = subprocess.check_output("rm %s*" % (tmpname), shell = True) 440 | return (consensus, copy_num) 441 | else: 442 | sys.stderr.write("Not enough aligned copy to correct!\n") 443 | tmp = subprocess.check_output("rm %s*" % (tmpname), shell = True) 444 | return None 445 | 446 | ################################################################################ 447 | def consensus_poa(record, alnFile, copy_num_thre, len_diff_thre, tmp_folder): 448 | ## use poaV2 to align the reads and construct consensus using heaviest bundle algorithm 449 | seg_coordinates = segment_filters(alnFile, copy_num_thre, len_diff_thre) 450 | if not seg_coordinates: 451 | return None 452 | 453 | #### split into segments and call consensus 454 | ## split this read into a multiple fasta file 455 | tmpname = tmp_folder + hashlib.md5(record.id).hexdigest() + ".tmp" 456 | tmpFASTA = tmpname + ".fasta" 457 | 458 | counter = 0 459 | # write all the subreads 460 | with open(tmpFASTA, 'w') as h: 461 | for s, e in seg_coordinates: 462 | counter += 1 463 | subRead = SeqRecord(record.seq[s-1:e], record.id+'_'+str(counter), description="") 464 | SeqIO.write(subRead, h, "fasta") 465 | 466 | # run poa 467 | consensus = poa(tmpFASTA, tmpname, ">"+record.id) 468 | tmp = subprocess.check_output("rm %s*" % (tmpname), shell = True) 469 | 470 | return (consensus, len(seg_coordinates)) 471 | 472 | ################################################################################## 473 | def consensus_marginAlign(record, alnFile, copy_num_thre, len_diff_thre, tmp_folder, seg_cov, iterative): 474 | seg_coordinates = segment_filters(alnFile, copy_num_thre, len_diff_thre) 475 | if not seg_coordinates: 476 | return None 477 | 478 | #### split into segments and call consensus 479 | ## split this read into a multiple fasta file 480 | tmpname = tmp_folder + hashlib.md5(record.id).hexdigest() + ".tmp" 481 | tmpRef = tmpname + ".ref.fasta" 482 | tmpQ = tmpname + ".q.fastq" 483 | 484 | ## try using each subread as the backbone, select the backbone with minimal errors 485 | alignments = {"alignments":'\n',"errors":sys.maxint} 486 | subReads = [] 487 | 488 | counter = 0 489 | # write queries (all the subreads) 490 | q_handle = open(tmpQ, 'w') 491 | for s, e in seg_coordinates: 492 | counter += 1 493 | subRead = SeqRecord(record.seq[s-1:e], record.id+'_'+str(counter), description="") 494 | subReads.append(subRead) 495 | subRead.letter_annotations["phred_quality"] = [40] * len(subRead) 496 | SeqIO.write(subRead, q_handle, "fastq") 497 | q_handle.close() 498 | # use graphmap to determine the best backbone 499 | # graphmap 500 | best_backbone = None 501 | for subRead in subReads: 502 | ref_handle = open(tmpRef, 'w') 503 | SeqIO.write(subRead, ref_handle, "fasta") 504 | ref_handle.close() 505 | stdout = graphmap(tmpQ, tmpRef) 506 | errors = get_errors(stdout) 507 | if errors < alignments["errors"]: 508 | alignments["errors"] = errors 509 | alignments["alignments"] = stdout 510 | best_backbone = subRead 511 | ## remove index 512 | tmp = subprocess.check_output("rm %s*" % (tmpRef), shell = True) 513 | 514 | sys.stderr.write("Using %s as the backbone\n" % (best_backbone.id)) 515 | # marginAlign 516 | ref_handle = open(tmpRef, 'w') 517 | SeqIO.write(best_backbone, ref_handle, "fasta") 518 | ref_handle.close() 519 | stdout = marginAlign(tmpQ, tmpRef, tmpname+"margin") 520 | ##errors = get_errors(stdout) 521 | ##alignments["errors"] = errors 522 | alignments["alignments"] = stdout 523 | 524 | copy_num = alignments["alignments"].count("\n") 525 | if copy_num >= copy_num_thre: 526 | with open(tmpname + '.m5', 'w') as outH: 527 | outH.write(post_processing(alignments["alignments"])) 528 | consensus = pbdagcon(tmpname + '.m5', 0) 529 | 530 | ## run iteratively 531 | #---------------------------------------- 532 | # write consensus seq 0 533 | if iterative: 534 | delta = 1 535 | consensus_p = consensus 536 | iteration = 0 537 | sys.stderr.write("Iteratively improving consensus\n") 538 | tmpRef_iter = tmpname + '.con.iter.fa' 539 | tmpRef_iter_next = tmpname + '.con.iter.n.fa' 540 | tmpRef_iter_m5 = tmpname + '.con.iter.m5' 541 | 542 | while (delta>0.001 and consensus and iteration<10): 543 | sys.stderr.write("######################Iteration: %d########################\n" % (iteration+1)) 544 | iteration += 1 545 | with open(tmpRef_iter, 'w') as outH: 546 | outH.write(consensus) 547 | stdout = marginAlign(tmpQ, tmpRef_iter, tmpname+"margin") 548 | copy_num = stdout.count('\n') 549 | ## write new m5 550 | with open(tmpRef_iter_m5, 'w') as outH: 551 | outH.write(stdout) 552 | consensus_p = consensus 553 | consensus = pbdagcon(tmpRef_iter_m5, 0) 554 | 555 | ## some cases pbdagcon return empty results 556 | if not consensus: 557 | break 558 | 559 | with open(tmpRef_iter_next, 'w') as outH: 560 | outH.write(consensus) 561 | # update delta 562 | ## check the % identity between two iterations 563 | blastOutFMT = '6 sseqid sstart send slen qstart qend qlen evalue score length nident mismatch gaps sseq qseq qseqid' 564 | tmp = blastn(tmpRef_iter, tmpRef_iter_next, None, blastOutFMT, 1, seg_cov, False) 565 | tmp_len, tmp_iden = tmp.strip().split()[9:11] 566 | delta = 1-float(tmp_iden)/int(tmp_len) 567 | sys.stderr.write("Delta: %f\n" %(delta)) 568 | #---------------------------------------- 569 | consensus = consensus_p 570 | tmp = subprocess.check_output("rm %s*" % (tmpname), shell = True) 571 | return (consensus, copy_num) 572 | else: 573 | sys.stderr.write("Not enough aligned copy to correct!\n") 574 | tmp = subprocess.check_output("rm %s*" % (tmpname), shell = True) 575 | return None 576 | -------------------------------------------------------------------------------- /utils/filter_best_match.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """Filter the blast output: 4 | This script takes in a blastn results from find_unit.py, 5 | scans through the alignment, if there are overlapped ones, 6 | keeps the one with lower e value. If e value is the same, 7 | break ties using identity. 8 | 9 | Critical column arrangement of the input: 10 | -outfmt "6 sseqid sstart send slen qstart qend qlen ..." 11 | 12 | """ 13 | 14 | import os 15 | import sys 16 | import argparse 17 | 18 | # MIN_OVER_LAP = 1 ## minimal overlap length to be considered as the same location (in this application, the results should not be overlapping) 19 | 20 | def getOverlap(a1, a2, b1, b2): 21 | (a_min, a_max) = sorted([a1,a2]) 22 | (b_min, b_max) = sorted([b1,b2]) 23 | return max(0, min(a_max, b_max) - max(a_min, b_min)) 24 | 25 | def get_cor(aln): 26 | cors = aln.strip().split()[0:3] 27 | return (cors[0], min(int(cors[1]), int(cors[2]))) 28 | 29 | 30 | def main(arguments): 31 | parser = argparse.ArgumentParser(description=__doc__) 32 | 33 | parser.add_argument('-i', '--infile', 34 | required = "True", 35 | dest="infile", 36 | help="Input file sorted based on positions of reads") 37 | parser.add_argument('-o', '--outfile', help="Output file with collapsed alignments [Default: stdout]", dest='outfile', 38 | default=sys.stdout, type=argparse.FileType('w')) 39 | parser.add_argument("-c", "--coverage", 40 | default=0.9, 41 | type = float, 42 | dest="coverage", 43 | help="The query coverage threshold [default: 0.9]") 44 | parser.add_argument("-m", "--min_overlap", 45 | default=1, 46 | type = int, 47 | dest="min_overlap", 48 | help="Minimal overlap length to be considered for collapse [default: 1]") 49 | 50 | 51 | args = parser.parse_args(arguments) 52 | 53 | print_flag = 0 ## 0: do not print ;1: safe to print the previous record; 54 | update_flag = 0 ## 0: do not update; 1: update 55 | if args.infile == '-': 56 | inFile = sys.stdin 57 | else: 58 | inFile = open(args.infile, 'rU') 59 | 60 | ## previous_fields = 'sseqid sstart send slen qstart qend qlen evalue bitscore length pident mismatch gaps gapopen'.split(' ') 61 | alns = sorted(inFile.readlines(), key=get_cor) 62 | 63 | if not alns: 64 | ## empty results 65 | return None 66 | 67 | previous_fields = alns[0].strip().split() 68 | if previous_fields != []: 69 | for record in alns[1:]: 70 | fields = record.strip().split() 71 | 72 | if (fields[0] == previous_fields[0]) and getOverlap(int(fields[1]),int(fields[2]),int(previous_fields[1]),int(previous_fields[2])) >= args.min_overlap: 73 | ## The two records overlap, print the one with lower e value 74 | if float(fields[7]) < float(previous_fields[7]): 75 | print_flag = 0 76 | update_flag = 1 77 | elif float(fields[7]) > float(previous_fields[7]): 78 | print_flag = 0 79 | update_flag = 0 80 | else: 81 | ## The two e-values equal 82 | ## compare the identity 83 | if int(fields[9]) > int(previous_fields[9]): 84 | print_flag = 0 85 | update_flag = 1 86 | elif int(fields[9]) <= int(previous_fields[9]): 87 | print_flag = 0 88 | update_flag = 0 89 | else: 90 | ## The two records do not overlap (different reads or non-overlapping regions in the same reads) 91 | print_flag = 1 92 | update_flag = 1 93 | if (print_flag == 1): 94 | if abs(int(previous_fields[4])-int(previous_fields[5]))*1.0/int(previous_fields[6]) >= args.coverage: 95 | args.outfile.write('\t'.join(previous_fields) + "\n") 96 | if (update_flag == 1): 97 | previous_fields = fields 98 | ## print the last record 99 | if abs(int(previous_fields[4])-int(previous_fields[5]))*1.0/int(previous_fields[6]) >= args.coverage: 100 | args.outfile.write('\t'.join(previous_fields) + "\n") 101 | if __name__ == '__main__': 102 | sys.exit(main(sys.argv[1:])) 103 | -------------------------------------------------------------------------------- /utils/findUnit.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import os 4 | import sys 5 | import hashlib 6 | 7 | import subprocess 8 | from aligners import * 9 | 10 | from Bio import SeqIO 11 | from Bio.Seq import Seq 12 | from Bio.SeqRecord import SeqRecord 13 | 14 | 15 | ################################################################################ 16 | 17 | ###############################find unit######################################## 18 | def best_aln(aln): 19 | ## find the best aln according to e-value 20 | best_alignment = '' 21 | best_e = 1000 22 | for l in aln.split('\n'): 23 | if l != '': 24 | evalue = float(l.split()[7]) 25 | if evalue <= best_e: 26 | best_alignment = l 27 | best_e = evalue 28 | return best_alignment 29 | 30 | def find_unit_blastn(record, ref_anchor, tmp_folder, seqlen, query_seg_step, query_len, anchor_cov): 31 | tmpname = tmp_folder + hashlib.md5(record.id).hexdigest() + ".tmp" 32 | tmpRef = tmpname + ".ref.fasta" 33 | tmpQ = tmpname + ".q.fasta" 34 | blastOutFMT = '6 sseqid sstart send slen qstart qend qlen evalue score length nident mismatch gaps' 35 | 36 | alignments = {'alignments':'\n', 'number':0} 37 | 38 | ## write the ref seq (single seq) 39 | with open(tmpRef, "w") as ref_handle: 40 | SeqIO.write(record, ref_handle, "fasta") 41 | 42 | if ref_anchor: 43 | ## ref anchor is provided 44 | ## firstly map the ref anchor to the read (best mapping) 45 | ## then extract some more bps from the ref anchor and use it as the new anchor 46 | stdout = blastn(ref_anchor, tmpRef, None, blastOutFMT, 47 | 1, anchor_cov, False) 48 | best_alignment=best_aln(stdout) 49 | if best_alignment != '': 50 | s_start = int((best_aln(stdout)).split()[1]) 51 | with open(tmpQ, 'w') as q_handle: 52 | qrecord = SeqRecord(record.seq[s_start:s_start+query_len], 53 | record.id+ "RefAnchor", 54 | description= "") 55 | SeqIO.write(qrecord, q_handle, "fasta") 56 | stdout = blastn(tmpQ, tmpRef, None, blastOutFMT, 57 | seqlen/query_len + 1, anchor_cov, False) 58 | alignments['number'] = max(stdout.count('\n') - 1,0) 59 | alignments['alignments'] = stdout 60 | else: 61 | ## try different anchors 62 | starts = xrange(0, seqlen/2, query_seg_step) 63 | 64 | for start in starts: 65 | ## write the query seq 66 | with open(tmpQ, 'w') as q_handle: 67 | qrecord = SeqRecord(record.seq[start:(start+query_len)], 68 | record.id+ str(start) + 'to' + str(query_len) + "bps", 69 | description= "") 70 | SeqIO.write(qrecord, q_handle, "fasta") 71 | stdout = blastn(tmpQ, tmpRef, None, blastOutFMT, 72 | seqlen/query_len + 1, anchor_cov, False) 73 | num_alignments = stdout.count('\n') - 1 74 | if num_alignments > alignments['number']: 75 | alignments['number'] = num_alignments 76 | alignments['alignments'] = stdout 77 | 78 | # finished one read, clean tmp files 79 | tmp = subprocess.check_output("rm %s*" % (tmpname), shell = True) 80 | sys.stderr.write("Max number of segments found: %i \n" % (alignments['number'])) 81 | if alignments['alignments'] != '\n' and alignments['alignments'] != '': 82 | return alignments['alignments'] 83 | return None 84 | 85 | ################################################################################ 86 | 87 | -------------------------------------------------------------------------------- /utils/graphmap: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CSB5/INC-Seq/10306359c26b78d62f0b59603bf9c6ce0d2f91d5/utils/graphmap -------------------------------------------------------------------------------- /utils/pbdagcon: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CSB5/INC-Seq/10306359c26b78d62f0b59603bf9c6ce0d2f91d5/utils/pbdagcon -------------------------------------------------------------------------------- /utils/poa: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CSB5/INC-Seq/10306359c26b78d62f0b59603bf9c6ce0d2f91d5/utils/poa -------------------------------------------------------------------------------- /utils/sam2blasr.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """Convert the sam file from graphmap to blasr m5 4 | 5 | """ 6 | 7 | import os 8 | import sys 9 | import argparse 10 | from re import split as rs 11 | from Bio import SeqIO 12 | from collections import deque 13 | 14 | def split_cigar(cigar): 15 | ## split the cigar into a list of tuples 16 | cigar_splitted = rs('(S|I|D|M|X|=|H)', cigar) 17 | return zip(cigar_splitted[0::2], cigar_splitted[1::2]) 18 | 19 | def build_aln(ref, seq, cigar, tstart): 20 | ## construct the alignment 21 | qstart = 1 22 | qend = len(seq) 23 | ## soft clipping 24 | if cigar[0][1] == 'S': 25 | qstart = int(cigar[0][0]) + 1 ## this is one based 26 | tstart = tstart - int(cigar[0][0]) 27 | if cigar[-1][1] == 'S': 28 | qend = qend - int(cigar[-1][0]) 29 | ## hard clipping does not need to be explicitly handled 30 | 31 | tmp_q = deque(seq[qstart-1:]) 32 | tmp_t = deque(ref[tstart + qstart-1:]) 33 | qseq = [] 34 | tseq = [] 35 | mathc = [] 36 | for c in cigar: 37 | length = int(c[0]) 38 | if c[1] in ['M', 'X', '=']: 39 | qseq.extend([tmp_q.popleft() for _i in xrange(length)]) 40 | tseq.extend([tmp_t.popleft() for _i in xrange(length)]) 41 | 42 | elif c[1] == 'I': 43 | qseq.extend([tmp_q.popleft() for _i in xrange(length)]) 44 | tseq.extend(['-'] * length) 45 | elif c[1] == 'D': 46 | tseq.extend([tmp_t.popleft() for _i in xrange(length)]) 47 | qseq.extend(['-'] * length) 48 | match = [] 49 | for i,j in zip(qseq,tseq): 50 | match.append('|' if i==j else '*') 51 | 52 | qseq = ''.join(qseq) 53 | tseq = ''.join(tseq) 54 | match = ''.join(match) 55 | ## blasr output seems to be 0-based, half-open 56 | return [str(qstart-1), str(qend), qseq, match, tseq] 57 | 58 | def main(arguments): 59 | parser = argparse.ArgumentParser(description=__doc__) 60 | parser.add_argument("-i", "--in_sam", 61 | required="True", 62 | dest="sam", 63 | help="The input sam file.") 64 | parser.add_argument("-r", "--reference", 65 | required="True", 66 | dest="ref", 67 | help="Reference sequence(s).") 68 | parser.add_argument("-e", "--evalue", 69 | default=None, 70 | type=float, 71 | dest="e_cutoff", 72 | help="E-value cutoff.") 73 | parser.add_argument("--debug", 74 | action = "store_true", 75 | dest="debug", 76 | help="Only print the alignments") 77 | parser.add_argument('-o', '--outfile', 78 | help="Output file", 79 | dest='outFile', 80 | default=sys.stdout, 81 | type=argparse.FileType('w')) 82 | parser.add_argument("--pacbio", 83 | action = "store_true", 84 | dest="pacbio_flag", 85 | help="Output in PacBio's format") 86 | 87 | args = parser.parse_args(arguments) 88 | # read the reference sequence 89 | ref = {} 90 | refH = open(args.ref, "rU") 91 | for record in SeqIO.parse(refH, "fasta"): 92 | ref[record.id] = list(record.seq) 93 | refH.close() 94 | 95 | if args.sam == '-': 96 | sam = sys.stdin 97 | else: 98 | sam = open(args.sam, 'rU') 99 | 100 | counter = 0 101 | for l in sam: 102 | ## skip the headers 103 | if l[0] != '@': 104 | fields = l.strip().split("\t") 105 | # if counter % 100 == 0: 106 | # sys.stderr.write('=') 107 | # only for graphmap output sam files 108 | if args.e_cutoff != None: 109 | e_pass = False 110 | ZE = fields[-4].split(':') 111 | if ZE[0]!='ZE': 112 | sys.exit("Wrong sam specification (ZE)!") 113 | evalue = float(ZE[-1]) 114 | if evalue < args.e_cutoff: 115 | e_pass = True 116 | else: 117 | e_pass = True 118 | 119 | if fields[2] != '*' and e_pass: 120 | counter += 1 121 | cigar = fields[5] 122 | read_seq = fields[9] 123 | tStart = int(fields[3]) - 1 124 | tName = fields[2] 125 | mapQV = fields[4] 126 | samflags = fields[1] 127 | qStart, qEnd, qseq, match, tseq = build_aln(ref[tName], list(read_seq), split_cigar(cigar), tStart) 128 | if args.debug: 129 | print qseq 130 | print match 131 | print tseq 132 | else: 133 | qName = fields[0] 134 | qLength = str(len(read_seq)) 135 | tLength = str(len(tseq)) 136 | tEnd = str(tStart + len(tseq) - tseq.count('-')) 137 | score = '-3000' 138 | numMatch = str(match.count('|')) 139 | numIns = str(qseq.count('-')) 140 | numDel = str(tseq.count('-')) 141 | numMismatch = str(match.count('*') - int(numIns) - int(numDel)) 142 | if args.pacbio_flag: 143 | AS = fields[13].split(":")[-1] 144 | qStrand = "+" if int(samflags)/16%2 == 0 else "-" 145 | percent_id = float(numMatch)/len(match)*100 146 | output = '\t'.join([qName, tName, qStrand, '+', AS, str(percent_id), str(tStart), tEnd, tLength, qStart, qEnd, qLength, "1111"] 147 | ) 148 | else: ## notice m5 format has two spaces after qStrand 149 | output = ' '.join([qName, qLength, qStart, qEnd, '+ ', tName, tLength, str(tStart), tEnd, '+', score, numMatch, numMismatch, numIns, numDel, mapQV, qseq, match, tseq]) 150 | args.outFile.write(output + '\n') 151 | ## qName qLength qStart qEnd qStrand tName tLength tStart tEnd tStrand score numMatch numMismatch numIns numDel mapQV qAlignedSeq matchPattern tAlignedSeq ## 152 | ##sys.stderr.write('\nNumber of aligned reads: ' + str(counter) + '\n') 153 | 154 | 155 | if __name__ == '__main__': 156 | sys.exit(main(sys.argv[1:])) 157 | -------------------------------------------------------------------------------- /utils/sam2blasr2.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """Convert the sam file from graphmap to blasr m5 4 | 5 | """ 6 | 7 | import os 8 | import sys 9 | import argparse 10 | import pysam 11 | from Bio import SeqIO 12 | 13 | def build_aln(r, refsq): 14 | rseq = [] 15 | qseq = [] 16 | qstart = r.query_alignment_start 17 | qend = r.query_alignment_end 18 | for (qapos, rpos) in r.get_aligned_pairs(): 19 | ## qapos is the aligned index, i.e. this ignores clipping. add that 20 | if qapos is None: 21 | qpos = None 22 | else: 23 | qpos = qapos + qstart 24 | ## qpos and rpos now safe to use, but might be None (indel) 25 | rbase = qbase = "-" 26 | if rpos is not None: 27 | rbase = refsq[rpos].upper() 28 | if qpos is not None: 29 | qbase = r.seq[qpos].upper() 30 | rseq.append(rbase) 31 | qseq.append(qbase) 32 | match = [] 33 | for i,j in zip(qseq,rseq): 34 | match.append('|' if i==j else '*') 35 | return (str(qstart+1), str(qend), ''.join(qseq), ''.join(match) , ''.join(rseq)) 36 | 37 | 38 | def main(arguments): 39 | parser = argparse.ArgumentParser(description=__doc__) 40 | parser.add_argument("-i", "--in_sam", 41 | required="True", 42 | dest="sam", 43 | help="The input sam file.") 44 | parser.add_argument("-r", "--reference", 45 | required="True", 46 | dest="ref", 47 | help="Reference sequence(s).") 48 | parser.add_argument("--debug", 49 | action = "store_true", 50 | dest="debug", 51 | help="Only print the alignments") 52 | parser.add_argument('-o', '--outfile', 53 | help="Output file", 54 | dest='outFile', 55 | default=sys.stdout, 56 | type=argparse.FileType('w')) 57 | 58 | args = parser.parse_args(arguments) 59 | # read the reference sequence 60 | refH = open(args.ref, "rU") 61 | ref = SeqIO.to_dict(SeqIO.parse(refH, "fasta")) 62 | refH.close() 63 | 64 | if args.sam == '-': 65 | sam = sys.stdin 66 | else: 67 | sam = args.sam 68 | 69 | samfile = pysam.AlignmentFile(sam, "r") 70 | 71 | counter = 0 72 | for r in samfile.fetch(): 73 | if counter % 100 == 0: 74 | sys.stderr.write('=') 75 | if not r.is_unmapped: 76 | counter += 1 77 | tName = samfile.getrname(r.reference_id) 78 | refseq = str(ref[tName].seq) 79 | qStart, qEnd, qseq, match, tseq = build_aln(r, refseq) 80 | if args.debug: 81 | print qseq 82 | print match 83 | print tseq 84 | else: 85 | qName = r.query_name 86 | qLength = str(r.query_length) 87 | tLength = str(len(refseq)) 88 | tStart = str(r.reference_start + 1) 89 | tEnd = str(r.reference_end) 90 | score = '-3000' 91 | numMatch = str(match.count('|')) 92 | numIns = str(qseq.count('-')) 93 | numDel = str(tseq.count('-')) 94 | numMismatch = str(match.count('*') - int(numIns) - int(numDel)) 95 | mapQV = '254' 96 | output = ' '.join([qName, qLength, qStart, qEnd, '+', tName, tLength, tStart, tEnd, '+', score, numMatch, numMismatch, numIns, numDel, mapQV, qseq, match, tseq]) 97 | 98 | args.outFile.write(output + '\n') 99 | ## qName qLength qStart qEnd qStrand tName tLength tStart tEnd tStrand score numMatch numMismatch numIns numDel mapQV qAlignedSeq matchPattern tAlignedSeq ## 100 | sys.stderr.write('\nNumber of aligned reads: ' + str(counter) + '\n') 101 | 102 | 103 | if __name__ == '__main__': 104 | sys.exit(main(sys.argv[1:])) 105 | --------------------------------------------------------------------------------