├── Dockerfile
├── LICENSE.txt
├── README.md
├── debreak_detect.py
├── debreak_merge.py
├── debreak_merge_clustering.py
├── denovo_baseerror.py
├── denovo_correct.py
├── denovo_plot.py
├── denovo_static.py
├── inspector-correct.py
├── inspector.py
└── testdata
├── contig_test.fa
└── read_test.fastq.gz
/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM ubuntu:18.04
2 |
3 |
4 | RUN apt-get update \
5 | && apt-get install -y --no-install-recommends \
6 | build-essential \
7 | bzip2 \
8 | curl \
9 | git \
10 | less \
11 | sudo \
12 | vim \
13 | wget \
14 | zlib1g-dev \
15 | libbz2-dev \
16 | liblzma-dev \
17 | && rm -rf /var/lib/apt/lists/*
18 |
19 | RUN sudo apt -y update
20 | RUN sudo apt -y upgrade
21 | RUN sudo apt -y install python2.7 python-pip
22 | RUN pip install pysam
23 | RUN python -m pip install -U matplotlib
24 | RUN pip install statsmodels==0.10.1
25 |
26 |
27 | RUN curl -L https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2 | tar -jxvf -
28 | WORKDIR samtools-1.9
29 | RUN ./configure --without-curses
30 | RUN make && make install
31 | WORKDIR ..
32 |
33 |
34 | RUN curl -L https://github.com/lh3/minimap2/releases/download/v2.15/minimap2-2.15_x64-linux.tar.bz2 | tar -jxvf -
35 | ENV PATH="minimap2-2.15_x64-linux/:${PATH}"
36 |
37 | RUN pip install setuptools
38 | RUN git clone https://github.com/fenderglass/Flye
39 | WORKDIR Flye
40 | RUN git checkout tags/2.8.3 -b inspector-flye
41 | RUN python setup.py install
42 | WORKDIR ..
43 | RUN git clone https://github.com/Maggi-Chen/Inspector.git
44 | ENV PATH="Inspector/:${PATH}"
45 |
46 |
--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
1 | The MIT License
2 |
3 | Copyright (c) 2020- University of Alabama at Birmingham
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining
6 | a copy of this software and associated documentation files (the
7 | "Software"), to deal in the Software without restriction, including
8 | without limitation the rights to use, copy, modify, merge, publish,
9 | distribute, sublicense, and/or sell copies of the Software, and to
10 | permit persons to whom the Software is furnished to do so, subject to
11 | the following conditions:
12 |
13 | The above copyright notice and this permission notice shall be
14 | included in all copies or substantial portions of the Software.
15 |
16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
20 | BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
21 | ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
22 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
23 | SOFTWARE.
24 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Inspector
2 |
3 | A reference-free assembly evaluator.
4 |
5 | Author: Maggi Chen
6 |
7 | Email: maggic@uab.edu
8 |
9 | Draft date: Apr. 20, 2021
10 |
11 | ## Quick Start
12 | ```sh
13 | git clone https://github.com/ChongLab/Inspector.git
14 | cd Inspector/
15 | ./inspector.py -h
16 |
17 | # Evaluate assembly with raw reads
18 | inspector.py -c contig.fa -r rawreads.1.fastq rawreads.2.fastq -o inspector_out/ --datatype clr
19 | # Evaluate assembly with hifi reads
20 | inspector.py -c contig.fa -r ccsreads.1.fastq ccsreads.2.fastq -o inspector_out/ --datatype hifi
21 |
22 | # With reference-based evaluation
23 | inspector.py -c contig.fa -r rawreads.1.fastq --ref reference.fa -o inspector_out/ --datatype clr
24 |
25 | # Reference-based only evaluation
26 | inspector.py -c contig.fa -r emptyfile --ref reference.fa -o inspector_out/
27 |
28 | # Error correction
29 | inspector-correct.py -i inspector_out/ --datatype pacbio-hifi -o inspector_out/
30 |
31 | ```
32 |
33 |
34 |
35 | ## Description
36 |
37 | Inspector is a tool for assembly evaluation with long read data. The input includes a contig file, long reads (PacBio CLR, PacBio HiFi, Oxford Nanopore, or mixed platform), and a reference genome (optional). The output includes A summary report, read-to-contig alignment file, a list of structrual errors and small-scale errors. This program was tested on a x86_64 Linux system with a 128GB physical memory.
38 |
39 | ## Depencency
40 |
41 | Dependencies for Inspector:
42 |
43 | * python
44 | * pysam
45 | * statsmodels (tested with version 0.10.1)
46 |
47 | * minimap2 (tested with version 2.10 and 2.15)
48 | * samtools (tested with version 1.9)
49 |
50 |
51 | Dependencies for Inspector error correction module:
52 | * flye (tested with version 2.8.3)
53 |
54 |
55 | ## Installation
56 |
57 | To create an environment with conda or mamba (recommended):
58 | ```
59 | mamba create --name ins inspector
60 | mamba activate ins
61 |
62 | ```
63 | Git install after installing all the dependencies.
64 | ```
65 | git clone https://github.com/ChongLab/Inspector.git
66 | export PATH=$PWD/Inspector/:$PATH
67 | ```
68 |
69 |
70 |
71 | A subset of human genome assembly is available as testing dataset to validate successful installation. The contig_test.fa includes two contigs (1.4Mbp and 10Kbp). The read_test.fastq.gz includes ~60X PacBio HiFi reads belonging to these two contigs. There are 3 structural errors and 281 small-scale errors present in the testing dataset.
72 | ```
73 | cd Inspector/
74 | ./inspector.py -c testdata/contig_test.fa -r testdata/read_test.fastq.gz -o test_out/ --datatype hifi
75 | ./inspector-correct.py -i test_out/ --datatype pacbio-hifi
76 | ```
77 | (The Inspector evaluation on testing dataset should finish within several minutes with 4 CPUs and 400MB memory.
78 | The Inspector error correction should finish within 10-15 minutes with 4 CPUs and 500MB memory.)
79 |
80 |
81 | ## General usage
82 |
83 |
84 | ```
85 |
86 | inspector.py [-h] -c contig.fa -r raw_reads.fa -o output_dict/
87 | required arguments:
88 | --contig,-c FASTA/FASTQ file containing contig sequences to be evaluated
89 | --read,-r A list of FASTA/FASTQ files containing long read sequences
90 |
91 | optional arguments:
92 | -h, --help Show this help message and exit
93 | --version Show program's version number and exit
94 | --datatype,-d Input read type. (clr, hifi, nanopore, mixed) [clr]
95 | --ref OPTIONAL reference genome in .fa format
96 | --thread,-t Number of threads. [8]
97 | --min_contig_length Minimal length for a contig to be evaluated [10000]
98 | --min_contig_length_assemblyerror Minimal contig length for assembly error detection. [1000000]
99 | --pvalue Maximal p-value for small-scale error identification [0.01 for HiFi, 0.05 for others]
100 | --skip_read_mapping Skip the step of mapping reads to contig
101 | --skip_structural_error Skip the step of identifying large structural errors
102 | --skip_structural_error_detect Skip the step of detecting large structural errors
103 | --skip_base_error Skip the step of identifying small-scale errors
104 | --skip_base_error_detect Skip the step of detecting small-scale errors from pileup
105 |
106 |
107 |
108 | inspector-correct.py [-h] -i inspector_out/ --datatype pacbio-raw
109 | required arguments:
110 | --inspector,-i Inspector evaluation directory with original file names
111 | --datatype Type of read used for Inspector evaluation. Required for structural error correction
112 | --outpath,-o Output directory
113 | --flyetimeout Maximal runtime for local assembly with Flye
114 | --thread,-t Number of threads
115 | --skip_structural Do not correct structural errors. Local assembly will not be performed
116 | --skip_baseerror Do not correct small-scale errors
117 |
118 |
119 | ```
120 |
121 | ## Use cases
122 | Inspector evaluates the contigs and identifies assembly errors with sequencing reads. You can use reads from single platform:
123 | ```
124 | inspector.py -c contig.fa -r rawreads.1.fastq rawreads.2.fastq -o inspector_out/ --datatype clr
125 | ```
126 | Or use a mixed data type:
127 | ```
128 | inspector.py -c contig.fa -r rawreads.fastq nanopore.fastq -o inspector_out/ --datatype mixed
129 | ```
130 | Reference-based evaluation is also supported:
131 | (Note that reported assembly error from reference-based mode will contain genetic variants)
132 | ```
133 | inspector.py -c contig.fa -r rawreads.fastq --ref reference.fa -o inspector_out/ --datatype clr
134 | ```
135 | If only the continuity analysis is needed, simply provide an empty file for --read:
136 | ```
137 | inspector.py -c contig.fa -r emptyfile -o inspector_out/ --skip_base_error --skip_structural_error
138 | ```
139 | For the '--skip' options, do not use unless you are repeating the evaluation with same contig and read files in the same output directory. These may help save time when testing different options for error detection.
140 |
141 |
142 |
143 | Inspector provides an error-correction module to improve assembly accuracy. High-accuracy reads are recommended, especially for small-scale error correction:
144 | (Note that only reads from single platform are supported for error correction.)
145 | ```
146 | inspector-correct.py -i inspector_out/ --datatype pacbio-hifi -o inspector_out/
147 | ```
148 |
149 |
150 | ## Output file descriptions
151 | Inspector writes its evaluation reports into a new directory to avoid conflicts in file names. Inspector error correction uses the evaluationary results to generate corrected assembly. The output directory of Inspector includes:
152 |
153 | ### summary_statistics
154 | An evaluation report of the input assembly. This file includes the contig continuity statistics reports, the read mapping summary, number of structural and small-scale errors, the QV score, and the contig alignment summary from reference-based mode when available. An assembly with expected total length, high read-to-contig mapping rate, low number of structural and small-scale errors, and high QV score indicates a high assembly quality. When the reference genome is provided, a higher genome coverage and NA50 also indicates more complete assembly.
155 |
156 | ### structural_error.bed
157 | This file includes all structural errors identified in the assembly.
158 | The first, second and third column indicate the contig, start and end position of structural error. For Expansions and Inversions, the size of error equals the distance between start and end position. For Collapses, the collapsed sequences are missing in the contigs, therefore the EndPosition is StartPosition+1. The length of collapsed sequence should be inferred from the Size column. For HaplotypeSwitches, Inspector normally considers the haplotype containing Expansion-like pattern as haplotype 1, and considers the haplotype with Collapse-like pattern as haplotype 2. The position of error in haplotype1 and haplotype2 are separated by ";" in the second and third columns.
159 | The fourth column indicates the number of error-supporting reads. A high number of error-supporting reads indicates a confident error call.
160 | The fifth and sixth column indicates the type and size of the erorr. For HaplotypeSwitch, the error sizes in the two haplotypes are usually different. Inspector normally lists the size in haplotype1 and haplotype2, corresponding to the position columns.
161 | Column seven to twelve include other information about the structural errors. These are kept for developmental purpose.
162 |
163 | ### small_scale_error.bed
164 | This file includes all small-scale errors identified in the assembly.
165 | The first, second and third column indicate the contig, start and end position of small-scale errors. Similar to the structural errors, the distance between StartPosition and EndPosition equals error size for small expansions and equals 1 for small collapses.
166 | The fourth and fifth column indicate the base in the contig and in the reads.
167 | The sixth and seventh column indicate the number of error-supporting reads and the local sequencing depth. A high supporting-read-to-depth ratio means a confident error call.
168 | The eighth column indicates the type of error.
169 | The nineth column indicates the p-value from binominal test.
170 |
171 | ### contig_corrected.fa
172 | The output corrected assembly of Inspector error-correction module.
173 | Only contigs contained in the valid_contig.fa file (longer than --min_contig_length) are corrected. The small-scale errors listed in small_scale_error.bed should all be fixed. The structural errors in structural_error.bed are fixed if the local de novo assembly generates a full-length contig that can be confidently aligned to the original error region. Otherwise, the original sequence will be remained.
174 |
175 |
176 |
177 | ## Citing Inspector
178 | If you use Inspector, please cite
179 | > Chen, Y., Zhang, Y., Wang, A.Y. et al. Accurate long-read de novo assembly evaluation with Inspector. Genome Biol 22, 312 (2021). https://doi.org/10.1186/s13059-021-02527-4
180 |
--------------------------------------------------------------------------------
/debreak_detect.py:
--------------------------------------------------------------------------------
1 | import pysam
2 | import time
3 | import os
4 |
5 | def cigardeletion_ref(flag,chrom,position,cigar,min_size,max_size):
6 | pos=int(position)
7 | numbers='1234567890'
8 | num=''
9 | reflen=0
10 | readlen=0
11 | leftclip=0
12 | rightclip=0
13 | deletions=[]
14 | insertions=[]
15 | for c in cigar:
16 | if c in numbers:
17 | num+=c; continue
18 | if c in 'MNP=X':
19 | readlen+=int(num); reflen+=int(num); num=''; continue
20 | if c=='I':
21 | if int(num)>=min_size and int(num)<=max_size:
22 | insertions+=[[chrom,pos+reflen,int(num),'I-cigar',readlen+leftclip]]
23 | readlen+=int(num)
24 | num=''; continue
25 | if c == 'D':
26 | if int(num)>=min_size and int(num)<=max_size:
27 | deletions+=[[chrom,pos+reflen,int(num),'D-cigar',readlen+leftclip]]
28 | reflen+=int(num); num=''; continue
29 |
30 | if c in 'SH':
31 | if readlen==0:
32 | leftclip=int(num)
33 | else:
34 | rightclip=int(num)
35 | num=''; continue
36 | testif=1
37 | window=500
38 | while testif==1:
39 | testif=0
40 | if len(deletions)==1:
41 | break
42 | i=len(deletions)-1
43 | while i>0:
44 | gaplength=deletions[i][1]-deletions[i-1][1]-deletions[i-1][2]
45 | if gaplength <= window:
46 | deletions=deletions[:i-1]+[[chrom,deletions[i-1][1],deletions[i-1][2]+deletions[i][2],'D-cigar',deletions[i-1][4]]]+deletions[i+1:]
47 | testif=1;break
48 | else:
49 | i-=1
50 | testif=1
51 | while testif==1:
52 | testif=0
53 | if len(insertions)==1:
54 | break
55 | i=len(insertions)-1
56 | while i>0:
57 | gaplength=insertions[i][1]-insertions[i-1][1]
58 | l1=insertions[i][2]
59 | l2=insertions[i-1][2]
60 | window=200 if max(l1,l2)<100 else 400
61 | window=400 if window==400 and max(l1,l2) <500 else 600
62 | if gaplength >window :
63 | i-=1
64 | else:
65 | insertions=insertions[:i-1]+[[chrom,insertions[i-1][1],l1+l2,'I-cigar',insertions[i-1][4]]]+insertions[i+1:]
66 | testif=1;break
67 |
68 | svcallset=deletions+insertions
69 | return [svcallset,reflen,[leftclip,readlen,rightclip]]
70 |
71 |
72 | def cigardeletion(flag,chrom,position,cigar,min_size,max_size): #input a read line, return list of deletions
73 | flag=int(flag)
74 | if flag<=16:
75 | detect_cigar_sv=True
76 | else:
77 | detect_cigar_sv=True
78 | pos=int(position)
79 | numbers='1234567890'
80 | num=''
81 | reflen=0
82 | readlen=0
83 | leftclip=0
84 | rightclip=0
85 | deletions=[]
86 | insertions=[]
87 | for c in cigar:
88 | if c in numbers:
89 | num+=c
90 | continue
91 | if c in 'MNP=X':
92 | readlen+=int(num); reflen+=int(num); num=''; continue
93 | if c=='I':
94 | if detect_cigar_sv and int(num)>=min_size and int(num)<=max_size:
95 | insertions+=[[chrom,pos+reflen,int(num),'I-cigar',readlen]]
96 | readlen+=int(num)
97 | num=''; continue
98 | if c == 'D':
99 | if detect_cigar_sv and int(num)>=min_size and int(num)<=max_size:
100 | deletions+=[[chrom,pos+reflen,int(num),'D-cigar']]
101 | reflen+=int(num); num=''; continue
102 | if c in 'SH':
103 | if readlen==0:
104 | leftclip=int(num)
105 | else:
106 | rightclip=int(num)
107 | num=''; continue
108 | #merge deletions
109 | if detect_cigar_sv:
110 | testif=1
111 | window=500
112 | while testif==1:
113 | testif=0
114 | if len(deletions)==1:
115 | break
116 | i=len(deletions)-1
117 | while i>0:
118 | gaplength=deletions[i][1]-deletions[i-1][1]-deletions[i-1][2]
119 | if gaplength <= window:
120 | deletions=deletions[:i-1]+[[chrom,deletions[i-1][1],deletions[i-1][2]+deletions[i][2],'D-cigar']]+deletions[i+1:]
121 | testif=1
122 | break
123 | else:
124 | i-=1
125 | #merge insertions
126 | testif=1
127 | while testif==1:
128 | testif=0
129 | if len(insertions)==1:
130 | break
131 | i=len(insertions)-1
132 | while i>0:
133 | l1=insertions[i][2]
134 | l2=insertions[i-1][2]
135 | gaplength=insertions[i][1]-insertions[i-1][1]
136 | window=200 if max(l1,l2)<100 else 400
137 | window=400 if window==400 and max(l1,l2) <500 else 600
138 | if gaplength >window :
139 | i-=1
140 | else:
141 | insertions=insertions[:i-1]+[[chrom,insertions[i-1][1],l1+l2,'I-cigar',insertions[i-1][4]]]+insertions[i+1:]
142 | testif=1
143 | break
144 |
145 | svcallset=deletions+insertions
146 | return [svcallset,reflen,[leftclip,readlen,rightclip]]
147 |
148 |
149 | def segmentdeletion_ref(segments,min_size,max_size,if_contig):
150 | segments=[c for c in segments if c[5][1]>=min(0.05*(c[5][0]+c[5][1]+c[5][2]),20000)]
151 | if len(segments)<=1:
152 | return []
153 | svcallset=[]
154 | for iii in range(len(segments)-1):
155 | primary=segments[iii]
156 | chrom=primary[2]
157 | priflag=(int(primary[1])%32)>15
158 | samedirchr=[];samechr=[];diffchr=[]
159 | restsegments=segments[iii+1:]
160 | for c in restsegments:
161 | ch=c[2]; f=int(c[1])%32>15
162 | if c[5][1]<300:
163 | continue
164 | if ch!=chrom:
165 | diffchr+=[c]
166 | elif f!=priflag:
167 | samechr+=[c]
168 | else:
169 | samedirchr+=[c]
170 | for c in samedirchr:
171 | if c[3]>primary[3] and c[4]-primary[4]>-200:
172 | leftread=primary; rightread=c
173 | elif c[3]-200:
174 | leftread=c; rightread=primary
175 | else:
176 | continue
177 | leftinfo=leftread[5]
178 | rightinfo=rightread[5]
179 | window=300
180 | if if_contig:
181 | window=min(2000,leftinfo[1]//2,rightinfo[1]//2)
182 | if abs(rightread[3]-leftread[4])<=window:
183 | overlap=rightread[3]-leftread[4]
184 | ins_size=rightinfo[0]-leftinfo[1]-leftinfo[0]-overlap
185 | if min_size<=ins_size<=max_size:
186 | if not priflag:
187 | svcallset+=[chrom+'\t'+str(min(rightread[3],leftread[4]))+'\t'+str(ins_size)+'\t'+'I-segment'+'\t'+primary[0]+'\t'+str(int(c[1])+int(primary[1]))+'\t'+str((int(c[6])+int(primary[6]))//2)+'\t'+str(leftinfo[0]+leftinfo[1])]
188 | else:
189 | svcallset+=[chrom+'\t'+str(min(rightread[3],leftread[4]))+'\t'+str(ins_size)+'\t'+'I-segment'+'\t'+primary[0]+'\t'+str(int(c[1])+int(primary[1]))+'\t'+str((int(c[6])+int(primary[6]))//2)+'\t'+str(leftinfo[2])]
190 |
191 | overlapmap=leftinfo[0]+leftinfo[1]-rightinfo[0]
192 | window_max=2000 #Test for rescue FN
193 | overlap_window=-200
194 | if if_contig:
195 | overlap_window=-5000
196 | window_max=5000
197 | if overlap_windowprimary[3] and c[4]-primary[4]>-200:
207 | leftread=primary; rightread=c
208 | elif c[3]-200:
209 | leftread=c; rightread=primary
210 | else:
211 | continue
212 | leftinfo=leftread[5]
213 | rightinfo=rightread[5]
214 | window_max=500
215 | overlap_window=-200
216 | if if_contig:
217 | overlap_window=-2000
218 | overlapmap=rightinfo[0]+rightinfo[1]-leftinfo[2]
219 | if overlap_window=max(100,overlapmap):
220 | inv_size=rightread[4]-leftread[4]-overlapmap
221 | if min_size<=inv_size<=max_size:
222 | if int(leftread[1]) % 32 <16:
223 | svcallset+=[chrom+'\t'+str(leftread[4])+'\t'+str(inv_size)+'\t'+'INV-segment'+'\t'+primary[0]+'\t'+str(c[1])+'\t'+str((int(c[6])+int(primary[6]))//2)+'\t'+str(leftinfo[0]+leftinfo[1])]
224 | else:
225 | svcallset+=[chrom+'\t'+str(leftread[4])+'\t'+str(inv_size)+'\t'+'INV-segment'+'\t'+primary[0]+'\t'+str(c[1])+'\t'+str((int(c[6])+int(primary[6]))//2)+'\t'+str(leftinfo[2])]
226 | continue
227 | overlapmap=rightinfo[1]+rightinfo[2]-leftinfo[0]
228 | if overlap_window=max(100,overlapmap):
229 | inv_size=rightread[3]-leftread[3]-overlapmap
230 | if min_size<=inv_size<=max_size:
231 | if int(leftread[1]) % 32 <16:
232 | svcallset+=[chrom+'\t'+str(leftread[3])+'\t'+str(inv_size)+'\t'+'INV-segment'+'\t'+primary[0]+'\t'+str(c[1])+'\t'+str((int(c[6])+int(primary[6]))//2)+'\t'+str(leftinfo[0])]
233 | else:
234 | svcallset+=[chrom+'\t'+str(leftread[3])+'\t'+str(inv_size)+'\t'+'INV-segment'+'\t'+primary[0]+'\t'+str(c[1])+'\t'+str((int(c[6])+int(primary[6]))//2)+'\t'+str(leftinfo[1]+leftinfo[2])]
235 | continue
236 | return svcallset
237 |
238 |
239 |
240 |
241 |
242 |
243 | def segmentdeletion(segments,min_size,max_size,if_contig): #input a list of segments,return list of deletions
244 | if len([c for c in segments if int(c[1])<=16])==0:
245 | return []
246 | segments=[c for c in segments if c[5][1]>=min(0.05*(c[5][0]+c[5][1]+c[5][2]),20000)]
247 | if len(segments)<=1:
248 | return []
249 | svcallset=[]
250 | for iii in range(len(segments)-1):
251 | primary=segments[iii]
252 | chrom=primary[2]
253 | priflag=(int(primary[1])%32)>15
254 | samedirchr=[]
255 | samechr=[]
256 | diffchr=[]
257 | restsegments=segments[iii+1:]
258 | for c in restsegments:
259 | ch=c[2]
260 | f=int(c[1])%32>15
261 | if c[5][1]<300:
262 | continue
263 | if ch!=chrom:
264 | diffchr+=[c]
265 | elif f!=priflag:
266 | samechr+=[c]
267 | else:
268 | samedirchr+=[c]
269 | for c in samedirchr:
270 | if c[3]>primary[3] and c[4]-primary[4]>-200:
271 | leftread=primary
272 | rightread=c
273 | elif c[3]-200:
274 | leftread=c
275 | rightread=primary
276 | else:
277 | continue
278 | leftinfo=leftread[5]
279 | rightinfo=rightread[5]
280 | #insertion:
281 | window=300
282 | if if_contig:
283 | window=min(2000,leftinfo[1]//2,rightinfo[1]//2)
284 | if abs(rightread[3]-leftread[4])<=window:
285 | overlap=rightread[3]-leftread[4]
286 | ins_size=rightinfo[0]-leftinfo[1]-leftinfo[0]-overlap
287 | if min_size<=ins_size<=max_size:
288 | svcallset+=[chrom+'\t'+str(min(rightread[3],leftread[4]))+'\t'+str(ins_size)+'\t'+'I-segment'+'\t'+primary[0]+'\t'+str(int(c[1])+int(primary[1]))+'\t'+str((int(c[6])+int(primary[6]))//2)]
289 |
290 | #deletion:
291 | overlapmap=leftinfo[0]+leftinfo[1]-rightinfo[0]
292 | #window_max=1500
293 | window_max=2000 #Test for rescue FN
294 | overlap_window=-200
295 | if if_contig:
296 | overlap_window=-5000
297 | window_max=5000
298 | if overlap_window=max(50,overlapmap):
308 | dup_size=leftread[4]-rightread[3]-max(overlapmap,0)
309 | if min_size<=dup_size<=max_size:
310 | svcallset+=[chrom+'\t'+str(rightread[3])+'\t'+str(dup_size)+'\t'+'I-segment'+'\t'+primary[0]+'\t'+str(c[1])+'\t'+str((int(c[6])+int(primary[6]))/2)]
311 | overlapmap=rightinfo[0]+rightinfo[1]-leftinfo[0]
312 | if -200=max(1000,overlapmap):
313 | dup_size=rightread[4]-leftread[3]-overlapmap
314 | if min_size<=dup_size<=max_size:
315 | svcallset+=[chrom+'\t'+str(leftread[3])+'\t'+str(dup_size)+'\t'+'I-segment'+'\t'+primary[0]+'\t'+str(c[1])+'\t'+str((int(c[6])+int(primary[6]))/2)]
316 | '''
317 | #inversion:
318 | for c in samechr:
319 | if c[3]>primary[3] and c[4]-primary[4]>-200:
320 | leftread=primary
321 | rightread=c
322 | elif c[3]-200:
323 | leftread=c
324 | rightread=primary
325 | else:
326 | continue
327 | leftinfo=leftread[5]
328 | rightinfo=rightread[5]
329 | window_max=500
330 | overlap_window=-200
331 | if if_contig:
332 | overlap_window=-2000
333 | overlapmap=rightinfo[0]+rightinfo[1]-leftinfo[2]
334 | if overlap_window=max(100,overlapmap):
335 | inv_size=rightread[4]-leftread[4]-overlapmap
336 | if min_size<=inv_size<=max_size:
337 | svcallset+=[chrom+'\t'+str(leftread[4])+'\t'+str(inv_size)+'\t'+'INV-segment'+'\t'+primary[0]+'\t'+str(c[1])+'\t'+str((int(c[6])+int(primary[6]))//2)]
338 | continue
339 | overlapmap=rightinfo[1]+rightinfo[2]-leftinfo[0]
340 | if overlap_window=max(100,overlapmap):
341 | inv_size=rightread[3]-leftread[3]-overlapmap
342 | if min_size<=inv_size<=max_size:
343 | svcallset+=[chrom+'\t'+str(leftread[3])+'\t'+str(inv_size)+'\t'+'INV-segment'+'\t'+primary[0]+'\t'+str(c[1])+'\t'+str((int(c[6])+int(primary[6]))//2)]
344 | continue
345 | return svcallset
346 |
347 |
348 |
349 |
350 | def detect_sortbam(workpath,min_size,max_size,chrom):
351 | f=pysam.AlignmentFile(workpath+'read_to_contig.bam', "rb")
352 | segmentreads={}
353 | tempfile=open(workpath+'debreak_workspace/read_to_contig_'+chrom+'.debreak.temp','w')
354 | totalmaplength=0
355 | number_read=0
356 | split_num=0
357 |
358 | for align in f.fetch(chrom,):
359 | if align.is_secondary:
360 | continue
361 | readname=align.query_name
362 | flag=align.flag
363 | position=align.reference_start+1
364 | refend=align.reference_end+1
365 | cigar_info=[0,0,0]
366 | if align.cigar[0][0] in [4,5]:
367 | cigar_info[0]=align.cigar[0][1]
368 | if align.cigar[-1][0] in [4,5]:
369 | cigar_info[2]=align.cigar[-1][1]
370 | cigar_info[1]=align.query_alignment_length
371 | mappingquality=align.mapping_quality
372 | readinfo=[readname,flag,chrom,position,refend,cigar_info,mappingquality]
373 | if align.is_supplementary:
374 | cigar=align.cigarstring
375 | cigarinfo=cigardeletion(flag,chrom,position,cigar,5,max_size)
376 | cigarsv=[mm for mm in cigarinfo[0] if int(mm[2])>=min_size]
377 | for d in cigarsv:
378 | tempfile.write(d[0]+'\t'+str(d[1])+'\t'+str(d[2])+'\t'+d[3]+'\t'+readname+'\t'+str(flag)+'\t'+str(mappingquality)+'\n')
379 |
380 | pri_chrom=align.get_tag("SA").split(',')[0]
381 | if pri_chrom!=chrom:
382 | continue
383 | else:
384 | if readname not in segmentreads:
385 | segmentreads[readname]=[readinfo]
386 | else:
387 | segmentreads[readname]+=[readinfo]
388 |
389 | else:
390 | totalmaplength+=align.query_length
391 | number_read+=1
392 | cigar=align.cigarstring
393 | cigarinfo=cigardeletion(flag,chrom,position,cigar,5,max_size)
394 | cigarsv=[mm for mm in cigarinfo[0] if int(mm[2])>=min_size]
395 | for d in cigarsv:
396 | tempfile.write(d[0]+'\t'+str(d[1])+'\t'+str(d[2])+'\t'+d[3]+'\t'+readname+'\t'+str(flag)+'\t'+str(mappingquality)+'\n')
397 |
398 | if align.has_tag("SA"):
399 | if align.mapping_quality > 50:
400 | split_num+=1
401 | if chrom in [c.split(',')[0] for c in align.get_tag("SA").split(';')[:-1]]:
402 | if readname not in segmentreads:
403 | segmentreads[readname]=[readinfo]
404 | else:
405 | segmentreads[readname]+=[readinfo]
406 |
407 | for readgroup in segmentreads:
408 | if len(segmentreads[readgroup])<2 or len(segmentreads[readgroup])>20:
409 | continue
410 | segmentsv=segmentdeletion(segmentreads[readgroup],min_size,max_size,False)
411 | for d in segmentsv:
412 | tempfile.write(d+'\n')
413 | tempfile.close()
414 | f.close()
415 | if totalmaplength!=0:
416 | f=open(workpath+'map_depth/maplength_large_'+chrom,'w')
417 | f.write(str(totalmaplength)+'\n')
418 | f.close()
419 | f=open(workpath+'map_depth/readnum_large_'+chrom,'w')
420 | f.write(str(number_read)+'\n')
421 | f.close()
422 | f=open(workpath+'map_depth/splitread_large_'+chrom,'w')
423 | f.write(str(split_num)+'\n')
424 | f.close()
425 | return 0
426 |
427 |
428 | def detect_sortbam_nosv(writepath,chrom,contig_type):
429 | print('Collect info from '+chrom)
430 | samfile=pysam.AlignmentFile(writepath+'read_to_contig.bam',"rb")
431 | allreads=samfile.fetch(chrom,)
432 | totalmaplength=0
433 | number_read=0
434 | split_num=0
435 | for align in allreads:
436 | if align.is_secondary or align.is_supplementary:
437 | continue
438 | totalmaplength+=align.query_length
439 | number_read+=1
440 | if align.has_tag("SA"):
441 | if align.mapping_quality > 50:
442 | split_num+=1
443 |
444 | if totalmaplength!=0:
445 | f=open(writepath+'map_depth/maplength_'+contig_type+'_'+chrom,'w')
446 | f.write(str(totalmaplength)+'\n')
447 | f.close()
448 | f=open(writepath+'map_depth/readnum_'+contig_type+'_'+chrom,'w')
449 | f.write(str(number_read)+'\n')
450 | f.close()
451 | f=open(writepath+'map_depth/splitread_'+contig_type+'_'+chrom,'w')
452 | f.write(str(split_num)+'\n')
453 | f.close()
454 | return 0
455 |
456 |
457 |
458 | def detect_sam_ref(filename,readpath,writepath,min_size,max_size):
459 | f=open(readpath+filename,'r')
460 | c=f.readline()
461 | g=open(writepath+filename[:-4]+'.debreak.temp','w')
462 | lastname=''
463 | segments=['']
464 | unmapped=0
465 | mapped=0
466 | multimap=[]
467 | totalmappedlength=0
468 | while c!='':
469 | #remove headerlines, secondary alignments, alignment on scallfolds
470 | if c[0]=='@' or c.split('\t')[1] not in ['0','16','4','256','272','2048','2064']:
471 | c=f.readline(); continue
472 | if c.split('\t')[1]=='4':
473 | unmapped+=1; c=f.readline(); continue
474 | if c.split('\t')[1] in ['256','272'] :
475 | readname=c.split('\t')[0]
476 | if readname not in multimap:
477 | multimap+=[readname]
478 | c=f.readline(); continue
479 | #detect the deletion from cigar
480 | readname=c.split('\t')[0]
481 | flag=c.split('\t')[1]
482 | chrom=c.split('\t')[2]
483 | position=int(c.split('\t')[3])
484 | mappingquality=c.split('\t')[4]
485 | cigar=c.split('\t')[5]
486 | cigarinfo=cigardeletion_ref('0',chrom,position,cigar,min_size,max_size)
487 |
488 | if flag in ['0','16']:
489 | mapped+=1
490 | totalmappedlength+=cigarinfo[2][1]
491 |
492 |
493 | if cigarinfo[2][1]<10000 or cigarinfo[2][1]<0.01*(sum(cigarinfo[2])):
494 | #if cigarinfo[2][1]<100000 or cigarinfo[2][1]<0.01*(sum(cigarinfo[2])):
495 | #if cigarinfo[2][1]<500000 and cigarinfo[2][1]<0.05*(sum(cigarinfo[2])):
496 |
497 | c=f.readline()
498 | continue
499 |
500 | refend=position+cigarinfo[1]
501 | cimplecigar=str(cigarinfo[2][0])+'\t'+str(cigarinfo[2][1])+'\t'+str(cigarinfo[2][2])
502 | # if primary: write deletions from cigar string
503 | cigarsv=cigarinfo[0]
504 | if int(flag)%32<16:
505 | for d in cigarsv:
506 | g.write(d[0]+'\t'+str(d[1])+'\t'+str(d[2])+'\t'+d[3]+'\t'+readname+'\t'+flag+'\t'+mappingquality+'\t'+str(d[4])+'\n')
507 | else:
508 | totalreadlength=sum(cigarinfo[2])
509 | for d in cigarsv:
510 | if 'I-cigar' in d:
511 | g.write(d[0]+'\t'+str(d[1])+'\t'+str(d[2])+'\t'+d[3]+'\t'+readname+'\t'+flag+'\t'+mappingquality+'\t'+str(totalreadlength-d[4]-d[2])+'\n')
512 | else:
513 | g.write(d[0]+'\t'+str(d[1])+'\t'+str(d[2])+'\t'+d[3]+'\t'+readname+'\t'+flag+'\t'+mappingquality+'\t'+str(totalreadlength-d[4])+'\n')
514 |
515 | readinfo=[readname,flag,chrom,position,refend,cigarinfo[2],mappingquality]
516 | if readname!=lastname:
517 | if 1=0.5*length:
35 | can_count=int(can.split('\t')[3])
36 | position=int((position*count+int(can.split('\t')[1])*can_count)/(count+can_count))
37 | length=int((length*count+int(can.split('\t')[2])*can_count)/(count+can_count))
38 | quality=(quality*count+float(can.split('\t')[4])*can_count)/(count+can_count)
39 | sd=(sd*count+float(can.split('\t')[5])*can_count)/(count+can_count)
40 | readnames+=can.split('\t')[6]+';'
41 | count+=can_count
42 | readnames=readnames[:-1]
43 | ins+=[candi[0].split('\t')[0]+'\t'+str(position)+'\t'+str(length)+'\t'+str(count)+'\t'+str(quality)+'\t'+str(sd)+'\t'+readnames+'\tUnique']
44 | candi=[event]
45 | last=event
46 | return ins
47 |
48 | def sort_mostspupport(a):
49 | return [int(a.split('\t')[3]),int(a.split('\t')[2])]
50 |
51 | def m_samechr_deletion(samechrom):
52 | dels=[]
53 | samechrom.sort(key=mergerpossort)
54 | samechrom+=['last_end\t999999999999\t999999999999\t1\t60\t0']
55 | candi=[]
56 | last=samechrom[0]
57 | for event in samechrom:
58 | maxlen=max(int(event.split('\t')[2]),int(last.split('\t')[2]))
59 | window=max(300,maxlen+10)
60 | window=min(800,window)
61 | if int(event.split('\t')[1]) < int(last.split('\t')[1])+int(last.split('\t')[2])+200 and int(event.split('\t')[1])-int(last.split('\t')[1]) < window:
62 | candi+=[event]
63 | last=event
64 | continue
65 | if len(candi)==1:
66 | dels+=[candi[0]]
67 | else:
68 | position=0; length=0; count=0; quality=0; sd=0; readnames=''
69 | candi.sort(key=sort_mostspupport,reverse=True)
70 | for can in candi:
71 | if int(can.split('\t')[2])>=0.5*length:
72 | can_count=int(can.split('\t')[3])
73 | position=int((position*count+int(can.split('\t')[1])*can_count)/(count+can_count))
74 | length=int((length*count+int(can.split('\t')[2])*can_count)/(count+can_count))
75 | quality=(quality*count+float(can.split('\t')[4])*can_count)/(count+can_count)
76 | sd=(sd*count+float(can.split('\t')[5])*can_count)/(count+can_count)
77 | readnames+=can.split('\t')[6]+';'
78 | count+=can_count
79 | readnames=readnames[:-1]
80 | dels+=[candi[0].split('\t')[0]+'\t'+str(position)+'\t'+str(length)+'\t'+str(count)+'\t'+str(quality)+'\t'+str(sd)+'\t'+readnames+'\tUnique']
81 | candi=[event]
82 | last=event
83 | return dels
84 |
85 | def mergertra(a):
86 | return [a.split('\t')[2],int(a.split('\t')[3])]
87 |
88 | def m_samechr_translocation(samechrom):
89 | samechrom.sort(key=mergertra)
90 | iftrue=0
91 | while iftrue==0:
92 | iftrue=1
93 | for i in range(len(samechrom)-1):
94 | if samechrom[i].split('\t')[2]==samechrom[i+1].split('\t')[2] and abs(int(samechrom[i].split('\t')[3])-int(samechrom[i+1].split('\t')[3]))<=800:
95 | iftrue=0
96 | if samechrom[i].split('\t')[0]==samechrom[i+1].split('\t')[0] and abs(int(samechrom[i].split('\t')[1])-int(samechrom[i+1].split('\t')[1]))<=1000:
97 | count1=int(samechrom[i].split('\t')[4]); count2=int(samechrom[i+1].split('\t')[4])
98 | pos1=(int(samechrom[i].split('\t')[1])*count1+int(samechrom[i+1].split('\t')[1])*count2)//(count1+count2)
99 | pos2=(int(samechrom[i].split('\t')[3])*count1+int(samechrom[i+1].split('\t')[3])*count2)//(count1+count2)
100 | quality=(float(samechrom[i].split('\t')[5])*count1+float(samechrom[i+1].split('\t')[5])*count2)/(count1+count2)
101 | sd1=(float(samechrom[i].split('\t')[6])*count1+float(samechrom[i+1].split('\t')[6])*count2)/(count1+count2)
102 | sd2=(float(samechrom[i].split('\t')[7])*count1+float(samechrom[i+1].split('\t')[7])*count2)/(count1+count2)
103 | readname=samechrom[i].split('\t')[8]+';'+samechrom[i+1].split('\t')[8]
104 | mergedtra=samechrom[i].split('\t')[0]+'\t'+str(pos1)+'\t'+samechrom[i].split('\t')[2]+'\t'+str(pos2)+'\t'+str(count1+count2)+'\t'+str(quality)+'\t'+str(sd1)+'\t'+str(sd2)+'\t'+readname+'\tTranslocation'
105 | samechrom=samechrom[:i]+[mergedtra]+samechrom[i+2:]
106 | else:
107 | count1=int(samechrom[i].split('\t')[4]); count2=int(samechrom[i+1].split('\t')[4])
108 | if count1>=count2:
109 | samechrom.remove(samechrom[i+1])
110 | else:
111 | samechrom.remove(samechrom[i])
112 | break
113 | return samechrom
114 |
115 | def standerd_varition(length):
116 | avelen=sum(length)/float(len(length))
117 | s=0.0
118 | for c in length:
119 | s+=(c-avelen)**2/float(len(length))
120 | s=s**0.5
121 | return s
122 |
123 |
124 | def mergeinfosecpos(a):
125 | return int(a.split('\t')[3])
126 |
127 | def mergeinfo_translocation(candi,min_support):
128 | chrom=candi[0].split('\t')[0]
129 | secchr=[c.split('\t')[2] for c in candi]
130 | secchr=max(set(secchr),key=secchr.count)
131 | candi=[c for c in candi if c.split('\t')[2]==secchr]
132 |
133 | candi.sort(key=mergeinfosecpos)
134 | if len(candi)%2==0:
135 | median=(int(candi[len(candi)//2-1].split('\t')[3])+int(candi[len(candi)//2].split('\t')[3]))//2
136 | else:
137 | median=int(candi[len(candi)//2].split('\t')[3])
138 | candi=[ c for c in candi if abs(int(c.split('\t')[3])-median)<=800]
139 |
140 | if len(candi)>=min_support:
141 | pos1=[int(c.split('\t')[1]) for c in candi]
142 | pos2=[int(c.split('\t')[3]) for c in candi]
143 | qual=[float(c.split('\t')[4]) for c in candi]
144 | sd1=standerd_varition(pos1)
145 | sd2=standerd_varition(pos2)
146 | readnames=''
147 | for c in candi:
148 | readnames+=c.split('\t')[5]+';'
149 | readnames=readnames[:-1]
150 | return [chrom+'\t'+str(sum(pos1)//len(pos1))+'\t'+secchr+'\t'+str(sum(pos2)//len(pos2))+'\t'+str(len(candi))+'\t'+str(sum(qual)/len(qual))+'\t'+str(sd1)+'\t'+str(sd2)+'\t'+readnames+'\tTranslocation']
151 | else:
152 | return []
153 |
154 |
155 | def assign_candi_insertion(candi,mean1,mean2):
156 | group1=[]
157 | group2=[]
158 | for c in candi:
159 | if abs(int(c.split('\t')[2])-mean1)<=abs(mean2-int(c.split('\t')[2])):
160 | group1+=[c]
161 | else:
162 | group2+=[c]
163 | mean1_new=int(sum([int(c.split('\t')[2]) for c in group1])/len(group1))
164 | mean2_new=int(sum([int(c.split('\t')[2]) for c in group2])/len(group2))
165 | return [group1,group2,mean1_new,mean2_new]
166 |
167 |
168 |
169 | def mergeinfolengthsort(a):
170 | return int(a.split('\t')[2])
171 |
172 | def mergeinfo_insertion(candi,min_support):
173 | candi.sort(key=mergeinfolengthsort)
174 |
175 | if len(candi)>=1.5*min_support:
176 | upper=int(candi[len(candi)*3//4].split('\t')[2])
177 | lower=int(candi[len(candi)//4].split('\t')[2])
178 | if upper>3*lower:
179 | svgroups=assign_candi_insertion(candi,upper,lower)
180 | svgroups=assign_candi_insertion(candi,svgroups[2],svgroups[3])
181 | svgroups=assign_candi_insertion(candi,svgroups[2],svgroups[3])
182 | mergedsv=[]
183 | if len(svgroups[0])>=min_support:
184 | mergedsv+=mergeinfo_insertion_oneevent(svgroups[0],min_support)
185 | if len(svgroups[1])>=min_support:
186 | mergedsv+=mergeinfo_insertion_oneevent(svgroups[1],min_support)
187 | if len(mergedsv)==2:
188 | #mergedsv=[c+'\tCompoundSV' for c in mergedsv]
189 | # Test for rescue FP
190 | if int(mergedsv[0].split('\t')[3])>=int(mergedsv[1].split('\t')[3]):
191 | mergedsv=[mergedsv[0]+'\tCompoundSV']
192 | else:
193 | mergedsv=[mergedsv[1]+'\tCompoundSV']
194 | if len(mergedsv)==1:
195 | mergedsv=[mergedsv[0]+'\tUnique']
196 | return mergedsv
197 | mergedsv=mergeinfo_insertion_oneevent(candi,min_support)
198 | if len(mergedsv)==1:
199 | return [mergedsv[0]+'\tUnique']
200 | else:
201 | return []
202 |
203 | def mergeinfo_insertion_oneevent(candi,min_support):
204 | candi.sort(key=mergeinfolengthsort)
205 | while len(candi)>min_support-2:
206 | if int(candi[-1].split('\t')[2]) > 1.5* int(candi[len(candi)//2].split('\t')[2]):
207 | candi.remove(candi[-1])
208 | continue
209 | if int(candi[len(candi)//2].split('\t')[2]) > 1.5*int(candi[0].split('\t')[2]):
210 | candi.remove(candi[0])
211 | continue
212 | break
213 | if len(candi)>=min_support:
214 | chrom=candi[0].split('\t')[0]
215 | position=[int(c.split('\t')[1]) for c in candi]
216 | length=[int(c.split('\t')[2]) for c in candi]
217 | quality=[float(c.split('\t')[3]) for c in candi]
218 | position=sum(position)//len(position)
219 | quality=sum(quality)/float(len(quality))
220 | stand=standerd_varition(length)
221 | length=sum(length)//len(length)
222 | readnames=''
223 | for c in candi:
224 | readnames+=c.split('\t')[4]+';'
225 | readnames=readnames[:-1]
226 | return[chrom+'\t'+str(position)+'\t'+str(length)+'\t'+str(len(candi))+'\t'+str(quality)+'\t'+str(stand)+'\t'+readnames]
227 | else:
228 | return []
229 |
230 |
231 | def counttimesort_tra(a):
232 | return [int(a.split('\t')[1]),int(a.split('\t')[3])]
233 |
234 | def counttime_translocation(samechrom,min_support):
235 | samechrom.sort(key=counttimesort_tra)
236 | samechrtra=[]
237 | start=int(samechrom[0].split('\t')[1])
238 | candi=[]
239 | window=800
240 | for event in samechrom:
241 | if int(event.split('\t')[1])<=start+window:
242 | candi+=[event]
243 | continue
244 | if len(candi)>=min_support:
245 | samechrtra+=mergeinfo_translocation(candi,min_support)
246 | candi=[event]
247 | start=int(event.split('\t')[1])
248 | if len(candi)>=min_support:
249 | samechrtra+=mergeinfo_translocation(candi,min_support)
250 | candi=[]
251 | return samechrtra
252 |
253 |
254 | def counttimesort(a):
255 | return [int(a.split('\t')[1]),int(a.split('\t')[2])]
256 |
257 |
258 | def counttime_insertion(samechrom,min_support):
259 | if samechrom==[]:
260 | return []
261 | samechrom.sort(key=counttimesort)
262 | samechrins=[]
263 | start=int(samechrom[0].split('\t')[1])
264 | candi=[]
265 | inslength=[]
266 | window=100
267 | for event in samechrom:
268 | if int(event.split('\t')[1])<=start+window:
269 | candi+=[event]
270 | inslength+=[int(event.split('\t')[2])]
271 | continue
272 | if window==100:
273 | length=sum(inslength)//len(inslength)
274 | if length<=100:
275 | window=200
276 | if 100500:
279 | window=800
280 | if int(event.split('\t')[1])<=start+window:
281 | candi+=[event]
282 | inslength+=[int(event.split('\t')[2])]
283 | continue
284 | if len(candi)>=min_support:
285 | samechrins+=mergeinfo_insertion(candi,min_support)
286 | candi=[event]
287 | inslength=[int(event.split('\t')[2])]
288 | start=int(event.split('\t')[1])
289 | window=100
290 | if len(candi)>=min_support:
291 | samechrins+=mergeinfo_insertion(candi,min_support)
292 | candi=[]
293 | return samechrins
294 |
295 |
296 |
297 | def counttime_deletion(samechrom,min_support):
298 | if samechrom==[]:
299 | return []
300 | samechrom.sort(key=counttimesort)
301 | samechrdel=[]
302 | start=int(samechrom[0].split('\t')[1])
303 | candi=[]
304 | dellength=[]
305 | window=100
306 | for event in samechrom:
307 | if int(event.split('\t')[1])<=start+window:
308 | candi+=[event]
309 | dellength+=[int(event.split('\t')[2])]
310 | continue
311 | if window==100:
312 | length=sum(dellength)//len(dellength)
313 | if length<=100:
314 | window=200
315 | if 100500:
318 | window=800
319 | if int(event.split('\t')[1])<=start+window:
320 | candi+=[event]
321 | dellength+=[int(event.split('\t')[2])]
322 | continue
323 | if len(candi)>=min_support:
324 | samechrdel+=mergeinfo_insertion(candi,min_support)
325 | candi=[event]
326 | dellength=[int(event.split('\t')[2])]
327 | start=int(event.split('\t')[1])
328 | window=100
329 | if len(candi)>=min_support:
330 | samechrdel+=mergeinfo_insertion(candi,min_support)
331 | candi=[]
332 | return samechrdel
333 |
334 | def merge_deletion(min_support,min_quality,readpath,samechrom_deletion,chrom,svtype,upper_bound):
335 | delt1=time.time()
336 | samechrom_deletion=[c for c in samechrom_deletion if float(c.split('\t')[3])>=min_quality]
337 | if samechrom_deletion==[]:
338 | return True
339 | tt1=time.time()
340 | deletions=counttime_deletion(samechrom_deletion,min_support)
341 | deletions=[c for c in deletions if int(c.split('\t')[3])>=min_support]
342 | f=open(readpath+svtype+'-info-'+chrom,'w')
343 | for d in deletions:
344 | f.write(d+'\n')
345 | f.close()
346 | real=[c for c in deletions if c.split('\t')[-1]=='Unique']
347 | comp=[c for c in deletions if c.split('\t')[-1]=='CompoundSV']
348 | cleaneddels=m_samechr_deletion(real)
349 | cleaneddels+=comp
350 | cleaneddels.sort(key=counttimesort)
351 | if upper_bound:
352 | cleaneddels=[c for c in cleaneddels if min_support<=int(c.split('\t')[3])<=min_support*30]
353 | else:
354 | cleaneddels=[c for c in cleaneddels if min_support<=int(c.split('\t')[3])]
355 | f=open(readpath+svtype+'-merged-'+chrom,'w')
356 | if svtype=='del':
357 | sv_type='Deletion'
358 | if svtype=='dup':
359 | sv_type='Duplication'
360 | if svtype=='inv':
361 | sv_type='Inversion'
362 | merged_result=[]
363 | for d in cleaneddels:
364 | f.write(d+'\t'+sv_type+'\n')
365 | merged_result+=[d+'\t'+sv_type]
366 | f.close()
367 | delt2=time.time()
368 | return merged_result
369 |
370 | def merge_insertion(min_support,min_quality,readpath,samechrom_insertion,chrom,svtype,upper_bound):
371 | samechrom_insertion=[c for c in samechrom_insertion if float(c.split('\t')[3])>=min_quality]
372 | if samechrom_insertion==[]:
373 | return True
374 |
375 | inst1=time.time()
376 | insertions=counttime_insertion(samechrom_insertion,min_support)
377 | f=open(readpath+svtype+'-info-'+chrom,'w')
378 | for d in insertions:
379 | f.write(d+'\n')
380 | f.close()
381 | real=[c for c in insertions if c.split('\t')[-1]=='Unique']
382 | compound=[c for c in insertions if c.split('\t')[-1]=='CompoundSV']
383 | cleanedins=m_samechr_insertion(real)
384 | cleanedins+=compound
385 | cleanedins.sort(key=counttimesort)
386 |
387 | if upper_bound:
388 | cleanedins=[c for c in cleanedins if min_support<=int(c.split('\t')[3])<=30*min_support]
389 | else:
390 | cleanedins=[c for c in cleanedins if min_support<=int(c.split('\t')[3])]
391 | f=open(readpath+svtype+'-merged-'+chrom,'w')
392 | if svtype=='ins':
393 | sv_type='Insertion'
394 | if svtype=='inv':
395 | sv_type='Inversion'
396 | merged_result=[]
397 | for d in cleanedins:
398 | f.write(d+'\t'+sv_type+'\n')
399 | merged_result+=[d+'\t'+sv_type]
400 | f.close()
401 | inst2=time.time()
402 |
403 | return merged_result
404 |
405 | def finalsorttra(a):
406 | return [a.split('\t')[0],int(a.split('\t')[1])]
407 |
408 | def merge_translocation(min_support,min_qual,readpath,samechrom_translocation,chrom,upper_bound):
409 | samechrom_translocation=[c for c in samechrom_translocation if float(c.split('\t')[4])>=min_qual]
410 | if samechrom_translocation==[]:
411 | return True
412 | trat1=time.time()
413 | translocations=counttime_translocation(samechrom_translocation,min_support)
414 | translocations=m_samechr_translocation(translocations)
415 | if upper_bound:
416 | translocations=[c for c in translocations if int(c.split('\t')[4])<=30*min_support]
417 |
418 | translocations.sort(key=finalsorttra)
419 | merged_result=[]
420 | f=open(readpath+'tra-merged-'+chrom,'w')
421 | for d in translocations:
422 | f.write(d+'\n')
423 | merged_result+=[d]
424 | f.close()
425 | trat2=time.time()
426 | return merged_result
427 |
428 |
429 |
430 |
431 | def genotype(depth,outpath):
432 | highcov=depth*2
433 | allae=open(outpath+'assembly_errors.bed','r').read().split('\n')[:-1]
434 |
435 | samfile=pysam.AlignmentFile(outpath+'read_to_contig.bam',"rb")
436 | f=open(outpath+'assembly_errors.bed-gt_test','w')
437 | coll=0
438 | expan=0
439 | inv=0
440 |
441 | for c in allae:
442 | chrom=c.split('\t')[0]
443 | start=int(c.split('\t')[1])
444 | stop=int(c.split('\t')[2])
445 | #print (c)
446 | if start<0:
447 | continue
448 | if 'Expansion' in c or 'Inversion' in c:
449 | leftcov=samfile.count(chrom,max(start-100,0),start,read_callback='all')
450 | rightcov=samfile.count(chrom,stop,stop+100,read_callback='all')
451 | #if leftcov>highcov and rightcov>highcov:
452 | # continue
453 | if 'Collapse' in c:
454 | leftcov=samfile.count(chrom,max(0,start-100),start,read_callback='all')
455 | rightcov=samfile.count(chrom,stop,stop+100,read_callback='all')
456 | # if leftcov>highcov and rightcov>highcov:
457 | # continue
458 | if 'Expan' in c:
459 | expan+=1
460 | if 'Coll' in c:
461 | coll+=1
462 | if 'Inv' in c:
463 | inv+=1
464 | gtinfo='./.'
465 | if int(c.split('\t')[3])>=0.6*min(leftcov,rightcov):
466 | gtinfo='1/1'
467 | else:
468 | gtinfo='1/0'
469 |
470 | #f.write(c+'\t'+gtinfo+'\n')
471 | f.write(c+'\t'+gtinfo+'\t'+str(leftcov)+'\t'+str(rightcov)+'\t'+str(min(rightcov,leftcov))+'\n')
472 | f.close()
473 | '''
474 | f=open(outpath+'summary_statistics','a')
475 | f.write('After Genotyping:\n')
476 | f.write('Number of assembly collapse\t'+str(coll)+'\n')
477 | f.write('Number of assembly expansion\t'+str(expan)+'\n')
478 | f.write('Number of assembly inversion\t'+str(inv)+'\n\n\n')
479 | '''
480 |
481 |
482 |
483 | '''
484 | f.write('Number of assembly collapse in large contigs\t'+str(coll_large)+'\n')
485 | f.write('Number of assembly expansion in large contigs\t'+str(expan_large)+'\n')
486 | f.write('Number of assembly inversion in large contigs\t'+str(inv_large)+'\n\n\n')
487 | '''
488 | f.close()
489 | return True
490 |
491 |
492 |
493 |
--------------------------------------------------------------------------------
/debreak_merge_clustering.py:
--------------------------------------------------------------------------------
1 | import os
2 | import time
3 | import sys
4 | import pysam
5 |
6 | def mergeinfolengthsort(a):
7 | return int(a.split('\t')[2])
8 |
9 | def mergeinfo_insertion(candi,min_support):
10 | candi.sort(key=mergeinfolengthsort)
11 |
12 | if len(candi)>=1.5*min_support:
13 | upper=int(candi[len(candi)*3//4].split('\t')[2])
14 | lower=int(candi[len(candi)//4].split('\t')[2])
15 | if upper>1.75*lower and upper-lower>50:
16 | svgroups=assign_candi_insertion(candi,upper,lower)
17 | svgroups=assign_candi_insertion(candi,svgroups[2],svgroups[3])
18 | svgroups=assign_candi_insertion(candi,svgroups[2],svgroups[3])
19 | mergedsv=[]
20 | if len(svgroups[0])>=min_support:
21 | mergedsv+=mergeinfo_insertion_oneevent(svgroups[0],min_support)
22 | if len(svgroups[1])>=min_support:
23 | mergedsv+=mergeinfo_insertion_oneevent(svgroups[1],min_support)
24 | if len(mergedsv)==2:
25 | mergedsv=[c+'\tCompoundSV' for c in mergedsv]
26 | if len(mergedsv)==1:
27 | mergedsv=[mergedsv[0]+'\tUnique']
28 | return mergedsv
29 | mergedsv=mergeinfo_insertion_oneevent(candi,min_support)
30 | if len(mergedsv)==1:
31 | return [mergedsv[0]+'\tUnique']
32 | else:
33 | return []
34 |
35 | def assign_candi_insertion(candi,mean1,mean2):
36 | group1=[]
37 | group2=[]
38 | for c in candi:
39 | if abs(int(c.split('\t')[2])-mean1)<=abs(mean2-int(c.split('\t')[2])):
40 | group1+=[c]
41 | else:
42 | group2+=[c]
43 | mean1_new=sum([int(c.split('\t')[2]) for c in group1])//len(group1)
44 | mean2_new=sum([int(c.split('\t')[2]) for c in group2])//len(group2)
45 | return [group1,group2,mean1_new,mean2_new]
46 |
47 |
48 | def mergeinfo_insertion_oneevent(candi,min_support):
49 | candi.sort(key=mergeinfolengthsort)
50 | min_support=max(2,min_support)
51 | while len(candi)>max(2,min_support-2):
52 | if int(candi[-1].split('\t')[2]) > 2* int(candi[len(candi)//2].split('\t')[2]) and int(candi[-1].split('\t')[2]) -int(candi[len(candi)//2].split('\t')[2]) >30:
53 | candi.remove(candi[-1])
54 | continue
55 | if int(candi[len(candi)//2].split('\t')[2]) > 2*int(candi[0].split('\t')[2]) and int(candi[len(candi)//2].split('\t')[2]) -int(candi[0].split('\t')[2]) >30:
56 | candi.remove(candi[0])
57 | continue
58 | break
59 | if len(candi)>=max(2,min_support):
60 | chrom=candi[0].split('\t')[0]
61 | position=[int(c.split('\t')[1]) for c in candi]
62 | length=[int(c.split('\t')[2]) for c in candi]
63 | quality=[float(c.split('\t')[6]) for c in candi]
64 | position=sum(position)//len(position)
65 | quality=sum(quality)/float(len(quality))
66 | length=sum(length)//len(length)
67 | readnames=''
68 | for c in candi:
69 | readnames+=c.split('\t')[4]+';'
70 | readnames=readnames[:-1]
71 | numread=len(readnames.split(';'))
72 | return[chrom+'\t'+str(position)+'\t'+str(length)+'\t'+str(len(candi))+'\t'+str(numread)+'\t'+str(quality)+'\t'+'\t'+readnames]
73 | else:
74 | return []
75 |
76 | def counttimesort(a):
77 | return [int(a.split('\t')[1]),int(a.split('\t')[2])]
78 |
79 | def cluster(outpath,chrom,contiglength,mins,maxdepth):
80 | allsv=open(outpath+'debreak_workspace/read_to_contig_'+chrom+'.debreak.temp','r').read().split('\n')[:-1]
81 |
82 | # Large DEL
83 | largesv=[c for c in allsv if 'D-' in c and int(c.split('\t')[2])>2000]
84 | window=1600
85 | largesv.sort(key=counttimesort)
86 | largedel=[]
87 | start=0
88 | candi=[]
89 | for event in largesv:
90 | if int(event.split('\t')[1])<=start+window:
91 | candi+=[event]
92 | continue
93 | if len(candi)>=mins:
94 | largedel+=mergeinfo_insertion(candi,mins)
95 | candi=[event]
96 | start=int(event.split('\t')[1])
97 | if len(candi)>=mins:
98 | largedel+=mergeinfo_insertion(candi,mins)
99 | candi=[]
100 |
101 | #smaller DEL
102 | allsv=[c for c in allsv if 'D-' in c and int(c.split('\t')[2])<=3000]
103 | genomeposition=[0]*contiglength
104 |
105 | for c in allsv:
106 | start=int(c.split('\t')[1])
107 | end=int(c.split('\t')[1])+int(c.split('\t')[2])
108 | original=genomeposition[start-1:end-1]
109 | new=[mm+1 for mm in original]
110 | genomeposition[start-1:end-1]=new
111 | svregion=[]
112 | inblock=False
113 | threshold=3
114 |
115 | for i in range(len(genomeposition)):
116 | if inblock:
117 | if genomeposition[i]>=max(maxdep/10.0,threshold):
118 | localdep+=[genomeposition[i]]
119 | if genomeposition[i]>maxdep:
120 | maxdep=genomeposition[i]
121 | else:
122 | inblock=False
123 | end=i
124 | if maxdep<=maxdepth:
125 | peakpos=localdep.index(maxdep)
126 | peakleftsize=0
127 | for i in range(peakpos):
128 | if localdep[peakpos-i-1]>=maxdep/10.0:
129 | peakleftsize+=1
130 | else:
131 | break
132 | svregion+=[(start+peakpos-peakleftsize,end,maxdep)]
133 | start=0
134 | end=0
135 | maxdep=0
136 |
137 | else:
138 | if genomeposition[i] > threshold:
139 | inblock=True
140 | localdep=[genomeposition[i]]
141 | start=i
142 | maxdep=genomeposition[i]
143 |
144 | svregion=[c for c in svregion if c[2] < maxdepth]
145 | allsvinfo={}
146 | for c in svregion:
147 | allsvinfo[c]=[]
148 |
149 | for c in allsv:
150 | start=int(c.split('\t')[1])
151 | end=start+int(c.split('\t')[2])
152 | for d in svregion:
153 | if min(end,d[1])-max(d[0],start)>0:
154 | allsvinfo[d]+=[c]
155 |
156 | sv=[]
157 | for c in svregion:
158 | svinfo=allsvinfo[c]
159 | sv+=mergeinfo_insertion(svinfo,mins)
160 |
161 | newsv=[]
162 | for c in largedel:
163 | testif=0
164 | for d in sv:
165 | if min(int(c.split('\t')[1])+int(c.split('\t')[2]), int(d.split('\t')[1])+int(d.split('\t')[2])) - max(int(c.split('\t')[1]),int(d.split('\t')[1]))>0 and 0.8*int(d.split('\t')[2])<=int(c.split('\t')[2])<=int(d.split('\t')[2])/0.8:
166 | testif=1; break
167 | if testif==0:
168 | newsv+=[c]
169 | newsv+=sv
170 | newsv.sort(key=counttimesort)
171 |
172 |
173 | if newsv==[]:
174 | return 0
175 |
176 | f=open(outpath+'ae_merge_workspace/del_merged_'+chrom,'w')
177 | for c in newsv:
178 | f.write(c+'\n')
179 | f.close()
180 |
181 | return 0
182 |
183 |
184 |
185 | def cluster_ins(outpath,chrom,contiglength,mins,maxdepth,svtype):
186 | allsv=open(outpath+'debreak_workspace/read_to_contig_'+chrom+'.debreak.temp','r').read().split('\n')[:-1]
187 |
188 | # Large INS
189 | if svtype=='ins':
190 | largesv=[c for c in allsv if 'I-' in c and int(c.split('\t')[2])>2000]
191 | else:
192 | largesv=[c for c in allsv if 'INV-' in c and int(c.split('\t')[2])>2000]
193 |
194 | window=1600
195 | largesv.sort(key=counttimesort)
196 | largedel=[]
197 | start=0
198 | candi=[]
199 | for event in largesv:
200 | if int(event.split('\t')[1])<=start+window:
201 | candi+=[event]
202 | continue
203 | if len(candi)>=mins:
204 | largedel+=mergeinfo_insertion(candi,mins)
205 | candi=[event]
206 | start=int(event.split('\t')[1])
207 | if len(candi)>=mins:
208 | largedel+=mergeinfo_insertion(candi,mins)
209 | candi=[]
210 |
211 | # Small INS
212 | if svtype=='ins':
213 | allsv=[c for c in allsv if 'I-' in c and int(c.split('\t')[2])<=3000]
214 | else:
215 | allsv=[c for c in allsv if 'INV-' in c and int(c.split('\t')[2])<=3000]
216 |
217 | genomeposition=[0]*contiglength
218 |
219 | for c in allsv:
220 | start=int(c.split('\t')[1])-100
221 | end=int(c.split('\t')[1])+100
222 | original=genomeposition[start-1:end-1]
223 | new=[mm+1 for mm in original]
224 | genomeposition[start-1:end-1]=new
225 |
226 | svregion=[]
227 | inblock=False
228 | threshold=3
229 |
230 | for i in range(len(genomeposition)):
231 | if inblock:
232 | if genomeposition[i]>=max(maxdep/10.0,threshold):
233 | localdep+=[genomeposition[i]]
234 | if genomeposition[i]>maxdep:
235 | maxdep=genomeposition[i]
236 | else:
237 | inblock=False
238 | end=i
239 | if maxdep<=maxdepth:
240 | peakpos=localdep.index(maxdep)
241 | peakleftsize=0
242 | for i in range(peakpos):
243 | if localdep[peakpos-i-1]>=maxdep/10.0:
244 | peakleftsize+=1
245 | else:
246 | break
247 | svregion+=[(start+peakpos-peakleftsize,end,maxdep)]
248 | start=0;end=0;maxdep=0
249 |
250 | else:
251 | if genomeposition[i] > threshold:
252 | inblock=True
253 | localdep=[genomeposition[i]]
254 | start=i
255 | maxdep=genomeposition[i]
256 |
257 | svregion=[c for c in svregion if c[2] < maxdepth]
258 | allsvinfo={}
259 | for c in svregion:
260 | allsvinfo[c]=[]
261 |
262 | for c in allsv:
263 | start=int(c.split('\t')[1])-50
264 | end=start+100
265 | for d in svregion:
266 | if min(end,d[1])-max(d[0],start)>0:
267 | allsvinfo[d]+=[c]
268 | sv=[]
269 | for c in svregion:
270 | svinfo=allsvinfo[c]
271 | mergedins=mergeinfo_insertion(svinfo,mins)
272 | for m in mergedins:
273 | sv+=[m+'\t'+chrom+'\t'+str(c[0])+'\t'+str(c[1])+'\t'+str(c[2])]
274 |
275 | newsv=[]
276 | for c in largedel:
277 | testif=0
278 | for d in sv:
279 | if min(int(c.split('\t')[1])+int(c.split('\t')[2]), int(d.split('\t')[1])+int(d.split('\t')[2])) - max(int(c.split('\t')[1]),int(d.split('\t')[1]))>0 and 0.8*int(d.split('\t')[2])<=int(c.split('\t')[2])<=int(d.split('\t')[2])/0.8:
280 | testif=1; break
281 | if testif==0:
282 | newsv+=[c]
283 | newsv+=sv
284 | newsv.sort(key=counttimesort)
285 |
286 | if newsv==[]:
287 | return 0
288 |
289 | if svtype=='ins':
290 | f=open(outpath+'ae_merge_workspace/ins_merged_'+chrom,'w')
291 | else:
292 | f=open(outpath+'ae_merge_workspace/inv_merged_'+chrom,'w')
293 |
294 | for c in newsv:
295 | f.write(c+'\n')
296 | f.close()
297 | return 0
298 |
299 |
300 | def genotype(depth,outpath):
301 | highcov=depth*2
302 | allae=open(outpath+'assembly_errors.bed','r').read().split('\n')[:-1]
303 | samfile=pysam.AlignmentFile(outpath+'read_to_contig.bam',"rb")
304 | f=open(outpath+'assembly_errors.bed-gt','w')
305 | coll=0;expan=0;inv=0
306 |
307 | for c in allae:
308 | chrom=c.split('\t')[0]
309 | start=int(c.split('\t')[1])
310 | stop=int(c.split('\t')[2])
311 | if start<0:
312 | continue
313 | if 'Expansion' in c or 'Inversion' in c:
314 | leftcov=samfile.count(chrom,max(start-100,0),start,read_callback='all')
315 | rightcov=samfile.count(chrom,stop,stop+100,read_callback='all')
316 | if 'Collapse' in c:
317 | leftcov=samfile.count(chrom,max(0,start-100),start,read_callback='all')
318 | rightcov=samfile.count(chrom,stop,stop+100,read_callback='all')
319 | gtinfo='./.'
320 | if int(c.split('\t')[3])>=0.6*min(leftcov,rightcov):
321 | gtinfo='1/1'
322 | else:
323 | gtinfo='1/0'
324 | f.write(c+'\t'+gtinfo+'\t'+str(leftcov)+'\t'+str(rightcov)+'\t'+str(min(rightcov,leftcov))+'\n')
325 | f.close()
326 |
327 | def filterae(depth,outpath,min_size,datatype):
328 | allsv=open(outpath+'assembly_errors.bed-gt','r').read().split('\n')[:-1]
329 | if datatype=='hifi':
330 | rat=0.8
331 | else:
332 | rat=0.7
333 |
334 | highcov=depth*2
335 | lowcov=depth/2
336 | exp=[c for c in allsv if 'Exp' in c]
337 | col=[c for c in allsv if 'Col' in c]
338 | inv=[c for c in allsv if 'Inv' in c]
339 | new=[]
340 | exponly=[]
341 | for i in range(len(exp)):
342 | c=exp[i].split('\t')
343 | testif=0
344 | for d in col:
345 | if c[0]==d.split('\t')[0] and int(c[1])-250<=int(d.split('\t')[1])<=250+int(c[2]) and int(c[5].split('=')[1])<20*int(d.split('\t')[5].split('=')[1]):
346 | testif=1
347 | expread=c[6].split(';');goodexp=len(list(dict.fromkeys((expread))))
348 | colread=d.split('\t')[6].split(';'); goodcol=len(list(dict.fromkeys((colread))))
349 | totaln=len(list(dict.fromkeys((expread+colread))))
350 | if 0.33<=int(c[3])/float(d.split('\t')[3])<=3:
351 | if totalncolsize+min_size:
355 | new+=[c[0]+'\t'+c[1]+'\t'+c[2]+'\t'+str(goodexp)+'\t'+c[4]+'\tSize='+str(expsize-colsize)+'\t'+c[7]+'\t'+c[8]+'\t'+c[9]+'\t'+c[10]+'\t'+';'.join(expread)]
356 | if expsizeint(c[3])/float(d.split('\t')[3]):
363 | dd=d.split('\t')
364 | new+=[c[0]+'\t'+dd[1]+'\t'+dd[2]+'\t'+str(goodcol)+'\tCollapse\tSize='+str(int(c[2])-int(c[1]))+';'+d.split('\t')[5].split('=')[1]+'\t-/-\t'+dd[8]+'\t'+dd[9]+'\t'+dd[10]+'\t'+';'.join(colread)]
365 | if int(c[3])/float(d.split('\t')[3])>3:
366 | new+=[c[0]+'\t'+c[1]+'\t'+c[2]+'\t'+str(goodexp)+'\tExpansion\tSize='+str(int(c[2])-int(c[1]))+';'+d.split('\t')[5].split('=')[1]+'\t-/-\t'+c[8]+'\t'+c[9]+'\t'+c[10]+'\t'+';'.join(expread)]
367 | col.remove(d);break
368 | if testif==0:
369 | exponly+=[exp[i]]
370 | allsv=new
371 | for c in exponly+col:
372 | c=c.split('\t')
373 | expread=c[6].split(';');goodexp=len(list(dict.fromkeys((expread))))
374 | allsv+=[c[0]+'\t'+c[1]+'\t'+c[2]+'\t'+str(goodexp)+'\t'+c[4]+'\t'+c[5]+'\t'+c[7]+'\t'+c[8]+'\t'+c[9]+'\t'+c[10]+'\t'+';'.join(expread)]
375 |
376 | for c in inv:
377 | c=c.split('\t')
378 | expread=c[5].split(';');goodexp=len(list(dict.fromkeys((expread))))
379 | allsv+=[c[0]+'\t'+c[1]+'\t'+c[2]+'\t'+str(goodexp)+'\t'+c[4]+'\tSize='+str(int(c[2])-int(c[1]))+'\t'+c[6]+'\t'+c[7]+'\t'+c[8]+'\t'+c[9]+'\t'+';'.join(expread)]
380 | new=[]
381 | for c in allsv:
382 | if max([int(mm) for mm in c.split('\t')[5].split('=')[1].split(';')])=10 and int(c.split('\t')[3])>=rat*int(c.split('\t')[9]) and lowcov<=int(c.split('\t')[9])num:
11 | val=c
12 | return val
13 |
14 | def getsnv(path,chrom,mincount,maxcov,mindepth):
15 | logf=open(path+'Inspector.log','a')
16 | logf.write('Start small-scale error detection for '+chrom+'\n')
17 | logf.close()
18 | g=open(path+'base_error_workspace/baseerror_'+chrom+'.bed','w')
19 | os.system('samtools mpileup -Q 0 '+path+'read_to_contig.bam -r '+chrom+' -o '+path+'base_error_workspace/base_'+chrom+'.pileup -f '+path+'valid_contig.fa')
20 | f=open(path+'base_error_workspace/base_'+chrom+'.pileup','r')
21 | a=f.readline()
22 | numbaseerror=0
23 | validctgbase=0
24 | if mindepth==False and type(mindepth)==bool:
25 | mindepth=maxcov/10.0
26 |
27 | while a!='':
28 | if a.split('\t')[2]!='N' and mindepth<=int(a.split('\t')[3]) <=maxcov:
29 | validctgbase+=1
30 | if int(a.split('\t')[3]) maxcov:
31 | a=f.readline(); continue
32 | info=a.split('\t')[4]
33 | info=info.replace(',','.')
34 | info=re.sub('\^.','',info)
35 | info=info.replace('a','A')
36 | info=info.replace('t','T')
37 | info=info.replace('c','C')
38 | info=info.replace('g','G')
39 | depth=int(a.split('\t')[3])-info.count('*')
40 | min_supp=max(mincount,depth*0.2)
41 | ins=info.count('+')
42 | dels=info.count('-')
43 | ifindel=False
44 | if ins>=min_supp:
45 | ifindel=True;
46 | insinfp=info.split('+')[1:]
47 | insseq=[]
48 | for m in insinfp:
49 | num='';inum=0
50 | for dd in m:
51 | if dd in '1234567890':
52 | num+=dd; inum+=1
53 | else:
54 | break
55 | if int(num)<=mincount//2:
56 | insseq+=[m[inum:][:int(num)]]
57 | else:
58 | ins-=1
59 | if ins>=min_supp :
60 | mostf1=find2(insseq)
61 | numbaseerror+=1
62 | g.write(a.split('\t')[0]+'\t'+str(int(a.split('\t')[1])-1)+'\t'+a.split('\t')[1]+'\t-\t'+mostf1+'\t'+str(ins)+'\t'+str(depth)+'\tSmallCollapse\n')
63 | if dels>=min_supp:
64 | ifindel=True;
65 | insinfp=info.split('-')[1:]
66 | insseq=[]
67 | for m in insinfp:
68 | num='';inum=0
69 | for dd in m:
70 | if dd in '1234567890':
71 | num+=dd; inum+=1
72 | else:
73 | break
74 | if int(num)<=mincount//2:
75 | insseq+=[m[inum:][:int(num)]]
76 | else:
77 | dels-=1
78 | if dels>=min_supp:
79 | mostf1=find2(insseq)
80 | numbaseerror+=1
81 | g.write(a.split('\t')[0]+'\t'+str(int(a.split('\t')[1])-1)+'\t'+str(int(a.split('\t')[1])+len(mostf1)-1)+'\t'+mostf1+'\t-\t'+str(dels)+'\t'+str(depth)+'\tSmallExpansion\n')
82 |
83 | if info.count('.')+info.count('*')>0.8*int(a.split('\t')[3]) :
84 | a=f.readline(); continue
85 | acount=info.count('A')
86 | tcount=info.count('T')
87 | ccount=info.count('C')
88 | gcount=info.count('G')
89 |
90 | if '+' in a or '-' in info:
91 | insseq=''
92 | if '+' in info:
93 | insinfp=info.split('+')[1:]
94 | for m in insinfp:
95 | num=''
96 | inum=0
97 | for dd in m:
98 | if dd in '1234567890':
99 | num+=dd; inum+=1
100 | else:
101 | break
102 | insseq+=m[inum:][:int(num)]
103 | if '-' in info:
104 | insinfp=info.split('-')[1:]
105 | for m in insinfp:
106 | num=''
107 | inum=0
108 | for dd in m:
109 | if dd in '1234567890':
110 | num+=dd; inum+=1
111 | else:
112 | break
113 | insseq+=m[inum:][:int(num)]
114 |
115 | insacount=insseq.count('A')
116 | instcount=insseq.count('T')
117 | insccount=insseq.count('C')
118 | insgcount=insseq.count('G')
119 |
120 | acount-=insacount; tcount-=instcount; ccount-=insccount; gcount-=insgcount
121 |
122 | if max(acount,tcount,ccount,gcount) >=min_supp:
123 | if max(acount,tcount,ccount,gcount)==acount:
124 | altbase='A'
125 | if max(acount,tcount,ccount,gcount)==tcount:
126 | altbase='T'
127 | if max(acount,tcount,ccount,gcount)==ccount:
128 | altbase='C'
129 | if max(acount,tcount,ccount,gcount)==gcount:
130 | altbase='G'
131 | numbaseerror+=1
132 | g.write(a.split('\t')[0]+'\t'+str(int(a.split('\t')[1])-1)+'\t'+a.split('\t')[1]+'\t'+a.split('\t')[2]+'\t'+altbase+'\t'+str(max(acount,tcount,ccount,gcount))+'\t'+str(depth)+'\tBaseSubstitution\n')
133 |
134 | a=f.readline()
135 | f.close()
136 | os.system('rm '+path+'base_error_workspace/base_'+chrom+'.pileup')
137 | g.close()
138 | if numbaseerror==0:
139 | os.system('rm '+path+'base_error_workspace/baseerror_'+chrom+'.bed')
140 | f=open(path+'base_error_workspace/validbase','a')
141 | f.write(str(validctgbase)+'\n')
142 | f.close()
143 | return 0
144 |
145 |
146 | def count_baseerrror(path,ctgtotallen,datatype,ave_depth):
147 | os.system('cat '+path+'base_error_workspace/baseerror_*bed > '+path+'base_error_workspace/allbaseerror.bed')
148 | allsnv=open(path+'base_error_workspace/allbaseerror.bed','r').read().split('\n')[:-1]
149 | snv=0;indelins=0;indeldel=0
150 |
151 | baseerror=[]
152 | iii=0
153 | if datatype=='hifi':
154 | propvalue=0.5
155 | pcutoff=0.01
156 | readcutoff=0.75
157 | if ave_depth<25:
158 | pcutoff=0.02
159 | if ave_depth<15:
160 | pcutoff=0.1
161 | else:
162 | propvalue=0.4
163 | pcutoff=0.05
164 | readcutoff=0.5
165 | if ave_depth<25:
166 | pcutoff=0.1
167 |
168 |
169 | allpvalue=[]
170 |
171 | for c in allsnv:
172 | p=0
173 | nread=int(c.split('\t')[5])
174 | depth=int(c.split('\t')[6])
175 | if nread0:
30 | bad+=[snpset[i],snpset[i+1]]
31 | snpset=[c for c in snpset if c not in bad]
32 | cutposinfo=[]
33 | if snpset==[]:
34 | return (ctgseq,snpset)
35 | for i in range(len(snpset)):
36 | cutinfo=[0,0,'']
37 | if i>0:
38 | cutinfo[0]=get_snpcut_start(snpset[i-1])
39 | cutinfo[1]=get_snpcut_end(snpset[i])
40 | if 'SmallExpansion' == snpset[i].split('\t')[7]:
41 | cutinfo[2]=''
42 | elif snpset[i].split('\t')[7]== 'BaseSubstitution' or snpset[i].split('\t')[7]== 'SmallCollapse':
43 | cutinfo[2]=snpset[i].split('\t')[4]
44 | else:
45 | print ('Warning: Possible error in small-error correction.')
46 | cutposinfo+=[cutinfo]
47 | newseq=''
48 | for cutinfo in cutposinfo:
49 | newseq+=ctgseq[cutinfo[0]:cutinfo[1]]+cutinfo[2]
50 | newseq+=ctgseq[get_snpcut_start(snpset[-1]):]
51 | t2=time.time()
52 | print ('Base error correction for ',ctg,' finished. Time cost: ',t2-t1)
53 | return (newseq,snpset)
54 |
55 | def call_flye_timeout(datatype,outpath,aeinfo,outtime):
56 | testp = multiprocessing.dummy.Pool(1)
57 | testres = testp.apply_async(call_flye, args=(datatype,outpath,aeinfo))
58 | try:
59 | testout = testres.get(outtime) # Wait timeout seconds for func to complete.
60 | return testout
61 | except multiprocessing.TimeoutError:
62 | print ('Flye assembly time out for ',aeinfo)
63 | raise
64 |
65 | def call_flye(datatype,outpath,aeinfo):
66 | tt0=time.time()
67 | os.system('flye --'+datatype+' '+outpath+'assemble_workspace/read_ass_'+aeinfo+'.fa -o '+outpath+'assemble_workspace/flye_out_'+aeinfo+'/ -t 4 ')
68 | tt1=time.time()
69 | print ('FLYETIME for ',aeinfo,tt1-tt0)
70 | return 0
71 |
72 |
73 | def findpos(aeset,snpset,bamfile,outpath,datatype,thread,outtime):
74 | snpsetshift=[c for c in snpset if 'Small' in c]
75 | snpsetshift.sort(key=sort_snp)
76 | new=[]
77 | bam=pysam.AlignmentFile(bamfile,'rb')
78 | ctg=aeset[0].split('\t')[0]
79 | aeinfolist={}
80 | for c in aeset:
81 | if 'Inversion' in c:
82 | continue
83 | if 'HaplotypeSwitch' in c:
84 | if int(c.split('\t')[11].split(';')[0])>=int(c.split('\t')[11].split(';')[1]):
85 | readgroup=c.split('\t')[10].split(':')[0].split(';')
86 | aestart=int(c.split('\t')[1].split(';')[0])
87 | aeend=int(c.split('\t')[2].split(';')[0])
88 | aesize=c.split('\t')[5].split('=')[1].split(';')[0]
89 | aeinfo=ctg+'__'+str(aestart)+'__'+str(aeend)+'__'+str(aesize)+'__exp'
90 | aeinfolist[c]=aeinfo
91 | else:
92 | readgroup=c.split('\t')[10].split(':')[1].split(';')
93 | aestart=int(c.split('\t')[1].split(';')[1])
94 | aeend=int(c.split('\t')[2].split(';')[1])
95 | aesize=c.split('\t')[5].split('=')[1].split(';')[1]
96 | aeinfo=ctg+'__'+str(aestart)+'__'+str(aeend)+'__'+str(aesize)+'__col'
97 | aeinfolist[c]=aeinfo
98 | else:
99 | readgroup=c.split('\t')[10].split(';')
100 | aestart=int(c.split('\t')[1])
101 | aeend=int(c.split('\t')[2])
102 | aesize=c.split('\t')[5].split('=')[1].split(';')[0]
103 | aeinfo=ctg+'__'+str(aestart)+'__'+str(aeend)+'__'+str(aesize)+'__exp' if 'Exp' in c else ctg+'__'+str(aestart)+'__'+str(aeend)+'__'+str(aesize)+'__col'
104 | aeinfolist[c]=aeinfo
105 |
106 | f=open(outpath+'assemble_workspace/read_ass_'+aeinfo+'.fa','w')
107 | allread=bam.fetch(ctg,max(0,aestart-2000),aeend+2000)
108 | iii=0
109 | for read in allread:
110 | if read.query_name not in readgroup or read.flag>16:
111 | continue
112 | f.write('>'+read.query_name+'\n'+read.query_sequence+'\n')
113 | iii+=1
114 | f.close()
115 |
116 | flyerun=multiprocessing.Pool(thread)
117 | for c in aeinfolist:
118 | aeinfo=aeinfolist[c]
119 | flyerun.apply_async(call_flye_timeout,args=(datatype,outpath,aeinfo,outtime))
120 | flyerun.close()
121 | flyerun.join()
122 |
123 | for c in aeinfolist:
124 | aeinfo=aeinfolist[c]
125 | try:
126 | allctg=open(outpath+'assemble_workspace/flye_out_'+aeinfo+'/assembly.fasta','r').read().split('>')[1:]
127 | except:
128 | allctg=[]
129 | print ('Inspector Assembly Fail ' ,aeinfo)
130 | os.system('rm -rf '+outpath+'assemble_workspace/flye_out_'+aeinfo+'/')
131 | continue
132 | if len(allctg)==1:
133 | f=open(outpath+'assemble_workspace/new_contig_'+ctg+'.fa','a')
134 | newassseq=''.join(allctg[0].split('\n')[1:-1])
135 | f.write('>'+aeinfo+'__newctg\n'+newassseq+'\n')
136 | f.close()
137 | else:
138 | print ('Inspector Multi/No Alignment ' ,aeinfo)
139 | os.system('rm -rf '+outpath+'assemble_workspace/flye_out_'+aeinfo+'/')
140 |
141 | shiftpos=0
142 | for d in snpsetshift:
143 | if int(d.split('\t')[2])<=aestart:
144 | if 'SmallCollapse' in d:
145 | shiftpos+=len(d.split('\t')[4])
146 | else:
147 | shiftpos-=len(d.split('\t')[3])
148 | else:
149 | break
150 | c=c.split('\t')
151 | new+=[ctg+'\t'+str(aestart+shiftpos)+'\t'+str(aeend+shiftpos)+'\t'+c[4]+'\t'+aesize+'\t'+aeinfo]
152 | return new
153 |
154 |
155 | def substitute_seq(ctgseq,newseq,ctgstart,ctgend,newstart,newend,diffsize):
156 | newpart=newseq[newstart:newend]
157 | oldpart=ctgseq[ctgstart-1000-10:ctgend+1000+10]
158 | realdiff=len(newpart)-len(oldpart)
159 | if (realdiff/float(diffsize)>2 or realdiff/float(diffsize)<0.5) and (abs(diffsize-realdiff)>300 and realdiff*diffsize>0):
160 | return (ctgseq,False)
161 | leftside= check_same(newpart[:100],oldpart[10:110])
162 | rightside= check_same(newpart[-100:],oldpart[-110:-10])
163 | shift1=0
164 | if leftside<90:
165 | for shift1 in range(-5,5):
166 | leftside= check_same(newpart[:100],oldpart[10+shift1:110+shift1])
167 | if leftside>=90:
168 | break
169 | shift2=0
170 | if rightside <90:
171 | for shift2 in range(-5,5):
172 | rightside= check_same(newpart[-100:],oldpart[-110+shift2:-10+shift2])
173 | if rightside>=90:
174 | break
175 | if leftside>=90 and rightside>=90:
176 | ctgseq=ctgseq[:ctgstart-1000+shift1]+newpart+ctgseq[ctgend+1000+shift2:]
177 | return (ctgseq,True)
178 | else:
179 | ctgseq=ctgseq[:ctgstart-1000]+newpart+ctgseq[ctgend+1000:]
180 | return (ctgseq,True)
181 |
182 | def ae_correct_within(seq,read,start,end,size):
183 | mapping=read.get_aligned_pairs()
184 | readstart=0;readend=0
185 | for c in mapping:
186 | if c[1]==start-1000:
187 | readstart=c[0]
188 | if c[1]==end+1000:
189 | readend=c[0]
190 | break
191 | if readstart!=0 and readend!=0:
192 | (seq,ifcorr)=substitute_seq(seq,read.query_sequence,start,end,readstart,readend,size)
193 | return (seq,ifcorr)
194 | else:
195 | return (seq,False)
196 |
197 |
198 | def ae_correct_between(seq,align,start,end,size):
199 | readstart=0;readend=0
200 | for c in align:
201 | if c.reference_start < start-1000 and c.reference_end > start-100 :
202 | mapping=c.get_aligned_pairs()
203 | for m in mapping:
204 | if m[1]==start-1000:
205 | readstart=m[0]
206 | if c.reference_start < end+100 and c.reference_end > end+1000:
207 | mapping=c.get_aligned_pairs()
208 | for m in mapping:
209 | if m[1]==end+1000:
210 | readend=m[0]
211 | if readstart!=0 and readend!=0:
212 | (seq,ifcorr)=substitute_seq(seq,align[0].query_sequence,start,end,readstart,readend,size)
213 | return (seq,ifcorr)
214 | else:
215 | return (seq,False)
216 |
217 |
218 | def ae_correct_expcol(seq,align,aetype):
219 | aeinfo=align[0].query_name
220 | start=int(aeinfo.split('__')[1])
221 | end=int(aeinfo.split('__')[2])
222 | if aetype=='exp':
223 | size=0-int(aeinfo.split('__')[3])
224 | else:
225 | size=int(aeinfo.split('__')[3])
226 |
227 | for read in align:
228 | if read.reference_start < start-1000 and end+1000old_ctg_'+ctg+'\n'+ctgseq+'\n')
254 | f.close()
255 | try:
256 | allctg=open(outpath+'assemble_workspace/new_contig_'+ctg+'.fa','r').read().split('>')[1:]
257 | except:
258 | return (ctgseq,0)
259 | newcontig={}
260 | ctgname=[]
261 | for c in allctg:
262 | newcontig[c.split('\n')[0]]=c.split('\n')[1]
263 | ctgname+=[c.split('\n')[0]]
264 | ctgname.sort(key=sortctg,reverse=True)
265 | f=open(outpath+'assemble_workspace/new_contig_'+ctg+'.fa','w')
266 | for c in ctgname:
267 | f.write('>'+c+'\n'+newcontig[c]+'\n')
268 | f.close()
269 |
270 | os.system('minimap2 -a '+outpath+'assemble_workspace/old_contig_'+ctg+'.fa '+outpath+'assemble_workspace/new_contig_'+ctg+'.fa --MD --eqx -t 6 --secondary=no -Y > '+outpath+'assemble_workspace/ctgalignment_'+ctg+'.sam')
271 |
272 |
273 | alignfile=pysam.AlignmentFile(outpath+'assemble_workspace/ctgalignment_'+ctg+'.sam','r')
274 | aeset.sort(key=sort_snp,reverse=True)
275 |
276 | allalign=alignfile.fetch(until_eof=True)
277 | lastreadname=''
278 | samectg=[]
279 | numcorr=0
280 | for aligninfo in allalign:
281 | if aligninfo.flag==4:
282 | print (aligninfo.query_name,' contig not aligned.')
283 | continue
284 | if aligninfo.query_name == lastreadname:
285 | samectg+=[aligninfo]
286 | continue
287 | if samectg!=[]:
288 | if 'exp' in lastreadname:
289 | (ctgseq,ifcorr)=ae_correct_expcol(ctgseq,samectg,'exp')
290 | if ifcorr:
291 | numcorr+=1
292 | if 'col' in lastreadname:
293 | (ctgseq,ifcorr)=ae_correct_expcol(ctgseq,samectg,'col')
294 | if ifcorr:
295 | numcorr+=1
296 | lastreadname=aligninfo.query_name
297 | samectg=[aligninfo]
298 | if samectg!=[]:
299 | if 'exp' in lastreadname:
300 | (ctgseq,ifcorr)=ae_correct_expcol(ctgseq,samectg,'exp')
301 | if ifcorr:
302 | numcorr+=1
303 | if 'col' in lastreadname:
304 | (ctgseq,ifcorr)=ae_correct_expcol(ctgseq,samectg,'col')
305 | if ifcorr:
306 | numcorr+=1
307 | logf=open(outpath+'Inspector_correct.log','a')
308 | logf.write('total ae'+str(len(aeset))+', corrected error '+str(numcorr)+'\n')
309 | logf.close()
310 | return (ctgseq,numcorr)
311 |
312 | mapinfo={}
313 | for c in aeset:
314 | mapinfo[c.split('\t')[5]]=int(c.split('\t')[1])
315 | allread=alignfile.fetch(until_eof=True)
316 | for aligninfo in allread:
317 | if aligninfo.is_secondary:
318 | continue
319 | if type( mapinfo[aligninfo.query_name[:-8]])==int and aligninfo.reference_start+1000 = aestart-1000:
344 | readstart=readpos-(refpos-aestart+1000)
345 | if readend==-1 and refpos>=aeend+1000:
346 | readend=readpos-(refpos-aeend-1000)
347 | if readstart>0 and readend>0:
348 | break
349 | num='';continue
350 | if m in 'IS':
351 | readpos+=int(num);num='';continue
352 | if m=='H':
353 | num='';continue
354 | newseq=aligninfo.query_sequence[readstart:readend]
355 | oldseq=ctgseq[aestart-1000-10:aeend+1000+10]
356 | leftside= check_same(newseq[:100],oldseq[10:110])
357 | rightside= check_same(newseq[-100:],oldseq[-110:-10])
358 | shift1=0
359 | if leftside<90:
360 | for shift1 in range(-5,5):
361 | leftside= check_same(newseq[:100],oldseq[10+shift1:110+shift1])
362 | if leftside>=90:
363 | break
364 | shift2=0
365 | if rightside <90:
366 | for shift2 in range(-5,5):
367 | rightside= check_same(newseq[-100:],oldseq[-110+shift2:-10+shift2])
368 | if rightside>=90:
369 | break
370 | if leftside>=90 and rightside>=90:
371 | ctgseq=ctgseq[:aestart-1000+shift1]+newseq+ctgseq[aeend+1000+shift2:]
372 | correctedstructural+=1
373 | return (ctgseq,correctedstructural)
374 |
375 |
376 |
377 | def error_correction_large(ctg,oldseq,aeset,snpset,bamfile,outpath,datatype,thread,flyeouttime):
378 | t0=time.time()
379 | (newseq,snpset)=base_correction(oldseq,snpset,ctg)
380 | if aeset!=[]:
381 | aeset=findpos(aeset,snpset,bamfile,outpath,datatype,thread,flyeouttime)
382 | if aeset!=[]:
383 | (newseq,numcorr)=ae_correction(newseq,aeset,outpath)
384 | ff=open(outpath+'contig_corrected_'+ctg+'.fa','w')
385 | ff.write('>'+ctg+'\n'+newseq+'\n')
386 | ff.close()
387 | t1=time.time()
388 | logf=open(outpath+'Inspector_correct.log','a')
389 | logf.write('TIME used for structural error correction of '+ctg+': '+str(t1-t0)+'\n')
390 | logf.close()
391 | return 0
392 |
393 | def error_correction_small(ctg,oldseq,snpset,bamfile,outpath,datatype):
394 | t0=time.time()
395 | #snpset=[c for c in snpset if c.split('\t')[0]==ctg]
396 | (newseq,snpset)=base_correction(oldseq,snpset,ctg)
397 | ff=open(outpath+'contig_corrected_'+ctg+'.fa','w')
398 | ff.write('>'+ctg+'\n'+newseq+'\n')
399 | ff.close()
400 | t1=time.time()
401 | logf=open(outpath+'Inspector_correct.log','a')
402 | logf.write('TIME used for small error correction of '+ctg+':'+str(t1-t0)+'\n')
403 | logf.close()
404 | return 0
405 |
406 |
407 |
408 |
--------------------------------------------------------------------------------
/denovo_plot.py:
--------------------------------------------------------------------------------
1 | import matplotlib
2 | matplotlib.use('Agg')
3 | import pysam
4 | import matplotlib.pyplot as plt
5 |
6 | def plot_n100(outpath,minlen):
7 | ctglen=open(outpath+'contig_length_info','r').read().split('\n')[:-1]
8 | ctglen=[int(c.split('\t')[1]) for c in ctglen if int(c.split('\t')[1]) >= minlen]
9 |
10 | n100=[]
11 | x100=[]
12 | ctglen.sort(reverse=True)
13 | totallen=sum(ctglen)
14 | addlen=0
15 | lastlen=0
16 | for i in range(100):
17 | x100+=[i+1]
18 |
19 | while addlen < (i+1)/100.0*totallen:
20 | try:
21 | lastlen=ctglen.pop(1)
22 | addlen+=lastlen
23 | except:
24 | break
25 | n100+=[lastlen]
26 | plt.plot(x100,n100,linewidth=2)
27 | plt.xlabel('N1-N100')
28 | plt.ylabel('Contig Length /bp')
29 | plt.savefig(outpath+'plot_n1n100.pdf')
30 | print ('end n100')
31 | return 0
32 |
33 |
34 | def plot_na100(outpath):
35 | samfile=pysam.AlignmentFile(outpath+'contig_to_ref.sam','r')
36 | allread=samfile.fetch()
37 | alignlen=[]
38 | for align in allread:
39 | if align.flag==4:
40 | continue
41 | alignlen+=[align.query_alignment_length]
42 | n100=[]
43 | x100=[]
44 | alignlen.sort(reverse=True)
45 | totallen=sum(alignlen)
46 | addlen=0
47 | lastlen=0
48 | for i in range(100):
49 | x100+=[i+1]
50 | while addlen < (i+1)/100.0*totallen:
51 | try:
52 | lastlen=alignlen.pop(1)
53 | addlen+=lastlen
54 | except:
55 | break
56 | n100+=[lastlen]
57 | plt.plot(x100,n100,linewidth=2)
58 | plt.xlabel('NA1-NA100')
59 | plt.ylabel('Contig Length /bp')
60 | plt.savefig(outpath+'plot_na1na100.pdf')
61 | print ('end na100')
62 | return 0
63 |
64 |
65 |
66 | def findpos(c,ctglength,step,startrefpos,ctgstartpos):
67 | temppos=[]
68 | ctgname=c.query_name
69 | refpos=c.reference_start
70 | cigar=c.cigarstring
71 | if 'S' not in cigar.split('M')[0].split('=')[0] and 'H' not in cigar.split('M')[0].split('=')[0]:
72 | ctgpos=0
73 | else:
74 | ctgpos=int(cigar.split('M')[0].split('=')[0].split('S')[0].split('H')[0])
75 | currpos=ctgpos
76 |
77 | num=''
78 |
79 | for m in cigar:
80 | num+=m; continue
81 | if m in 'M=X':
82 | ctgpos+=int(num); refpos+=int(num); num=''
83 | if m=='I':
84 | ctgpos+=int(num);num=''
85 | if m =='D':
86 | refpos+=int(num); num='';continue
87 | if m in 'SH':
88 | if refpos>c.reference_start:
89 | break
90 | else:
91 | num=''
92 | while ctgpos>=currpos+step:
93 | if ctgpos>=currpos+step*2:
94 | temppos+=[[currpos+step,refpos+startrefpos,ctgname]]
95 | else:
96 | temppos+=[[ctgpos,refpos+startrefpos,ctgname]]
97 | currpos+=step
98 | if c.flag in [16,2064]:
99 | temppos=[ [ctglength-mm[0],mm[1],mm[2]] for mm in temppos]
100 | updatestart=[]
101 | for mm in temppos :
102 | updatestart+=[[mm[0]+ctgstartpos,mm[1],mm[2]]]
103 |
104 | return updatestart
105 |
106 |
107 |
108 | def plot_dotplot(outpath):
109 | print ('start dot plot')
110 | samfile=pysam.AlignmentFile(outpath+'contig_to_ref.bam','rb')
111 | allchrom=samfile.references
112 | allchromlen=samfile.lengths
113 | maxreflen=max(allchromlen)
114 | idex=allchromlen.index(maxreflen)
115 | maxchrom=allchrom[idex]
116 | print (maxchrom)
117 | allread=samfile.fetch(maxchrom)
118 | if maxreflen >= 10000000:
119 | step=10000
120 | elif maxreflen>=1000000:
121 | step=1000
122 | else:
123 | step=100
124 |
125 | ctgleninfo=open(outpath+'contig_length_info','r').read().split('\n')[:-1]
126 | ctglen={}
127 | for c in ctgleninfo:
128 | ctglen[c.split('\t')[0]]=int(c.split('\t')[1])
129 | alignedctg={}
130 | for align in allread:
131 | if align.query_name not in alignedctg:
132 | alignedctg[align.query_name]=align.query_alignment_length
133 | else:
134 | alignedctg[align.query_name]+=align.query_alignment_length
135 | longalignctg=[c for c in alignedctg if alignedctg[c] >= maxreflen/100.0]
136 |
137 | print (len(longalignctg))
138 |
139 | startpos=0
140 | contig_startpos={}
141 | for c in longalignctg:
142 | contig_startpos[c]=startpos
143 | startpos+=ctglen[c]
144 | allpos=[]
145 |
146 | allread=samfile.fetch(maxchrom)
147 |
148 | for align in allread:
149 | if align.query_name not in longalignctg:
150 | continue
151 | temppos=[]
152 | ctglength=ctglen[align.query_name]
153 | ctgstart=contig_startpos[align.query_name]
154 | temppos=findpos(align,ctglength,step,0,ctgstart)
155 | allpos+=temppos
156 |
157 |
158 | plotx=[c[0] for c in allpos]
159 | ploty=[c[1] for c in allpos]
160 | plotcolor=[c[2] for c in allpos]
161 |
162 | allcolor=set(plotcolor)
163 | colors={}
164 | i=1
165 | for c in allcolor:
166 | colors[c]=i
167 | i+=10
168 | plotcolor2=[colors[c] for c in plotcolor]
169 | size=[1]*len(plotcolor2)
170 | plt.scatter(plotx,ploty,s=size,c=plotcolor2)
171 | plt.xlabel('Reference Position')
172 | plt.ylabel('Contig Position')
173 | plt.savefig(outpath+'plot_synteny.pdf')
174 | return 0
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
--------------------------------------------------------------------------------
/denovo_static.py:
--------------------------------------------------------------------------------
1 | import os
2 | import subprocess
3 | import pysam
4 | import sys
5 | import gzip
6 |
7 | def simple(contigfile,outpath,min_size,min_size_assemblyerror):
8 | if len(contigfile)==2:
9 | halp=True
10 | else:
11 | halp=False
12 | halpnum=1
13 | f=open(outpath+'valid_contig.fa','w')
14 | length=[]
15 | maxlen=0
16 | largecontiglength={}
17 | all_contigs=[]
18 | largecontigs=[]
19 | map_contigs=[]
20 | totallength=0
21 | totallength_large=0
22 | contig_length_info=[]
23 |
24 |
25 | for contig in contigfile:
26 | if contig.endswith('.gz'):
27 | contig=gzip.open(contig,'rt')
28 | else:
29 | contig=open(contig,'r')
30 | allcontig=contig.read().split('>')[1:]
31 | for c in allcontig:
32 | c=c.split('\n')[:-1]
33 | contig_name=c[0].split(' ')[0]
34 | if halp:
35 | contig_name='HAP_'+str(halpnum)+'_'+contig_name
36 | all_contigs+=[contig_name]
37 | seq=''
38 | length1=0
39 | for cc in c[1:]:
40 | seq+=cc
41 | length1+=len(cc)
42 | length+=[length1]
43 | contig_length_info+=[contig_name+'\t'+str(length1)]
44 | if length1>maxlen:
45 | maxlen=length1
46 | maxcontig=contig_name
47 | if length1>=min_size:
48 | f.write('>'+contig_name+'\n'+seq+'\n')
49 | totallength+=length1
50 | map_contigs+=[contig_name]
51 | if length1>=min_size_assemblyerror:
52 | largecontigs+=[contig_name]
53 | largecontiglength[contig_name]=length1
54 | totallength_large+=length1
55 |
56 | halpnum+=1
57 |
58 | f.close()
59 |
60 | length.sort(reverse=True)
61 | f=open(outpath+'contig_length_info','w')
62 | for c in contig_length_info:
63 | f.write(c+'\n')
64 | f.close()
65 | f=open(outpath+'summary_statistics','w')
66 | f.write('Statics of contigs:\n')
67 |
68 | iii=0
69 | total=sum(length)//2
70 | for c in length:
71 | iii+=c
72 | if iii>=total:
73 | n50=c; break
74 | length_ae=[c for c in length if c > min_size_assemblyerror]
75 |
76 |
77 | f.write('Number of contigs\t'+str(len(length))+'\n')
78 | f.write('Number of contigs > '+str(min_size)+' bp\t'+str(len(map_contigs))+'\n')
79 | f.write('Number of contigs >'+str(min_size_assemblyerror)+' bp\t'+str(len(length_ae))+'\n')
80 | f.write('Total length\t'+str(sum(length))+'\n')
81 | f.write('Total length of contigs > '+str(min_size)+' bp\t'+str(totallength)+'\n')
82 | f.write('Total length of contigs >'+str(min_size_assemblyerror)+'bp\t'+str(sum(length_ae))+'\n')
83 |
84 | if len(length)==0:
85 | logf=open(outpath+'Inspector.log','a')
86 | logf.write('Error: No contigs found. Check if input file is empty or if --min_contig_length is too high.\n')
87 | f.close()
88 | logf.close()
89 | quit()
90 |
91 | if len(length_ae)==0:
92 | logf=open(outpath+'Inspector.log','a')
93 | logf.write('Warning: No contigs larger than '+str(min_size_assemblyerror)+'bp. No structural errors will be reported. Check if --min_contig_length_assemblyerror is too high.\n')
94 | logf.close()
95 |
96 | f.write('Longest contig\t'+str(length[0])+'\n')
97 | if len(length)>1:
98 | f.write('Second longest contig length\t'+str(length[1])+'\n')
99 | f.write('N50\t'+str(n50)+'\n')
100 |
101 |
102 |
103 | iii=0; total=sum(length)//2; n50=0
104 | for c in length:
105 | iii+=c
106 | if iii>total:
107 | n50=c; break
108 | f.write('N50 of contigs >1Mbp\t'+str(n50)+'\n\n\n')
109 | f.close()
110 | return [all_contigs,map_contigs,largecontigs,totallength,totallength_large,maxcontig,maxlen,largecontiglength]
111 |
112 |
113 | def mapping_info_ctg(outpath,largechrom,smallchrom,contiglength,contiglength_large):
114 |
115 | f=open(outpath+'summary_statistics','a')
116 | f.write('Read to Contig alignment:\n')
117 |
118 | os.system('touch '+outpath+'map_depth/maplength_large_null')
119 | os.system('touch '+outpath+'map_depth/readnum_large_null')
120 | os.system('touch '+outpath+'map_depth/splitread_large_null')
121 |
122 |
123 | os.system('cat '+outpath+'map_depth/maplength_large_* > '+outpath+'map_depth/all_maplength_large')
124 | os.system('cat '+outpath+'map_depth/maplength_* > '+outpath+'map_depth/all_maplength_total')
125 | os.system('cat '+outpath+'map_depth/readnum_large_* > '+outpath+'map_depth/all_readnum_large')
126 | os.system('cat '+outpath+'map_depth/readnum_* > '+outpath+'map_depth/all_readnum_total')
127 | os.system('cat '+outpath+'map_depth/splitread_large_* > '+outpath+'map_depth/all_splitread_large')
128 | os.system('cat '+outpath+'map_depth/splitread_* > '+outpath+'map_depth/all_splitread_total')
129 |
130 | unmapped=int(pysam.AlignmentFile(outpath+'read_to_contig.bam','rb').unmapped)
131 |
132 | info=open(outpath+'map_depth/all_readnum_total','r').read().split('\n')[:-1]
133 | mapped=sum([int(ccc) for ccc in info])
134 | totalread=mapped+unmapped
135 | if totalread==0:
136 | logf=open(outpath+'Inspector.log','a')
137 | logf.write('Warning: No reads found in read_to_contig alignment.\n')
138 | logf.close()
139 | return 0
140 | mapprate=round(10000*float(mapped)/(totalread))/100.0
141 | f.write('Mapping rate /%\t'+str(mapprate)+'\n')
142 |
143 | info=open(outpath+'map_depth/all_splitread_total','r').read().split('\n')[:-1]
144 | splitread=sum([int(ccc) for ccc in info])
145 | splrate=round(10000*float(splitread)/mapped)/100.0
146 | f.write('Split-read rate /%\t'+str(splrate)+'\n')
147 |
148 | info=open(outpath+'map_depth/all_maplength_total','r').read().split('\n')[:-1]
149 | mappedlen=sum([int(ccc) for ccc in info])
150 | cov=round(10000*float(mappedlen)/contiglength)/10000.0
151 | f.write('Depth\t'+str(cov)+'\n')
152 |
153 | try:
154 | info=open(outpath+'map_depth/all_readnum_large','r').read().split('\n')[:-1]
155 | mapped=sum([int(ccc) for ccc in info])
156 | mapprate=round(10000*float(mapped)/(totalread))/100.0
157 | f.write('Mapping rate in large contigs /%\t'+str(mapprate)+'\n')
158 |
159 | info=open(outpath+'map_depth/all_splitread_large','r').read().split('\n')[:-1]
160 | splitread=sum([int(ccc) for ccc in info])
161 | splrate=round(10000*float(splitread)/mapped)/100.0
162 | f.write('Split-read rate in large contigs /%\t'+str(splrate)+'\n')
163 |
164 | info=open(outpath+'map_depth/all_maplength_large','r').read().split('\n')[:-1]
165 | mappedlen=sum([int(ccc) for ccc in info])
166 | cov=round(10000*float(mappedlen)/contiglength_large)/10000.0
167 | f.write('Depth in large conigs\t'+str(cov)+'\n\n\n')
168 | f.close()
169 |
170 | except:
171 | logf=open(outpath+'Inspector.log','a')
172 | logf.write('Warning: Failed to characterize read alignment in large contigs. \n')
173 | logf.close()
174 |
175 | return cov
176 |
177 |
178 |
179 | def sort_sv(a):
180 | return [a.split('\t')[0],int(a.split('\t')[1])]
181 |
182 |
183 | def assembly_info_cluster(outpath,min_size,max_size):
184 | os.system("cat "+outpath+"ae_merge_workspace/del_merged_* > "+outpath+"ae_merge_workspace/deletion-merged")
185 | os.system("cat "+outpath+"ae_merge_workspace/ins_merged_* > "+outpath+"ae_merge_workspace/insertion-merged")
186 | os.system("cat "+outpath+"ae_merge_workspace/inv_merged_* > "+outpath+"ae_merge_workspace/inversion-merged")
187 | f=open(outpath+'assembly_errors.bed','w')
188 | alldel=open(outpath+'ae_merge_workspace/deletion-merged','r').read().split('\n')[:-1]
189 | alldel=[c for c in alldel if min_size<=int(c.split('\t')[2])<=max_size]
190 | for c in alldel:
191 | c=c.split('\t')
192 | f.write(c[0]+'\t'+c[1]+'\t'+str(int(c[1])+int(c[2]))+'\t'+c[3]+'\tExpansion\tSize='+c[2]+'\t'+c[7]+'\n')
193 | allins=open(outpath+'ae_merge_workspace/insertion-merged','r').read().split('\n')[:-1]
194 | allins=[c for c in allins if min_size<=int(c.split('\t')[2])<=max_size]
195 | for c in allins:
196 | c=c.split('\t')
197 | f.write(c[0]+'\t'+c[1]+'\t'+str(int(c[1])+1)+'\t'+c[3]+'\tCollapse\tSize='+c[2]+'\t'+c[7]+'\n')
198 | allinv=open(outpath+'ae_merge_workspace/inversion-merged','r').read().split('\n')[:-1]
199 | allinv=[c for c in allinv if min_size<=int(c.split('\t')[2])<=max_size]
200 | for c in allinv:
201 | c=c.split('\t')
202 | f.write(c[0]+'\t'+c[1]+'\t'+str(int(c[1])+int(c[2]))+'\t'+c[3]+'\tInversion\t'+c[7]+'\n')
203 | f.close()
204 | return 0
205 |
206 |
207 | def assembly_info(outpath):
208 |
209 | os.system("cat "+outpath+"del-merged-* > "+outpath+"deletion-merged")
210 | os.system("cat "+outpath+"ins-merged-* > "+outpath+"insertion-merged")
211 | os.system("cat "+outpath+"dup-merged-* > "+outpath+"duplication-merged")
212 | os.system("cat "+outpath+"inv-merged-* > "+outpath+"inversion-merged")
213 | os.system("rm "+outpath+"*-info-* "+outpath+"*-merged-*")
214 |
215 | f=open(outpath+'deletion-merged','r')
216 | alldel=f.read().split('\n')[:-1]
217 | f.close()
218 | allins=open(outpath+'insertion-merged','r').read().split('\n')[:-1]
219 | alldup=open(outpath+'duplication-merged','r').read().split('\n')[:-1]
220 | allins+=alldup
221 | allsv=alldel+allins
222 | allsv.sort(key=sort_sv)
223 | f=open(outpath+'assembly_errors.bed','w')
224 | for c in allsv:
225 | if 'Del' in c:
226 | c=c.split('\t')
227 | f.write(c[0]+'\t'+c[1]+'\t'+str(int(c[1])+int(c[2]))+'\t'+c[3]+'\tExpansion\tSize='+c[2]+'\t'+c[6]+'\n')
228 | if 'Ins' in c :
229 | c=c.split('\t')
230 | f.write(c[0]+'\t'+c[1]+'\t'+str(int(c[1])+1)+'\t'+c[3]+'\tCollapse\tSize='+c[2]+'\t'+c[6]+'\n')
231 | allinv=open(outpath+'inversion-merged','r').read().split('\n')[:-1]
232 | for c in allinv:
233 | c=c.split('\t')
234 | f.write(c[0]+'\t'+c[1]+'\t'+str(int(c[1])+int(c[2]))+'\t'+c[3]+'\tInversion\t'+c[6]+'\n')
235 | f.close()
236 |
237 | f=open(outpath+'summary_statistics','a')
238 | f.write('Number of assembly collapse\t'+str(len(allins))+'\n')
239 | f.write('Number of assembly expansion\t'+str(len(alldel))+'\n')
240 | f.write('Number of assembly inversion\t'+str(len(allinv))+'\n')
241 | f.close()
242 | return 0
243 |
244 | def assembly_info_ref(outpath):
245 |
246 | os.system("cat "+outpath+"del-merged-* > "+outpath+"deletion-merged_ref")
247 | os.system("cat "+outpath+"ins-merged-* > "+outpath+"insertion-merged_ref")
248 | os.system("cat "+outpath+"dup-merged-* > "+outpath+"duplication-merged_ref")
249 | os.system("cat "+outpath+"inv-merged-* > "+outpath+"inversion-merged_ref")
250 | os.system("rm "+outpath+"*-info-* "+outpath+"*-merged-*")
251 |
252 | f=open(outpath+'deletion-merged_ref','r')
253 | alldel=f.read().split('\n')[:-1]
254 | f.close()
255 | allins=open(outpath+'insertion-merged_ref','r').read().split('\n')[:-1]
256 | alldup=open(outpath+'duplication-merged_ref','r').read().split('\n')[:-1]
257 | allins+=alldup
258 | allsv=alldel+allins
259 | allsv.sort(key=sort_sv)
260 | f=open(outpath+'structural_errors_ref.bed','w')
261 | for c in allsv:
262 | if 'Ins' in c or 'Dup' in c:
263 | c=c.split('\t')
264 | f.write(c[0]+'\t'+c[1]+'\t'+str(int(c[1])+int(c[2]))+'\tExpansion\n')
265 | if 'Del' in c:
266 | c=c.split('\t')
267 | f.write(c[0]+'\t'+c[1]+'\t'+str(int(c[1])+1)+'\tCollapse\tSize='+c[2]+'\n')
268 | allinv=open(outpath+'inversion-merged_ref','r').read().split('\n')[:-1]
269 | for c in allinv:
270 | c=c.split('\t')
271 | f.write(c[0]+'\t'+c[1]+'\t'+str(int(c[1])+int(c[2]))+'\tInversion\n')
272 | f.close()
273 |
274 | f=open(outpath+'summary_statistics','a')
275 | f.write('Assembly errors from contig to reference:\n')
276 | f.write('Number of assembly collapse\t'+str(len(alldel))+'\n')
277 | f.write('Number of assembly expansion\t'+str(len(allins))+'\n')
278 | f.write('Number of assembly inversion\t'+str(len(allinv))+'\n\n\n')
279 | f.close()
280 | #os.system("rm "+outpath+"*ion-merged")
281 | return 0
282 |
283 |
284 | def basepair_error(outpath):
285 | f=open(outpath+'assembly_basepair_error.vcf','r')
286 | a=f.readline()
287 | mismatch=0
288 | dels=0
289 | ins=0
290 | snp=0
291 | mnp=0
292 | while a!='':
293 | if a[0]=='#':
294 | a=f.readline(); continue
295 | if 'TYPE=snp' in a:
296 | mismatch+=1
297 | snp+=1
298 | if 'TYPE=ins' in a:
299 | mismatch+=len(a.split('\t')[4])-len(a.split('\t')[3])
300 | ins+=1
301 | if 'TYPE=del' in a:
302 | mismatch+=len(a.split('\t')[3])-len(a.split('\t')[4])
303 | dels+=1
304 | if 'TYPE=mnp' in a:
305 | mismatch+=len(a.split('\t')[4])
306 | mnp+=1
307 | a=f.readline()
308 | accuracy=1-mismatch/100000.0
309 | f=open(outpath+'summary_statistics','a')
310 | f.write('Number of small collapse\t'+str(ins)+'\n')
311 | f.write('Number of small expansion\t'+str(dels)+'\n')
312 | f.write('Number of single basepair error\t'+str(snp)+'\n')
313 | f.write('Number of multiple basepair error\t'+str(mnp)+'\n')
314 | f.write('Base pair accuracy\t'+str(accuracy)+'\n\n\n')
315 | return 0
316 |
317 | def basepair_error_ref(outpath,largestchr):
318 | f=open(outpath+'contig_to_ref.sam','r')
319 | a=f.readline()
320 | mismatch=0
321 | totallength=0
322 | ins=0; dels=0; snp=0
323 | svs=[]
324 | while a!='':
325 | if a[0]=='@':
326 | a=f.readline(); continue
327 | if a.split('\t')[0]!=largestchr or a.split('\t')[1] in ['256','272']:
328 | a=f.readline(); continue
329 | cigar=a.split('\t')[5]
330 | num=''
331 | length=0
332 | chrom=a.split('\t')[2]
333 | pos=int(a.split('\t')[3])
334 | for c in cigar:
335 | if c in '1234567890':
336 | num+=c
337 | if c in 'SH':
338 | num=''
339 | if c in 'M=':
340 | length+=int(num); num=''
341 | if c == 'X':
342 | if int(num)==1:
343 | svs+=[chrom+'\t'+str(pos+length)+'\t'+str(pos+length)+'\tSNP']
344 | else:
345 | svs+=[chrom+'\t'+str(pos+length)+'\t'+str(pos+length+int(num)-1)+'\tMNP\tsize='+num]
346 | length+=int(num); mismatch+=int(num);num=''
347 | snp+=1
348 | if c == 'I':
349 | if int(num)<=10:
350 | svs+=[chrom+'\t'+str(pos+length)+'\t'+str(pos+length+1)+'\tExpansion\tsize='+num]
351 | mismatch+=int(num); ins+=1
352 | num=''
353 | if c == 'D':
354 | if int(num)<=10:
355 | svs+=[chrom+'\t'+str(pos+length)+'\t'+str(pos+length+int(num))+'\tCollapse']
356 | mismatch+=int(num); dels+=1
357 | length+=int(num)
358 | num=''
359 |
360 | totallength+=length
361 | a=f.readline()
362 | accuracy=round((1-mismatch/float(totallength))*10000)/10000.0
363 | f=open(outpath+'summary_statistics','a')
364 | f.write('Base pair accuracy of longest contig from contig to reference:\n')
365 | f.write('Number of small assembly collapse\t'+str(dels)+'\n')
366 | f.write('Number of small assembly extension\t'+str(ins)+'\n')
367 | f.write('Number of single basepair error\t'+str(snp)+'\n')
368 | f.write('Base pair accuracy\t'+str(accuracy)+'\n\n\n')
369 | f.close()
370 | f=open(outpath+'small_scale_error_ref.bed','w')
371 | for c in svs:
372 | f.write(c+'\n')
373 | f.close()
374 | return 0
375 |
376 | def sortblock(a):
377 | return [a[0],a[1]]
378 |
379 | def count_ref_coverage(refcoveredall):
380 | allchrom=list(set([c[0] for c in refcoveredall]))
381 | new=[]
382 | for chrom in allchrom:
383 | refcovered=[c for c in refcoveredall if c[0]==chrom]
384 | refcovered.sort(key=sortblock)
385 | ifovlp=0
386 | while len(refcovered)>1:
387 | if refcovered[0][2]<=refcovered[1][1]:
388 | new+=[refcovered[0]]
389 | refcovered=refcovered[1:]
390 | else:
391 | i=0
392 | ovlpstart=refcovered[i+1][1]; ovlpend=min(refcovered[i][2],refcovered[i+1][2])
393 | newblock=[]
394 | if refcovered[i+1][1] > refcovered[i][1]:
395 | newblock+=[[refcovered[i][0],refcovered[i][1],refcovered[i+1][1],refcovered[i][3]]]
396 | newblock+=[[refcovered[i][0],ovlpstart,ovlpend,refcovered[i][3]+refcovered[i+1][3]]]
397 | if refcovered[i+1][2] > refcovered[i][2]:
398 | newblock+=[[refcovered[i][0],refcovered[i][2],refcovered[i+1][2],refcovered[i+1][3]]]
399 | if refcovered[i+1][2]2]
409 | base1=sum(b1)
410 | base2=sum(b2)
411 | base3=sum(b3)
412 | return (base1,base2,base3)
413 |
414 |
415 | def get_ref_align_info(path,totallength):
416 | f=pysam.AlignmentFile(path+'contig_to_ref.sam','r')
417 | allali=f.fetch()
418 | maplen=[]
419 | refcovered=[]
420 | for c in allali:
421 | if c.flag==4:
422 | continue
423 | readlen=c.query_alignment_length
424 | refcovered+=[[c.reference_name,c.reference_start,c.reference_end,1]]
425 | if c.flag in [0,2048]:
426 | leftclipinfo=c.cigartuples[0]
427 | leftclip = leftclipinfo[1] if leftclipinfo[0]==5 or leftclipinfo[0]==4 else 0
428 | leftclip = leftclipinfo[1] if leftclipinfo[0]==5 or leftclipinfo[0]==4 else 0
429 | maplen+=[[c.query_name,leftclip,leftclip+readlen]]
430 | if c.flag in [16,2064]:
431 | leftclipinfo=c.cigartuples[-1]
432 | leftclip = leftclipinfo[1] if leftclipinfo[0]==5 or leftclipinfo[0]==4 else 0
433 | maplen+=[[c.query_name,leftclip,leftclip+readlen]]
434 |
435 | n50info=[c[2]-c[1] for c in maplen]
436 | n50info.sort(reverse=True)
437 | lenacc=0
438 | na50=0
439 | info=sum(n50info)
440 | for c in n50info:
441 | lenacc+=c
442 | if lenacc>=0.5*info:
443 | na50=c;break
444 |
445 | assembly_maplenratio=float(info)/totallength
446 | (base1,base2,base3)=count_ref_coverage(refcovered)
447 |
448 | totalrefbase=sum(f.lengths)
449 | allrefchrom=list(f.references)
450 |
451 | base0=totalrefbase-base1-base2-base3
452 | f=open(path+'summary_statistics','a')
453 | f.write('\n\n\nReference-based mode:\n')
454 | f.write('Genome Coverage /% '+str(float(base1+base2+base3)/totalrefbase)+'\nReference base with Depth=0 (including Ns): '+str(base0)+';\t'+str(base0/float(totalrefbase)*100)+'%\n')
455 | f.write('Reference base with Depth=1 '+str(base1)+';\t'+str(base1/float(totalrefbase)*100)+'%\n')
456 | f.write('Reference base with Depth=2 '+str(base2)+';\t'+str(base2/float(totalrefbase)*100)+'%\n')
457 | f.write('Reference base with Depth>2 '+str(base3)+';\t'+str(base3/float(totalrefbase)*100)+'%\n')
458 | f.write('Assembly contig mapping ratio (length) /%'+str(assembly_maplenratio)+'\n')
459 | f.write('Assembly contig NA50 '+str(na50)+'\n')
460 | f.close()
461 |
462 | return allrefchrom
463 |
464 |
465 | def get_ref_chroms(outpath):
466 | f=open(outpath+'contig_to_ref.sam','r')
467 | a=f.readline()
468 | chroms=[]
469 | length=0
470 | longestlen=0
471 | longestchr=''
472 | while a[0]=='@':
473 | if a[:3]=='@SQ':
474 | chroms+=[a.split('\t')[1].split(':')[1]]
475 | length+=int(a.split('\t')[2].split(':')[1])
476 | if int(a.split('\t')[2].split(':')[1])>longestlen:
477 | longestlen=int(a.split('\t')[2].split(':')[1])
478 | longestchr=a.split('\t')[1].split(':')[1]
479 | a=f.readline()
480 | a=int(subprocess.check_output("awk \'$3==0\' "+outpath+'contig_to_ref.depth | wc -l',shell=True))
481 | covered=length-a
482 | return (chroms,length,longestchr,longestlen,covered)
483 |
484 |
485 | def check_depth_ref(outpath,ref):
486 | cov0=int(subprocess.check_output("awk \'$3==0\' "+outpath+'contig_to_ref.depth | wc -l',shell=True))
487 | cov1=int(subprocess.check_output("awk \'$3==1\' "+outpath+'contig_to_ref.depth | wc -l',shell=True))
488 | cov2=int(subprocess.check_output("awk \'$3==2\' "+outpath+'contig_to_ref.depth | wc -l',shell=True))
489 | cov3=int(subprocess.check_output("awk \'$3>2\' "+outpath+'contig_to_ref.depth | wc -l',shell=True))
490 |
491 | total=cov0+cov1+cov2+cov3
492 |
493 | f=open(outpath+'summary_statistics','a')
494 | f.write('#BP with cov=0 '+str(cov0)+', '+str(cov0*100.00/total)+'\n')
495 | f.write('#BP with cov=1 '+str(cov1)+', '+str(cov1*100.00/total)+'\n')
496 | f.write('#BP with cov=2 '+str(cov2)+', '+str(cov2*100.00/total)+'\n')
497 | f.write('#BP with cov>2 '+str(cov3)+', '+str(cov3*100.00/total)+'\n')
498 | f.write('Coverage: '+str(1-round(10000*float(cov0)/total)/10000.0)+'\n')
499 | f.close()
500 |
501 | return 0
502 |
503 |
--------------------------------------------------------------------------------
/inspector-correct.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | import argparse
3 | import multiprocessing
4 | import sys
5 | import denovo_correct as inspector_correct
6 | import os
7 | from datetime import datetime
8 | import time
9 |
10 |
11 | t0=time.time()
12 | parser=argparse.ArgumentParser(description='Assembly error correction based on Inspector assembly evaluation', usage='inspector-correct.py [-h] -i inspector_out/ --datatype pacbio-raw ')
13 | parser.add_argument('-v','--version', action='version', version='Inspector_correct_v1.0')
14 | parser.add_argument('-i','--inspector',type=str,default=False,help='Inspector evaluation directory. Original file names are required.',required=True)
15 | parser.add_argument('--datatype',type=str,default=False,help='Type of read used for Inspector evaluation. This option is required for structural error correction when performing local assembly with Flye. (pacbio-raw, pacbio-hifi, nano-raw,pacbio-corr, nano-corr)',required=True)
16 | parser.add_argument('-o','--outpath',type=str,default=False,help='output directory')
17 | parser.add_argument('--flyetimeout',type=int,default=1200,help='Maximal runtime for local assembly with Flye. Unit is second. [1200]')
18 | parser.add_argument('--skip_structural',action='store_true',default=False,help='Do not correct structural errors. Local assembly will not be performed.')
19 | parser.add_argument('--skip_baseerror',action='store_true',default=False,help='Do not correct base errors.')
20 | parser.add_argument('-t','--thread',type=int,default=8,help='number of threads')
21 |
22 | if len(sys.argv)==1:
23 | parser.print_help()
24 | sys.exit(1)
25 |
26 | inscor_args=parser.parse_args()
27 | if inscor_args.inspector[-1]!='/':
28 | readpath=inscor_args.inspector+'/'
29 | else:
30 | readpath=inscor_args.inspector
31 | if not inscor_args.outpath:
32 | outpath=readpath
33 | else:
34 | if inscor_args.outpath[-1]!='/':
35 | outpath=inscor_args.outpath+'/'
36 | else:
37 | outpath=inscor_args.outpath
38 | if not os.path.exists(outpath):
39 | os.mkdir(outpath)
40 |
41 |
42 | logf=open(outpath+'Inspector_correct.log','a')
43 | logf.write('Inspector assembly error correction starting... '+datetime.now().strftime("%d/%m/%Y %H:%M:%S")+'\n')
44 |
45 | if not inscor_args.skip_structural and not inscor_args.datatype:
46 | logf.write('Error: No data type (--datatype) given!\nFor Debreak usage, use -h or --help\n')
47 | sys.exit(1)
48 |
49 | if inscor_args.datatype not in ['pacbio-raw','pacbio-hifi', 'pacbio-corr', 'nano-raw','nano-corr']:
50 | logf.write('Error: Data type (--datatype) not valid. Supported read types are: pacbio-raw, pacbio-hifi, pacbio-corr, nano-raw, nano-corr.\n')
51 | sys.exit(1)
52 |
53 |
54 | t1=time.time()
55 | logf.write('TIME for validating parameter'+str(t1-t0)+'\n')
56 |
57 | try:
58 | allctg=open(readpath+'valid_contig.fa','r').read().split('>')[1:]
59 | except:
60 | logf.write('Error: Contig file not valid. Please keep original file name in the inspector output directory.\nCheck if file is valid: '+readpath+'valid_contig.fa\n')
61 | sys.exit(1)
62 | ctginfo={}
63 | for c in allctg:
64 | ctginfo[c.split('\n')[0]]=c.split('\n')[1]
65 |
66 |
67 | t2=time.time()
68 | logf.write('TIME for reading contig and length'+str(t2-t1)+'\n')
69 | newsnplist=[]
70 |
71 | if not inscor_args.skip_baseerror:
72 | try:
73 | allsnplist=open(readpath+'small_scale_error.bed','r').read().split('\n')[1:-1]
74 | except:
75 | logf.write('Warning: small-scale eror bed file not found. Check file \'small_scale_error.bed\' in Inspector evaluation directory. Continue without small-scale error correction.\n')
76 | allsnplist=[]
77 | else:
78 | allsnplist=[]
79 |
80 | snpctg={}
81 |
82 | if not inscor_args.skip_structural:
83 | os.system('mkdir '+outpath+'assemble_workspace/')
84 | try:
85 | allaelist=open(readpath+'structural_error.bed','r').read().split('\n')[1:-1]
86 | except:
87 | logf.write('Warning: structural eror bed file not found. Check file \'structural_error.bed\' in Inspector evaluation directory. Continue without structural error correction.\n')
88 | allaelist=[]
89 | else:
90 | allaelist=[]
91 |
92 | snpctg={}
93 | aectg={}
94 | for ctgname in ctginfo:
95 | snpctg[ctgname]=[]
96 | aectg[ctgname]=[]
97 | for aeinfo in allsnplist:
98 | snpctg[aeinfo.split('\t')[0]]+=[aeinfo]
99 | for aeinfo in allaelist:
100 | aectg[aeinfo.split('\t')[0]]+=[aeinfo]
101 |
102 | allsnplist=[]
103 | allaelist=[]
104 | bamfile=readpath+'read_to_contig.bam'
105 |
106 | t3=time.time()
107 | logf.write('TIME for reading assembly errors'+str(t3-t2)+'\n')
108 | logf.close()
109 |
110 |
111 | for chrominfo in ctginfo:
112 | inspector_correct.error_correction_large(chrominfo,ctginfo[chrominfo],aectg[chrominfo],snpctg[chrominfo],bamfile,outpath,inscor_args.datatype,inscor_args.thread//3,inscor_args.flyetimeout)
113 |
114 | t4=time.time()
115 | logf=open(outpath+'Inspector_correct.log','a')
116 | logf.write('TIME for correcting all contigs'+str(t4-t3)+'\n')
117 | logf.close()
118 |
119 | f=open(outpath+'contig_corrected.fa','w')
120 | for chrominfo in ctginfo:
121 | try:
122 | correctedinfo=open(outpath+'contig_corrected_'+chrominfo+'.fa','r').read()
123 | f.write(correctedinfo)
124 | except:
125 | logf=open(outpath+'Inspector_correct.log','a')
126 | logf.write('Warning: corrected contig ',chrominfo,'not found.\n')
127 | logf.close()
128 | f.close()
129 | t5=time.time()
130 | logf=open(outpath+'Inspector_correct.log','a')
131 | logf.write('TIME for writing corrected contig'+str(t5-t4)+'\n')
132 | os.system('rm '+outpath+'contig_corrected_*fa')
133 | logf.write('Inspector error correction finished. Bye.\n')
134 | logf.close()
135 |
136 |
137 |
--------------------------------------------------------------------------------
/inspector.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | import os
3 | import argparse
4 | import denovo_static
5 | import debreak_detect
6 | import debreak_merge_clustering as debreak_cluster
7 | import debreak_merge
8 | import multiprocessing
9 | import math
10 | import time
11 | from datetime import datetime
12 | t0=time.time()
13 |
14 | parser=argparse.ArgumentParser(description='de novo assembly evaluator', usage='inspector.py [-h] -c contig.fa -r raw_reads.fastq -o output_dict/')
15 | parser.add_argument('--version', action='version', version='Inspector_v1.3.1')
16 | parser.add_argument('-c','--contig',action='append', dest='contigfile',default=[],help='assembly contigs in FASTA format',required=True)
17 | parser.add_argument('-r','--read',type=str,default=False,help='sequencing reads in FASTA/FASTQ format',required=True,nargs='+')
18 | parser.add_argument('-d','--datatype',type=str,default='clr',help='Input read type. (clr, hifi, nanopore) [clr]')
19 | parser.add_argument('-o','--outpath',type=str,default='./adenovo_evaluation-out/',help='output directory')
20 | parser.add_argument('--ref',type=str,default=False,help='OPTIONAL reference genome in FASTA format')
21 |
22 | parser.add_argument('-t','--thread',type=int,default=8,help='number of threads. [8]')
23 | parser.add_argument('--min_depth',type=int,default=False,help='minimal read-alignment depth for a contig base to be considered in QV calculation. [20%% of average depth]')
24 | parser.add_argument('--min_contig_length',type=int,default=10000,help='minimal length for a contig to be evaluated. [10000]')
25 | parser.add_argument('--min_contig_length_assemblyerror',type=int,default=1000000,help='minimal contig length for assembly error detection. [1000000]')
26 | parser.add_argument('--min_assembly_error_size',type=int,default=50,help='minimal size for assembly errors. [50]')
27 | parser.add_argument('--max_assembly_error_size',type=int,default=4000000,help='maximal size for assembly errors. [4000000]')
28 | parser.add_argument('--noplot',action='store_true',default=False,help='do not make plots')
29 | parser.add_argument('--skip_read_mapping',action='store_true',default=False,help='skip the step of mapping reads to contig.')
30 | parser.add_argument('--skip_structural_error',action='store_true',default=False,help='skip the step of identifying large structural errors.')
31 | parser.add_argument('--skip_structural_error_detect',action='store_true',default=False,help='skip the step of detecting large structural errors.')
32 | parser.add_argument('--skip_base_error',action='store_true',default=False,help='skip the step of identifying small-scale errors.')
33 | parser.add_argument('--skip_base_error_detect',action='store_true',default=False,help='skip the step of detecting small-scale errors from pileup.')
34 |
35 | denovo_args=parser.parse_args()
36 |
37 | if denovo_args.outpath[-1]!='/':
38 | denovo_args.outpath+='/'
39 | if not os.path.exists(denovo_args.outpath):
40 | os.mkdir(denovo_args.outpath)
41 |
42 | logf=open(denovo_args.outpath+'Inspector.log','a')
43 | logf.write('Inspector starting... '+datetime.now().strftime("%d/%m/%Y %H:%M:%S")+'\n')
44 | logf.write("Start Assembly evaluation with contigs: " + str(denovo_args.contigfile)+'\n')
45 | validate_read=[]
46 | for inputfile in denovo_args.read:
47 | try:
48 | open(inputfile,'r')
49 | validate_read+=[inputfile]
50 | except:
51 | logf.write('Warning: cannot open input file \"'+inputfile+'\". Removed from list.'+'\n')
52 | if len(validate_read)==0:
53 | logf.write('Error: No valida input read file. Abort.\n')
54 | quit()
55 |
56 | if denovo_args.datatype not in ['clr','hifi','nanopore']:
57 | logf.write('Warning: Invalid input datatype (--datatype/-d). Should be one of the following: clr, ccs, nanopore. Use clr as default.\n')
58 | denovo_args.datatype='clr'
59 |
60 | # Check input arguments
61 | if len(denovo_args.contigfile)==1:
62 | singlecontig=True
63 | elif len(denovo_args.contigfile)==2:
64 | singlecontig=False
65 | else:
66 | logf.write('Error: Input contig file should be either 1 fasta file or two halploid.fa files. Check input -c/--contig.\n')
67 | quit()
68 |
69 | if not denovo_args.skip_base_error:
70 | import denovo_baseerror
71 | logf.close()
72 |
73 | # Simple statistics of contigs
74 | contiginfo=denovo_static.simple(denovo_args.contigfile,denovo_args.outpath,denovo_args.min_contig_length,denovo_args.min_contig_length_assemblyerror)
75 | chromosomes=contiginfo[0]
76 | chromosomes_map=contiginfo[1]
77 | chromosomes_large=contiginfo[2]
78 | largecontig_length=contiginfo[7]
79 | chromosomes_small=[mmm for mmm in chromosomes_map if mmm not in chromosomes_large]
80 | totalcontiglen=contiginfo[3]
81 | totalcontiglen_large=contiginfo[4]
82 |
83 |
84 | t1=time.time()
85 | logf=open(denovo_args.outpath+'Inspector.log','a')
86 | logf.write('TIME: Before read mapping '+str(t1-t0)+'\n')
87 | logf.close()
88 |
89 |
90 | # Read alignment
91 | if not denovo_args.skip_read_mapping:
92 | inputfileid=1
93 | for inputfile in validate_read:
94 | os.system("minimap2 -a -Q -N 1 -I 10G -t " + str(denovo_args.thread) + " "+denovo_args.outpath+"valid_contig.fa " + inputfile + " | samtools sort -@ " + str(denovo_args.thread) + " -o "+denovo_args.outpath+"read_to_contig_"+str(inputfileid)+".bam")
95 | inputfileid+=1
96 | if len(validate_read)>1:
97 | os.system("samtools merge "+denovo_args.outpath+"read_to_contig.bam "+denovo_args.outpath+"read_to_contig_*.bam")
98 | os.system("rm "+str(denovo_args.outpath)+"read_to_contig_*.bam")
99 | else:
100 | os.system("mv "+denovo_args.outpath+"read_to_contig_1.bam "+denovo_args.outpath+"read_to_contig.bam")
101 | os.system("samtools index "+str(denovo_args.outpath)+"read_to_contig.bam")
102 | t2=time.time()
103 | logf=open(denovo_args.outpath+'Inspector.log','a')
104 | logf.write('TIME: Read Alignment: '+str(t2-t1)+'\n')
105 | logf.close()
106 |
107 |
108 | # Structural assembly error detection
109 | if not denovo_args.skip_structural_error_detect:
110 | os.system("mkdir "+denovo_args.outpath+"map_depth/")
111 | if not denovo_args.skip_structural_error:
112 | os.system("mkdir "+denovo_args.outpath+"debreak_workspace/")
113 | debreak_det=multiprocessing.Pool(denovo_args.thread)
114 | for i in range(len(chromosomes_large)):
115 | debreak_det.apply_async(debreak_detect.detect_sortbam,args=(denovo_args.outpath,denovo_args.min_assembly_error_size,denovo_args.max_assembly_error_size,chromosomes_large[i]))
116 | for i in range(len(chromosomes_small)):
117 | debreak_det.apply_async(debreak_detect.detect_sortbam_nosv,args=(denovo_args.outpath,chromosomes_small[i],'small'))
118 | debreak_det.close()
119 | debreak_det.join()
120 | os.system("cat "+denovo_args.outpath+"debreak_workspace/read_to_contig_*debreak.temp > "+denovo_args.outpath+"read_to_contig.debreak.temp")
121 | else:
122 | debreak_det=multiprocessing.Pool(denovo_args.thread)
123 | for i in range(len(chromosomes_large)):
124 | debreak_det.apply_async(debreak_detect.detect_sortbam_nosv,args=(denovo_args.outpath,chromosomes_large[i],'large'))
125 | for i in range(len(chromosomes_small)):
126 | debreak_det.apply_async(debreak_detect.detect_sortbam_nosv,args=(denovo_args.outpath,chromosomes_small[i],'small'))
127 | debreak_det.close()
128 | debreak_det.join()
129 |
130 |
131 | cov=denovo_static.mapping_info_ctg(denovo_args.outpath,chromosomes_large,chromosomes_small,totalcontiglen,totalcontiglen_large)
132 | minsupp=max(1,round(cov/10.0))
133 |
134 | t3=time.time()
135 | logf=open(denovo_args.outpath+'Inspector.log','a')
136 | logf.write('TIME: Structural error signal detection: '+str(t3-t2)+'\n')
137 | logf.close()
138 |
139 |
140 | aelen_structuralerror=0
141 | if not denovo_args.skip_structural_error:
142 | os.system('mkdir '+denovo_args.outpath+'ae_merge_workspace')
143 | for chrom in largecontig_length:
144 | contiglength=largecontig_length[chrom]
145 | debreak_cluster.cluster(denovo_args.outpath,chrom,contiglength,minsupp,cov*2)
146 | debreak_cluster.cluster_ins(denovo_args.outpath,chrom,contiglength,minsupp,cov*2,'ins')
147 | debreak_cluster.cluster_ins(denovo_args.outpath,chrom,contiglength,minsupp,cov*2,'inv')
148 | denovo_static.assembly_info_cluster(denovo_args.outpath,denovo_args.min_assembly_error_size,denovo_args.max_assembly_error_size)
149 | debreak_cluster.genotype(cov,denovo_args.outpath)
150 |
151 | aelen_structuralerror=debreak_cluster.filterae(cov,denovo_args.outpath,denovo_args.min_assembly_error_size,denovo_args.datatype)
152 |
153 | t4=time.time()
154 | logf=open(denovo_args.outpath+'Inspector.log','a')
155 | logf.write('TIME: Structural error clustering : '+str(t4-t3)+'\n')
156 | logf.close()
157 |
158 | # SNP & indel detection
159 | aelen_baseerror=0
160 | if not denovo_args.skip_base_error:
161 | if not denovo_args.skip_base_error_detect:
162 | os.system('samtools faidx '+denovo_args.outpath+'valid_contig.fa')
163 | debreak_det=multiprocessing.Pool(denovo_args.thread)
164 | os.system('mkdir '+denovo_args.outpath+'base_error_workspace')
165 | for chrom in chromosomes_map:
166 | debreak_det.apply_async(denovo_baseerror.getsnv,args=(denovo_args.outpath,chrom,cov*2/5,cov*2,denovo_args.min_depth))
167 | debreak_det.close()
168 | debreak_det.join()
169 |
170 | aelen_baseerror=denovo_baseerror.count_baseerrror(denovo_args.outpath,totalcontiglen,denovo_args.datatype,cov)
171 |
172 | t5=time.time()
173 | logf=open(denovo_args.outpath+'Inspector.log','a')
174 | logf.write('TIME: Small-scale error detection: '+str(t5-t4)+'\n')
175 | logf.close()
176 |
177 | #QV
178 | if aelen_structuralerror+aelen_baseerror>0:
179 | try:
180 | allvalidnum=open(denovo_args.outpath+'base_error_workspace/validbase','r').read().split('\n')[:-1]
181 | validctgbase=sum([int(validnum) for validnum in allvalidnum])
182 | except:
183 | validctgbase=totalcontiglen
184 | qv=-10 * math.log10(float(aelen_baseerror+aelen_structuralerror)/validctgbase)
185 |
186 | f=open(denovo_args.outpath+'summary_statistics','a')
187 | f.write('\nQV\t'+str(qv)+'\n')
188 | f.close()
189 |
190 |
191 | t6=time.time()
192 | logf=open(denovo_args.outpath+'Inspector.log','a')
193 | logf.write('TIME: QV calculation: '+str(t6-t5)+'\n')
194 | logf.close()
195 |
196 | # Reference-based evaluation
197 | if denovo_args.ref:
198 | mapinfo=os.system("minimap2 -a -I 10G --eqx -x asm5 -t " + str(denovo_args.thread//2) + " "+denovo_args.ref+" " + denovo_args.outpath + "valid_contig.fa --secondary=no > "+ denovo_args.outpath+"contig_to_ref.sam")
199 | os.system("samtools sort -@ " + str(denovo_args.thread//2) + " "+ denovo_args.outpath+"contig_to_ref.sam -o " + denovo_args.outpath+"contig_to_ref.bam")
200 | os.system("samtools index "+ denovo_args.outpath+"contig_to_ref.bam")
201 | chromosomes=denovo_static.get_ref_align_info(denovo_args.outpath,totalcontiglen)
202 | mapping_info=debreak_detect.detect_sam_ref("contig_to_ref.sam",denovo_args.outpath,denovo_args.outpath,denovo_args.min_assembly_error_size,denovo_args.max_assembly_error_size)
203 |
204 | minsupp=1
205 |
206 | allsvsignal=open(denovo_args.outpath+'contig_to_ref.debreak.temp','r').read().split('\n')[:-1]
207 | rawdelcalls={}; rawinscalls={};rawdupcalls={};rawinvcalls={}
208 | for chrom in chromosomes:
209 | rawdelcalls[chrom]=[c.split('\t')[0]+'\t'+c.split('\t')[1]+'\t'+c.split('\t')[2]+'\t'+c.split('\t')[6]+'\t'+c.split('\t')[4] for c in allsvsignal if 'D-' in c and c.split('\t')[0]==chrom]
210 | rawinscalls[chrom]=[c.split('\t')[0]+'\t'+c.split('\t')[1]+'\t'+c.split('\t')[2]+'\t'+c.split('\t')[6]+'\t'+c.split('\t')[4] for c in allsvsignal if 'I-' in c and c.split('\t')[0]==chrom]
211 | rawdupcalls[chrom]=[c.split('\t')[0]+'\t'+c.split('\t')[1]+'\t'+c.split('\t')[2]+'\t'+c.split('\t')[6]+'\t'+c.split('\t')[4] for c in allsvsignal if 'DUP-' in c and c.split('\t')[0]==chrom]
212 | rawinvcalls[chrom]=[c.split('\t')[0]+'\t'+c.split('\t')[1]+'\t'+c.split('\t')[2]+'\t'+c.split('\t')[6]+'\t'+c.split('\t')[4] for c in allsvsignal if 'INV-' in c and c.split('\t')[0]==chrom]
213 | for chrom in chromosomes:
214 | debreak_merge.merge_insertion(minsupp,0,denovo_args.outpath,rawinscalls[chrom],chrom,'ins',True,)
215 | debreak_merge.merge_deletion(minsupp,0,denovo_args.outpath,rawdelcalls[chrom],chrom,'del',True,)
216 | debreak_merge.merge_deletion(minsupp,0,denovo_args.outpath,rawdupcalls[chrom],chrom,'dup',True,)
217 | debreak_merge.merge_insertion(minsupp,0,denovo_args.outpath,rawinvcalls[chrom],chrom,'inv',True,)
218 |
219 | denovo_static.assembly_info_ref(denovo_args.outpath)
220 |
221 | denovo_static.basepair_error_ref(denovo_args.outpath,contiginfo[5])
222 |
223 | t7=time.time()
224 | logf=open(denovo_args.outpath+'Inspector.log','a')
225 | logf.write('TIME: Reference-based mode: '+str(t7-t6)+'\n')
226 | logf.close()
227 |
228 | # Plots
229 | if not denovo_args.noplot:
230 | try:
231 | import denovo_plot
232 | denovo_plot.plot_n100(denovo_args.outpath,denovo_args.min_contig_length)
233 | except:
234 | logf=open(denovo_args.outpath+'Inspector.log','a')
235 | logf.write('Warning: Failed to plot N1_N100.\n')
236 | logf.close()
237 | if denovo_args.ref:
238 | try:
239 | import denovo_plot
240 | denovo_plot.plot_na100(denovo_args.outpath)
241 | denovo_plot.plot_dotplot(denovo_args.outpath)
242 | except:
243 | logf=open(denovo_args.outpath+'Inspector.log','a')
244 | logf.write('Warning: Failed to plot NA1_NA100 and Dotplots.\n')
245 | logf.close()
246 | t8=time.time()
247 | logf=open(denovo_args.outpath+'Inspector.log','a')
248 | logf.write('TIME: Generate plots: '+str(t8-t7)+'\n')
249 | logf.write('Inspector evaluation finished. Bye.\n')
250 | logf.close()
251 |
252 |
253 |
--------------------------------------------------------------------------------
/testdata/read_test.fastq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ChongLab/Inspector/0e08f882181cc0e0e0fa749cd87fb74a278ea0f0/testdata/read_test.fastq.gz
--------------------------------------------------------------------------------