├── LICENSE
├── README.md
└── backmap.pl
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 Tilman Schell
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # backmap.pl v0.5
2 |
3 | ## Description
4 | __Automatic read mapping and genome size estimation from coverage.__
5 |
6 | Automatic mapping of paired, unpaired, PacBio and Nanopore reads to an assembly with `bwa mem` or `minimap2`, execution of `qualimap bamqc` and estimation of genome size from mapped nucleotides divided by mode of the coverage distribution (>0). This method was first pulished in Schell et al. (2017). To show high accuracy and reliability of this method throughout the tree of life, Pfenninger et al. (2021) published a study comparing different estimators. Currently, the estimator Nbm/m (number of back-mapped bases divided by the modal value of the sequencing depth distribution) is implemented in this script only.
7 | The tools `samtools`, `bwa` and/or `minimap2` need to be in your `$PATH`. The tools `qualimap`, `multiqc`, `bedtools` and `Rscript` are optional but needed to create the mapping quality report, coverage histogram as well as genome size estimation and to plot of the coverage distribution respectively.
8 |
9 | ## Dependencies
10 |
11 | `backmap.pl` needs the following perl modules and will search for executables in your `$PATH`:
12 |
13 | Mandatory:
14 | - [Number::FormatEng](https://metacpan.org/pod/Number::FormatEng)
15 | - [Parallel::Loops](https://metacpan.org/pod/Parallel::Loops)
16 | - [samtools](https://github.com/samtools/samtools): `samtools`
17 |
18 | Short read mapping:
19 | - [bwa (mem)](https://github.com/lh3/bwa): `bwa`
20 |
21 | Long read mapping:
22 | - [minimap2](https://github.com/lh3/minimap2): `minimap2`
23 |
24 | Optional:
25 | - [Qualimap](http://qualimap.bioinfo.cipf.es/): `qualimap`
26 | - [MultiQC](https://multiqc.info/): `multiqc`
27 | - [bedtools](https://bedtools.readthedocs.io/en/latest/): `bedtools`
28 | - [Rscript](https://www.r-project.org/): `Rscript`
29 |
30 | ## Usage
31 |
32 | ```
33 | backmap.pl [-a {-p , | -u } |
34 | -pb | -hifi | -ont } | -b ]
35 |
36 | Mandatory:
37 | -a STR Assembly were reads should mapped to in fasta format
38 | AND AT LEAST ONE OF
39 | -p STR Two files with paired Illumina reads comma sperated
40 | -u STR Fastq file with unpaired Illumina reads
41 | -pb STR Fasta or fastq file with PacBio CLR reads
42 | -hifi STR Fasta or fastq file with PacBio HiFi reads
43 | -ont STR Fasta or fastq file with Nanopore reads
44 | OR
45 | -b STR Bam file to calculate coverage from
46 | Skips read mapping
47 | Overrides -nh
48 | Technologies will recognized correctly if filenames end with
49 | .pb(.sort).bam, .hifi(.sort).bam or .ont(.sort).bam for PacBio CLR,
50 | PacBio HiFi and Nanopore respectively. Otherwise they are assumed to
51 | be from Illumina.
52 |
53 | All mandatory options except of -a can be specified multiple times
54 |
55 | Options: [default]
56 | -o STR Output directory [.]
57 | Will be created if not existing
58 | -t INT Number of parallel executed processes [1]
59 | Affects bwa mem, samtools sort/index/view/stats, qualimap bamqc
60 | -pre STR Prefix of output files if -a is used [filename of -a]
61 | -sort Sort the bam file(s) (-b) [off]
62 | -nq Do not run qualimap bamqc [off]
63 | -nh Do not create coverage histogram [off]
64 | Implies -ne
65 | -ne Do not estimate genome size [off]
66 | -kt Keep temporary bam files [off]
67 | -bo STR Options passed to bwa [-a -c 10000]
68 | -mo STR Options passed to minimap [CLR: -H -x map-pb; HiFi: minimap<=2.18
69 | -x asm20 minimap>2.18 -x map-hifi; ONT: -x map-ont]
70 | -qo STR Options passed to qualimap [none]
71 | Pass options with quotes e.g. -bo ""
72 | -v Print executed commands to STDERR [off]
73 | -dry-run Only print commands to STDERR instead of executing [off]
74 |
75 | -h or -help Print this help and exit
76 | -version Print version number and exit
77 | ```
78 |
79 | ## Citation
80 | Pfenninger M, Schönenbeck P & Schell T (2021). ModEst: Accurate estimation of genome size from next generation sequencing data. _Molecular ecology resources_, 00, 1–11.
81 |
82 | Schell T, Feldmeyer B, Schmidt H, Greshake B, Tills O et al. (2017). An Annotated Draft Genome for _Radix auricularia_ (Gastropoda, Mollusca). _Genome Biology and Evolution_, 9(3):585–592,
83 |
84 | __If you use this tool please cite the dependencies as well:__
85 |
86 | - samtools:
87 | Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. (2009). The Sequence Alignment/Map format and SAMtools. _Bioinformatics_, 25(16):2078–2079,
88 | - bwa mem:
89 | Li H (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. _arXiv preprint arXiv:1303.3997_.
90 | - minimap2:
91 | Li H (2018). Minimap2: pairwise alignment for nucleotide sequences. _Bioinformatics_, 34:3094–3100,
92 | - Qualimap:
93 | Okonechnikov K, Conesa A, García-Alcalde F (2016). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. _Bioinformatics_, 32(2):292–294,
94 | - MultiQC:
95 | Ewels P, Magnusson M, Lundin S, Käller M (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. _Bioinformatics_, 32(19):3047–3048,
96 | - bedtools:
97 | Quinlan AR, Hall IM (2010). BEDTools: a flexible suite of utilities for comparing genomic features. _Bioinformatics_, 26(6):841–842,
98 | - Rscript:
99 | R Core Team (2021). R: A Language and Environment for Statistical Computing.
100 |
--------------------------------------------------------------------------------
/backmap.pl:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env perl
2 |
3 | use strict;
4 | use warnings;
5 | use Cwd 'abs_path';
6 | use IPC::Cmd qw[can_run run];
7 | use Number::FormatEng qw(:all);
8 | use Parallel::Loops;
9 |
10 | my $version = "0.5";
11 |
12 | sub print_help{
13 | print STDOUT "\n";
14 | print STDOUT "backmap.pl v$version\n";
15 | print STDOUT "\n";
16 | print STDOUT "Description:\n";
17 | print STDOUT "\tAutomatic mapping of paired, unpaired, PacBio and Nanopore reads to an\n\tassembly, execution of qualimap bamqc, multiqc and estimation of genome size\n\tfrom mapped nucleotides and peak coverage.\n\tThe tools bwa, minimap2, samtools, qualimap, multiqc, bedtools and Rscript need to be\n\tin your \$PATH.\n";
18 | print STDOUT "\n";
19 | print STDOUT "Usage:\n";
20 | print STDOUT "\tbackmap.pl [-a {-p , | -u |\n";
21 | print STDOUT "\t -pb | -hifi | -ont } | -b ]\n";
22 | print STDOUT "\n";
23 | print STDOUT "Mandatory:\n";
24 | print STDOUT "\t-a STR\t\tAssembly were reads should mapped to in fasta format\n";
25 | print STDOUT "\tAND AT LEAST ONE OF\n";
26 | print STDOUT "\t-p STR\t\tTwo fastq files with paired Illumina reads comma sperated\n";
27 | print STDOUT "\t-u STR\t\tFastq file with unpaired Illumina reads\n";
28 | print STDOUT "\t-pb STR\t\tFasta or fastq file with PacBio CLR reads\n";
29 | print STDOUT "\t-hifi STR\tFasta or fastq file with PacBio HiFi reads\n";
30 | print STDOUT "\t-ont STR\tFasta or fastq file with Nanopore reads\n";
31 | print STDOUT "\tOR\n";
32 | print STDOUT "\t-b STR\t\tBam file to calculate coverage from\n";
33 | print STDOUT "\t\t\tSkips read mapping\n";
34 | print STDOUT "\t\t\tOverrides -nh\n";
35 | print STDOUT "\t\t\tTechnologies will recognized correctly if filenames end with\n\t\t\t.pb(.sort).bam, .hifi(.sort).bam or .ont(.sort).bam for PacBio CLR,\n\t\t\tPacBio HiFi and Nanopore respectively. Otherwise they are assumed to\n\t\t\tbe from Illumina.\n";
36 | print STDOUT "\n";
37 | print STDOUT "\tAll mandatory options except of -a can be specified multiple times\n";
38 | print STDOUT "\n";
39 | print STDOUT "Options: [default]\n";
40 | print STDOUT "\t-o STR\t\tOutput directory [.]\n";
41 | print STDOUT "\t\t\tWill be created if not existing\n";
42 | print STDOUT "\t-t INT\t\tNumber of parallel executed processes [1]\n";
43 | print STDOUT "\t\t\tAffects bwa mem, samtools sort/index/view/stats, qualimap bamqc\n";
44 | print STDOUT "\t-pre STR\tPrefix of output files if -a is used [filename of -a]\n";
45 | print STDOUT "\t-sort\t\tSort the bam file(s) (-b) [off]\n";
46 | print STDOUT "\t-nq\t\tDo not run qualimap bamqc [off]\n";
47 | print STDOUT "\t-nh\t\tDo not create coverage histogram [off]\n";
48 | print STDOUT "\t\t\tImplies -ne\n";
49 | print STDOUT "\t-ne\t\tDo not estimate genome size [off]\n";
50 | print STDOUT "\t-kt\t\tKeep temporary bam files [off]\n";
51 | print STDOUT "\t-bo STR\t\tOptions passed to bwa [-a -c 10000]\n";
52 | print STDOUT "\t-mo STR\t\tOptions passed to minimap [CLR: -H -x map-pb; HiFi: minimap<=2.18\n\t\t\t-x asm20 minimap>2.18 -x map-hifi; ONT: -x map-ont]\n";
53 | print STDOUT "\t-qo STR\t\tOptions passed to qualimap [none]\n";
54 | print STDOUT "\tPass options with quotes e.g. -bo \"\"\n";
55 | print STDOUT "\t-v\t\tPrint executed commands to STDERR [off]\n";
56 | print STDOUT "\t-dry-run\tOnly print commands to STDERR instead of executing [off]\n";
57 | print STDOUT "\n";
58 | print STDOUT "\t-h or -help\tPrint this help and exit\n";
59 | print STDOUT "\t-version\tPrint version number and exit\n";
60 | exit;
61 | }
62 |
63 | sub exe_cmd{
64 | my ($cmd,$verbose,$dry) = @_;
65 | if($verbose == 1){
66 | print STDERR "CMD\t$cmd\n";
67 | }
68 | if($dry == 0){
69 | system("$cmd") == 0 or die "ERROR\tsystem $cmd failed: $?";
70 | }
71 | }
72 |
73 | sub round_format_pref{
74 | my ($number) = @_;
75 | $number = format_pref($number);
76 | my $format_number = substr($number,0,-1);
77 | $format_number = sprintf("%.2f", $format_number);
78 | my $pref = substr($number,-1);
79 | my $return = $format_number . $pref;
80 | return $return;
81 | }
82 |
83 | my $out_dir = abs_path("./");
84 | my $assembly_path = "";
85 | my $assembly = "";
86 | my @paired = ();
87 | my @unpaired = ();
88 | my @pb = ();
89 | my @hifi = ();
90 | my @ont = ();
91 | my $threads = 1;
92 | my $prefix = "";
93 | my $verbose = 0;
94 | my $bwa_opts = "-a -c 10000 ";
95 | my $minimap_opts = "";
96 | my $qm_opts = "";
97 | my $create_histo_switch = 1;
98 | my $estimate_genome_size_switch = 1;
99 | my $run_bamqc_switch = 1;
100 | my $keep_tmp = 0;
101 | my $dry = 0;
102 | my @bam = ();
103 | my $sort_bam_switch = 0;
104 | my $cmd;
105 | my $mkdir_cmd;
106 | my $home = `echo \$HOME`;
107 | chomp $home;
108 |
109 | my $input_error = 0;
110 |
111 | if(scalar(@ARGV==0)){
112 | print_help;
113 | }
114 |
115 | for (my $i = 0; $i < scalar(@ARGV);$i++){
116 | if ($ARGV[$i] eq "-o"){
117 | $out_dir = abs_path($ARGV[$i+1]);
118 | }
119 | if ($ARGV[$i] eq "-a"){
120 | if($assembly ne ""){
121 | print STDERR "ERROR\tSpecify -a just once\n";
122 | $input_error = 1;
123 | }
124 | $assembly = (split /\//,$ARGV[$i+1])[-1];
125 | $assembly_path = abs_path($ARGV[$i+1]);
126 | }
127 | if ($ARGV[$i] eq "-p"){
128 | push(@paired,$ARGV[$i+1]);
129 | }
130 | if ($ARGV[$i] eq "-u"){
131 | push(@unpaired,$ARGV[$i+1]);
132 | }
133 | if ($ARGV[$i] eq "-pb"){
134 | push(@pb,$ARGV[$i+1]);
135 | }
136 | if ($ARGV[$i] eq "-hifi"){
137 | push(@hifi,$ARGV[$i+1]);
138 | }
139 | if ($ARGV[$i] eq "-ont"){
140 | push(@ont,$ARGV[$i+1]);
141 | }
142 | if ($ARGV[$i] eq "-b"){
143 | push(@bam,$ARGV[$i+1]);
144 | }
145 | if ($ARGV[$i] eq "-sort"){
146 | $sort_bam_switch = 1;
147 | }
148 | if ($ARGV[$i] eq "-t"){
149 | $threads = $ARGV[$i+1];
150 | }
151 | if ($ARGV[$i] eq "-pre"){
152 | $prefix = $ARGV[$i+1];
153 | }
154 | if ($ARGV[$i] eq "-v"){
155 | $verbose = 1;
156 | }
157 | if ($ARGV[$i] eq "-bo"){
158 | $bwa_opts = $ARGV[$i+1] . " "; #nonsense flags are skipped from bwa
159 | $ARGV[$i+1] = "\'$ARGV[$i+1]\'";
160 | }
161 | if ($ARGV[$i] eq "-mo"){
162 | $minimap_opts = $ARGV[$i+1] . " ";
163 | $ARGV[$i+1] = "\'$ARGV[$i+1]\'";
164 | }
165 | if ($ARGV[$i] eq "-qo"){
166 | $qm_opts = $ARGV[$i+1] . " "; #nonsense flags are skipped from qualimap
167 | $ARGV[$i+1] = "\'$ARGV[$i+1]\'";
168 | }
169 | if ($ARGV[$i] eq "-nq"){
170 | $run_bamqc_switch = 0;
171 | }
172 | if ($ARGV[$i] eq "-nh"){
173 | $create_histo_switch = 0;
174 | $estimate_genome_size_switch = 0;
175 | }
176 | if ($ARGV[$i] eq "-ne"){
177 | $estimate_genome_size_switch = 0;
178 | }
179 | if($ARGV[$i] eq "-kt"){
180 | $keep_tmp = 1;
181 | }
182 | if ($ARGV[$i] eq "-dry-run"){
183 | $dry = 1;
184 | $verbose = 1;
185 | }
186 | if ($ARGV[$i] eq "-h" or $ARGV[$i] eq "-help"){
187 | print_help;
188 | }
189 | if ($ARGV[$i] eq "-version"){
190 | print STDERR $version . "\n";
191 | exit;
192 | }
193 | }
194 |
195 | print STDERR "CMD\t" . $0 . " " . join(" ",@ARGV) . "\n";
196 |
197 | if($assembly_path ne "" and scalar(@bam) > 0){
198 | print STDERR "ERROR\tSpecify either -a or -b\n";
199 | $input_error = 1;
200 | }
201 |
202 | if($assembly_path eq "" and scalar(@bam) == 0){
203 | print STDERR "ERROR\tSpecify either -a or -b\n";
204 | $input_error = 1;
205 | }
206 |
207 | if($assembly_path ne "" and scalar(@bam) == 0){
208 | if(scalar(@paired) > 0 or scalar(@unpaired) > 0){
209 | if(not defined(can_run("bwa"))){
210 | print STDERR "ERROR\tbwa is not in your \$PATH\n";
211 | $input_error = 1;
212 | }
213 | }
214 | if(scalar(@pb) > 0 or scalar(@hifi) > 0 or scalar(@ont) > 0){
215 | if(not defined(can_run("minimap2"))){
216 | print STDERR "ERROR\tminimap2 is not in your \$PATH\n";
217 | $input_error = 1;
218 | }
219 | }
220 | if(not defined(can_run("samtools"))){
221 | print STDERR "ERROR\tsamtools is not in your \$PATH\n";
222 | $input_error = 1;
223 | }
224 | }
225 |
226 | if(not defined(can_run("qualimap")) and $run_bamqc_switch == 1){
227 | print STDERR "INFO\tqualimap is not in your \$PATH and will not be executed\n";
228 | }
229 |
230 | if(not defined(can_run("bedtools"))){
231 | print STDERR "INFO\tbedtools is not in your \$PATH\n";
232 | if($estimate_genome_size_switch == 1){
233 | print STDERR "WARNING\tGenome size estimation not possible\n";
234 | $estimate_genome_size_switch = 0;
235 | }
236 | if(scalar(@bam) > 0){
237 | print STDERR "ERROR\tGenome size estimation not possible\n";
238 | $input_error = 1;
239 | }
240 | }
241 |
242 | if(not defined(can_run("Rscript"))){
243 | print STDERR "INFO\tRscript is not in your \$PATH\n";
244 | if($create_histo_switch == 1){
245 | print STDERR "WARNING\tPlotting not possible\n";
246 | }
247 | }
248 |
249 | if(-f "$out_dir"){
250 | print STDERR "ERROR\tOutput directory $out_dir is already a file!\n";
251 | $input_error = 1;
252 | }
253 |
254 | if($assembly_path ne ""){
255 | if(not -f $assembly_path){
256 | print STDERR "ERROR\tFile $assembly_path does not exist!\n";
257 | $input_error = 1;
258 | }
259 | if(scalar(@paired) == 0 and scalar(@unpaired) == 0 and scalar(@pb) == 0 and scalar(@hifi) == 0 and scalar(@ont) == 0){
260 | print STDERR "ERROR\tNo reads specified!\n";
261 | $input_error = 1;
262 | }
263 | }
264 |
265 | if($threads !~ m/^\d+$/ or $threads < 1){
266 | print STDERR "ERROR\tThreads is no integer >= 1!\n";
267 | $input_error = 1;
268 | }
269 |
270 | if ($input_error == 1){
271 | print STDERR "ERROR\tInput error detected!\n";
272 | exit 1;
273 | }
274 |
275 | my $samtools_threads = $threads - 1;
276 |
277 | if(not -d "$out_dir"){
278 | print STDERR "INFO\tCreating output directory $out_dir\n";
279 | $mkdir_cmd="mkdir -p $out_dir";
280 | }
281 |
282 | if($prefix eq ""){
283 | if($assembly ne ""){
284 | $prefix = $assembly;
285 | print STDERR "INFO\tSetting prefix to $prefix\n";
286 | }
287 | else{
288 | print STDERR "INFO\tSetting prefix to name(s) of corresponding bam\n";
289 | }
290 | }
291 |
292 | if(scalar(@bam) > 0){
293 | if(defined(can_run("bedtools"))){
294 | if($create_histo_switch == 0 or $estimate_genome_size_switch == 0){
295 | $create_histo_switch = 1;
296 | print STDERR "INFO\tCreating coverage histogram\n";
297 | }
298 | }
299 | }
300 |
301 | my %paired_filter;
302 |
303 | if($assembly_path ne ""){
304 | foreach(@paired){
305 | my @pair = split(/,/,$_);
306 | foreach(@pair){
307 | if($_ =~ m/^~/){
308 | $_ =~ s/^~/$home/; #~ is translated by bash into $HOME. This does not work if there is no space infront. That means if the second file starts with "~" it will not be recognized even though it exists
309 | }
310 | }
311 | if(scalar(@pair) != 2){
312 | print STDERR "INFO\tNot a pair: $_ - skipping these file(s)\n";
313 | }
314 | else{
315 | my $file_error = 0;
316 | if(not -f "$pair[0]"){
317 | print STDERR "INFO\tNo file $pair[0] - skipping pair $_\n";
318 | $file_error = 1;
319 | }
320 | if(not -f "$pair[1]"){
321 | print STDERR "INFO\tNo file $pair[1] - skipping pair $_\n";
322 | $file_error = 1;
323 | }
324 | if($file_error == 0){
325 | if(exists($paired_filter{abs_path($pair[0]) . "," . abs_path($pair[1])})){
326 | print STDERR "INFO\tPair " . abs_path($pair[0]) . "," . abs_path($pair[1]) . " already specified\n";
327 | }
328 | else{
329 | $paired_filter{abs_path($pair[0]) . "," . abs_path($pair[1])} = 1;
330 | }
331 | }
332 | }
333 | }
334 | }
335 |
336 | my %unpaired_filter;
337 |
338 | if($assembly_path ne ""){
339 | foreach(@unpaired){
340 | if(not -f "$_"){
341 | print STDERR "INFO\tNo file $_ - skipping this file\n";
342 | }
343 | else{
344 | if(exists($unpaired_filter{abs_path($_)})){
345 | print STDERR "INFO\tFile " . abs_path($_) . " already specified\n";
346 | }
347 | else{
348 | $unpaired_filter{abs_path($_)} = 1;
349 | }
350 | }
351 | }
352 | }
353 |
354 | my %pb_filter;
355 |
356 | if($assembly_path ne ""){
357 | foreach(@pb){
358 | if(not -f "$_"){
359 | print STDERR "INFO\tNo file $_ - skipping this file\n";
360 | }
361 | else{
362 | if(exists($pb_filter{abs_path($_)})){
363 | print STDERR "INFO\tFile " . abs_path($_) . " already specified\n";
364 | }
365 | else{
366 | $pb_filter{abs_path($_)} = 1;
367 | }
368 | }
369 | }
370 | }
371 |
372 | my %hifi_filter;
373 |
374 | if($assembly_path ne ""){
375 | foreach(@hifi){
376 | if(not -f "$_"){
377 | print STDERR "INFO\tNo file $_ - skipping this file\n";
378 | }
379 | else{
380 | if(exists($hifi_filter{abs_path($_)})){
381 | print STDERR "INFO\tFile " . abs_path($_) . " already specified\n";
382 | }
383 | else{
384 | $hifi_filter{abs_path($_)} = 1;
385 | }
386 | }
387 | }
388 | }
389 |
390 | my %ont_filter;
391 |
392 | if($assembly_path ne ""){
393 | foreach(@ont){
394 | if(not -f "$_"){
395 | print STDERR "INFO\tNo file $_ - skipping this file\n";
396 | }
397 | else{
398 | if(exists($ont_filter{abs_path($_)})){
399 | print STDERR "INFO\tFile " . abs_path($_) . " already specified\n";
400 | }
401 | else{
402 | $ont_filter{abs_path($_)} = 1;
403 | }
404 | }
405 | }
406 | }
407 |
408 | my %bam_filter;
409 |
410 | if(scalar(@bam) > 0){
411 | foreach(@bam){
412 | if(not -f "$_"){
413 | print STDERR "INFO\tNo file $_ - skipping this file\n";
414 | }
415 | else{
416 | if(exists($bam_filter{abs_path($_)})){
417 | print STDERR "INFO\tFile " . abs_path($_) . " already specified\n";
418 | }
419 | else{
420 | $bam_filter{abs_path($_)} = 1;
421 | }
422 | }
423 | }
424 | }
425 |
426 | if($assembly_path ne ""){
427 | if(scalar(keys(%paired_filter)) == 0 and scalar(keys(%unpaired_filter)) == 0 and scalar(keys(%pb_filter)) == 0 and scalar(keys(%hifi_filter)) == 0 and scalar(keys(%ont_filter)) == 0){
428 | print STDERR "ERROR\tNo existing read files specified!\n";
429 | exit 1;
430 | }
431 | }
432 | else{
433 | if(scalar(keys(%bam_filter)) == 0){
434 | print STDERR "ERROR\tNo existing bam files specified!\n";
435 | exit 1;
436 | }
437 | }
438 |
439 | my $index_path = $out_dir . "/" . $assembly;
440 |
441 | if($assembly_path ne ""){
442 | if(scalar(keys(%paired_filter)) > 0 or scalar(keys(%unpaired_filter)) > 0){
443 | if(not -f $assembly_path . ".amb" or not -f $assembly_path . ".ann" or not -f $assembly_path . ".bwt" or not -f $assembly_path . ".pac" or not -f $assembly_path . ".sa"){
444 | $cmd = "bwa index -p $index_path $assembly_path > $out_dir/$prefix\_bwa_index.log 2> $out_dir/$prefix\_bwa_index.err";
445 | }
446 | else{
447 | print STDERR "INFO\tIndex files already existing\n";
448 | $index_path = $assembly_path;
449 | }
450 | }
451 | }
452 |
453 | my $bwa_version;
454 | if(not defined(can_run("bwa"))){
455 | $bwa_version = "not detected";
456 | }
457 | else{
458 | $bwa_version = `bwa 2>&1 | head -3 | tail -1 | sed 's/^Version: //'`;
459 | chomp $bwa_version;
460 | }
461 |
462 | my $minimap_version;
463 | my $minimap_minor_version;
464 | if(not defined(can_run("minimap2"))){
465 | $minimap_version = "not detected";
466 | }
467 | else{
468 | $minimap_version = `minimap2 --version`;
469 | chomp $minimap_version;
470 | $minimap_minor_version = $minimap_version;
471 | $minimap_minor_version =~ s/-.*//;
472 | $minimap_minor_version =~ s/^.*\.//;
473 | }
474 |
475 | my $samtools_version = `samtools --version | head -1 | sed 's/^samtools //'`;
476 | chomp $samtools_version;
477 |
478 | my $qualimap_version;
479 | if(not defined(can_run("qualimap"))){
480 | $qualimap_version = "not detected";
481 | }
482 | else{
483 | $qualimap_version = `qualimap bamqc 2> /dev/null | head -4 | tail -1 | sed 's/^QualiMap v.//'`;
484 | chomp $qualimap_version;
485 | }
486 |
487 | my $bedtools_version;
488 | if(not defined(can_run("bedtools"))){
489 | $bedtools_version = "not detected";
490 | }
491 | else{
492 | $bedtools_version = `bedtools --version | awk '{print \$2}' | sed 's/^v//'`;
493 | chomp $bedtools_version;
494 | }
495 |
496 | my $rscript_version;
497 | if(not defined(can_run("Rscript"))){
498 | $rscript_version = "not detected";
499 | }
500 | else{
501 | $rscript_version = `Rscript --version 2>&1 | sed 's/^R scripting front-end version //;s/ .*//'`;
502 | chomp $rscript_version;
503 | }
504 |
505 | my $multiqc_version;
506 | if(not defined(can_run("multiqc"))){
507 | $multiqc_version = "not detected";
508 | }
509 | else{
510 | $multiqc_version = `multiqc --version 2> /dev/null | awk '{print \$NF}'`;
511 | chomp $multiqc_version;
512 | }
513 |
514 | my $verbose_word = "No";
515 | if($verbose == 1){
516 | $verbose_word = "Yes";
517 | }
518 |
519 | my $keep_tmp_word = "No";
520 | if($keep_tmp == 1){
521 | $keep_tmp_word = "Yes";
522 | }
523 |
524 | my $run_bamqc_switch_word = "Yes";
525 | if($run_bamqc_switch == 0){
526 | $run_bamqc_switch_word = "No";
527 | }
528 |
529 | my $create_histo_switch_word = "Yes";
530 | if($create_histo_switch == 0){
531 | $create_histo_switch_word = "No";
532 | }
533 |
534 | my $estimate_genome_size_switch_word = "Yes";
535 | if($estimate_genome_size_switch == 0){
536 | $estimate_genome_size_switch_word = "No";
537 | }
538 |
539 | print "\n";
540 | print "backmap.pl v$version\n";
541 | print "\n";
542 | print "Detected tools\n";
543 | print "==============\n";
544 | print "bwa: " . $bwa_version . "\n";
545 | print "minimap2: " . $minimap_version . "\n";
546 | print "samtools: " . $samtools_version . "\n";
547 | print "qualimap: " . $qualimap_version . "\n";
548 | print "bedtools: " . $bedtools_version . "\n";
549 | print "Rscript: " . $rscript_version . "\n";
550 | print "multiqc: " . $multiqc_version . "\n";
551 | print "\n";
552 | print "User defined input\n";
553 | print "==================\n";
554 | print "Output directory: " . $out_dir . "\n";
555 | if($assembly_path ne ""){
556 | print "Assembly: " . $assembly_path . "\n";
557 | print "Paired reads: ";
558 | print join("\n ",keys(%paired_filter)) . "\n";
559 | print "Unpaired reads: ";
560 | print join("\n ",keys(%unpaired_filter)) . "\n";
561 | print "PacBio reads: ";
562 | print join("\n ",keys(%pb_filter)) . "\n";
563 | print "Nanopore reads: ";
564 | print join("\n ",keys(%ont_filter)) . "\n";
565 | }
566 | if(scalar(keys(%bam_filter)) > 0){
567 | print "Bam files: ";
568 | print join("\n ",keys(%bam_filter)) . "\n";
569 | }
570 | print "Number of threads: " . $threads . "\n";
571 | print "Outpufile prefix: " . $prefix . "\n";
572 | print "Verbose: " . $verbose_word . "\n";
573 | print "Keep temporary files: " . $keep_tmp_word . "\n";
574 | if($assembly_path ne ""){
575 | print "bwa options: " . $bwa_opts . "\n";
576 | print "minimap2 options: " . $minimap_opts . "\n";
577 | }
578 | print "Run qualimap bamqc: " . $run_bamqc_switch_word . "\n";
579 | if($run_bamqc_switch == 1){
580 | print "qualimap options: " . $qm_opts . "\n";
581 | }
582 | print "Create cov histo: " . $create_histo_switch_word . "\n";
583 | print "Estimate genome size: " . $estimate_genome_size_switch_word . "\n";
584 |
585 | if(defined $mkdir_cmd){
586 | exe_cmd($mkdir_cmd,$verbose,$dry);
587 | }
588 |
589 | if(defined $cmd){
590 | exe_cmd($cmd,$verbose,$dry);
591 | }
592 |
593 | #mapping will be executed if there are entries in %paired_filer or %unpaired_filter
594 | #this happens if $assembly_path is set only
595 | my $paired_counter = 0;
596 | my @paired_bam = ();
597 | foreach(keys(%paired_filter)){
598 | $paired_counter++;
599 | my ($for,$rev) = split(/,/,$_);
600 | $cmd = "bwa mem -t $threads $bwa_opts$index_path $for $rev 2> $out_dir/$prefix\_bwa_mem_paired$paired_counter.err | samtools view -1 -b - > $out_dir/$prefix.paired$paired_counter.bam";
601 | exe_cmd($cmd,$verbose,$dry);
602 | push(@paired_bam,"$out_dir/$prefix.paired$paired_counter.bam")
603 | }
604 |
605 | my $unpaired_counter = 0;
606 | my @unpaired_bam = ();
607 | foreach(keys(%unpaired_filter)){
608 | $unpaired_counter++;
609 | $cmd = "bwa mem -t $threads $bwa_opts$index_path $_ 2> $out_dir/$prefix\_bwa_mem_unpaired$unpaired_counter.err | samtools view -1 -b - > $out_dir/$prefix.unpaired$unpaired_counter.bam";
610 | exe_cmd($cmd,$verbose,$dry);
611 | push(@unpaired_bam,"$out_dir/$prefix.unpaired$unpaired_counter.bam");
612 | }
613 |
614 | my $pb_counter = 0;
615 | my @pb_bam = ();
616 | foreach(keys(%pb_filter)){
617 | $pb_counter++;
618 | $cmd = "minimap2 $minimap_opts-H -x map-pb -a -t $threads $assembly_path $_ 2> $out_dir/$prefix\_minimap_pb$pb_counter.err | samtools view -1 -b - > $out_dir/$prefix.pb$pb_counter.bam";
619 | exe_cmd($cmd,$verbose,$dry);
620 | push(@pb_bam,"$out_dir/$prefix.pb$pb_counter.bam");
621 | }
622 |
623 | my $hifi_counter = 0;
624 | my @hifi_bam = ();
625 | foreach(keys(%hifi_filter)){
626 | $hifi_counter++;
627 | if($minimap_minor_version <= 18){
628 | $cmd = "minimap2 $minimap_opts-x asm20 -a -t $threads $assembly_path $_ 2> $out_dir/$prefix\_minimap_hifi$hifi_counter.err | samtools view -1 -b - > $out_dir/$prefix.hifi$hifi_counter.bam";
629 | }
630 | else{
631 | $cmd = "minimap2 $minimap_opts-x map-hifi -a -t $threads $assembly_path $_ 2> $out_dir/$prefix\_minimap_hifi$hifi_counter.err | samtools view -1 -b - > $out_dir/$prefix.hifi$hifi_counter.bam";
632 | }
633 | exe_cmd($cmd,$verbose,$dry);
634 | push(@hifi_bam,"$out_dir/$prefix.hifi$hifi_counter.bam");
635 | }
636 |
637 | my $ont_counter = 0;
638 | my @ont_bam = ();
639 | foreach(keys(%ont_filter)){
640 | $ont_counter++;
641 | $cmd = "minimap2 $minimap_opts-x map-ont -a -t $threads $assembly_path $_ 2> $out_dir/$prefix\_minimap_ont$ont_counter.err| samtools view -1 -b - > $out_dir/$prefix.ont$ont_counter.bam";
642 | exe_cmd($cmd,$verbose,$dry);
643 | push(@ont_bam,"$out_dir/$prefix.ont$ont_counter.bam");
644 | }
645 |
646 | my @merged_bam_file = ();
647 | if($assembly_path ne ""){
648 | my $ill_bam_count = scalar(@paired_bam) + scalar(@unpaired_bam);
649 |
650 | my $paired_bam_files = join(" ",@paired_bam);
651 | my $unpaired_bam_files = join(" ",@unpaired_bam);
652 | my $pb_bam_files = join(" ",@pb_bam);
653 | my $hifi_bam_files = join(" ",@hifi_bam);
654 | my $ont_bam_files = join(" ",@ont_bam);
655 |
656 | if($ill_bam_count > 0){
657 | if($ill_bam_count == 1){
658 | my $single_bam = join(" ",@paired_bam,@unpaired_bam);
659 | $single_bam =~ s/^\s+//;
660 | $single_bam =~ s/\s+$//;
661 | $cmd = "ln -fs $single_bam $out_dir/$prefix.bam";
662 | exe_cmd($cmd,$verbose,$dry);
663 | push(@merged_bam_file, "$out_dir/$prefix.bam");
664 | }
665 | else{
666 | $cmd = "samtools merge -@ $samtools_threads $out_dir/$prefix.bam $paired_bam_files $unpaired_bam_files";
667 | exe_cmd($cmd,$verbose,$dry);
668 | push(@merged_bam_file, "$out_dir/$prefix.bam");
669 | }
670 | }
671 |
672 | if(scalar(@pb_bam) > 0){
673 | if(scalar(@pb_bam) == 1){
674 | my $single_bam = $pb_bam[0];
675 | $cmd = "ln -fs $single_bam $out_dir/$prefix.pb.bam";
676 | exe_cmd($cmd,$verbose,$dry);
677 | push(@merged_bam_file, "$out_dir/$prefix.pb.bam");
678 | }
679 | else{
680 | $cmd = "samtools merge -@ $samtools_threads $out_dir/$prefix.pb.bam $pb_bam_files";
681 | exe_cmd($cmd,$verbose,$dry);
682 | push(@merged_bam_file, "$out_dir/$prefix.pb.bam");
683 | }
684 | }
685 |
686 | if(scalar(@hifi_bam) > 0){
687 | if(scalar(@hifi_bam) == 1){
688 | my $single_bam = $hifi_bam[0];
689 | $cmd = "ln -fs $single_bam $out_dir/$prefix.hifi.bam";
690 | exe_cmd($cmd,$verbose,$dry);
691 | push(@merged_bam_file, "$out_dir/$prefix.hifi.bam");
692 | }
693 | else{
694 | $cmd = "samtools merge -@ $samtools_threads $out_dir/$prefix.hifi.bam $hifi_bam_files";
695 | exe_cmd($cmd,$verbose,$dry);
696 | push(@merged_bam_file, "$out_dir/$prefix.hifi.bam");
697 | }
698 | }
699 |
700 | if(scalar(@ont_bam) > 0){
701 | if(scalar(@ont_bam) == 1){
702 | my $single_bam = $ont_bam[0];
703 | $cmd = "ln -fs $single_bam $out_dir/$prefix.ont.bam";
704 | exe_cmd($cmd,$verbose,$dry);
705 | push(@merged_bam_file, "$out_dir/$prefix.ont.bam");
706 | }
707 | else{
708 | $cmd = "samtools merge -@ $samtools_threads $out_dir/$prefix.ont.bam $ont_bam_files";
709 | exe_cmd($cmd,$verbose,$dry);
710 | push(@merged_bam_file, "$out_dir/$prefix.ont.bam");
711 | }
712 | }
713 |
714 | }
715 |
716 | my @sorted_bams;
717 | if(scalar(keys(%bam_filter)) > 0){
718 | if($sort_bam_switch == 0){
719 | @sorted_bams = keys(%bam_filter);
720 | }
721 | if($sort_bam_switch == 1){
722 | @merged_bam_file = keys(%bam_filter);
723 | }
724 | }
725 |
726 | if($assembly_path ne "" or $sort_bam_switch == 1){
727 | foreach(@merged_bam_file){
728 | my $sorted_bam_file = $_;
729 | $sorted_bam_file =~ s/\.bam$/\.sort\.bam/;
730 | $cmd = "samtools sort -l 9 -@ $samtools_threads -T $out_dir/$prefix -o $sorted_bam_file $_";
731 | exe_cmd($cmd,$verbose,$dry);
732 | push(@sorted_bams,$sorted_bam_file);
733 | }
734 |
735 | if($keep_tmp == 0){
736 | my $tmp_bams = join(" ",@paired_bam,@unpaired_bam,@pb_bam,@hifi_bam,@ont_bam,@merged_bam_file);
737 | $cmd = "rm $tmp_bams";
738 | exe_cmd($cmd,$verbose,$dry);
739 | }
740 | }
741 |
742 | foreach(@sorted_bams){
743 | $cmd = "samtools index -@ $samtools_threads $_";
744 | exe_cmd($cmd,$verbose,$dry);
745 | }
746 |
747 | if($run_bamqc_switch == 1){
748 | if(defined(can_run("qualimap"))){
749 | foreach(@sorted_bams){
750 | my $bamqc_out = $_;
751 | $bamqc_out = (split(/\//,$bamqc_out))[-1];
752 | $bamqc_out =~ s/\.bam$/_stats/;
753 | $cmd = "qualimap bamqc $qm_opts-bam $_ -nt $threads -outdir $out_dir/$bamqc_out > $out_dir/$bamqc_out\_bamqc.log 2> $out_dir/$bamqc_out\_bamqc.err";
754 | exe_cmd($cmd,$verbose,$dry);
755 | }
756 | if(defined(can_run("multiqc")) and scalar(@sorted_bams) > 1){
757 | $cmd = "multiqc -s -m qualimap -o $out_dir $out_dir > $out_dir/multiqc.log 2> $out_dir/multiqc.err";
758 | exe_cmd($cmd,$verbose,$dry);
759 | }
760 | }
761 | }
762 |
763 | my %cov_files;
764 | my %peak_cov;
765 | my %n0_all;
766 | my @global_ymax = ();
767 |
768 | my $maxProcs = scalar(@sorted_bams);
769 | if($threads < $maxProcs){
770 | $maxProcs = $threads;
771 | }
772 |
773 | if($create_histo_switch == 1){
774 |
775 | foreach(@sorted_bams){
776 | my $filename = (split(/\//,$_))[-1];
777 | my $cov_hist_file = "$out_dir/$filename.cov-hist";
778 | $cmd = "samtools view -@ $samtools_threads -b -h -F 256 $_ | bedtools genomecov -ibam stdin -d | awk \'{print \$3}\' | sort -g | uniq -c | awk '{print \$2\"\\t\"\$1}' > $cov_hist_file";
779 | exe_cmd($cmd,$verbose,$dry);
780 | }
781 |
782 | my $multiple_histos = Parallel::Loops->new($maxProcs);
783 | $multiple_histos->share(\%cov_files);
784 | $multiple_histos->share(\%peak_cov);
785 | $multiple_histos->share(\%n0_all);
786 | $multiple_histos->share(\@global_ymax);
787 |
788 | $multiple_histos->foreach( \@sorted_bams, sub {
789 |
790 | my $tech = "Illumina";
791 | if($_ =~ m/\.pb\.sort\.bam$/){
792 | $tech = "CLR";
793 | }
794 | if($_ =~ m/\.hifi\.sort\.bam$/){
795 | $tech = "HiFi";
796 | }
797 | if($_ =~ m/\.ont\.sort\.bam$/){
798 | $tech = "Nanopore";
799 | }
800 |
801 | my $filename = (split(/\//,$_))[-1];
802 | my $cov_hist_file = "$out_dir/$filename.cov-hist";
803 |
804 | $cov_files{$tech} = $cov_hist_file;
805 |
806 | if($dry == 0){
807 | my $peak = `sort -rgk2 $cov_hist_file | awk \'\$1!=0{print \$1}\' | head -1`;
808 | chomp $peak;
809 | $peak_cov{$tech} = $peak;
810 | my $n0 = `awk \'\$1==0{print \$2}\' $cov_hist_file`;
811 | chomp $n0;
812 | $n0_all{$tech} = $n0;
813 | my $ymax = `sort -rgk2 $cov_hist_file | awk \'\$1!=0{print \$2}\' | head -1`;
814 | chomp $ymax;
815 | push(@global_ymax,$ymax);
816 |
817 | open(R,'>',$cov_hist_file . ".plot.r") or die "ERROR\tCould not open file " . $cov_hist_file . "plot.r\n";
818 |
819 | print R "x=read.table(\"$cov_hist_file\")\n";
820 | print R "pdf(\"$cov_hist_file.pdf\")\n";
821 | if($n0 < $ymax){
822 | print R "plot(x[,1],x[,2],log=\"x\",type=\"l\",xlab=\"Coverage\",ylab=\"Count\",main=\"$assembly\\n$tech\")\n";
823 | }
824 | else{
825 | print R "plot(x[,1],x[,2],ylim=c(0,$ymax),log=\"x\",type=\"l\",xlab=\"Coverage\",ylab=\"Count\",main=\"$assembly\\n$tech\")\n";
826 | }
827 | print R "text(2.5,$ymax,\"N(0)=$n0\")\n";
828 | print R "dev.off()\n";
829 |
830 | close R;
831 | }
832 | else{
833 | print STDERR "CMD\tsort -rgk2 $cov_hist_file | awk \'\$1!=0{print \$1}\' | head -1\n";
834 | print STDERR "CMD\tawk \'\$1==0{print \$2}\' $cov_hist_file\n";
835 | print STDERR "I would create $cov_hist_file.plot.r\n";
836 | }
837 |
838 | if(defined(can_run("Rscript"))){
839 | # $cmd = "Rscript $cov_hist_file.plot.r > /dev/null 2> /dev/null";
840 | $cmd = "Rscript $cov_hist_file.plot.r > $cov_hist_file.log 2> $cov_hist_file.err";
841 | exe_cmd($cmd,$verbose,$dry);
842 | }
843 | });
844 |
845 | my @global_ymax = sort( {$b <=> $a} @global_ymax);
846 |
847 | if(scalar(@sorted_bams) > 1){
848 | my $rscript = "$out_dir/plot.all.r";
849 | if($prefix ne ""){
850 | $rscript = "$out_dir/$prefix.plot.all.r";
851 | }
852 | if($dry == 0){
853 | my @techs = ("Illumina","CLR","HiFi","Nanopore");
854 |
855 | open(RALL,'>',"$rscript") or die "ERROR\tCould not open file $rscript\n";
856 |
857 | print RALL "xmax <- 0\n";
858 | for(my $i = 0; $i < scalar(@techs); $i++){
859 | if(exists($cov_files{$techs[$i]})){
860 | print RALL "$techs[$i]=read.table(\"$cov_files{$techs[$i]}\")\n";
861 | print RALL "xmax <- max(xmax, $techs[$i]\[,1])\n";
862 | }
863 | }
864 | my $pdf = $rscript;
865 | $pdf =~ s/r$/pdf/;
866 | print RALL "pdf(\"$pdf\")\n";
867 | my @legend = ();
868 | my @lty = ();
869 | my @col = ();
870 | print RALL "plot(NULL,log=\"x\",type=\"l\",xlab=\"Coverage\",ylab=\"Count\",main=\"$assembly\",ylim=c(0,$global_ymax[0]), xlim=c(1,xmax))\n";
871 | for(my $i = 0; $i < scalar(@techs); $i++){
872 | if(exists $cov_files{$techs[$i]}){
873 | push(@legend,"\"$techs[$i] N(0)=$n0_all{$techs[$i]}\"");
874 | push(@lty,"1");
875 | if($i == 0 and exists($cov_files{$techs[$i]})){
876 | print RALL "lines($techs[$i]\[,1],$techs[$i]\[,2],type=\"l\",col=\"black\")\n";
877 | push(@col,"\"black\"");
878 | }
879 | if($i == 1 and exists($cov_files{$techs[$i]})){
880 | print RALL "lines($techs[$i]\[,1],$techs[$i]\[,2],type=\"l\",col=\"blue\")\n";
881 | push(@col,"\"blue\"");
882 | }
883 | if($i == 2 and exists($cov_files{$techs[$i]})){
884 | print RALL "lines($techs[$i]\[,1],$techs[$i]\[,2],type=\"l\",col=\"darkgreen\")\n";
885 | push(@col,"\"darkgreen\"");
886 | }
887 | if($i == 3 and exists($cov_files{$techs[$i]})){
888 | print RALL "lines($techs[$i]\[,1],$techs[$i]\[,2],type=\"l\",col=\"red\")\n";
889 | push(@col,"\"red\"");
890 | }
891 | }
892 | }
893 | print RALL "legend(\"topright\",legend=c(" . join(",",@legend) . "),lty=c(" . join(",",@lty) . "),col=c(" . join(",",@col) . "))\n";
894 | print RALL "dev.off()\n";
895 |
896 | close RALL;
897 | }
898 | else{
899 | print STDERR "I would create $rscript\n";
900 | }
901 |
902 | if(defined(can_run("Rscript"))){
903 | $cmd = "Rscript $rscript > $rscript.log 2> $rscript.err";
904 | exe_cmd($cmd,$verbose,$dry);
905 | }
906 | }
907 |
908 | }
909 |
910 | if($estimate_genome_size_switch == 1){
911 |
912 | my %results;
913 |
914 | foreach(@sorted_bams){
915 | $cmd = "samtools stats -@ $samtools_threads $_ > $_.stats 2> $_.stats.err";
916 | exe_cmd($cmd,$verbose,$dry);
917 | }
918 |
919 | my $multiple_genome_size = Parallel::Loops->new($maxProcs);
920 | $multiple_genome_size->share(\%results);
921 |
922 | $multiple_genome_size->foreach( \@sorted_bams, sub {
923 | if($dry == 1){
924 | print STDERR "CMD\tsort -rgk2 $_.cov-hist | awk \'\$1!=0{print \$1}\' | head -1\n";
925 | }
926 |
927 | # $cmd = "samtools stats $_ > $_.stats 2> $_.stats.err";
928 | # exe_cmd($cmd,$verbose,$dry);
929 |
930 | if($dry == 0){
931 |
932 | my $tech = "Illumina";
933 | if($_ =~ m/\.pb\.sort\.bam$/){
934 | $tech = "CLR";
935 | }
936 | if($_ =~ m/\.hifi\.sort\.bam$/){
937 | $tech = "HiFi";
938 | }
939 | if($_ =~ m/\.ont\.sort\.bam$/){
940 | $tech = "Nanopore";
941 | }
942 |
943 | my $assembly_length = 0;
944 | my $assemly_perc = 0;
945 | if($assembly ne ""){
946 | open(IN,'<',"$assembly_path") or die "ERROR\tCould not open file " . $assembly_path . "\n";
947 | while(my $line = ){
948 | chomp $line;
949 | if($line !~ m/^>/){
950 | $assembly_length = $assembly_length + length($line);
951 | }
952 | }
953 | }
954 | my $total_nucs = `grep "bases mapped (cigar):" $_.stats | awk -F'\\t' '{print \$3}'`;
955 | chomp $total_nucs;
956 |
957 | my $genome_size_estimate = $total_nucs / $peak_cov{$tech};
958 |
959 | $results{$tech} = $tech . " (" . $_ . ")\n" . "Mapped nucleotides: " . round_format_pref($total_nucs) . "b\n" . "Peak coverage: " . $peak_cov{$tech} . "\n" . "Genome size estimate: " . round_format_pref($genome_size_estimate) . "b\n";
960 | if($assembly_length > 0){
961 | $assemly_perc = sprintf("%.2f", ($assembly_length / $genome_size_estimate) * 100);
962 | $results{$tech} = $results{$tech} . "Assembly length: " . round_format_pref($assembly_length) . "b ($assemly_perc% of estimate)\n";
963 | }
964 |
965 | }
966 | else{
967 | print STDERR "CMD\tgrep \"bases mapped (cigar):\" $_.stats | awk -F\'\\t\' \'{print \$3}\'\n";
968 | }
969 | });
970 |
971 | if($dry == 0){
972 | print "\n";
973 | print "Output\n";
974 | print "======\n";
975 |
976 | my @techs = ("Illumina","CLR","HiFi","Nanopore");
977 | for (my $i = 0; $i < scalar(@techs); $i++){
978 | if(exists($results{$techs[$i]})){
979 | print $results{$techs[$i]};
980 | }
981 | }
982 | }
983 | }
984 |
985 | exit;
986 |
--------------------------------------------------------------------------------