├── LICENSE ├── README.pod └── tmerge /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Julien Lagarde 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.pod: -------------------------------------------------------------------------------- 1 | =head1 NAME 2 | 3 | tmerge 4 | 5 | =head1 SYNOPSIS 6 | 7 | Merge transcriptome read-to-genome alignments into non-redundant transcript models. 8 | 9 | C compares transcript structures (or read-to-genome alignments) present in the input and attempts to reduce transcript redundancy, I, merge compatible input transcripts into non-redundant transcript models. The program treats spliced and monoexonic reads separately (I, those are never merged together). 10 | 11 | C is fast and can typically process several millions of aligned long reads in a few minutes. 12 | 13 | =begin HTML 14 | 15 |

tmerge sketch

16 | 17 | =end HTML 18 | 19 | See DESCRIPTION below for more details. 20 | 21 | B: 22 | 23 | C<< tmerge --tmPrefix > >> 24 | 25 | 26 | =head2 INPUT 27 | 28 | GTF file of read-to-genome alignments, sorted by chromosome and start position. 29 | 30 | Only C records are considered. 31 | Read alignments need to be uniquely identified with the C GTF attribute. C is the only mandatory GTF attribute in input records. 32 | 33 | =head2 OPTIONS 34 | 35 | =over 36 | 37 | =item * C (string) = Prefix string for C identifiers in the output 38 | 39 | B: '' (empty string) 40 | 41 | By default, output Cs consist in arbitrary "C" strings. If C is set, its value will prefix all C strings in the GTF output. 42 | 43 | =item * C (integer) = minimum number of times a read alignment (as defined by its exon/intron structure) needs to be present in the input. In other words, when building a transcript model, only the reads fulfilling the following conditions are considered: 44 | 45 | For B, at least C input reads must share a given intron chain and 5' + 3' ends (+/- C bases, see below). 46 | 47 | For B, at least C input reads must share their 5' + 3' ends (+/- C bases, see below). In other words, when C C< = 0> (the default), only monoexonic reads with identical genome coordinates are merged. 48 | 49 | B: 1 50 | 51 | =item * C (positive integer) = Tolerated fuzziness of 5' and 3' ends for two reads to be considered equivalent when calculating read support (see C option above) 52 | 53 | B: 0 (i.e., no fuzziness allowed) 54 | 55 | 56 | =item * C (positive integer) = maximum number of nucleotides of terminal exon overhang allowed within an intron of another transcript during the merging of input reads. See explanation in "DESCRIPTION" below. 57 | 58 | B: 0 (i.e., no exon overhang allowed) 59 | 60 | =back 61 | 62 | =head2 OUTPUT 63 | 64 | C outputs non-redundant transcript models (B) in GTF format. Each TM entry is uniquely identified by its (arbitrary) C attribute. 65 | 66 | The C attribute has the same value as C by convention; it is therefore meaningless. 67 | 68 | The following extra GTF attributes are present in the 9th field, in order: 69 | 70 | =over 71 | 72 | =item * C (string): comma-separated list of input reads (Cs) contained in the TM, sorted by descending genomic size. 73 | 74 | =item * C (integer): number of input reads contained in the TM. 75 | 76 | =item * C<3p_dists_to_3p> (string): comma-separated list of the distances (always positive, in bases on mature RNA, i.e. ignoring introns) of the TM's 3' end to each of the input reads 3' ends it C. The list's order follows that of C. 77 | 78 | =item * C<5p_dists_to_5p> (string): comma-separated list of the distances (always positive, in bases on mature RNA, i.e. ignoring introns) of the TM's 5' end to each of the input reads 5' ends it C. The list's order follows that of C. 79 | 80 | =item * C (float): TM's expression quantification in "Full-Length Reads per Million". This corresponds to C divided by the number of reads (i.e., C's) present in the input. 81 | 82 | =item * C (string): comma-separated list of the longest read(s) (Cs) contained in the TM. This list contains more that one item only in case of length ties. Note that the reads reported do not necessarily cover the entire length of the resulting TM. 83 | 84 | =item * C (string): comma-separated list of input reads that support C over C's full-length (+/- C). 85 | 86 | =item * C (integer): number of input reads that support C over C's full-length (+/- C). 87 | 88 | =item * C (integer): the mature RNA length of the TM (i.e., the sum of the lengths of all its exons) 89 | 90 | =item * C (string): comma-separated list of the distances (comprised between 0 and 1, on mature RNA, i.e. ignoring introns) of the TM's B<5' end> to each of the input reads 3' ends it C, normalized over the TM's mature RNA length. The list's order follows that of C. 91 | 92 | =item * C (string): comma-separated list of the distances (comprised between 0 and 1, on mature RNA, i.e. ignoring introns) of the TM's 5' end to each of the input reads 5' ends it C, normalized over the TM's mature RNA length. The list's order follows that of C. 93 | 94 | =item * C (float): TM's expression quantification in "Reads per Million". This corresponds to C divided by the number of reads (i.e, C's) present in the input. 95 | 96 | =item * C (boolean): specifies if the TM is spliced (1) or monoexonic (0). 97 | 98 | =back 99 | 100 | =head1 DESCRIPTION 101 | 102 | C reduces redundancy in a set of transcriptome read-to-genome alignments. It does so by looking for reads with I> aligned structures in the input, and merging those into I> (B). 103 | 104 | =begin HTML 105 | 106 |

tmerge sketch

107 | 108 | =end HTML 109 | 110 | Pairwise B between aligned structures is evaluated using the following rules: 111 | 112 | =over 113 | 114 | =item * If both structures are B, they are deemed compatible if: 115 | 116 | =over 117 | 118 | =item * 1. at least one of their exons overlap on the same genomic strand, 119 | 120 | =item * 2. either their intron chains are equal, or one is an exact subset of the other, 121 | 122 | and 123 | 124 | =item * 3. there is no overlap between an exon of one structure and an intron of the other. 125 | 126 | =back 127 | 128 | Condition (2) means that C will never artificially extend intron chains: 129 | 130 | =begin HTML 131 | 132 |

tmerge non-merge case

133 | 134 | =end HTML 135 | 136 | =item * If both structures are B, they are considered compatible if they overlap by at least 1 nucleotide on the same genomic strand. 137 | 138 | =item * If one structure is B and the other B, they are not merged. 139 | 140 | =back 141 | 142 | 143 | All pairs of compatible structures are then merged recursively into the longest possible TM. 144 | 145 | =head2 C option and splice sites 146 | 147 | Setting this option to a positive integer can correct mismapped splice junctions that sometimes occur when aligning very short, error-rich terminal read exons: 148 | 149 | 150 | =begin HTML 151 | 152 |

tmerge FalseExonOverhang sketch

153 | 154 | =end HTML 155 | 156 | The setting works as explained below: 157 | 158 | =begin HTML 159 | 160 |

tmerge exonOverhangTolerance sketch

161 | 162 | =end HTML 163 | 164 | 165 | =head1 AUTHOR 166 | 167 | Julien Lagarde, CRG, Barcelona, contact julienlag@gmail.com 168 | 169 | -------------------------------------------------------------------------------- /tmerge: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | 3 | use strict; 4 | use warnings FATAL => 'all'; 5 | use Getopt::Long; 6 | use Pod::Usage; 7 | use Data::Dumper; 8 | no warnings 'recursion'; 9 | 10 | my $message_text = "Error\n"; 11 | my $exit_status = 2; ## The exit status to use 12 | my $verbose_level = 99; ## The verbose level to use 13 | my $filehandle = \*STDERR; ## The filehandle to write to 14 | my $sections = "NAME|SYNOPSIS|DESCRIPTION"; 15 | 16 | 17 | =head1 NAME 18 | 19 | tmerge 20 | 21 | =head1 SYNOPSIS 22 | 23 | Merge transcriptome read-to-genome alignments into non-redundant transcript models. 24 | 25 | C compares transcript structures (or read-to-genome alignments) present in the input and attempts to reduce transcript redundancy, I, merge compatible input transcripts into non-redundant transcript models. The program treats spliced and monoexonic reads separately (I, those are never merged together). 26 | 27 | C is fast and can typically process several millions of aligned long reads in a few minutes. 28 | 29 | =begin HTML 30 | 31 |

tmerge sketch

32 | 33 | =end HTML 34 | 35 | See DESCRIPTION below for more details. 36 | 37 | B: 38 | 39 | C<< tmerge --tmPrefix > >> 40 | 41 | 42 | =head2 INPUT 43 | 44 | GTF file of read-to-genome alignments, sorted by chromosome and start position. 45 | 46 | Only C records are considered. 47 | Read alignments need to be uniquely identified with the C GTF attribute. C is the only mandatory GTF attribute in input records. 48 | 49 | =head2 OPTIONS 50 | 51 | =over 52 | 53 | =item * C (string) = Prefix string for C identifiers in the output 54 | 55 | B: '' (empty string) 56 | 57 | By default, output Cs consist in arbitrary "C" strings. If C is set, its value will prefix all C strings in the GTF output. 58 | 59 | =item * C (integer) = minimum number of times a read alignment (as defined by its exon/intron structure) needs to be present in the input. In other words, when building a transcript model, only the reads fulfilling the following conditions are considered: 60 | 61 | For B, at least C input reads must share a given intron chain and 5' + 3' ends (+/- C bases, see below). 62 | 63 | For B, at least C input reads must share their 5' + 3' ends (+/- C bases, see below). In other words, when C C< = 0> (the default), only monoexonic reads with identical genome coordinates are merged. 64 | 65 | B: 1 66 | 67 | =item * C (positive integer) = Tolerated fuzziness of 5' and 3' ends for two reads to be considered equivalent when calculating read support (see C option above) 68 | 69 | B: 0 (i.e., no fuzziness allowed) 70 | 71 | 72 | =item * C (positive integer) = maximum number of nucleotides of terminal exon overhang allowed within an intron of another transcript during the merging of input reads. See explanation in "DESCRIPTION" below. 73 | 74 | B: 0 (i.e., no exon overhang allowed) 75 | 76 | =back 77 | 78 | =head2 OUTPUT 79 | 80 | C outputs non-redundant transcript models (B) in GTF format. Each TM entry is uniquely identified by its (arbitrary) C attribute. 81 | 82 | The C attribute has the same value as C by convention; it is therefore meaningless. 83 | 84 | The following extra GTF attributes are present in the 9th field, in order: 85 | 86 | =over 87 | 88 | =item * C (string): comma-separated list of input reads (Cs) contained in the TM, sorted by descending genomic size. 89 | 90 | =item * C (integer): number of input reads contained in the TM. 91 | 92 | =item * C<3p_dists_to_3p> (string): comma-separated list of the distances (always positive, in bases on mature RNA, i.e. ignoring introns) of the TM's 3' end to each of the input reads 3' ends it C. The list's order follows that of C. 93 | 94 | =item * C<5p_dists_to_5p> (string): comma-separated list of the distances (always positive, in bases on mature RNA, i.e. ignoring introns) of the TM's 5' end to each of the input reads 5' ends it C. The list's order follows that of C. 95 | 96 | =item * C (float): TM's expression quantification in "Full-Length Reads per Million". This corresponds to C divided by the number of reads (i.e., C's) present in the input. 97 | 98 | =item * C (string): comma-separated list of the longest read(s) (Cs) contained in the TM. This list contains more that one item only in case of length ties. Note that the reads reported do not necessarily cover the entire length of the resulting TM. 99 | 100 | =item * C (string): comma-separated list of input reads that support C over C's full-length (+/- C). 101 | 102 | =item * C (integer): number of input reads that support C over C's full-length (+/- C). 103 | 104 | =item * C (integer): the mature RNA length of the TM (i.e., the sum of the lengths of all its exons) 105 | 106 | =item * C (string): comma-separated list of the distances (comprised between 0 and 1, on mature RNA, i.e. ignoring introns) of the TM's B<5' end> to each of the input reads 3' ends it C, normalized over the TM's mature RNA length. The list's order follows that of C. 107 | 108 | =item * C (string): comma-separated list of the distances (comprised between 0 and 1, on mature RNA, i.e. ignoring introns) of the TM's 5' end to each of the input reads 5' ends it C, normalized over the TM's mature RNA length. The list's order follows that of C. 109 | 110 | =item * C (float): TM's expression quantification in "Reads per Million". This corresponds to C divided by the number of reads (i.e, C's) present in the input. 111 | 112 | =item * C (boolean): specifies if the TM is spliced (1) or monoexonic (0). 113 | 114 | =back 115 | 116 | =head1 DESCRIPTION 117 | 118 | C reduces redundancy in a set of transcriptome read-to-genome alignments. It does so by looking for reads with I> aligned structures in the input, and merging those into I> (B). 119 | 120 | =begin HTML 121 | 122 |

tmerge sketch

123 | 124 | =end HTML 125 | 126 | Pairwise B between aligned structures is evaluated using the following rules: 127 | 128 | =over 129 | 130 | =item * If both structures are B, they are deemed compatible if: 131 | 132 | =over 133 | 134 | =item * 1. at least one of their exons overlap on the same genomic strand, 135 | 136 | =item * 2. either their intron chains are equal, or one is an exact subset of the other, 137 | 138 | and 139 | 140 | =item * 3. there is no overlap between an exon of one structure and an intron of the other. 141 | 142 | =back 143 | 144 | Condition (2) means that C will never artificially extend intron chains: 145 | 146 | =begin HTML 147 | 148 |

tmerge non-merge case

149 | 150 | =end HTML 151 | 152 | =item * If both structures are B, they are considered compatible if they overlap by at least 1 nucleotide on the same genomic strand. 153 | 154 | =item * If one structure is B and the other B, they are not merged. 155 | 156 | =back 157 | 158 | 159 | All pairs of compatible structures are then merged recursively into the longest possible TM. 160 | 161 | =head2 C option and splice sites 162 | 163 | Setting this option to a positive integer can correct mismapped splice junctions that sometimes occur when aligning very short, error-rich terminal read exons: 164 | 165 | 166 | =begin HTML 167 | 168 |

tmerge FalseExonOverhang sketch

169 | 170 | =end HTML 171 | 172 | The setting works as explained below: 173 | 174 | =begin HTML 175 | 176 |

tmerge exonOverhangTolerance sketch

177 | 178 | =end HTML 179 | 180 | 181 | =head1 AUTHOR 182 | 183 | Julien Lagarde, CRG, Barcelona, contact julienlag@gmail.com 184 | 185 | =cut 186 | 187 | 188 | 189 | my $tmPrefix=''; 190 | my $minReadSupport=1; 191 | my $exonOverhangTolerance=0; 192 | my $transcriptEndFuzziness=0; 193 | my $debug=''; 194 | GetOptions ('tmPrefix=s' => \$tmPrefix, 195 | 'minReadSupport=i' => \$minReadSupport, 196 | 'exonOverhangTolerance=i' => \$exonOverhangTolerance, 197 | 'endFuzz=i' => \$transcriptEndFuzziness, 198 | 'debug' => \$debug, 199 | ) 200 | or pod2usage( { -message => "Error in command line arguments", 201 | -exitval => $exit_status , 202 | -verbose => $verbose_level, 203 | -output => $filehandle } ); 204 | 205 | unless (defined $ARGV[0]){ 206 | pod2usage( { -message => "Error in command line arguments: no input provided.", 207 | -exitval => $exit_status , 208 | -verbose => $verbose_level, 209 | -output => $filehandle } ); 210 | } 211 | 212 | die "ERROR: minReadSupport value must be > 0, can't continue.\n" if $minReadSupport<1; 213 | die "ERROR: exonOverhangTolerance value must be >= 0, can't continue.\n" if $exonOverhangTolerance<0; 214 | die "ERROR: endFuzz value must be >= 0, can't continue.\n" if $transcriptEndFuzziness<0; 215 | 216 | print STDERR "###################################################################################################################### 217 | ###################################################################################################################### 218 | ## 219 | "; 220 | 221 | print STDERR "## Parameters information:\n"; 222 | print STDERR "##\n"; 223 | print STDERR "## - TM transcript_id output prefix (--tmPrefix): $tmPrefix\n"; 224 | print STDERR "## - Minimum read support required per merged transcript model (--minReadSupport): $minReadSupport\n"; 225 | print STDERR "## - Exon overhang tolerance when merging (--exonOverhangTolerance): $exonOverhangTolerance bases\n"; 226 | print STDERR "## - End fuzziness tolerance when calculating support (--endFuzz): $transcriptEndFuzziness bases\n"; 227 | print STDERR "##\n"; 228 | print STDERR "###################################################################################################################### 229 | ###################################################################################################################### 230 | "; 231 | 232 | 233 | my $sortedGff; 234 | if(defined($ARGV[0])){ 235 | $sortedGff=$ARGV[0]; 236 | } 237 | else{ 238 | die "ERROR: Need input GTF file name as argument.\n"; 239 | } 240 | 241 | open GFF, "$sortedGff" or die "ERROR: ".$!; 242 | 243 | my %transcript_to_transcript=(); 244 | my %transcript_exons=(); 245 | my %transcript_chr=(); 246 | my %transcript_strand=(); 247 | my $previous_start=-1; 248 | my $previous_chr='Caravaggio'; 249 | my $previous_transcript='Picasso'; 250 | my $superExonStart=-1; 251 | my $superExonStop=-1; 252 | my %transcript_id_index=(); 253 | 254 | print STDERR "Parsing GTF input...\n"; 255 | my $nr_exons=0; 256 | my $read_count=0; 257 | my %transcript_seen=(); 258 | my %transcript_index=(); 259 | my %transcript_rev_index=(); 260 | while (){ 261 | next if ($_=~/^#/); 262 | next unless ($_=~/\texon\t/); 263 | if ($_=~/^(\S+)\t(\S+)\t(\S+)\t(\d+)\t(\d+)\t\S+\t(\S+)\t\S+\t.*transcript_id "(\S+)?";/){ 264 | if ($3 eq "exon"){ 265 | $nr_exons++; 266 | my $GTFtranscript_id=$7; 267 | my $chr=$1; 268 | my $start=$4; 269 | my $stop=$5; 270 | my $str=$6; 271 | my $strand; 272 | if($str eq '+'){ 273 | $strand=1; 274 | } 275 | elsif($str eq '-'){ 276 | $strand=-1; 277 | } 278 | elsif($str eq '.'){ 279 | $strand=0; 280 | } 281 | else{ 282 | die "ERROR: Unrecognized strand value '$str' at line $.\n"; 283 | } 284 | die "ERROR: Corrupt GTF. Start coordinate cannot be greater than stop coordinate at line $. . Can't continue.\n" if($start > $stop); 285 | unless(exists $transcript_seen{$GTFtranscript_id}){ 286 | $read_count++; 287 | $transcript_seen{$GTFtranscript_id}=undef; 288 | $transcript_index{$read_count}=$GTFtranscript_id; 289 | $transcript_rev_index{$GTFtranscript_id}=$read_count; 290 | } 291 | 292 | my $transcript_id=$transcript_rev_index{$GTFtranscript_id}; 293 | #Check for sorted input: 294 | die "ERROR: Unsorted GTF input (line $.). Must be sorted by chr, then start, then stop. Can't continue.\n" if ($chr eq $previous_chr && $start < $previous_start); 295 | if(exists $transcript_strand{$transcript_id} && $transcript_strand{$transcript_id} != $strand){ 296 | die "ERROR: Inconsistent strand for transcript $GTFtranscript_id in input file. Can't continue.\n"; 297 | } 298 | $transcript_strand{$transcript_id}=$strand; 299 | if (exists $transcript_chr{$transcript_id} && $transcript_chr{$transcript_id} ne $chr){ 300 | die "ERROR: Inconsistent chr for transcript $GTFtranscript_id in input file. Can't continue.\n"; 301 | 302 | } 303 | $transcript_chr{$transcript_id}=$chr; 304 | my @exon=($chr, $start, $stop, $strand); 305 | push(@{$transcript_exons{$transcript_id}}, \@exon); 306 | $transcript_to_transcript{$transcript_id}{$transcript_id}=undef; 307 | 308 | ### BEGIN this should be a subroutine but repeated subroutine calls are too expensive in perl 309 | ### my $overlapWithPrevious=overlap($superExonStart, $superExonStop, $start, $stop); 310 | my $start1=$superExonStart; 311 | my $stop1=$superExonStop; 312 | my $start2=$start; 313 | my $stop2=$stop; 314 | my $overlap; 315 | my $start2minusstop1=$start2-$stop1; 316 | my $start2minusstart1=$start2-$start1; 317 | my $stop2minusstart1=$stop2-$start1; 318 | my $stop2minusstop1=$stop2-$stop1; 319 | if( ( $stop2minusstart1>=0 && $stop2minusstop1 <=0 ) || ($start2minusstart1 >=0 && $start2minusstop1 <=0) || ($start2minusstart1 <= 0 && $stop2minusstop1 >= 0)){ 320 | $overlap=1; 321 | } 322 | else{ 323 | if($stop2minusstart1<0){ 324 | $overlap=-1 325 | } 326 | elsif($start2minusstop1>0){ 327 | $overlap=-2 328 | } 329 | } 330 | ### # 1 : overlap 331 | # -1 : 2 upstream of 1 332 | # -2 : 2 downstream of 1 333 | ### END this should be a subroutine but repeated subroutine calls are too expensive in perl 334 | 335 | 336 | 337 | 338 | if ($chr eq $previous_chr && $overlap == 1){ 339 | $transcript_to_transcript{$transcript_id}{$previous_transcript}=undef; 340 | $transcript_to_transcript{$previous_transcript}{$transcript_id}=undef; 341 | if($stop > $superExonStop){ 342 | $superExonStop = $stop; 343 | } 344 | } 345 | 346 | else{ 347 | #re-initialize superExon 348 | $previous_chr=$chr; 349 | $superExonStart=$start; 350 | $superExonStop=$stop; 351 | } 352 | $previous_transcript=$transcript_id; 353 | $previous_start=$start; 354 | } 355 | } 356 | else{ 357 | die "ERROR: line $.: malformed GTF record. Can't continue.\n"; 358 | 359 | } 360 | 361 | } 362 | close GFF; 363 | print STDERR "Done. Found $nr_exons exons and $read_count transcripts.\n"; 364 | my $million_read_count=1000000/($read_count+1); 365 | %transcript_seen=(); 366 | %transcript_rev_index=(); 367 | 368 | print STDERR "Building contigs (sets of overlapping transcripts)...\n"; 369 | # we build contigs to reduce the search space when looking for compatible transcript structures 370 | my $locusNumber = 0; 371 | my %transcript_id_to_locus_id=(); 372 | foreach my $tr1 (keys %transcript_to_transcript){ 373 | buildContig($tr1, \%transcript_to_transcript, \%transcript_id_to_locus_id, $locusNumber); 374 | #generate UUID for contig (this is to write temp files): 375 | $locusNumber++; 376 | } 377 | 378 | %transcript_to_transcript=(); #free up some memory 379 | 380 | my %contig_to_transcripts=(); 381 | foreach my $tr (keys %transcript_id_to_locus_id){ 382 | my $contig=$transcript_id_to_locus_id{$tr}; 383 | push(@{$contig_to_transcripts{$contig}}, $tr); 384 | } 385 | %transcript_id_to_locus_id=(); 386 | my $nrContigs=scalar(keys %contig_to_transcripts); 387 | print STDERR "Done. Built $nrContigs contigs.\n"; 388 | 389 | print STDERR "Comparing transcript structures...\n"; 390 | my %transcript_introns=(); 391 | my %container_original_ends=(); #contains non-adjusted end coordinates for all containers 392 | 393 | CONTIGS: foreach my $contig (keys %contig_to_transcripts){ 394 | my %container_to_transcripts=(); #'value' is a subset of , or equal to, 'key''s intron chain. 395 | my %container_to_supporting_transcripts=(); # same as %container_to_transcripts but more stringent ('key' contains 'value's which consist only in equal transcripts +/- exonOverhangTolerance) 396 | my %transcript_to_container=(); 397 | my @list1=@{$contig_to_transcripts{$contig}}; 398 | #build introns within contig: 399 | my @list1Mono=(); 400 | my @list1Spliced=(); 401 | my %supportCount=(); 402 | # build sets of introns for each transcript, populate @list1Mono and/or @list1Spliced accordingly: 403 | foreach my $tr (@list1){ 404 | $supportCount{$tr}=1; 405 | for (my $i=0; $i< $#{$transcript_exons{$tr}};$i++){ 406 | my $intronChr=${$transcript_exons{$tr}}[$i][0]; 407 | my $intronStrand=${$transcript_exons{$tr}}[$i][3]; 408 | my $intronStart=${$transcript_exons{$tr}}[$i][2]+1; 409 | my $intronStop=${$transcript_exons{$tr}}[$i+1][1]-1; 410 | my @intron=($intronChr, $intronStart, $intronStop, $intronStrand); 411 | push(@{$transcript_introns{$tr}}, \@intron); 412 | 413 | } 414 | if($#{$transcript_exons{$tr}} == 0){ 415 | push (@list1Mono, $tr) 416 | } 417 | else{ 418 | push(@list1Spliced, $tr); 419 | } 420 | } 421 | @list1=(); 422 | 423 | @list1Spliced= sort ({ ${$transcript_introns{$a}}[0][1] <=> ${$transcript_introns{$b}}[0][1] or ${$transcript_exons{$a}}[0][1] <=> ${$transcript_exons{$b}}[0][1] or ${$transcript_exons{$b}}[-1][2] <=> ${$transcript_exons{$a}}[-1][2] or $a cmp $b } @list1Spliced); #sort spliced transcripts by position of first intron. second and third comparisons are necessary so the script is deterministic (otherwise ties are handled randomly) 424 | 425 | # compute read support for each input spliced transcript/read: 426 | print STDERR "Calculating spliced read support...\n" if $debug; 427 | NEXTTR1: for (my $k=0; $k<=$#list1Spliced; $k++){ 428 | my $tr1=$list1Spliced[$k]; 429 | print STDERR "DEBUG: tr1 $transcript_index{$tr1}\n" if $debug; 430 | if(exists ($transcript_to_container{$tr1})){ 431 | #$tr1=$transcript_to_container{$tr1}; 432 | print STDERR "DEBUG: tr1 IS CONTAINED in $transcript_index{$transcript_to_container{$tr1}}\n" if $debug; 433 | next; 434 | } 435 | @{$container_original_ends{$tr1}}=(${$transcript_exons{$tr1}}[0][1], ${$transcript_exons{$tr1}}[-1][2]); 436 | 437 | NEXTTR2: for (my $l=$k+1; $l<=$#list1Spliced; $l++){ 438 | my $tr2=$list1Spliced[$l]; 439 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2}\n" if $debug; 440 | if(exists ($transcript_to_container{$tr2})){ 441 | #$tr2=$transcript_to_container{$tr2}; 442 | next; 443 | } 444 | if(${$transcript_exons{$tr1}}[-1][2] - ${$transcript_exons{$tr2}}[0][1]>0){ 445 | if( ( $transcript_strand{$tr1} == $transcript_strand{$tr2}) 446 | && 447 | ($#{$transcript_introns{$tr2}} == $#{$transcript_introns{$tr1}}) 448 | && 449 | (${$transcript_exons{$tr2}}[0][1] >= ${$transcript_exons{$tr1}}[0][1] - $transcriptEndFuzziness 450 | && 451 | ${$transcript_exons{$tr2}}[0][1] <= ${$transcript_exons{$tr1}}[0][1] + $transcriptEndFuzziness) 452 | && 453 | (${$transcript_exons{$tr2}}[-1][2] >= ${$transcript_exons{$tr1}}[-1][2] - $transcriptEndFuzziness 454 | && 455 | ${$transcript_exons{$tr2}}[-1][2] <= ${$transcript_exons{$tr1}}[-1][2] + $transcriptEndFuzziness) ){ 456 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} overlaps tr2 $transcript_index{$tr2}\n" if $debug; 457 | 458 | for (my $j=0; $j <= $#{$transcript_introns{$tr2}}; $j++){ 459 | for (my $i=$j; $i<= $#{$transcript_introns{$tr1}}; $i++){ 460 | if(${${$transcript_introns{$tr2}}[$j]}[1] == ${${$transcript_introns{$tr1}}[$i]}[1] && ${${$transcript_introns{$tr2}}[$j]}[2] == ${${$transcript_introns{$tr1}}[$i]}[2]){ 461 | print STDERR "DEBUG: tr1 intron $i = tr2 intron $j\n" if $debug; 462 | last; 463 | } 464 | else{ 465 | print STDERR "DEBUG: tr1 intron $i != tr2 intron $j\n" if $debug; 466 | next NEXTTR2; 467 | } 468 | } 469 | } 470 | $container_to_transcripts{$tr1}{$tr2}=undef ; 471 | $transcript_to_container{$tr2}=$tr1; 472 | adjustContainerEnds($tr1,$tr2); 473 | if (exists $container_to_transcripts{$tr2}){ 474 | foreach my $tr3 (keys %{$container_to_transcripts{$tr2}}){ 475 | $transcript_to_container{$tr3}=$tr1; 476 | $container_to_transcripts{$tr1}{$tr3}=undef ; 477 | } 478 | delete($container_to_transcripts{$tr2}); 479 | } 480 | 481 | 482 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2} FULLMATCH 2\n" if $debug; 483 | $supportCount{$tr1}++; 484 | $container_to_supporting_transcripts{$tr1}{$tr2}=1; 485 | } 486 | } 487 | else{ #tr2 is downstream of tr1. Since trs are sorted by start position, we can safely skip to next tr1 488 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2} INCOMP 10\n" if $debug; 489 | last; 490 | } 491 | } 492 | } 493 | 494 | print STDERR "Merging spliced reads...\n" if $debug; 495 | 496 | NEXTTR1: for (my $k=0; $k<=$#list1Spliced; $k++){ 497 | my $tr1=$list1Spliced[$k]; 498 | print STDERR "DEBUG: tr1 $transcript_index{$tr1}\n" if $debug; 499 | if(exists ($transcript_to_container{$tr1})){ 500 | print STDERR "DEBUG: tr1 IS CONTAINED in $transcript_index{$transcript_to_container{$tr1}}\n" if $debug; 501 | next; 502 | } 503 | @{$container_original_ends{$tr1}}=(${$transcript_exons{$tr1}}[0][1], ${$transcript_exons{$tr1}}[-1][2]); 504 | next unless ($supportCount{$tr1} >= $minReadSupport); 505 | 506 | NEXTTR2: for (my $l=$k+1; $l<=$#list1Spliced; $l++){ 507 | my $tr2=$list1Spliced[$l]; 508 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2}\n" if $debug; 509 | next unless ($supportCount{$tr2} >= $minReadSupport); 510 | if(exists ($transcript_to_container{$tr2})){ 511 | next; 512 | } 513 | 514 | if(${$transcript_exons{$tr1}}[-1][2] - ${$transcript_exons{$tr2}}[0][1]>0){ 515 | if( $transcript_strand{$tr1} == $transcript_strand{$tr2}){ 516 | #are intron chains compatible? (ie is tr2 a subset or equal to tr1? 517 | my $intronTr1Index=0; 518 | my $countIntronsTr2MatchedToTr1=-1; 519 | my $firstTr1IntronMatchedToTr2; 520 | my $lastTr1IntronMatchedToTr2; 521 | for (my $j=0; $j <= $#{$transcript_introns{$tr2}}; $j++){ 522 | print STDERR "DEBUG: intronTr2 #$j: ${${$transcript_introns{$tr2}}[$j]}[1], ${${$transcript_introns{$tr2}}[$j]}[2]\n" if $debug; 523 | for (my $i=$intronTr1Index; $i<= $#{$transcript_introns{$tr1}}; $i++){ 524 | print STDERR "DEBUG: intronTr1 #$i: ${${$transcript_introns{$tr1}}[$i]}[1], ${${$transcript_introns{$tr1}}[$i]}[2]\n" if $debug; 525 | 526 | 527 | ### BEGIN this should be a subroutine but repeated subroutine calls are too expensive in perl 528 | ### my $overlap=overlap(${$intronTr1}[1], ${$intronTr1}[2], ${$intronTr2}[1], ${$intronTr2}[2]); 529 | #my $start1=${${$transcript_introns{$tr1}}[$i]}[1]; 530 | #my $stop1=${${$transcript_introns{$tr1}}[$i]}[2]; 531 | #my $start2=${${$transcript_introns{$tr2}}[$j]}[1]; 532 | #my $stop2=${${$transcript_introns{$tr2}}[$j]}[2]; 533 | my $overlap; 534 | my $start2minusstop1=${${$transcript_introns{$tr2}}[$j]}[1]-${${$transcript_introns{$tr1}}[$i]}[2]; 535 | my $start2minusstart1=${${$transcript_introns{$tr2}}[$j]}[1]-${${$transcript_introns{$tr1}}[$i]}[1]; 536 | my $stop2minusstart1=${${$transcript_introns{$tr2}}[$j]}[2]-${${$transcript_introns{$tr1}}[$i]}[1]; 537 | my $stop2minusstop1=${${$transcript_introns{$tr2}}[$j]}[2]-${${$transcript_introns{$tr1}}[$i]}[2]; 538 | if( ( $stop2minusstart1>=0 && $stop2minusstop1 <=0 ) || ($start2minusstart1 >=0 && $start2minusstop1 <=0) || ($start2minusstart1 <= 0 && $stop2minusstop1 >= 0)){ 539 | $overlap=1; 540 | } 541 | else{ 542 | if($stop2minusstart1<0){ 543 | $overlap=-1 544 | } 545 | elsif($start2minusstop1>0){ 546 | $overlap=-2 547 | } 548 | } 549 | ### # 1 : overlap 550 | # -1 : 2 upstream of 1 551 | # -2 : 2 downstream of 1 552 | ### END this should be a subroutine but repeated subroutine calls are too expensive in perl 553 | 554 | 555 | 556 | print STDERR "DEBUG: overlap $overlap\n" if $debug; 557 | if($overlap == 1){ 558 | if(${${$transcript_introns{$tr2}}[$j]}[1] == ${${$transcript_introns{$tr1}}[$i]}[1] && ${${$transcript_introns{$tr2}}[$j]}[2] == ${${$transcript_introns{$tr1}}[$i]}[2]){ 559 | $countIntronsTr2MatchedToTr1++; 560 | if($j==0){ 561 | $firstTr1IntronMatchedToTr2=$i; 562 | } 563 | if($j==$#{$transcript_introns{$tr2}}){ 564 | $lastTr1IntronMatchedToTr2=$i; 565 | } 566 | $intronTr1Index=$i+1; #skip directly to next $tr1 intron at the next round (next $tr2 intron) 567 | if($intronTr1Index > $#{$transcript_introns{$tr1}} #we've reached the last intron of tr1. 568 | && $j < $#{$transcript_introns{$tr2}} #we've not reached the last intron of tr2. 569 | && ${$transcript_introns{$tr1}}[0][1] == ${$transcript_introns{$tr2}}[0][1] #tr1 and tr2's respective intron chains start at the same coord 570 | && ${$transcript_introns{$tr2}}[$j+1][1] > ${$transcript_exons{$tr1}}[-1][2] - $exonOverhangTolerance){ #tr1's last exon does not overhang too much inside tr2's next intron 571 | #Transfer remaining exons of tr2 to tr1, if they're compatible 572 | $lastTr1IntronMatchedToTr2=$i; 573 | print STDERR "DEBUG: reached last tr1 intron\n" if $debug; 574 | if(checkIntronExonOverlap($tr1,$tr2,$firstTr1IntronMatchedToTr2,$lastTr1IntronMatchedToTr2) ==0){ 575 | 576 | $container_to_transcripts{$tr1}{$tr2}=undef ; 577 | $transcript_to_container{$tr2}=$tr1; 578 | transferRightExonsIntrons($tr1,$tr2,$j); 579 | adjustContainerEnds($tr1,$tr2); 580 | if (exists $container_to_transcripts{$tr2}){ 581 | foreach my $tr3 (keys %{$container_to_transcripts{$tr2}}){ 582 | $transcript_to_container{$tr3}=$tr1; 583 | $container_to_transcripts{$tr1}{$tr3}=undef ; 584 | } 585 | delete($container_to_transcripts{$tr2}); 586 | } 587 | 588 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2} MATCH 2\n" if $debug; 589 | next NEXTTR2 590 | 591 | } 592 | else{ 593 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2} INCOMP 2\n" if $debug; 594 | next NEXTTR2; 595 | } 596 | } 597 | last; 598 | } 599 | else{ # introns overlap but don't exactly match, give up current tr2 600 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2} INCOMP 3\n" if $debug; 601 | next NEXTTR2; 602 | } 603 | } 604 | elsif($overlap == -1){ #intron1 is downstream of intron2 605 | if($countIntronsTr2MatchedToTr1 == $#{$transcript_introns{$tr2}}){ # all tr2 introns have found a match in tr1 606 | print STDERR "DEBUG: YES 1" if $debug; 607 | if(checkIntronExonOverlap($tr1,$tr2,$firstTr1IntronMatchedToTr2,$lastTr1IntronMatchedToTr2) ==0){ 608 | $container_to_transcripts{$tr1}{$tr2}=undef ; 609 | $transcript_to_container{$tr2}=$tr1; 610 | adjustContainerEnds($tr1,$tr2); 611 | if (exists $container_to_transcripts{$tr2}){ 612 | foreach my $tr3 (keys %{$container_to_transcripts{$tr2}}){ 613 | $transcript_to_container{$tr3}=$tr1; 614 | $container_to_transcripts{$tr1}{$tr3}=undef ; 615 | } 616 | delete($container_to_transcripts{$tr2}); 617 | } 618 | 619 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2} MATCH 3\n" if $debug; 620 | next NEXTTR2; 621 | } 622 | else{ 623 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2} INCOMP 4\n" if $debug; 624 | next NEXTTR2; 625 | } 626 | } 627 | else{ 628 | 629 | if($i==0 && $j==0){ #first intron of tr1 is downstream of first intron of tr2. Since transcripts are sorted by position of first intron, we can skip the rest of the tr2 list 630 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2} INCOMP 5\n" if $debug; 631 | next NEXTTR1; 632 | } 633 | else{ 634 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2} INCOMP 6\n" if $debug; 635 | next NEXTTR2; 636 | } 637 | } 638 | } 639 | elsif($overlap == -2){ #intron1 is upstream of intron2 640 | if($i == $#{$transcript_introns{$tr1}} || $j > 0 ){ #last intron of tr1 is upstream of first intron of tr2, i.e. tr1 and tr2 are incompatible 641 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2} INCOMP 7\n" if $debug; 642 | next NEXTTR2; 643 | } 644 | 645 | } 646 | } 647 | } 648 | if($countIntronsTr2MatchedToTr1 == $#{$transcript_introns{$tr2}}){ #fully identical or contained intron chain 649 | print STDERR "DEBUG: YES 2\n" if $debug; 650 | if(checkIntronExonOverlap($tr1,$tr2,$firstTr1IntronMatchedToTr2,$lastTr1IntronMatchedToTr2) ==0){ 651 | $container_to_transcripts{$tr1}{$tr2}=undef ; 652 | $transcript_to_container{$tr2}=$tr1; 653 | adjustContainerEnds($tr1,$tr2); 654 | if (exists $container_to_transcripts{$tr2}){ 655 | foreach my $tr3 (keys %{$container_to_transcripts{$tr2}}){ 656 | $transcript_to_container{$tr3}=$tr1; 657 | $container_to_transcripts{$tr1}{$tr3}=undef ; 658 | } 659 | delete($container_to_transcripts{$tr2}); 660 | } 661 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2} MATCH 4\n" if $debug; 662 | next; 663 | } 664 | } 665 | else{ 666 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2} INCOMP 9\n" if $debug; 667 | 668 | } 669 | 670 | } 671 | } 672 | 673 | else{ #tr2 is downstream of tr1. Since trs are sorted by start position, we can safely skip to next tr1 674 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2} INCOMP 10\n" if $debug; 675 | last; 676 | } 677 | 678 | } 679 | } 680 | %transcript_introns=(); 681 | 682 | @list1Mono=sort { ${$transcript_exons{$a}}[0][1] <=> ${$transcript_exons{$b}}[0][1] or ${$transcript_exons{$a}}[0][2] <=> ${$transcript_exons{$b}}[0][2] } @list1Mono; 683 | # compute read support for each input monoexonic transcript/read: 684 | print STDERR "Calculating monoexonic read support...\n" if $debug; 685 | NEXTTR1: for (my $k=0; $k<=$#list1Mono; $k++){ 686 | my $tr1=$list1Mono[$k]; 687 | print STDERR "DEBUG: tr1 $transcript_index{$tr1}\n" if $debug; 688 | if(exists ($transcript_to_container{$tr1})){ 689 | print STDERR "DEBUG: tr1 IS CONTAINED in $transcript_index{$transcript_to_container{$tr1}}\n" if $debug; 690 | next; 691 | } 692 | @{$container_original_ends{$tr1}}=(${$transcript_exons{$tr1}}[0][1], ${$transcript_exons{$tr1}}[-1][2]); 693 | NEXTTR2: for (my $l=$k+1; $l<=$#list1Mono; $l++){ 694 | my $tr2=$list1Mono[$l]; 695 | 696 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2}\n" if $debug; 697 | if(exists ($transcript_to_container{$tr2})){ 698 | next; 699 | } 700 | if(${$transcript_exons{$tr1}}[0][2] - ${$transcript_exons{$tr2}}[0][1]>=0){ 701 | if( ($transcript_strand{$tr1} == $transcript_strand{$tr2}) 702 | && 703 | (${$transcript_exons{$tr2}}[0][1] >= ${$transcript_exons{$tr1}}[0][1] - $transcriptEndFuzziness 704 | && 705 | ${$transcript_exons{$tr2}}[0][1] <= ${$transcript_exons{$tr1}}[0][1] + $transcriptEndFuzziness) 706 | && 707 | (${$transcript_exons{$tr2}}[-1][2] >= ${$transcript_exons{$tr1}}[-1][2] - $transcriptEndFuzziness 708 | && 709 | ${$transcript_exons{$tr2}}[-1][2] <= ${$transcript_exons{$tr1}}[-1][2] + $transcriptEndFuzziness) ) { 710 | 711 | $container_to_transcripts{$tr1}{$tr2}=undef ; 712 | $transcript_to_container{$tr2}=$tr1; 713 | adjustContainerEnds($tr1,$tr2); 714 | if (exists $container_to_transcripts{$tr2}){ 715 | foreach my $tr3 (keys %{$container_to_transcripts{$tr2}}){ 716 | $transcript_to_container{$tr3}=$tr1; 717 | $container_to_transcripts{$tr1}{$tr3}=undef ; 718 | } 719 | delete($container_to_transcripts{$tr2}); 720 | } 721 | 722 | $supportCount{$tr1}++; 723 | $container_to_supporting_transcripts{$tr1}{$tr2}=1; 724 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2} FULLMATCH 2\n" if $debug; 725 | } 726 | } 727 | else{ #tr2 is downstream of tr1. Since trs are sorted by start position, we can safely skip to next tr1 728 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2} INCOMP 10\n" if $debug; 729 | last; 730 | } 731 | 732 | } 733 | 734 | } 735 | 736 | print STDERR "Merging monoexonic reads...\n" if $debug; 737 | NEXTTR1: for (my $k=0; $k<=$#list1Mono; $k++){ 738 | my $tr1=$list1Mono[$k]; 739 | print STDERR "DEBUG: tr1 $transcript_index{$tr1}\n" if $debug; 740 | if(exists ($transcript_to_container{$tr1})){ 741 | print STDERR "DEBUG: tr1 IS CONTAINED in $transcript_index{$transcript_to_container{$tr1}}\n" if $debug; 742 | next; 743 | } 744 | @{$container_original_ends{$tr1}}=(${$transcript_exons{$tr1}}[0][1], ${$transcript_exons{$tr1}}[-1][2]); 745 | print STDERR "DEBUG: tr1 read support: $supportCount{$tr1}\n" if $debug; 746 | next unless ($supportCount{$tr1} >= $minReadSupport); 747 | NEXTTR2: for (my $l=$k+1; $l<=$#list1Mono; $l++){ 748 | my $tr2=$list1Mono[$l]; 749 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2}\n" if $debug; 750 | print STDERR "DEBUG: tr2 read support: $supportCount{$tr2}\n" if $debug; 751 | next unless ($supportCount{$tr2} >= $minReadSupport); 752 | if(exists ($transcript_to_container{$tr2})){ 753 | print STDERR "DEBUG: tr2 is contained in $transcript_to_container{$tr2}\n" if $debug; 754 | next; 755 | } 756 | if(${$transcript_exons{$tr1}}[0][2] - ${$transcript_exons{$tr2}}[0][1]>=0){ 757 | if( $transcript_strand{$tr1} == $transcript_strand{$tr2}){ 758 | $container_to_transcripts{$tr1}{$tr2}=undef ; 759 | $transcript_to_container{$tr2}=$tr1; 760 | adjustContainerEnds($tr1,$tr2); 761 | if (exists $container_to_transcripts{$tr2}){ 762 | foreach my $tr3 (keys %{$container_to_transcripts{$tr2}}){ 763 | $transcript_to_container{$tr3}=$tr1; 764 | $container_to_transcripts{$tr1}{$tr3}=undef ; 765 | } 766 | delete($container_to_transcripts{$tr2}); 767 | } 768 | 769 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} overlaps tr2 $transcript_index{$tr2}\n" if $debug; 770 | } 771 | 772 | } 773 | else{ 774 | print STDERR "DEBUG: tr1 $transcript_index{$tr1} vs tr2 $transcript_index{$tr2} INCOMP 11\n" if $debug; 775 | last; 776 | 777 | } 778 | } 779 | 780 | } 781 | 782 | 783 | ################################ 784 | ################################ 785 | ## print all to GTF ## 786 | ################################ 787 | ################################ 788 | 789 | foreach my $tr (@list1Spliced){ 790 | next unless ($supportCount{$tr} >= $minReadSupport); 791 | if (exists ($container_to_transcripts{$tr})){ # i.e. if tr is a container 792 | my @trList= sort { ${$transcript_exons{$b}}[-1][2] - ${$transcript_exons{$b}}[0][1] <=> ${$transcript_exons{$a}}[-1][2] - ${$transcript_exons{$a}}[0][1] } (keys (%{$container_to_transcripts{$tr}}), $tr); # sort by genomic length 793 | my @longest=(); 794 | my $longestTMlength=0; 795 | for (my $i=0;$i<=$#trList;$i++){ 796 | my $t=$trList[$i]; 797 | if($i==0){ 798 | $longestTMlength=(${$transcript_exons{$t}}[-1][2] - ${$transcript_exons{$t}}[0][1]); 799 | } 800 | if(${$transcript_exons{$t}}[-1][2] - ${$transcript_exons{$t}}[0][1] == $longestTMlength){ 801 | push(@longest, $t); 802 | } 803 | } 804 | my $trMatureLength=0; 805 | for (my $i=0; $i<=$#{$transcript_exons{$tr}}; $i++){ 806 | $trMatureLength+= (${${$transcript_exons{$tr}}[$i]}[2] - ${${$transcript_exons{$tr}}[$i]}[1])+1; 807 | } 808 | #my @transcriptsList=(); 809 | #foreach my $t (@trList){ 810 | # push(@transcriptsList, $t) 811 | #} 812 | my %uniqsupporting_longest=(); 813 | my @supporting_longest=(); 814 | #print STDERR "longest:".join(",", @longest)."\n"; 815 | foreach my $longest (@longest){ 816 | foreach my $supporting_longest (keys %{$container_to_supporting_transcripts{$longest}}, $longest){ 817 | #print STDERR "supporting: $supporting\n"; 818 | $uniqsupporting_longest{$supporting_longest}=1; 819 | } 820 | } 821 | #print STDERR Dumper \%uniqSupporting; 822 | 823 | foreach my $t (keys %uniqsupporting_longest){ 824 | #print STDERR "$t\n"; 825 | push(@supporting_longest, $t); 826 | } 827 | if ($debug){ 828 | print STDERR "print records for ".$transcript_index{$tr}."\n"; 829 | print STDERR "\ttrList:\n"; 830 | foreach my $id (@trList){ 831 | print STDERR "\t ".$transcript_index{$id}."\n"; 832 | } 833 | } 834 | 835 | #compute distance of all contained reads' TSSs/TTSs to TM's TSS/TTS 836 | my ($refDistsToLeft, $refDistsToRight) = endDistances($tr, \@trList); 837 | #print STDERR "Dists to left: ".join(",", @{$refDistsToLeft})."\n"; 838 | #print STDERR "Dists to right: ".join(",", @{$refDistsToRight})."\n"; 839 | 840 | printGTF($tr, 0, \@trList, \@longest, \@supporting_longest, 1, $refDistsToLeft, $refDistsToRight, $trMatureLength); 841 | } 842 | else{ #tr is not a container 843 | unless(exists ($transcript_to_container{$tr})) { # i.e. if tr is not a container, and is not contained 844 | my @trList=($tr); 845 | my @tmp1=('0'); 846 | my $trMatureLength=0; 847 | for (my $i=0; $i<=$#{$transcript_exons{$tr}}; $i++){ 848 | $trMatureLength+= (${${$transcript_exons{$tr}}[$i]}[2] - ${${$transcript_exons{$tr}}[$i]}[1])+1; 849 | } 850 | 851 | #print STDERR "Dists to left: ".join(",", @tmp1)."\n"; 852 | #print STDERR "Dists to right: ".join(",", @tmp1)."\n"; 853 | if ($debug){ 854 | print STDERR "print records for ".$transcript_index{$tr}."\n"; 855 | print STDERR "\ttrList:\n"; 856 | foreach my $id (@trList){ 857 | print STDERR "\t ".$transcript_index{$id}."\n"; 858 | } 859 | } 860 | printGTF($tr, 0, \@trList, \@trList, \@trList, 1, \@tmp1, \@tmp1, $trMatureLength); 861 | } 862 | } 863 | } 864 | 865 | 866 | foreach my $tr (@list1Mono){ 867 | next unless ($supportCount{$tr} >= $minReadSupport); 868 | if (exists ($container_to_transcripts{$tr})){ # i.e. if tr is a container 869 | my @trList= sort { ${$transcript_exons{$b}}[-1][2] - ${$transcript_exons{$b}}[0][1] <=> ${$transcript_exons{$a}}[-1][2] - ${$transcript_exons{$a}}[0][1] } (keys (%{$container_to_transcripts{$tr}}), $tr); # sort by genomic length 870 | if ($#trList>=$minReadSupport-1){ 871 | my @longest=(); 872 | my $longestTMlength=0; 873 | for (my $i=0;$i<=$#trList;$i++){ 874 | my $t=$trList[$i]; 875 | if($i==0){ 876 | $longestTMlength=${$transcript_exons{$t}}[-1][2] - ${$transcript_exons{$t}}[0][1]; 877 | } 878 | if(${$transcript_exons{$t}}[-1][2] - ${$transcript_exons{$t}}[0][1] == $longestTMlength){ 879 | push(@longest, $t); 880 | } 881 | } 882 | my $trMatureLength=0; 883 | for (my $i=0; $i<=$#{$transcript_exons{$tr}}; $i++){ 884 | $trMatureLength+= (${${$transcript_exons{$tr}}[$i]}[2] - ${${$transcript_exons{$tr}}[$i]}[1])+1; 885 | } 886 | 887 | #my @transcriptsList=(); 888 | #foreach my $t (@trList){ 889 | # push(@transcriptsList, $t) 890 | #} 891 | my %uniqsupporting_longest=(); 892 | my @supporting_longest=(); 893 | foreach my $longest (@longest){ 894 | foreach my $supporting_longest (keys %{$container_to_supporting_transcripts{$longest}}, $longest){ 895 | $uniqsupporting_longest{$supporting_longest}=1; 896 | } 897 | } 898 | foreach my $t (keys %uniqsupporting_longest){ 899 | push(@supporting_longest, $t); 900 | } 901 | #compute distance of all contained reads' TSSs/TTSs to TM's TSS/TTS 902 | my ($refDistsToLeft, $refDistsToRight) = endDistances($tr, \@trList); 903 | #print STDERR "Dists to left: ".join(",", @{$refDistsToLeft})."\n"; 904 | #print STDERR "Dists to right: ".join(",", @{$refDistsToRight})."\n"; 905 | printGTF($tr, 0, \@trList, \@longest, \@supporting_longest, 0, $refDistsToLeft, $refDistsToRight, $trMatureLength); 906 | } 907 | 908 | } 909 | else{ #tr is not a container 910 | unless(exists ($transcript_to_container{$tr})) { # i.e. if tr is not a container, and is not contained 911 | my @trList=($tr); 912 | my @tmp1=('0'); 913 | my $trMatureLength=0; 914 | for (my $i=0; $i<=$#{$transcript_exons{$tr}}; $i++){ 915 | $trMatureLength+= (${${$transcript_exons{$tr}}[$i]}[2] - ${${$transcript_exons{$tr}}[$i]}[1])+1; 916 | } 917 | 918 | #print STDERR "Dists to left: ".join(",", @tmp1)."\n"; 919 | #print STDERR "Dists to right: ".join(",", @tmp1)."\n"; 920 | printGTF($tr, 0, \@trList, \@trList, \@trList, 0, \@tmp1, \@tmp1, $trMatureLength); 921 | 922 | } 923 | } 924 | } 925 | 926 | } 927 | 928 | print STDERR "Done.\n"; 929 | 930 | 931 | 932 | sub buildContig{ 933 | my $trA=$_[0]; 934 | my $feature_to_feature=$_[1]; 935 | my $feature_to_contig=$_[2]; 936 | my $contigNumber=$_[3]; 937 | 938 | unless(exists ${$feature_to_contig}{$trA}){ 939 | ${$feature_to_contig}{$trA}=$contigNumber; 940 | } 941 | foreach my $trB (keys %{${$feature_to_feature}{$trA}}){ 942 | unless( exists ${$feature_to_contig}{$trB} ){ 943 | buildContig($trB, $feature_to_feature, $feature_to_contig, $contigNumber); 944 | } 945 | } 946 | } 947 | 948 | sub printGTF{ 949 | my $transcript_id=$_[0]; 950 | my $score=$_[1]; 951 | my @contains=@{$_[2]}; 952 | my @longest=@{$_[3]}; 953 | my @supporting_longest=@{$_[4]}; 954 | my $supporting_longest_count= scalar @supporting_longest; 955 | my $contains_count= scalar @contains; 956 | my $rpm=$contains_count/$million_read_count; 957 | my $flrpm=$supporting_longest_count/$million_read_count; 958 | my $spliced_bool=$_[5]; 959 | my @distsToLeft=@{$_[6]}; 960 | my @distsToRight=@{$_[7]}; 961 | my $length=$_[8]; 962 | my @distsToFivePend=(); 963 | my @distsToThreePend=(); 964 | my @contains_id=(); 965 | foreach my $id (@contains){ 966 | push(@contains_id, $transcript_index{$id}); 967 | } 968 | 969 | my @longest_id=(); 970 | foreach my $id (@longest){ 971 | push(@longest_id, $transcript_index{$id}); 972 | } 973 | 974 | my @supporting_longest_id=(); 975 | foreach my $id (@supporting_longest){ 976 | push(@supporting_longest_id, $transcript_index{$id}); 977 | } 978 | my $contains=join(",", @contains_id); 979 | my $longest=join(",", @longest_id); 980 | my $supporting_longest=join(",", @supporting_longest_id); 981 | my $strand=''; 982 | if(${$transcript_exons{$transcript_id}}[0][3] == -1){ 983 | $strand='-'; 984 | @distsToFivePend=@distsToRight; 985 | @distsToThreePend=@distsToLeft; 986 | } 987 | elsif (${$transcript_exons{$transcript_id}}[0][3] == 1){ 988 | $strand='+'; 989 | @distsToFivePend=@distsToLeft; 990 | @distsToThreePend=@distsToRight; 991 | } 992 | elsif (${$transcript_exons{$transcript_id}}[0][3] == 0){ 993 | $strand='.'; 994 | @distsToFivePend=@distsToLeft; 995 | @distsToThreePend=@distsToRight; 996 | } 997 | else{ 998 | die; 999 | } 1000 | my $distsToFivePendString=join(",", @distsToFivePend); 1001 | my $distsToThreePendString=join(",", @distsToThreePend); 1002 | my @metaDistsToFivePend=(); 1003 | my @metaDistsToThreePend=(); 1004 | foreach my $dist (@distsToFivePend){ 1005 | push(@metaDistsToFivePend, $dist/$length); 1006 | } 1007 | foreach my $dist (@distsToThreePend){ 1008 | push(@metaDistsToThreePend, 1 - ($dist/$length)); 1009 | } 1010 | my $metaDistsToFivePendString=join(",", @metaDistsToFivePend); 1011 | my $metaDistsToThreePendString=join(",", @metaDistsToThreePend); 1012 | 1013 | 1014 | foreach my $exon (@{$transcript_exons{$transcript_id}}){ 1015 | my $tmId=makeTmId($transcript_id); 1016 | print "${$exon}[0]\ttmerge\texon\t${$exon}[1]\t${$exon}[2]\t$score\t$strand\t.\tgene_id \"$tmId\"; transcript_id \"$tmId\"; contains \"$contains\"; contains_count \"$contains_count\"; 3p_dists_to_3p \"$distsToThreePendString\"; 5p_dists_to_5p \"$distsToFivePendString\"; flrpm \"$flrpm\"; longest \"$longest\"; longest_FL_supporters \"$supporting_longest\"; longest_FL_supporters_count \"$supporting_longest_count\"; mature_RNA_length \"$length\"; meta_3p_dists_to_5p \"$metaDistsToThreePendString\"; meta_5p_dists_to_5p \"$metaDistsToFivePendString\"; rpm \"$rpm\"; spliced \"$spliced_bool\";\n"; 1017 | } 1018 | } 1019 | 1020 | sub endDistances{ 1021 | my $container=$_[0]; 1022 | my @contains=@{$_[1]}; 1023 | my @distsToLeft=(); 1024 | my @distsToRight=(); 1025 | #print STDERR $transcript_index{$container}."\n"; 1026 | foreach my $trContained (@contains){ 1027 | #print STDERR "\t$transcript_index{$trContained}\n"; 1028 | my $containedLeftStart; 1029 | my $containedRightEnd; 1030 | if($trContained == $container){ 1031 | $containedLeftStart=${$container_original_ends{$container}}[0]; 1032 | $containedRightEnd=${$container_original_ends{$container}}[1]; 1033 | } 1034 | else{ 1035 | $containedLeftStart=${$transcript_exons{$trContained}}[0][1]; 1036 | $containedRightEnd=${$transcript_exons{$trContained}}[-1][2]; 1037 | } 1038 | #print STDERR "\t$containedLeftStart - $containedRightEnd\n"; 1039 | 1040 | my $distToLeft=0; 1041 | my $intronLeftSubstract=0; 1042 | my $distToRight=0; 1043 | my $intronRightSubstract=0; 1044 | 1045 | #compute distance to left end of container 1046 | for (my $i=0; $i<=$#{$transcript_exons{$container}}; $i++){ 1047 | if($i>0){ 1048 | $intronLeftSubstract+=${${$transcript_exons{$container}}[$i]}[1] - ${${$transcript_exons{$container}}[$i-1]}[2]; 1049 | $distToLeft=(${${$transcript_exons{$container}}[$i]}[1] - ${${$transcript_exons{$container}}[0]}[1]) - $intronLeftSubstract; 1050 | } 1051 | #print STDERR "\t i: $i ${${$transcript_exons{$container}}[$i]}[1] ${${$transcript_exons{$container}}[$i]}[2]\n"; 1052 | #print STDERR "\t i: $i distToLeft: $distToLeft\n"; 1053 | if($containedLeftStart >= (${${$transcript_exons{$container}}[$i]}[1] -1) - $exonOverhangTolerance && $containedLeftStart <= ${${$transcript_exons{$container}}[$i]}[2]){ 1054 | my $lastDist=$containedLeftStart - ${${$transcript_exons{$container}}[$i]}[1] ; 1055 | $lastDist=0 if $lastDist<0; #account for exonOverhangTolerance 1056 | $distToLeft+=$lastDist ; 1057 | last; 1058 | } 1059 | elsif ($containedLeftStart < ${${$transcript_exons{$container}}[$i]}[1]){ 1060 | die "ERROR: Program died due to a bug (in endDistances subroutine: $containedLeftStart < ${${$transcript_exons{$container}}[$i]}[1])\nsorry. Please contact author.\n"; 1061 | } 1062 | } 1063 | 1064 | #compute distance to right end of container 1065 | for (my $i=$#{$transcript_exons{$container}}; $i>=0; $i--){ 1066 | if($i<$#{$transcript_exons{$container}}){ 1067 | $intronRightSubstract+=${${$transcript_exons{$container}}[$i+1]}[1] - ${${$transcript_exons{$container}}[$i]}[2]; 1068 | $distToRight=(${${$transcript_exons{$container}}[$#{$transcript_exons{$container}}]}[2] - ${${$transcript_exons{$container}}[$i]}[2]) - $intronRightSubstract; 1069 | #$distToLeft=$distToLeft + (${${$transcript_exons{$container}}[$i]}[1] - ${${$transcript_exons{$container}}[$i-1]}[1]); 1070 | #$distToLeft= $distToLeft - (${${$transcript_exons{$container}}[$i]}[1] - ${${$transcript_exons{$container}}[$i-1]}[2]); 1071 | } 1072 | #print STDERR "\t i: $i distToRight: $distToRight\n"; 1073 | if($containedRightEnd <= (${${$transcript_exons{$container}}[$i]}[2] +1) + $exonOverhangTolerance && $containedRightEnd >= ${${$transcript_exons{$container}}[$i]}[1]){ 1074 | my $lastDist=${${$transcript_exons{$container}}[$i]}[2] - $containedRightEnd ; 1075 | $lastDist=0 if $lastDist<0; #account for exonOverhangTolerance 1076 | $distToRight+=$lastDist ; 1077 | last; 1078 | } 1079 | elsif ($containedRightEnd > ${${$transcript_exons{$container}}[$i]}[2]){ 1080 | die "ERROR: Program died due to a bug (in endDistances subroutine: $containedRightEnd > ${${$transcript_exons{$container}}[$i]}[2])\n, sorry. Please contact author.\n"; 1081 | } 1082 | } 1083 | #print STDERR "\tFinal distToLeft: $distToLeft\n"; 1084 | #print STDERR "\tFinal distToRight: $distToRight\n"; 1085 | push(@distsToLeft, $distToLeft); 1086 | push(@distsToRight, $distToRight); 1087 | 1088 | } 1089 | return (\@distsToLeft, \@distsToRight); 1090 | } 1091 | 1092 | 1093 | 1094 | sub checkIntronExonOverlap{ 1095 | #verify that when two A and B transcripts have compatible intron chains, there is no terminal exon/intron overlap 1096 | #returns 1 if any overlap found, 0 otherwise 1097 | my $trA=$_[0]; 1098 | my $trB=$_[1]; 1099 | my $firstTrAIntronMatchedToTrB=$_[2]; 1100 | my $lastTrAIntronMatchedToTrB=$_[3]; 1101 | unless (defined $firstTrAIntronMatchedToTrB){ 1102 | die "ERROR: firstTrAIntronMatchedToTrB is undefined for $transcript_index{$trA} / $transcript_index{$trB}\n"; 1103 | } 1104 | unless (defined $lastTrAIntronMatchedToTrB){ 1105 | die "ERROR: lastTrAIntronMatchedToTrB is undefined for $transcript_index{$trA} / $transcript_index{$trB}\n"; 1106 | 1107 | } 1108 | my @exons2=(${$transcript_exons{$trB}}[0],${$transcript_exons{$trB}}[-1]); #only terminal exons 1109 | for (my $i=0; $i<=$#exons2;$i++) { 1110 | my $exon2=$exons2[$i]; 1111 | if($i==0){ #first exon of trB 1112 | unless ($firstTrAIntronMatchedToTrB ==0) { # if the first intron of trA matched to trB is trA[0], no need to look for upstream introns on trA 1113 | 1114 | my $intron1=${$transcript_introns{$trA}}[$firstTrAIntronMatchedToTrB-1]; 1115 | #my $overlap=overlap(${$intron1}[1], ${$intron1}[2], ${$exon2}[1], ${$exon2}[2]); 1116 | if( ${$exon2}[1] <= ${$intron1}[2] - $exonOverhangTolerance){ 1117 | return 1; 1118 | } 1119 | } 1120 | } 1121 | elsif($i==1){ #last exon of trB 1122 | unless ($lastTrAIntronMatchedToTrB == $#{$transcript_introns{$trA}} ){ # if the last intron of trA matched to trB is trA[-1], no need to look for downstream introns on trA 1123 | my $intron1=${$transcript_introns{$trA}}[$lastTrAIntronMatchedToTrB+1]; 1124 | #my $overlap=overlap(${$intron1}[1], ${$intron1}[2], ${$exon2}[1], ${$exon2}[2]); 1125 | if(${$exon2}[2] >= ${$intron1}[1] + $exonOverhangTolerance){ 1126 | return 1; 1127 | } 1128 | } 1129 | } 1130 | else{ 1131 | die; 1132 | } 1133 | 1134 | } 1135 | 1136 | return 0; 1137 | } 1138 | 1139 | 1140 | 1141 | sub makeTmId{ 1142 | my $id=$_[0]; 1143 | my @newId=split("", $id); 1144 | my @prepend=(join("",split("", $tmPrefix)),'T','M','_'); 1145 | my $totalLength=($#prepend+13)-length($id); 1146 | for (my $i=$#prepend+1;$i<$totalLength; $i++){ 1147 | $prepend[$i]=0; 1148 | } 1149 | unshift(@newId, @prepend); 1150 | return(join("",@newId)) 1151 | } 1152 | 1153 | 1154 | sub adjustContainerEnds{ 1155 | my $tr1=$_[0]; 1156 | my $tr2=$_[1]; 1157 | if(${$transcript_exons{$tr2}}[0][1] < ${$transcript_exons{$tr1}}[0][1]){ 1158 | ${$transcript_exons{$tr1}}[0][1] = ${$transcript_exons{$tr1}}[0][1] 1159 | } 1160 | if(${$transcript_exons{$tr2}}[-1][2] > ${$transcript_exons{$tr1}}[-1][2]){ 1161 | ${$transcript_exons{$tr1}}[-1][2] = ${$transcript_exons{$tr2}}[-1][2] 1162 | } 1163 | } 1164 | 1165 | sub transferRightExonsIntrons{ 1166 | my $tr1=$_[0]; #container 1167 | my $tr2=$_[1]; #content 1168 | my $tr2IntronIndex=$_[2]; 1169 | my $leftmostTr2ExonToTransfer=$tr2IntronIndex+1; 1170 | pop @{$transcript_exons{$tr1}}; 1171 | for (my $m=$leftmostTr2ExonToTransfer; $m <= $#{$transcript_exons{$tr2}} ;$m++) { 1172 | push (@{$transcript_exons{$tr1}}, \@{${$transcript_exons{$tr2}}[$m]} ); 1173 | } 1174 | for (my $m=$tr2IntronIndex+1; $m <= $#{$transcript_introns{$tr2}} ;$m++) { 1175 | push (@{$transcript_introns{$tr1}}, \@{${$transcript_introns{$tr2}}[$m]} ); 1176 | } 1177 | 1178 | } 1179 | --------------------------------------------------------------------------------