├── ExTraMapper.py ├── ExTraMapper_Figure.jpg ├── Human-Monkey-Processed-Data ├── README.md ├── config.human-monkey.conf ├── extMpreprocess └── scripts │ ├── ensemblUtils.py │ ├── liftOver │ ├── liftover-withMultiples │ ├── parseAndPicklePerPair.py │ └── splitExonsIntoIndividualFiles.py ├── Human-Mouse-Preprocess-Data ├── README.md ├── config.human-mouse.conf ├── extMpreprocess └── scripts │ ├── ensemblUtils.py │ ├── liftOver │ ├── liftover-withMultiples │ ├── parseAndPicklePerPair.py │ └── splitExonsIntoIndividualFiles.py ├── LICENSE ├── README.md ├── Result ├── Exon-Pairs │ └── README.md └── Transcript-Pairs │ ├── ExTraMapper_Transcript_Mapping_ENSMBL102_Genome_Build_Human_vs_Mouse.xlsx │ ├── ExTraMapper_Transcript_Mapping_ENSMBL102_Human_vs_Monkey.xlsx │ ├── ExTraMapper_Transcript_Mapping_ENSMBL81_Genome_Build_Human_vs_Mouse.xlsx │ └── README.md └── extMsummarise /ExTraMapper_Figure.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ay-lab/ExTraMapper/ff8bf6399e457c041e10ab8d94c83ae54414b273/ExTraMapper_Figure.jpg -------------------------------------------------------------------------------- /Human-Monkey-Processed-Data/README.md: -------------------------------------------------------------------------------- 1 | ## Steps to generate the input files (Human - Monkey) 2 | The users should run the _extMpreprocess_ to generate the inputfiles. All the input files will be generated under _preprocess/data_ folder. All the required executables and scripts are provided here. The _extMpreprocess_ has 7 individual steps and should be run in the following manner 3 | 4 | ### Run the following steps 5 | 6 | - ![#f03c15](https://via.placeholder.com/15/f03c15/000000?text=+) For help, type
7 | 8 | ```bash 9 | ./extMpreprocess help 10 | 11 | This script will download and preprocess the dataset required for exon-pair and transcript pair finding by ExTraMapper. 12 | Type ./extMpreprocess to execute the script. 13 | Type ./extMpreprocess example to print a example config.conf file. 14 | 15 | This script will run seven (7) sequential steps to create the inputs for ExTraMapper program. 16 | Users can provide step numbers (1-7) or all in the arugemt of this script. 17 | Short description of the individual scripts: 18 | Step 1: Download per organism specific files e.g. reference genomes, gene annotation files. 19 | Step 2: Will create genomedata archives with the genomes of org1 and org2 (Make sure to install genomedata package). 20 | Step 3: Pickle files for each homologous gene pair will be created. 21 | Step 4: Perform coordinate liftOver of exons with multiple mappings (This step requires bedtools and liftOver executables). 22 | Step 5-7: postprocessing the liftOver files. 23 | 24 | example: 25 | 26 | ./extMpreprocess config.human-monkey.conf all 27 | ``` 28 |
29 |
30 | - ![#f03c15](https://via.placeholder.com/15/f03c15/000000?text=+) The script requires genomedata package which can be installed by running the following commnand.
31 | 32 | ```bash 33 | $ pip install genomedata --user 34 | ``` 35 |
36 | 37 |
38 | 39 | #### Once finished the _extMpreproces_ script shoudld produce the _preprocess_ folder with the following subfolders.
40 | 41 | ```bash 42 | ./preprocess 43 | |-- bin 44 | | `-- liftOver 45 | `-- data 46 | |-- human-rhesus 47 | | |-- GTFsummaries 48 | | | |-- onlyOrthologAndCodingGenes 49 | | | | |-- org1-allExons-GTFparsed.txt 50 | | | | |-- org1-allGenes-GTFparsed.txt 51 | | | | |-- org1-allTranscripts-GTFparsed.txt 52 | | | | |-- org2-allExons-GTFparsed.txt 53 | | | | |-- org2-allGenes-GTFparsed.txt 54 | | | | `-- org2-allTranscripts-GTFparsed.txt 55 | | | |-- org1-allExons-GTFparsed.txt 56 | | | |-- org1-allGenes-GTFparsed.txt 57 | | | |-- org1-allTranscripts-GTFparsed.txt 58 | | | |-- org2-allExons-GTFparsed.txt 59 | | | |-- org2-allGenes-GTFparsed.txt 60 | | | `-- org2-allTranscripts-GTFparsed.txt 61 | | |-- ensemblDownloads 62 | | | |-- org1.gtf 63 | | | |-- org1.gtf.gz 64 | | | |-- org1_homolog_org2.txt 65 | | | |-- org1_homolog_org2.txt.gz 66 | | | |-- org2.gtf 67 | | | |-- org2.gtf.gz 68 | | | |-- org2_homolog_org1.txt 69 | | | `-- org2_homolog_org1.txt.gz 70 | | |-- genePairsSummary-one2one.txt 71 | | |-- genomedataArchives 72 | | | |-- org1 [27 entries exceeds filelimit, not opening dir] 73 | | | `-- org2 [23 entries exceeds filelimit, not opening dir] 74 | | |-- liftoverRelatedFiles [56 entries exceeds filelimit, not opening dir] 75 | | |-- perExonLiftoverCoords 76 | | | |-- org1 [619127 entries exceeds filelimit, not opening dir] 77 | | | `-- org2 [260616 entries exceeds filelimit, not opening dir] 78 | | `-- perGenePairPickledInfo [16150 entries exceeds filelimit, not opening dir] 79 | |-- liftover_chains 80 | | |-- hg38 81 | | | `-- liftOver 82 | | | `-- hg38ToRheMac10.over.chain.gz 83 | | `-- rheMac10 84 | | `-- liftOver 85 | | `-- rheMac10ToHg38.over.chain.gz 86 | `-- reference_genomes 87 | |-- hg38 [27 entries exceeds filelimit, not opening dir] 88 | `-- rheMac10 [24 entries exceeds filelimit, not opening dir] 89 | ``` 90 | 91 | ##### The whole process should take several hours to complete! 92 | ##### [(Check also the Human-Mouse data processing steps)](https://github.com/ay-lab/ExTraMapper/tree/master/Human-Mouse-Preprocess-Data) 93 | -------------------------------------------------------------------------------- /Human-Monkey-Processed-Data/config.human-monkey.conf: -------------------------------------------------------------------------------- 1 | # reference genome versions 2 | ref1=hg38 3 | ref2=rheMac10 4 | 5 | # short names of organisms 6 | org1=human 7 | org2=rhesus 8 | 9 | # Ensembl release version number to be used for both organisms 10 | releaseNo=102 11 | 12 | # Find out the standard Ensembl names for your organisms of interest from ftp://ftp.ensembl.org/pub/release-81/gtf/ 13 | org1EnsemblName=homo_sapiens 14 | org2EnsemblName=macaca_mulatta 15 | 16 | # Find out the full and short Ensembl Mart names for your organisms of interest from ftp://ftp.ensembl.org/pub/release-81/mysql/ensembl_mart_81 17 | org1EnsemblMartName=hsapiens 18 | org2EnsemblMartName=mmulatta 19 | org1EnsemblMartNameShort=hsap 20 | org2EnsemblMartNameShort=mmul 21 | 22 | #liftOver executable path (Please make sure it is executable, chmod u+x liftOver) 23 | liftOver=/Human-Monkey-Preprocess-Data/scripts/liftOver 24 | -------------------------------------------------------------------------------- /Human-Monkey-Processed-Data/extMpreprocess: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl 2 | 3 | ## This script will download and preprocess the dataset required for 4 | ## exon-pair and transcript pair finding by ExTraMapper. 5 | ## The script requires a config.conf file which will direct this script 6 | ## to download and process the essential data. 7 | 8 | ##################### config.conf file ##################### 9 | ## Example of human-monkey confif.conf file: 10 | ## 11 | ## #Reference genome versions 12 | ## ref1=hg38 13 | ## ref2=rheMac10 14 | ## 15 | ## #Short names of organisms 16 | ## org1=human 17 | ## org2=rhesus 18 | ## 19 | ## #Ensembl release version number to be used for both organisms 20 | ## releaseNo=102 21 | ## 22 | ## #Find out the standard Ensembl names for your organisms of interest from ftp://ftp.ensembl.org/pub/release-81/gtf/ 23 | ## org1EnsemblName=homo_sapiens 24 | ## org2EnsemblName=macaca_mulatta 25 | ## 26 | ## #Find out the full and short Ensembl Mart names for your organisms of interest from ftp://ftp.ensembl.org/pub/release-81/mysql/ensembl_mart_102 27 | ## org1EnsemblMartName=hsapiens 28 | ## org2EnsemblMartName=mmulatta 29 | ## org1EnsemblMartNameShort=hsap 30 | ## org2EnsemblMartNameShort=mmul 31 | ## 32 | ## #liftOver executable path (Check here https://hgdownload.cse.ucsc.edu/admin/exe) 33 | ## liftOver=./usr/bin/liftOver 34 | ## 35 | ## 36 | ## Example of human-mouse confif.conf file: 37 | ## 38 | ## #Reference genome versions 39 | ## ref1=hg38 40 | ## ref2=mm10 41 | ## 42 | ## #Short names of organisms 43 | ## org1=human 44 | ## org2=mouse 45 | ## 46 | ## #Ensembl release version number to be used for both organisms 47 | ## releaseNo=102 48 | ## 49 | ## #Find out the standard Ensembl names for your organisms of interest from ftp://ftp.ensembl.org/pub/release-81/gtf/ 50 | ## org1EnsemblName=homo_sapiens 51 | ## org2EnsemblName=mus_musculus 52 | ## 53 | ## #Find out the full and short Ensembl Mart names for your organisms of interest from ftp://ftp.ensembl.org/pub/release-81/mysql/ensembl_mart_102 54 | ## org1EnsemblMartName=hsapiens 55 | ## org2EnsemblMartName=mmusculus 56 | ## org1EnsemblMartNameShort=hsap 57 | ## org2EnsemblMartNameShort=mmus 58 | ## 59 | ## #liftOver executable path (Check here https://hgdownload.cse.ucsc.edu/admin/exe) 60 | ## liftOver=./usr/bin/liftOver 61 | ## 62 | ############################################################ 63 | 64 | if ($#ARGV == -1 || $ARGV[0] eq "help") { 65 | print ("\n"); 66 | print ("This script will download and preprocess the dataset required for exon-pair and transcript pair finding by ExTraMapper.\n"); 67 | print ("Type ./extMpreprocess to execute the script.\n"); 68 | print ("Type ./extMpreprocess example to print a example config.conf file.\n\n"); 69 | print ("This script will run seven (7) sequential steps to create the inputs for ExTraMapper program.\n"); 70 | print ("Users can provide step numbers (1-7) or all in the arugemt of this script.\n"); 71 | print ("Short description of the individual scripts:\n"); 72 | print ("Step 1: Download per organism specific files e.g. reference genomes, gene annotation files.\n"); 73 | print ("Step 2: Will create genomedata archives with the genomes of org1 and org2 (Make sure to install genomedata package).\n"); 74 | print ("Step 3: Pickle files for each homologous gene pair will be created.\n"); 75 | print ("Step 4: Perform coordinate liftOver of exons with multiple mappings (This step requires bedtools and liftOver executables).\n"); 76 | print ("Step 5-7: postprocessing the liftOver files.\n"); 77 | print ("\n"); 78 | exit(); 79 | } elsif ($ARGV[0] eq "example") { 80 | my @exmpl = "# reference genome versions 81 | ref1=hg38 82 | ref2=mm10 83 | 84 | # short names of organisms 85 | org1=human 86 | org2=mouse 87 | 88 | # Ensembl release version number to be used for both organisms 89 | releaseNo=102 90 | 91 | # Find out the standard Ensembl names for your organisms of interest from ftp://ftp.ensembl.org/pub/release-81/gtf/ 92 | org1EnsemblName=homo_sapiens 93 | org2EnsemblName=mus_musculus 94 | 95 | # Find out the full and short Ensembl Mart names for your organisms of interest from ftp://ftp.ensembl.org/pub/release-81/mysql/ensembl_mart_81 96 | org1EnsemblMartName=hsapiens 97 | org2EnsemblMartName=mmusculus 98 | org1EnsemblMartNameShort=hsap 99 | org2EnsemblMartNameShort=mmus 100 | 101 | #liftOver executable path (Check here https://hgdownload.cse.ucsc.edu/admin/exe) 102 | liftOver=/usr/bin/liftOver\n"; 103 | 104 | print (@exmpl); 105 | print ("\n"); 106 | open (out, ">config.human-mouse.conf"); 107 | print out @exmpl; 108 | close out; 109 | print ("The example config.human-mouse.conf file is written\n"); 110 | exit; 111 | } 112 | my ($configfile, $step) = @ARGV; 113 | chomp ($configfile, $step); 114 | 115 | #### File and folder check #### 116 | die "The $configfile does not exists, exit!" unless -e "$configfile"; 117 | 118 | 119 | #### Get the environmental variables #### 120 | $ENV{'EXTRAMAPPER_DIR'} = $ENV{'PWD'}; 121 | open(in, $configfile); 122 | while (my $var = ) { 123 | chomp $var; 124 | if ($var =~ /=/) { 125 | $var_n = (split(/=/,$var))[0]; 126 | $var_v = (split(/=/,$var))[1]; 127 | $ENV{$var_n} = $var_v; 128 | } 129 | } 130 | close in; 131 | 132 | #### Set the variable folders and files #### 133 | $dataDir = "$ENV{'EXTRAMAPPER_DIR'}/preprocess/data"; 134 | $dataDirPerPair = "$ENV{'EXTRAMAPPER_DIR'}/preprocess/data/$ENV{'org1'}-$ENV{'org2'}"; 135 | $referenceGenomesDir = "$dataDir/reference_genomes"; 136 | $chainsDir = "$dataDir/liftover_chains"; 137 | $ensemblDir = "$dataDirPerPair/ensemblDownloads"; 138 | $genomedataDir = "$dataDirPerPair/genomedataArchives"; 139 | $GTFsummaryDir = "$dataDirPerPair/GTFsummaries"; 140 | $perGenePairPickleDir= "$dataDirPerPair/perGenePairPickledInfo"; 141 | $liftOverFilesDir = "$dataDirPerPair/liftoverRelatedFiles"; 142 | $perExonLiftoverDir = "$dataDirPerPair/perExonLiftoverCoords"; 143 | 144 | #### Main functions and sub-routines #### 145 | sub getfasta { 146 | my $path = $_[0]; 147 | my $org = $_[1]; 148 | my %chr; 149 | open(chrname,"$path/$org/name_chr.txt"); 150 | while ( ){ 151 | chomp $_; 152 | $chr{$_} = 1; 153 | } 154 | close (chrname); 155 | 156 | my $file = "$path/$org/$org.fa.gz"; 157 | open(in, "zcat $file |"); 158 | while ( ) { 159 | chomp $_; 160 | if ($_ =~ />/) { 161 | $name = $_; 162 | $ckpt = 0; 163 | $name =~ s/>//g; 164 | if ($chr{$name} ne "") { 165 | print ("Extracting $name from $org.fa.gz file\n"); 166 | $ckpt = 1; 167 | open($out,"|gzip -c > $path/$org/$name.fa.gz"); 168 | print $out (">$name\n"); 169 | } else { 170 | close ($out); 171 | } 172 | } else { 173 | if ($ckpt == 1) { 174 | print $out ("$_\n"); 175 | } 176 | } 177 | } 178 | close(in); 179 | system("rm -rf $path/$org/$org.fa.gz"); 180 | print ("Finished extracting chromosomes and writing the individual *.fa.gz files\n"); 181 | print ("Removed $path/$org/$org.fa.gz\n"); 182 | } 183 | 184 | sub downloadrefgenome { 185 | 186 | my $path = $_[0]; 187 | my $org = $_[1]; 188 | if (!-d "$path/$org") { 189 | print ("Creating $path/$org folder\n"); 190 | system("mkdir -p $path/$org"); 191 | print ("Running: wget --timestamping ftp://hgdownload.cse.ucsc.edu/goldenPath/$org/chromosomes/* --directory-prefix=$path/$org 2>&1 | grep \"Login incorrect\"\n"); 192 | my $error = `wget --timestamping ftp://hgdownload.cse.ucsc.edu/goldenPath/$org/chromosomes/* --directory-prefix=$path/$org 2>&1 | grep "No such directory"`; 193 | if ($error =~ "No such directory") { 194 | print ("There is no chromosome folder for $org. So, downloding the bigZip file and extracting them\n"); 195 | print ("Running: wget --timestamping ftp://hgdownload.cse.ucsc.edu/goldenPath/$org/bigZips/$org.fa.gz --directory-prefix=$path/$org 2> /dev/null\n"); 196 | system("wget --timestamping ftp://hgdownload.cse.ucsc.edu/goldenPath/$org/bigZips/$org.fa.gz --directory-prefix=$path/$org 2> /dev/null"); 197 | print ("Extracting the individual chromosomes\n"); 198 | print ("zcat $path/$org/$org.fa.gz |grep \">\" |grep -v \"_random\" |grep -v \"chrUn\" |sed 's/>//g' > $path/$org/name_chr.txt\n"); 199 | system("zcat $path/$org/$org.fa.gz |grep \">\" |grep -v \"_random\" |grep -v \"chrUn\" |sed 's/>//g' > $path/$org/name_chr.txt"); 200 | getfasta($path, $org); 201 | print "Reference genomes are downloaded in $path/$org\n"; 202 | } else { 203 | system("rm -rf $path/$org/*_random*"); 204 | system("rm -rf $path/$org/chrUn*"); 205 | system("rm -rf $path/$org/*_alt*"); 206 | } 207 | } else { 208 | print ("$path/$org folder already exists, skipping downloading the dataset\n"); 209 | } 210 | } 211 | 212 | sub downloadliftoverfiles { 213 | 214 | my $path = $_[0]; 215 | my $org1 = $_[1]; 216 | my $org2 = $_[2]; 217 | if (!-d "$path/$org1/liftOver") { 218 | print ("Creating $path/$org1/liftOver folder\n"); 219 | system("mkdir -p $path/$org1/liftOver"); 220 | my $ref2Cap =`echo $org2 | python -c "s=input(); print (s[0].upper()+s[1:])"`; 221 | chomp $ref2Cap; 222 | my $chain_name = $org1."To".$ref2Cap; 223 | print ("Running: wget http://hgdownload.cse.ucsc.edu/goldenPath/$org1/liftOver/$chain_name.over.chain.gz --directory-prefix=$path/$org1/liftOver\n"); 224 | system("wget http://hgdownload.cse.ucsc.edu/goldenPath/$org1/liftOver/$chain_name.over.chain.gz --directory-prefix=$path/$org1/liftOver 2> /dev/null"); 225 | print ("LiftOver chain saved to $path/$org1/liftOver/$chain_name.over.chain.gz\n"); 226 | } else { 227 | print ("$path/$org1 folder already exists, skipping download\n"); 228 | } 229 | } 230 | 231 | sub downloadensmblfiles { 232 | 233 | my $path = $_[0]; 234 | my $releaseNo = $_[1]; 235 | my $org1EnsemblName = $_[2]; 236 | my $org1EnsemblMartName = $_[3]; 237 | my $org2EnsemblName = $_[4]; 238 | my $org2EnsemblMartName = $_[5]; 239 | 240 | print ("Downloading GTF files\n"); 241 | if (!-e "$path/org1.gtf.gz") { 242 | print ("wget ftp://ftp.ensembl.org/pub/release-$releaseNo/gtf/$org1EnsemblName/*.$releaseNo.gtf.gz -O $path/org1.gtf.gz\n"); 243 | system("wget ftp://ftp.ensembl.org/pub/release-$releaseNo/gtf/$org1EnsemblName/*.$releaseNo.gtf.gz -O $path/org1.gtf.gz 2> /dev/null"); 244 | print ("GTF files downloaded in $path\n"); 245 | } else { 246 | print ("$path/org1.gtf.gz file exists, skipping download\n"); 247 | } 248 | if (!-e "$path/org2.gtf.gz") { 249 | print ("wget ftp://ftp.ensembl.org/pub/release-$releaseNo/gtf/$org2EnsemblName/*.$releaseNo.gtf.gz -O $path/org2.gtf.gz\n"); 250 | system("wget ftp://ftp.ensembl.org/pub/release-$releaseNo/gtf/$org2EnsemblName/*.$releaseNo.gtf.gz -O $path/org2.gtf.gz 2> /dev/null"); 251 | print ("GTF files downloaded in $path\n"); 252 | } else { 253 | print ("$path/org2.gtf.gz file exists, skipping download\n"); 254 | } 255 | 256 | print ("Downloading ENSEMBL homologs\n"); 257 | if (!-e "$path/org1_homolog_org2.txt.gz") { 258 | print ("wget ftp://ftp.ensembl.org/pub/release-$releaseNo/mysql/ensembl_mart_$releaseNo/$org1EnsemblMartName\_gene_ensembl__homolog_$org2EnsemblMartName\__dm.txt.gz -O $path/org1_homolog_org2.txt.gz\n"); 259 | system("wget ftp://ftp.ensembl.org/pub/release-$releaseNo/mysql/ensembl_mart_$releaseNo/$org1EnsemblMartName\_gene_ensembl__homolog_$org2EnsemblMartName\__dm.txt.gz -O $path/org1_homolog_org2.txt.gz 2> /dev/null"); 260 | print ("ENSEMBL homolog downloaded in $path\n"); 261 | } else { 262 | print ("$path/org1_homolog_org2.txt.gz file exists, skipping download\n"); 263 | } 264 | 265 | if (!-e "$path/org2_homolog_org1.txt.gz") { 266 | print ("wget ftp://ftp.ensembl.org/pub/release-$releaseNo/mysql/ensembl_mart_$releaseNo/$org2EnsemblMartName\_gene_ensembl__homolog_$org1EnsemblMartName\__dm.txt.gz -O $path/org2_homolog_org1.txt.gz\n"); 267 | system("wget ftp://ftp.ensembl.org/pub/release-$releaseNo/mysql/ensembl_mart_$releaseNo/$org2EnsemblMartName\_gene_ensembl__homolog_$org1EnsemblMartName\__dm.txt.gz -O $path/org2_homolog_org1.txt.gz 2> /dev/null"); 268 | print ("ENSEMBL homolog downloaded in $path\n"); 269 | } else { 270 | print ("$path/org2_homolog_org1.txt.gz file exists, skipping download\n"); 271 | } 272 | 273 | } 274 | 275 | sub ltime { 276 | 277 | my $time = localtime; 278 | return($time); 279 | } 280 | 281 | sub genomedataarchive { 282 | 283 | my $path = $_[0]; 284 | my $org = $_[1]; 285 | my $ref = $_[2]; 286 | my $referenceGenomesDir = $_[3]; 287 | my $old_path = $ENV{'PWD'}; 288 | chdir $path; 289 | if (-e "$ref.fa") { 290 | print ("Deleting the existing $ref.fa\n"); 291 | system("rm -rf $ref.fa"); 292 | } 293 | if (!-d $org) { 294 | print ("Running : zcat $referenceGenomesDir/$ref/*.fa.gz > $ref.fa\n"); 295 | print ("Started at ",ltime(),"\n"); 296 | system("zcat $referenceGenomesDir/$ref/*.fa.gz > $ref.fa"); 297 | print ("Ended at ",ltime(),"\n"); 298 | print ("Running : genomedata-load-seq -d $org $ref.fa\n"); 299 | print ("Started at ",ltime(),"\n"); 300 | system("genomedata-load-seq -d $org $ref.fa"); 301 | system("genomedata-close-data $org"); 302 | print ("Ended at ",ltime(),"\n"); 303 | system("rm -rf $ref.fa"); 304 | } else { 305 | print ("$org genomedata exists, skipping the step\n"); 306 | } 307 | chdir $old_path; 308 | } 309 | 310 | sub parseAndPicklePerPair { 311 | 312 | my $extmapper_path = $_[0]; 313 | my $ensemblDir = $_[1]; 314 | my $dataDirPerPair = $_[2]; 315 | my $GTFsummaryDir = $_[3]; 316 | my $perGenePairPickleDir = $_[4]; 317 | 318 | if (!-e "$ensemblDir/org1.gtf") { 319 | print ("Running : gunzip -k $ensemblDir/org1.gtf.gz\n"); 320 | system("gunzip -k $ensemblDir/org1.gtf.gz"); 321 | } else { 322 | print ("$ensemblDir/org1.gtf file present, skipping gunzip action\n"); 323 | } 324 | if (!-e "$ensemblDir/org2.gtf") { 325 | print ("Running : gunzip -k $ensemblDir/org2.gtf.gz\n"); 326 | system("gunzip -k $ensemblDir/org2.gtf.gz"); 327 | } else { 328 | print ("$ensemblDir/org2.gtf file present, skipping gunzip action\n"); 329 | } 330 | if (!-e "$ensemblDir/org1_homolog_org2.txt") { 331 | print ("Running : gunzip -k $ensemblDir/org1_homolog_org2.txt.gz\n"); 332 | system("gunzip -k $ensemblDir/org1_homolog_org2.txt.gz"); 333 | } else { 334 | print ("$ensemblDir/org1_homolog_org2.txt file present, skipping gunzip action\n"); 335 | } 336 | if (!-e "$ensemblDir/org2_homolog_org1.txt") { 337 | print ("Running : gunzip -k $ensemblDir/org2_homolog_org1.txt.gz\n"); 338 | system("gunzip -k $ensemblDir/org2_homolog_org1.txt.gz"); 339 | } else { 340 | print ("$ensemblDir/org2_homolog_org1.txt file present, skipping gunzip action\n"); 341 | } 342 | 343 | if (!-d $perGenePairPickleDir) { 344 | print ("Running : python $extmapper_path/scripts/parseAndPicklePerPair.py $dataDirPerPair $GTFsummaryDir $perGenePairPickleDir\n"); 345 | print ("Started at ",ltime(),"\n"); 346 | system("python $extmapper_path/scripts/parseAndPicklePerPair.py $dataDirPerPair $GTFsummaryDir $perGenePairPickleDir"); 347 | print ("Ended at ",ltime(),"\n"); 348 | system("mv $perGenePairPickleDir/genePairsSummary-one2one.txt $dataDirPerPair/genePairsSummary-one2one.txt"); 349 | } else { 350 | print ("perGenePairPickleDir found, skipping\n"); 351 | } 352 | } 353 | 354 | sub liftoverexonmultiplemapping { 355 | 356 | my $GTFsummaryDir = $_[0]; 357 | my $liftOverFilesDir = $_[1]; 358 | my $chainsDir = $_[2]; 359 | my $ref1 = $_[3]; 360 | my $ref2 = $_[4]; 361 | my $extmapper_path = $_[5]; 362 | 363 | my $indir = "$GTFsummaryDir/onlyOrthologAndCodingGenes"; 364 | 365 | print ("Running : cat $indir/org1-allExons-GTFparsed.txt | awk -v OFS='\\t' 'NR>1{print \$1,\$2,\$3,\$4,\$5}' | sort -k1,1 -k2,2n > $liftOverFilesDir/org1_allExonsList.bed\n"); 366 | print ("Started at ",ltime(),"\n"); 367 | system("cat $indir/org1-allExons-GTFparsed.txt | awk -v OFS='\\t' 'NR>1{print \$1,\$2,\$3,\$4,\$5}' | sort -k1,1 -k2,2n > $liftOverFilesDir/org1_allExonsList.bed"); 368 | print ("Ended at ",ltime(),"\n"); 369 | 370 | print ("Running : cat $indir/org2-allExons-GTFparsed.txt | awk -v OFS='\\t' 'NR>1{print \$1,\$2,\$3,\$4,\$5}' | sort -k1,1 -k2,2n > $liftOverFilesDir/org2_allExonsList.bed\n"); 371 | print ("Started at ",ltime(),"\n"); 372 | system("cat $indir/org2-allExons-GTFparsed.txt | awk -v OFS='\\t' 'NR>1{print \$1,\$2,\$3,\$4,\$5}' | sort -k1,1 -k2,2n > $liftOverFilesDir/org2_allExonsList.bed"); 373 | print ("Ended at ",ltime(),"\n"); 374 | 375 | print ("Running : cat $indir/org1-allExons-GTFparsed.txt |awk -v OFS='\\t' '\$6==\"partCoding\" {print \$1,\$7,\$8,\$4,\$5}' | sort -k1,1 -k2,2n > $liftOverFilesDir/org1_partCodingExonsList.bed\n"); 376 | print ("Started at ",ltime(),"\n"); 377 | system("cat $indir/org1-allExons-GTFparsed.txt |awk -v OFS='\\t' '\$6==\"partCoding\" {print \$1,\$7,\$8,\$4,\$5}' | sort -k1,1 -k2,2n > $liftOverFilesDir/org1_partCodingExonsList.bed"); 378 | print ("Ended at ",ltime(),"\n"); 379 | 380 | print ("Running : cat $indir/org2-allExons-GTFparsed.txt |awk -v OFS='\\t' '\$6==\"partCoding\" {print \$1,\$7,\$8,\$4,\$5}' | sort -k1,1 -k2,2n > $liftOverFilesDir/org2_partCodingExonsList.bed\n"); 381 | print ("Started at ",ltime(),"\n"); 382 | system("cat $indir/org2-allExons-GTFparsed.txt |awk -v OFS='\\t' '\$6==\"partCoding\" {print \$1,\$7,\$8,\$4,\$5}' | sort -k1,1 -k2,2n > $liftOverFilesDir/org2_partCodingExonsList.bed"); 383 | print ("Ended at ",ltime(),"\n"); 384 | 385 | print ("Running : cat $indir/org1-allExons-GTFparsed.txt |awk -v OFS='\\t' '\$6==\"fullCoding\" {print \$1,\$2,\$3,\$4,\$5}' > $liftOverFilesDir/org1_f.temp\n"); 386 | print ("Started at ",ltime(),"\n"); 387 | system("cat $indir/org1-allExons-GTFparsed.txt |awk -v OFS='\\t' '\$6==\"fullCoding\" {print \$1,\$2,\$3,\$4,\$5}' > $liftOverFilesDir/org1_f.temp"); 388 | print ("Ended at ",ltime(),"\n"); 389 | 390 | print ("Running : cat $indir/org2-allExons-GTFparsed.txt |awk -v OFS='\\t' '\$6==\"fullCoding\" {print \$1,\$2,\$3,\$4,\$5}' > $liftOverFilesDir/org2_f.temp\n"); 391 | print ("Started at ",ltime(),"\n"); 392 | system("cat $indir/org2-allExons-GTFparsed.txt |awk -v OFS='\\t' '\$6==\"fullCoding\" {print \$1,\$2,\$3,\$4,\$5}' > $liftOverFilesDir/org2_f.temp"); 393 | print ("Ended at ",ltime(),"\n"); 394 | 395 | print ("Running : cat $liftOverFilesDir/org1_partCodingExonsList.bed $liftOverFilesDir/org1_f.temp | sort -k1,1 -k2,2n > $liftOverFilesDir/org1_allCodingExonsList.bed\n"); 396 | print ("Started at ",ltime(),"\n"); 397 | system("cat $liftOverFilesDir/org1_partCodingExonsList.bed $liftOverFilesDir/org1_f.temp | sort -k1,1 -k2,2n > $liftOverFilesDir/org1_allCodingExonsList.bed"); 398 | print ("Ended at ",ltime(),"\n"); 399 | 400 | print ("Running : cat $liftOverFilesDir/org2_partCodingExonsList.bed $liftOverFilesDir/org2_f.temp | sort -k1,1 -k2,2n > $liftOverFilesDir/org2_allCodingExonsList.bed\n"); 401 | print ("Started at ",ltime(),"\n"); 402 | system("cat $liftOverFilesDir/org2_partCodingExonsList.bed $liftOverFilesDir/org2_f.temp | sort -k1,1 -k2,2n > $liftOverFilesDir/org2_allCodingExonsList.bed"); 403 | print ("Ended at ",ltime(),"\n"); 404 | 405 | print ("Running : cat $liftOverFilesDir/org1_allCodingExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org1_allCodingExonsList.sorted.temp\n"); 406 | print ("Started at ",ltime(),"\n"); 407 | system("cat $liftOverFilesDir/org1_allCodingExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org1_allCodingExonsList.sorted.temp"); 408 | print ("Ended at ",ltime(),"\n"); 409 | 410 | print ("Running : cat $liftOverFilesDir/org2_allCodingExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org2_allCodingExonsList.sorted.temp\n"); 411 | print ("Started at ",ltime(),"\n"); 412 | system("cat $liftOverFilesDir/org2_allCodingExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org2_allCodingExonsList.sorted.temp"); 413 | print ("Ended at ",ltime(),"\n"); 414 | 415 | print ("Running : cat $liftOverFilesDir/org1_allExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org1_allExonsList.sorted.temp\n"); 416 | print ("Started at ",ltime(),"\n"); 417 | system("cat $liftOverFilesDir/org1_allExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org1_allExonsList.sorted.temp"); 418 | print ("Ended at ",ltime(),"\n"); 419 | 420 | print ("Running : cat $liftOverFilesDir/org2_allExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org2_allExonsList.sorted.temp\n"); 421 | print ("Started at ",ltime(),"\n"); 422 | system("cat $liftOverFilesDir/org2_allExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org2_allExonsList.sorted.temp"); 423 | print ("Ended at ",ltime(),"\n"); 424 | 425 | print ("Running : cat $liftOverFilesDir/org1_partCodingExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org1_partCodingExonsList.sorted.temp\n"); 426 | print ("Started at ",ltime(),"\n"); 427 | system("cat $liftOverFilesDir/org1_partCodingExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org1_partCodingExonsList.sorted.temp"); 428 | print ("Ended at ",ltime(),"\n"); 429 | 430 | print ("Running : cat $liftOverFilesDir/org2_partCodingExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org2_partCodingExonsList.sorted.temp\n"); 431 | print ("Started at ",ltime(),"\n"); 432 | system("cat $liftOverFilesDir/org2_partCodingExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org2_partCodingExonsList.sorted.temp"); 433 | print ("Ended at ",ltime(),"\n"); 434 | 435 | my $chain1to2=`ls $chainsDir/$ref1/liftOver/*.over.chain.gz`; 436 | my $chain2to1=`ls $chainsDir/$ref2/liftOver/*.over.chain.gz`; 437 | chomp ($chain1to2, $chain2to1); 438 | 439 | foreach my $minMatch (qw{1 0.95 0.9}) { 440 | print ("Running : $extmapper_path/scripts/liftover-withMultiples 0 $minMatch $chain1to2 $chain2to1\n"); 441 | print ("Started at ",ltime(),"\n"); 442 | system("$extmapper_path/scripts/liftover-withMultiples 0 $minMatch $chain1to2 $chain2to1"); 443 | print ("Ended at ",ltime(),"\n"); 444 | } 445 | system("rm -rf $liftOverFilesDir/org2_allExonsList.sorted.temp"); 446 | system("rm -rf $liftOverFilesDir/org1_allExonsList.sorted.temp"); 447 | system("rm -rf $liftOverFilesDir/org2_partCodingExonsList.sorted.temp"); 448 | system("rm -rf $liftOverFilesDir/org1_partCodingExonsList.sorted.temp"); 449 | system("rm -rf $liftOverFilesDir/org2_allCodingExonsList.sorted.temp"); 450 | system("rm -rf $liftOverFilesDir/org1_allCodingExonsList.sorted.temp"); 451 | } 452 | 453 | sub liftoverfilesprocess { 454 | 455 | my $indir = $_[0]; 456 | my $outdir = $_[1]; 457 | my $flank = $_[2]; 458 | my $extmapper_path = $_[3]; 459 | 460 | if (-e "oneHugeFile-2to1-partCoding.txt") { 461 | system("rm -rf oneHugeFile-2to1-partCoding.txt"); 462 | } 463 | if (-e "oneHugeFile-1to2-partCoding.txt") { 464 | system("rm -rf oneHugeFile-1to2-partCoding.txt"); 465 | } 466 | 467 | foreach my $minMatch (qw{1 0.95 0.9}) { 468 | $suffix="flank$flank-minMatch$minMatch-multiples-partCoding"; 469 | print ("Running : zcat $indir/org1_VS_org2_to_org1_intersectingExonsList-$suffix.bed.gz |awk -v OFS='\\t' '\$6!=\".\"{print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$11,\$9,\$12,s}' s=$suffix >> oneHugeFile-2to1-partCoding.txt\n"); 470 | print ("Started at ",ltime(),"\n"); 471 | system("zcat $indir/org1_VS_org2_to_org1_intersectingExonsList-$suffix.bed.gz |awk -v OFS='\\t' '\$6!=\".\"{print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$11,\$9,\$12,s}' s=$suffix >> oneHugeFile-2to1-partCoding.txt"); 472 | print ("Ended at ",ltime(),"\n"); 473 | 474 | print ("Running : zcat $indir/org2_VS_org1_to_org2_intersectingExonsList-$suffix.bed.gz |awk -v OFS='\\t' '\$6!=\".\"{print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$11,\$9,\$12,s}' s=$suffix >> oneHugeFile-1to2-partCoding.txt\n"); 475 | print ("Started at ",ltime(),"\n"); 476 | system("zcat $indir/org2_VS_org1_to_org2_intersectingExonsList-$suffix.bed.gz |awk -v OFS='\\t' '\$6!=\".\"{print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$11,\$9,\$12,s}' s=$suffix >> oneHugeFile-1to2-partCoding.txt"); 477 | print ("Started at ",ltime(),"\n"); 478 | } 479 | 480 | if (-e "oneHugeFile-2to1-others.txt") { 481 | system("rm -rf oneHugeFile-2to1-others.txt"); 482 | } 483 | if (-e "oneHugeFile-1to2-others.txt") { 484 | system("rm -rf oneHugeFile-1to2-others.txt"); 485 | } 486 | 487 | foreach my $minMatch (qw{1 0.95 0.9}) { 488 | $suffix="flank$flank-minMatch$minMatch-multiples"; 489 | print ("Running : zcat $indir/org1_VS_org2_to_org1_intersectingExonsList-$suffix.bed.gz |awk -v OFS='\\t' '\$6!=\"\.\"{print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$11,\$9,\$12,s}' s=$suffix >> oneHugeFile-2to1-others.txt\n"); 490 | print ("Started at ",ltime(),"\n"); 491 | system("zcat $indir/org1_VS_org2_to_org1_intersectingExonsList-$suffix.bed.gz |awk -v OFS='\\t' '\$6!=\"\.\"{print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$11,\$9,\$12,s}' s=$suffix >> oneHugeFile-2to1-others.txt"); 492 | print ("Ended at ",ltime(),"\n"); 493 | 494 | print ("Running : zcat $indir/org2_VS_org1_to_org2_intersectingExonsList-$suffix.bed.gz |awk -v OFS='\\t' '\$6!=\"\.\"{print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$11,\$9,\$12,s}' s=$suffix >> oneHugeFile-1to2-others.txt\n"); 495 | print ("Started at ",ltime(),"\n"); 496 | system("zcat $indir/org2_VS_org1_to_org2_intersectingExonsList-$suffix.bed.gz |awk -v OFS='\\t' '\$6!=\"\.\"{print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$11,\$9,\$12,s}' s=$suffix >> oneHugeFile-1to2-others.txt"); 497 | print ("Ended at ",ltime(),"\n"); 498 | } 499 | 500 | print ("Running : cat oneHugeFile-1to2-partCoding.txt oneHugeFile-1to2-others.txt | sort -k10,10 >oneHugeFile-1to2.txt.sorted\n"); 501 | print ("Started at ",ltime(),"\n"); 502 | system("cat oneHugeFile-1to2-partCoding.txt oneHugeFile-1to2-others.txt | sort -k10,10 >oneHugeFile-1to2.txt.sorted"); 503 | print ("Ended at ",ltime(),"\n"); 504 | 505 | print ("Running : cat oneHugeFile-2to1-partCoding.txt oneHugeFile-2to1-others.txt | sort -k10,10 >oneHugeFile-2to1.txt.sorted\n"); 506 | print ("Started at ",ltime(),"\n"); 507 | system("cat oneHugeFile-2to1-partCoding.txt oneHugeFile-2to1-others.txt | sort -k10,10 >oneHugeFile-2to1.txt.sorted"); 508 | print ("Ended at ",ltime(),"\n"); 509 | 510 | system("mkdir -p $outdir/org1 $outdir/org2"); 511 | $whichCol=10; 512 | $fileSuffix="_mapped.txt"; 513 | 514 | print ("Running : python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-1to2.txt.sorted $outdir/org1 $whichCol $fileSuffix\n"); 515 | print ("Started at ",ltime(),"\n"); 516 | system("python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-1to2.txt.sorted $outdir/org1 $whichCol $fileSuffix"); 517 | print ("Ended at ",ltime(),"\n"); 518 | 519 | print ("Running : python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-2to1.txt.sorted $outdir/org2 $whichCol $fileSuffix\n"); 520 | print ("Started at ",ltime(),"\n"); 521 | system("python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-2to1.txt.sorted $outdir/org2 $whichCol $fileSuffix"); 522 | print ("Ended at ",ltime(),"\n"); 523 | 524 | print ("Removing temporary files\n"); 525 | system("rm -rf oneHugeFile*.txt"); 526 | 527 | } 528 | 529 | sub liftoverfilesprocessunmappedexons { 530 | 531 | my $indir = $_[0]; 532 | my $outdir = $_[1]; 533 | my $flank = $_[2]; 534 | my $extmapper_path = $_[3]; 535 | 536 | if (-e "oneHugeFile-2to1-partCoding.txt") { 537 | system("rm -rf oneHugeFile-2to1-partCoding.txt"); 538 | } 539 | if (-e "oneHugeFile-1to2-partCoding.txt") { 540 | system("rm -rf oneHugeFile-1to2-partCoding.txt"); 541 | } 542 | 543 | foreach my $minMatch (qw{1 0.95 0.9}) { 544 | $suffix="flank$flank-minMatch$minMatch-multiples-partCoding"; 545 | print ("Running : zcat $indir/org1_to_org2_liftOver_unmappedExonsList-$suffix.bed.gz |awk -v OFS='\\t' '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-1to2-partCoding.txt\n"); 546 | print ("Started at ",ltime(),"\n"); 547 | system("zcat $indir/org1_to_org2_liftOver_unmappedExonsList-$suffix.bed.gz |awk -v OFS='\\t' '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-1to2-partCoding.txt"); 548 | print ("Ended at ",ltime(),"\n"); 549 | 550 | print ("Running : zcat $indir/org2_to_org1_liftOver_unmappedExonsList-$suffix.bed.gz |awk -v OFS='\\t' '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-2to1-partCoding.txt\n"); 551 | print ("Started at ",ltime(),"\n"); 552 | system("zcat $indir/org2_to_org1_liftOver_unmappedExonsList-$suffix.bed.gz |awk -v OFS='\\t' '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-2to1-partCoding.txt"); 553 | print ("Ended at ",ltime(),"\n"); 554 | } 555 | 556 | if (-e "oneHugeFile-2to1-others.txt") { 557 | system("rm -rf oneHugeFile-2to1-others.txt"); 558 | } 559 | if (-e "oneHugeFile-1to2-others.txt") { 560 | system("rm -rf oneHugeFile-1to2-others.txt"); 561 | } 562 | 563 | foreach my $minMatch (qw{1 0.95 0.9}) { 564 | $suffix="flank$flank-minMatch$minMatch-multiples"; 565 | print ("Running : zcat $indir/org1_to_org2_liftOver_unmappedExonsList-$suffix.bed.gz |awk '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-1to2-others.txt\n"); 566 | print ("Started at ",ltime(),"\n"); 567 | system("zcat $indir/org1_to_org2_liftOver_unmappedExonsList-$suffix.bed.gz |awk '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-1to2-others.txt"); 568 | print ("Ended at ",ltime(),"\n"); 569 | 570 | print ("Running : zcat $indir/org2_to_org1_liftOver_unmappedExonsList-$suffix.bed.gz |awk '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-2to1-others.txt\n"); 571 | print ("Started at ",ltime(),"\n"); 572 | system("zcat $indir/org2_to_org1_liftOver_unmappedExonsList-$suffix.bed.gz |awk '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-2to1-others.txt"); 573 | print ("Ended at ",ltime(),"\n"); 574 | } 575 | 576 | print ("Running : cat oneHugeFile-1to2-partCoding.txt oneHugeFile-1to2-others.txt | sort -k5,5 >oneHugeFile-1to2.txt.sorted\n"); 577 | print ("Started at ",ltime(),"\n"); 578 | system("cat oneHugeFile-1to2-partCoding.txt oneHugeFile-1to2-others.txt | sort -k5,5 >oneHugeFile-1to2.txt.sorted"); 579 | print ("Ended at ",ltime(),"\n"); 580 | 581 | print ("Running : cat oneHugeFile-2to1-partCoding.txt oneHugeFile-2to1-others.txt | sort -k5,5 >oneHugeFile-2to1.txt.sorted\n"); 582 | print ("Started at ",ltime(),"\n"); 583 | system("cat oneHugeFile-2to1-partCoding.txt oneHugeFile-2to1-others.txt | sort -k5,5 >oneHugeFile-2to1.txt.sorted"); 584 | print ("Ended at ",ltime(),"\n"); 585 | 586 | system("mkdir -p $outdir/org1 $outdir/org2"); 587 | $whichCol=5; 588 | $fileSuffix="_unmapped.txt"; 589 | 590 | print ("Running : python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-1to2.txt.sorted $outdir/org1 $whichCol $fileSuffix\n"); 591 | print ("Started at ",ltime(),"\n"); 592 | system("python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-1to2.txt.sorted $outdir/org1 $whichCol $fileSuffix"); 593 | print ("Ended at ",ltime(),"\n"); 594 | 595 | print ("Running : python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-2to1.txt.sorted $outdir/org2 $whichCol $fileSuffix\n"); 596 | print ("Started at ",ltime(),"\n"); 597 | system("python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-2to1.txt.sorted $outdir/org2 $whichCol $fileSuffix"); 598 | print ("Ended at ",ltime(),"\n"); 599 | } 600 | 601 | sub liftoverfilesprocessmappedexons { 602 | 603 | my $indir = $_[0]; 604 | my $outdir = $_[1]; 605 | my $flank = $_[2]; 606 | my $extmapper_path = $_[3]; 607 | 608 | if (-e "oneHugeFile-2to1-others.txt") { 609 | system("rm -rf oneHugeFile-2to1-others.txt"); 610 | } 611 | if (-e "oneHugeFile-1to2-others.txt") { 612 | system("rm -rf oneHugeFile-1to2-others.txt"); 613 | } 614 | 615 | foreach my $minMatch (qw{1 0.95 0.9}) { 616 | $suffix="flank$flank-minMatch$minMatch-multiples"; 617 | print ("Running : zcat $indir/org1_VS_org2_to_org1_nonintersectingExonsList-$suffix.bed.gz |awk '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-2to1-others.txt\n"); 618 | print ("Started at ",ltime(),"\n"); 619 | system("zcat $indir/org1_VS_org2_to_org1_nonintersectingExonsList-$suffix.bed.gz |awk '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-2to1-others.txt"); 620 | print ("Ended at ",ltime(),"\n"); 621 | 622 | print ("Running : zcat $indir/org2_VS_org1_to_org2_nonintersectingExonsList-$suffix.bed.gz |awk '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-1to2-others.txt\n"); 623 | print ("Started at ",ltime(),"\n"); 624 | system("zcat $indir/org2_VS_org1_to_org2_nonintersectingExonsList-$suffix.bed.gz |awk '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-1to2-others.txt"); 625 | print ("Ended at ",ltime(),"\n"); 626 | } 627 | 628 | print ("Running : cat oneHugeFile-1to2-others.txt | sort -k5,5 >oneHugeFile-1to2.txt.sorted\n"); 629 | print ("Started at ",ltime(),"\n"); 630 | system("cat oneHugeFile-1to2-others.txt | sort -k5,5 >oneHugeFile-1to2.txt.sorted"); 631 | print ("Ended at ",ltime(),"\n"); 632 | 633 | print ("Running : cat oneHugeFile-2to1-others.txt | sort -k5,5 >oneHugeFile-2to1.txt.sorted\n"); 634 | print ("Started at ",ltime(),"\n"); 635 | system("cat oneHugeFile-2to1-others.txt | sort -k5,5 >oneHugeFile-2to1.txt.sorted"); 636 | print ("Ended at ",ltime(),"\n"); 637 | 638 | system("mkdir -p $outdir/org1 $outdir/org2"); 639 | $whichCol=5; 640 | $fileSuffix="_nonintersecting.txt"; 641 | print ("Running : python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-1to2.txt.sorted $outdir/org1 $whichCol $fileSuffix\n"); 642 | print ("Started at ",ltime(),"\n"); 643 | system("python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-1to2.txt.sorted $outdir/org1 $whichCol $fileSuffix"); 644 | print ("Ended at ",ltime(),"\n"); 645 | 646 | print ("Running : python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-2to1.txt.sorted $outdir/org2 $whichCol $fileSuffix\n"); 647 | print ("Started at ",ltime(),"\n"); 648 | system("python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-2to1.txt.sorted $outdir/org2 $whichCol $fileSuffix"); 649 | print ("Ended at ",ltime(),"\n"); 650 | 651 | print ("Removing temporary files\n"); 652 | system("rm -rf oneHugeFile* dummy.txt"); 653 | } 654 | 655 | sub step { 656 | 657 | my $step = $_[0]; 658 | 659 | if ($step == 1 || $step eq "all" || $step eq "All" || $step eq "ALL") { 660 | 661 | print ("Running step 1:\n"); 662 | print ("Downloading per organism specific files and keep the original organism names for future reuse\n"); 663 | print ("Downloading the two reference genomes from UCSC and get rid of unknown, random and alt contigs\n"); 664 | 665 | system("mkdir -p $referenceGenomesDir"); 666 | downloadrefgenome($referenceGenomesDir, $ENV{'ref1'}); 667 | downloadrefgenome($referenceGenomesDir, $ENV{'ref2'}); 668 | 669 | system("mkdir -p $chainsDir"); 670 | downloadliftoverfiles($chainsDir, $ENV{'ref1'}, $ENV{'ref2'}); 671 | downloadliftoverfiles($chainsDir, $ENV{'ref2'}, $ENV{'ref1'}); 672 | 673 | system("mkdir -p $ensemblDir"); 674 | downloadensmblfiles($ensemblDir, $ENV{'releaseNo'}, $ENV{'org1EnsemblName'}, $ENV{'org1EnsemblMartName'}, $ENV{'org2EnsemblName'}, $ENV{'org2EnsemblMartName'}); 675 | print ("---------------------- Step 1 Finished ----------------------\n"); 676 | } 677 | 678 | if ($step == 2 || $step eq "all" || $step eq "All" || $step eq "ALL") { 679 | 680 | print ("Running step 2:\n"); 681 | print ("Initialize the genomedata archives with the genomes of org1 and org2\n"); 682 | print ("Make sure genomedata is installed first\n"); 683 | print ("Installation: pip install genomedata --user\n"); 684 | system("mkdir -p $genomedataDir"); 685 | 686 | genomedataarchive($genomedataDir, "org1", $ENV{'ref1'}, $referenceGenomesDir); 687 | genomedataarchive($genomedataDir, "org2", $ENV{'ref2'}, $referenceGenomesDir); 688 | print ("---------------------- Step 2 Finished ----------------------\n"); 689 | } 690 | 691 | if ($step == 3 || $step eq "all" || $step eq "All" || $step eq "ALL") { 692 | print ("Running step 3:\n"); 693 | print ("Creating pickle files\n"); 694 | parseAndPicklePerPair($ENV{'EXTRAMAPPER_DIR'}, $ensemblDir, $dataDirPerPair, $GTFsummaryDir, $perGenePairPickleDir); 695 | print ("---------------------- Step 3 Finished ----------------------\n"); 696 | } 697 | 698 | if ($step == 4 || $step eq "all" || $step eq "All" || $step eq "ALL") { 699 | print ("Running step 4:\n"); 700 | print ("liftOver the exon lists but this time allow multiple mappings and also compute intersections with the other set of exons\n"); 701 | system("mkdir -p $liftOverFilesDir"); 702 | system("mkdir -p preprocess/bin"); 703 | if (!-e "./preprocess/bin/liftOver") { 704 | system("ln -s \$(readlink $ENV{liftOver}) ./preprocess/bin"); 705 | } 706 | liftoverexonmultiplemapping($GTFsummaryDir, $liftOverFilesDir, $chainsDir, $ENV{'ref1'}, $ENV{'ref2'}, $ENV{'EXTRAMAPPER_DIR'}); 707 | print ("---------------------- Step 4 Finished ----------------------\n"); 708 | } 709 | 710 | if ($step == 5 || $step eq "all" || $step eq "All" || $step eq "ALL") { 711 | print ("Running step 5:\n"); 712 | print ("Putting together, sorting, making them uniq and then splitting into one file per exon for all the liftover files created so far\n"); 713 | liftoverfilesprocess($liftOverFilesDir, $perExonLiftoverDir, 0, $ENV{'EXTRAMAPPER_DIR'}); 714 | print ("---------------------- Step 5 Finished ----------------------\n"); 715 | } 716 | 717 | if ($step == 6 || $step eq "all" || $step eq "All" || $step eq "ALL") { 718 | print ("Running step 6:\n"); 719 | print ("Putting together, sorting, making them uniq and then splitting into one file per exon for all the liftover files created for UNMAPPED EXONS so far\n"); 720 | liftoverfilesprocessunmappedexons($liftOverFilesDir, $perExonLiftoverDir, 0, $ENV{'EXTRAMAPPER_DIR'}); 721 | print ("---------------------- Step 6 Finished ----------------------\n"); 722 | } 723 | 724 | if ($step == 7 || $step eq "all" || $step eq "All" || $step eq "ALL") { 725 | print ("Runing step 7:\n"); 726 | print ("Putting together, sorting, making them uniq and then splitting into one file per exon for all the liftover files for MAPPED EXONS that DO NOT INTERSECT WITH AN EXON so far\n"); 727 | liftoverfilesprocessmappedexons($liftOverFilesDir, $perExonLiftoverDir, 0, $ENV{'EXTRAMAPPER_DIR'}); 728 | print ("---------------------- Step 7 Finished ----------------------\n"); 729 | print ("Preporcessing steps finished!\n"); 730 | } 731 | } 732 | 733 | step($step); 734 | -------------------------------------------------------------------------------- /Human-Monkey-Processed-Data/scripts/ensemblUtils.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | ############################################################################## 3 | ### To use the functions in this lib simply import this python module using 4 | ### import ensemblUtils 5 | ### Then you'll able able to call functions with the proper arguments using 6 | ### returnVal=ensemblUtils.func1(arg1,arg2) 7 | ############################################################################## 8 | ############################################################################## 9 | import sys 10 | import os 11 | 12 | ############################################################## 13 | # Genomedata is off by one, seq[0] returns you the 1st bp. 14 | # Have to account for this by subtracting one from each seq coordinate 15 | # before retrieving sequence from Genomedata. 16 | # installation: pip install genomedata --user 17 | from genomedata import Genome 18 | ############################################################## 19 | 20 | import string 21 | import math 22 | import gzip 23 | import _pickle as pickle 24 | import numpy as np 25 | 26 | complement = str.maketrans('atcgn', 'tagcn') 27 | 28 | def parse_ensembl_geneAndProtein_pairings(infilename,proteinToGeneDic,proteinPairsDic): 29 | """ 30 | This function parses a given Ensembl file downloaded from below folders 31 | and parses out the gene and protein pairings. 32 | For the first files, say org1 to org2, it only reports protein pairings. 33 | For the second file it reports back the protein and gene pairings using 34 | the protein pairings from the first proteinPairingsSoFar dictionary. 35 | Fields of these files are (as far as I understand): 36 | someScore chr notsure start end orthologytype 37 | 38 | 0.14740 MT 192515 3307 4262 ortholog_one2one 73.39810 Euarchontoglires 77 39 | ENSG00000198888 ENSP00000354687 ENSMUSP00000080991 77 0 40 | 41 | """ 42 | sys.stderr.write("Parsing gene and protein pairings from file "+infilename+"\n") 43 | 44 | isFirstFile=True 45 | if bool(proteinToGeneDic): # empty dic evaluates to False 46 | isFirstFile=False 47 | 48 | genePairsDic={} 49 | if infilename.endswith(".gz"): 50 | infile=gzip.open(infilename,'rb') 51 | else: 52 | infile=open(infilename,'r') 53 | # 54 | 55 | # it doesn't have a header 56 | lineCount=0 57 | for line in infile: 58 | words=line.rstrip().split() 59 | someScore,chr,notsure,st,en,orthologyType,someOtherScore,phylo,\ 60 | gene1PercentIdentity,gene1,protein1,protein2,genename,gene2PercentIdentity,notsure=words ## genename variable added 61 | 62 | # skip the chromosomes that are not 1..23 or X or Y 63 | if chr=="MT" or len(chr)>2: 64 | continue 65 | 66 | if "ortholog" not in orthologyType: 67 | continue 68 | 69 | proteinToGeneDic[protein1]=gene1 70 | if protein1 not in proteinPairsDic: 71 | proteinPairsDic[protein1]=[] 72 | proteinPairsDic[protein1].append(protein2) 73 | 74 | if not isFirstFile: 75 | if protein2 not in proteinToGeneDic: # second gene is not from a 1,22 or X,Y chr 76 | #print [chr,gene1,protein1,protein2] 77 | continue 78 | gene2=proteinToGeneDic[protein2] 79 | # below checks ensure I only get one entry per g1-g2 pair 80 | if gene1 not in genePairsDic: 81 | genePairsDic[gene1]=[] 82 | genePairsDic[gene1].append([gene2,orthologyType,gene1PercentIdentity,gene2PercentIdentity]) 83 | else: 84 | if gene2 not in [a[0] for a in genePairsDic[gene1]]: 85 | genePairsDic[gene1].append([gene2,orthologyType,gene1PercentIdentity,gene2PercentIdentity]) 86 | # 87 | # 88 | types={} 89 | for g1 in genePairsDic: 90 | type=genePairsDic[g1][0][1] 91 | num=len(genePairsDic[g1]) 92 | if type=="ortholog_one2one" and num>1: 93 | sys.exit("Matching should be one2one but it isn't\t"+g1) 94 | if type not in types: 95 | types[type]=0 96 | types[type]+=1 97 | # 98 | infile.close() 99 | 100 | sys.stderr.write("Gene pair mappings summary: "+repr(types)+"\n\n") 101 | 102 | return proteinToGeneDic,genePairsDic,proteinPairsDic 103 | 104 | def parse_ensembl_gene_pairings(infilename): 105 | """ 106 | This function parses a given Ensembl file (comma seperated) 107 | that matches the genes of the first organism to the second. 108 | """ 109 | sys.stderr.write("Parsing gene pairings from file "+infilename+"\n") 110 | 111 | genePairsDic={} 112 | if infilename.endswith(".gz"): 113 | infile=gzip.open(infilename,'rb') 114 | else: 115 | infile=open(infilename,'r') 116 | # 117 | ######################################################################### 118 | ## Commented by Abhijit 119 | ## The file header is changed so the following section needs to be recoded 120 | ## 121 | ##columnIndices={"GeneID1" : -1, "GeneID2" : -1, "Homology Type" : -1, "Orthology confidence": -1, "Percent Identity" : -1, "Chromosome Name" : []} 122 | ## parse the information header first 123 | ##columnNames=infile.readline().strip().split(",") 124 | ##i=0 #0-based column indices 125 | ##for c in columnNames: 126 | ## if c=="Ensembl Gene ID": 127 | ## columnIndices["GeneID1"]=i 128 | ## elif c.endswith("Ensembl Gene ID"): 129 | ## columnIndices["GeneID2"]=i 130 | ## elif c=="Homology Type": 131 | ## columnIndices["Homology Type"]=i 132 | ## elif "Orthology confidence" in c: 133 | ## columnIndices["Orthology confidence"]=i 134 | ## elif c.endswith("Identity with respect to query gene"): 135 | ## columnIndices["Percent Identity"]=i 136 | ## elif c.endswith("Chromosome Name"): 137 | ## columnIndices["Chromosome Name"].append(i) 138 | ## i+=1 139 | ######################################################################### 140 | 141 | ######## Rewritten ######## 142 | columnIndices={"GeneID1" : -1, "GeneID2" : -1, "Homology Type" : -1, "Orthology confidence": -1, "Percent Identity" : -1, "Chromosome Name1" : -1, "Chromosome Name2" : -1} 143 | ## parse the information header first 144 | columnNames=infile.readline().strip().split(",") 145 | i=0 #0-based column indices 146 | for c in columnNames: 147 | if c=="GeneID1": 148 | columnIndices["GeneID1"]=i 149 | elif c=="GeneID2": 150 | columnIndices["GeneID2"]=i 151 | elif c=="Homology Type": 152 | columnIndices["Homology Type"]=i 153 | elif c=="Orthology confidence": 154 | columnIndices["Orthology confidence"]=i 155 | elif c=="Percent Identity": 156 | columnIndices["Percent Identity"]=i 157 | elif c=="Chromosome Name1": 158 | columnIndices["Chromosome Name1"]=i 159 | elif c=="Chromosome Name2": 160 | columnIndices["Chromosome Name2"]=i 161 | i+=1 162 | # 163 | lineCount=0 164 | for line in infile: 165 | words=line.rstrip().split(",") 166 | # skip the chromosomes that are not 1..23 or X or Y 167 | ##ch1,ch2 = words[columnIndices["Chromosome Name"][0]],words[columnIndices["Chromosome Name"][1]] 168 | ch1,ch2 = words[columnIndices["Chromosome Name1"]],words[columnIndices["Chromosome Name2"]] 169 | if ch1=="MT" or len(ch1)>2 or ch2=="MT" or len(ch2)>2: 170 | continue 171 | # 172 | gene1,gene2= words[columnIndices["GeneID1"]], words[columnIndices["GeneID2"]] 173 | homologyType= words[columnIndices["Homology Type"]] 174 | orthologyConfidence=words[columnIndices["Orthology confidence"]] 175 | percentIdentity=words[columnIndices["Percent Identity"]] 176 | # below checks ensure I only get one entry per g1-g2 pair 177 | if gene1 not in genePairsDic: 178 | genePairsDic[gene1]=[] 179 | genePairsDic[gene1].append([gene2,homologyType,orthologyConfidence,percentIdentity]) 180 | else: 181 | if gene2 not in [a[0] for a in genePairsDic[gene1]]: 182 | genePairsDic[gene1].append([gene2,homologyType,orthologyConfidence,percentIdentity]) 183 | # 184 | # 185 | lineCount+=1 186 | # 187 | #print len(genePairsDic) 188 | types={} 189 | for g1 in genePairsDic: 190 | type=genePairsDic[g1][0][1] 191 | num=len(genePairsDic[g1]) 192 | if type=="ortholog_one2one" and num>1: 193 | sys.exit("Matching should be one2one but it isn't\t"+g1) 194 | if type not in types: 195 | types[type]=0 196 | types[type]+=1 197 | # 198 | infile.close() 199 | 200 | sys.stderr.write("Gene pair mappings summary: "+repr(types)+"\n\n") 201 | 202 | return genePairsDic 203 | 204 | 205 | def parse_organism_GTF(orgID, infilename, outdir): 206 | """ 207 | This function parses a given Ensembl GTF file into the 208 | internal data structure for that organism. 209 | Does not output anything if outdir is "None" 210 | """ 211 | sys.stderr.write("Parsing organism GTF for "+orgID+" from file "+infilename+"\n") 212 | geneDic={} 213 | transcriptDic={} 214 | exonDic={} 215 | infoDic={} # build, version, accession 216 | if infilename.endswith(".gz"): 217 | infile=gzip.open(infilename,'rb') 218 | else: 219 | infile=open(infilename,'r') 220 | # parse the information header first 221 | elemCounts={"CDS" : 0, "exon" : 0, "gene" : 0, "start_codon" : 0, 222 | "stop_codon" : 0, "transcript" : 0, "UTR" : 0, "other" : 0} 223 | lineCount=0 224 | lastReadExon="dummy" 225 | for line in infile: 226 | if line.startswith("#"): 227 | key, item = line.split()[:2] 228 | infoDic[key.split("-")[-1]]=item 229 | else: 230 | elemType=line.rstrip().split("\t")[2] 231 | chrName=line.rstrip().split("\t")[0] 232 | if chrName=="MT" or len(chrName)>2: 233 | #print chrName 234 | continue 235 | if elemType=="gene": 236 | newGene=EnsemblGene(line) 237 | geneDic[newGene.basicInfoDic["gene_id"]]=newGene 238 | elif elemType=="transcript": 239 | newTranscript=EnsemblTranscript(line) 240 | transcriptDic[newTranscript.basicInfoDic["transcript_id"]]=newTranscript 241 | geneId=newTranscript.basicInfoDic["gene_id"] 242 | geneDic[geneId].add_transcript(newTranscript) 243 | elif elemType=="exon": 244 | newExon=EnsemblExon(line) 245 | lastReadExon=newExon.basicInfoDic["exon_id"] 246 | 247 | # DELETEME!! 248 | #print("ALL_EXON_ENTRY\t%s\n" % (newExon.get_summary_string())), 249 | 250 | # Make sure to store certain information about the exon if it appears multiple times 251 | if lastReadExon not in exonDic: 252 | exonDic[lastReadExon]=newExon 253 | else: 254 | # DELETEME!! 255 | #print("DUPLICATE_EXON_ENTRY\t%s\t%s\n" % (exonDic[lastReadExon].get_summary_string(),newExon.get_summary_string())), 256 | exonDic[lastReadExon].add_another_instance(newExon) 257 | # 258 | geneId=newExon.basicInfoDic["gene_id"] 259 | transcriptId=newExon.basicInfoDic["transcript_id"] 260 | ## Add the exon to a gene, don't care about ordering and simply owerwrite if exists 261 | geneDic[geneId].add_exon(newExon) 262 | ## Add to exon to a transcript, make sure this exon insertion is ordered 263 | ## Luckily entrys come ordered! 264 | ## also same exon doesn't appear twice in once transcript so that case is not handled 265 | transcriptDic[transcriptId].add_exon(newExon) 266 | ## no need for below line because Python is Pass-by-object-reference 267 | #geneDic[geneId].transcripts[transcriptId].add_exon(newExon) 268 | elif elemType=="CDS": 269 | #meaning previously read exon is completely/partially coding 270 | newLocus=EnsemblLocus(line) 271 | transcriptDic[transcriptId].handle_CDS(newLocus) 272 | exonDic[lastReadExon].handle_CDS(newLocus) 273 | # DELETEME!! 274 | #print "handleCDS\t%s\t%s\n" % (lastReadExon,newLocus.get_summary_string()), 275 | elif elemType=="UTR" or elemType=="stop_codon" or elemType=="start_codon": 276 | newLocus=EnsemblLocus(line) 277 | transcriptDic[transcriptId].add_locus(newLocus,elemType) 278 | # 279 | if elemType not in elemCounts: 280 | elemType="other" 281 | elemCounts[elemType]=elemCounts[elemType]+1 282 | lineCount+=1 283 | if lineCount%100000==0: 284 | sys.stderr.write(str(lineCount)+"\t") 285 | # 286 | # 287 | sys.stderr.write("\n") 288 | sys.stderr.write("GTF parsing summary: " +repr(elemCounts)+"\n\n") 289 | infile.close() 290 | if outdir!="None": 291 | print_some_summary(orgID, geneDic,transcriptDic,exonDic,elemCounts, outdir) 292 | return (geneDic,transcriptDic,exonDic,infoDic) 293 | 294 | 295 | def print_some_summary(orgID, geneDic,transcriptDic,exonDic,elemCounts, outdir): 296 | """ 297 | Print a summary for the genes, transcripts and exons in the 298 | read GTF file. 299 | """ 300 | outfile=open(outdir+"/"+orgID+"-allGenes-GTFparsed.txt",'w') 301 | outfile.write("chrName\tstartCoord\tendCoord\tstrand\tgeneID\tgeneName\tgeneType\tnoOfTranscripts\tnoOfExons\telementType\n") 302 | for g in geneDic: 303 | outfile.write(geneDic[g].get_summary_string()+"\n") 304 | #print geneDic[g].get_summary_string() 305 | # 306 | outfile.close() 307 | 308 | totalNumberOfExons=0 309 | outfile=open(outdir+"/"+orgID+"-allTranscripts-GTFparsed.txt",'w') 310 | outfile.write("chrName\tstartCoord\tendCoord\tstrand\ttranscriptID\ttranscriptName\ttranscriptType\tgeneID\tgeneName\texonIDs\texonTypes\texonStarts\texonEnds\tcodingStarts\tcodingEnds\tstartCodon\tstopCodon\tUTRstarts\tUTRends\tproteinID\telementType\n") 311 | for t in transcriptDic: 312 | outfile.write(transcriptDic[t].get_summary_string()+"\n") 313 | totalNumberOfExons+=len(transcriptDic[t].exon_types) 314 | #print transcriptDic[t].get_summary_string() 315 | # 316 | outfile.close() 317 | # print ["totalNumberOfExons", totalNumberOfExons] 318 | 319 | outfile=open(outdir+"/"+orgID+"-allExons-GTFparsed.txt",'w') 320 | outfile.write("chrName\tstartCoord\tendCoord\tstrand\texonID\texonType\tcodingStart\tcodingEnd\ttranscriptIDs\texonNumbers\tgeneID\texonLength\tacceptor2bp\tdonor2bp\tavgCodingConsScore\tavgConsScore\tfirstMidLastCounts\telementType\n") 321 | for e in exonDic: 322 | outfile.write(exonDic[e].get_summary_string()+"\n") 323 | #print exonDic[e].get_summary_string() 324 | # 325 | outfile.close() 326 | 327 | return 328 | 329 | def overlapping_combined( orig_data, reverse = False): 330 | """ 331 | Return list of intervals with overlapping neighbours merged together 332 | Assumes sorted intervals unless reverse is set 333 | 334 | """ 335 | if not orig_data or not len(orig_data): return [] 336 | if len(orig_data) == 1: 337 | return orig_data 338 | 339 | new_data = [] 340 | 341 | if reverse: 342 | data = orig_data[:] 343 | data.reverse() 344 | else: 345 | data = orig_data 346 | 347 | if not data[0][0] <= data[1][0]: 348 | print((data, reverse)) 349 | assert(data[0][0] <= data[1][0]) 350 | 351 | # start with the first interval 352 | prev_beg, prev_end = data[0] 353 | 354 | # check if any subsequent intervals overlap 355 | for beg, end in data[1:]: 356 | if beg - prev_end + 1 > 0: 357 | new_data.append((prev_beg, prev_end)) 358 | prev_beg = beg 359 | prev_end = max(end, prev_end) 360 | 361 | new_data.append((prev_beg, prev_end)) 362 | 363 | if reverse: 364 | new_data.reverse() 365 | return new_data 366 | 367 | 368 | def get_overlap_between_intervals(a, b): 369 | """ 370 | Finds the overlap between two intervals end points inclusive. 371 | #Makes sure not to report overlap beyond either interval length. 372 | # a=[10,20]; b=[10,20] --> f(a,b)=10 !(not 11) 373 | # a=[10,20]; b=[20,30] --> f(a,b)=1 374 | a=[10,20]; b=[15,30] --> f(a,b)=6 375 | """ 376 | #lena=abs(float(a[1])-float(a[0])) 377 | #lenb=abs(float(b[1])-float(b[0])) 378 | overlap=max(0, min(float(a[1]), float(b[1])) - max(float(a[0]), float(b[0]))+1) 379 | #minlen=min(lena,lenb) 380 | #return min(minlen,overlap) 381 | return overlap 382 | 383 | def sort_by_column(somelist, n): 384 | """ 385 | Given a list with 1 or more columns this functions sorts it according 386 | to the desired column n [0 len(list)). Does this in-place. 387 | """ 388 | somelist[:] = [(x[n], x) for x in somelist] 389 | somelist.sort() 390 | somelist[:] = [val for (key, val) in somelist] 391 | return 392 | 393 | def chr_name_conversion(chrIn,org): 394 | """ 395 | Given an identifier for the chromosome name (str) or a chromosome number (int) 396 | and an organism this function converts the identifier to the other representation. 397 | Example: 398 | converts 'chrX' or 'X' to 23 for human 399 | converts 23 to 'chrX' 1 to 'chr1' for human 400 | """ 401 | if isinstance(chrIn, int): # int to str 402 | if org=='human': 403 | if chrIn<23 and chrIn>0: 404 | chrOut='chr'+str(chrIn) 405 | elif chrIn==23: 406 | chrOut='chrX' 407 | elif chrIn==24: 408 | chrOut='chrY' 409 | else: 410 | return 'problem' 411 | elif org=='mouse': 412 | if chrIn<20 and chrIn>0: 413 | chrOut='chr'+str(chrIn) 414 | elif chrIn==20: 415 | chrOut='chrX' 416 | elif chrIn==21: 417 | chrOut='chrY' 418 | else: 419 | return 'problem' 420 | else: 421 | chrOut='chr'+str(chrIn) 422 | else: # str to int 423 | if 'chr' in chrIn: 424 | chrIn=chrIn[:3] # cut the 'chr' 425 | if org=='human': 426 | if chrIn=='X': 427 | chrOut=23 428 | elif chrIn=='Y': 429 | chrOut=24 430 | else: 431 | chrOut=int(chrIn) 432 | elif org=='mouse': 433 | if chrIn=='X': 434 | chrOut=20 435 | elif chrIn=='Y': 436 | chrOut=21 437 | else: 438 | chrOut=int(chrIn) 439 | return chrOut 440 | 441 | 442 | 443 | ################################# BEGIN ExtendedExon ################################## 444 | ### NOT USED FOR NOW, NOT YET IMPLEMENTED #### 445 | class ExtendedExon: 446 | """ 447 | This class is a container for exons that combines input from multiple different 448 | sources/files. Below is a list of these sources: 449 | - Ensembl Exon: This will initiate the instance of the ExtendedExon class 450 | - LiftOver Files: 451 | - Genomedata Archive: 452 | - PhastCons Scores: 453 | - BLAT within species: 454 | """ 455 | def __init__(self, ensemblExon): 456 | self.basicInfoDic= ensemblExon.basicInfoDic 457 | ################################# END ExtendedExon ################################## 458 | 459 | 460 | 461 | ################################# BEGIN EnsemblExon ################################## 462 | class EnsemblExon: 463 | """ 464 | This class is a container for Ensembl exons 465 | """ 466 | def __init__(self, line): 467 | # parse the transcript line 468 | chr,d,elemType,start_coord,end_coord,d,strand,d,others=line.rstrip().split("\t") 469 | if elemType!="exon": 470 | sys.exit("Not an exon parsed in class EnsemblExon:\t"+elemType) 471 | # 472 | #basic information about the exon 473 | self.basicInfoDic={"chromosome" : chr, "start_coord" : int(start_coord), "end_coord" : int(end_coord), "strand" : strand} 474 | 475 | # there are 13 or more items for exons. We keep only 7 relevant ones. 476 | # e.g: gene_id "ENSG00000167468"; gene_version "14"; 477 | # transcript_id "ENST00000593032"; transcript_version "3"; exon_number "2"; 478 | # gene_name "GPX4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; 479 | # transcript_name "GPX4-006"; transcript_source "havana"; transcript_biotype "protein_coding"; 480 | # exon_id "ENSE00003420595"; exon_version "1"; tag "seleno"; tag "cds_end_NF"; 481 | items=others.replace('"', '').split(";") 482 | for item in items: 483 | wds=item.lstrip().split() 484 | if len(wds)>1: 485 | key, val = wds[0],wds[1] 486 | if key in ["gene_id", "gene_name", "transcript_id", "transcript_name", "transcript_biotype", "exon_id", "exon_number"]: 487 | self.basicInfoDic[key]=val 488 | # 489 | 490 | self.exon_type = "nonCoding" # by default 491 | self.codingExon = [-1,-1] # the coordinates of below codingExon will change if partialCoding or coding 492 | self.transcriptIds=[self.basicInfoDic["transcript_id"]] 493 | self.exonNumbers=[int(self.basicInfoDic["exon_number"])] 494 | self.acceptor2bp="NN" 495 | self.donor2bp="NN" 496 | self.phastConsScores=[] 497 | self.avgConsScore=0 498 | self.avgCodingConsScore=0 499 | self.firstMidLast=[0,0,0] # appearences of this exon as first, mid and last exons. Single exons are counted as first and last. 500 | # 501 | 502 | def handle_CDS(self,newLocus): 503 | lastSt, lastEn=self.basicInfoDic["start_coord"], self.basicInfoDic["end_coord"] 504 | newSt, newEn= newLocus.basicInfoDic["start_coord"], newLocus.basicInfoDic["end_coord"] 505 | if lastSt==newSt and lastEn==newEn: 506 | self.exon_type="fullCoding" 507 | elif get_overlap_between_intervals([lastSt,lastEn], [newSt,newEn])>0: 508 | self.exon_type="partCoding" 509 | else: 510 | sys.exit("Reached a CDS entry that doesn't overlap with previous exon\t"\ 511 | +newLocus.get_summary_string()+"\n") 512 | # 513 | self.codingExon=[newSt,newEn] 514 | # 515 | def add_another_instance(self,newExon): 516 | # an exon may appear in only one gene but for many different transcripts 517 | self.transcriptIds.append(newExon.basicInfoDic["transcript_id"]) 518 | self.exonNumbers.append(int(newExon.basicInfoDic["exon_number"])) 519 | # 520 | # data containers within this class 521 | __slots__ = ["basicInfoDic", "exon_type", "codingExon", "transcriptIds", "exonNumbers", \ 522 | "acceptor2bp", "donor2bp", "phastConsScores", "avgCodingConsScore", "avgConsScore", "firstMidLast"] 523 | 524 | # get one liner summary of the given instance 525 | def get_summary_string(self): 526 | summary="chr"+self.basicInfoDic["chromosome"]+"\t"+str(self.basicInfoDic["start_coord"])+"\t"+\ 527 | str(self.basicInfoDic["end_coord"])+"\t"+self.basicInfoDic["strand"]+"\t"+self.basicInfoDic["exon_id"]+"\t"+\ 528 | self.exon_type +"\t"+str(self.codingExon[0])+"\t"+str(self.codingExon[1])+"\t"+\ 529 | ",".join(self.transcriptIds)+"\t"+",".join([str(e) for e in self.exonNumbers])+"\t"+\ 530 | self.basicInfoDic["gene_id"]+"\t"+str(abs(self.basicInfoDic["end_coord"]-self.basicInfoDic["start_coord"]))+"\t"+\ 531 | self.acceptor2bp+"\t"+self.donor2bp+"\t"+str(self.avgCodingConsScore)+"\t"+str(self.avgConsScore)+"\t"+\ 532 | ",".join([str(e) for e in self.firstMidLast])+"\texon" 533 | #self.basicInfoDic["transcript_id"]+"\t"+self.basicInfoDic["exon_number"]+"\t"+\ 534 | return summary 535 | ################################# END EnsemblExon ################################## 536 | 537 | ################################# BEGIN EnsemblLocus ################################## 538 | class EnsemblLocus: 539 | """ 540 | This class is a container for a basic locus that has chr, start, end, strand 541 | fields. UTRs, start and stop codons from Ensembl are of this type. 542 | """ 543 | def __init__(self, line): 544 | # parse the locus line 545 | chr,d,elemType,start_coord,end_coord,d,strand,d,others=line.rstrip().split("\t") 546 | if elemType!="UTR" and elemType!="stop_codon" and elemType!="start_codon" and elemType!="CDS": 547 | sys.exit("Not a basic locus as intended parsed in from Ensemble line:\t"+line) 548 | # 549 | #basic information about the locus 550 | self.basicInfoDic={"chromosome" : chr, "start_coord" : int(start_coord), \ 551 | "end_coord" : int(end_coord), "strand" : strand, "locus_type" : elemType} 552 | 553 | items=others.replace('"', '').split(";") 554 | for item in items: 555 | wds=item.lstrip().split() 556 | if len(wds)>1: 557 | key, val = wds[0],wds[1] 558 | if key in ["gene_id", "gene_name", "transcript_id", "transcript_name", "transcript_biotype", "protein_id"]: 559 | self.basicInfoDic[key]=val 560 | # 561 | 562 | 563 | # 564 | 565 | __slots__ = ["basicInfoDic"] 566 | 567 | # get one liner summary of the given instance 568 | def get_summary_string(self): 569 | summary="chr"+self.basicInfoDic["chromosome"]+"\t"+str(self.basicInfoDic["start_coord"])+"\t"+\ 570 | str(self.basicInfoDic["end_coord"])+"\t"+self.basicInfoDic["strand"]+"\t"+self.basicInfoDic["locus_type"]+"\t"+\ 571 | self.basicInfoDic["transcript_id"]+"\t"+self.basicInfoDic["transcript_name"]+"\t"+\ 572 | self.basicInfoDic["transcript_biotype"]+"\t"+self.basicInfoDic["gene_id"]+"\t"+self.basicInfoDic["gene_name"]+"\tlocus" 573 | return summary 574 | ################################# END EnsemblLocus ################################## 575 | 576 | 577 | ################################# BEGIN EnsemblTranscript ################################## 578 | class EnsemblTranscript: 579 | """ 580 | This class is a container for Ensembl transcripts 581 | """ 582 | def __init__(self, line): 583 | # parse the transcript line 584 | chr,d,elemType,start_coord,end_coord,d,strand,d,others=line.rstrip().split("\t") 585 | if elemType!="transcript": 586 | sys.exit("Not a transcript parsed in class EnsemblTranscript:\t"+elemType) 587 | # 588 | #basic information about the transcript 589 | self.basicInfoDic={"chromosome" : chr, "start_coord" : int(start_coord), "end_coord" : int(end_coord), "strand" : strand} 590 | 591 | # there are 10 or more items for transcripts. We keep only 5 relevant ones. 592 | # e.g: gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; 593 | # transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; 594 | # transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; 595 | # 596 | 597 | items=others.replace('"', '').split(";") 598 | for item in items: 599 | wds=item.lstrip().split() 600 | if len(wds)>1: 601 | key, val = wds[0],wds[1] 602 | #key, val = (item.lstrip().split())[0:2] 603 | if key in ["gene_id", "gene_name", "transcript_id", "transcript_name", "transcript_biotype"]: 604 | self.basicInfoDic[key]=val 605 | # 606 | if "gene_name" not in self.basicInfoDic: 607 | self.basicInfoDic["gene_name"]=self.basicInfoDic["gene_id"] 608 | if "transcript_name" not in self.basicInfoDic: 609 | self.basicInfoDic["transcript_name"]=self.basicInfoDic["transcript_id"] 610 | 611 | self.start_codon=["cds_start_NF"] # by default make them non-confirmed 612 | self.stop_codon=["cds_stop_NF"] # by default make them non-confirmed 613 | self.exons = [] 614 | self.codingExons = [] 615 | self.exon_types= [] 616 | self.protein_id="None" 617 | self.UTRs=[] 618 | # 619 | 620 | # adding an exon to the list of exons of the transcript "in order" 621 | def add_exon(self,newExon): 622 | exonCountSoFar=len(self.exons) 623 | #print "to add\t"+str(newExon.basicInfoDic["exon_number"])+"\t"+str(newExon.basicInfoDic["exon_id"]) 624 | if int(newExon.basicInfoDic["exon_number"])==exonCountSoFar+1: 625 | exonEntry=[newExon.basicInfoDic["start_coord"],newExon.basicInfoDic["end_coord"], 626 | newExon.basicInfoDic["exon_id"]] 627 | self.exons.append(exonEntry) 628 | # the coordinates of below codingExon will change if partialCoding or coding 629 | self.codingExons.append([-1,-1]) 630 | exonType="nonCoding" # by default 631 | self.exon_types.append(exonType) 632 | else: 633 | sys.exit("Exon entry is being entered out of order to the transcript\t" 634 | +self.basicInfoDic["transcript_id"]) 635 | # 636 | # 637 | def add_locus(self,newLocus,locus_type): 638 | if locus_type=='start_codon': 639 | self.start_codon=[newLocus.basicInfoDic["start_coord"],newLocus.basicInfoDic["end_coord"]] 640 | elif locus_type=='stop_codon': 641 | self.stop_codon=[newLocus.basicInfoDic["start_coord"],newLocus.basicInfoDic["end_coord"]] 642 | elif locus_type=='UTR': 643 | self.UTRs.append([newLocus.basicInfoDic["start_coord"],newLocus.basicInfoDic["end_coord"]]) 644 | else: 645 | sys.exit("Unknow locus type being inserted to transcript\t"\ 646 | +self.basicInfoDic["transcript_id"]) 647 | # 648 | 649 | def handle_CDS(self,newLocus): 650 | exonType="nonCoding" # by default 651 | exonCountSoFar=len(self.exons) 652 | lastAddedExon=self.exons[exonCountSoFar-1] 653 | lastSt,lastEn=self.exons[exonCountSoFar-1][0:2] 654 | newSt, newEn= newLocus.basicInfoDic["start_coord"], newLocus.basicInfoDic["end_coord"] 655 | if lastSt==newSt and lastEn==newEn: 656 | exonType="fullCoding" 657 | elif get_overlap_between_intervals([lastSt,lastEn], [newSt,newEn])>0: 658 | exonType="partCoding" 659 | else: 660 | sys.exit("Reached a CDS entry that doesn't overlap with previous exon\t"\ 661 | +newLocus.get_summary_string()+"\n") 662 | # 663 | self.codingExons[exonCountSoFar-1]=[newSt,newEn] 664 | self.exon_types[exonCountSoFar-1]=exonType # replace with the previous nonCoding tag 665 | self.protein_id=newLocus.basicInfoDic["protein_id"] 666 | # 667 | 668 | # data containers within this class 669 | __slots__ = [ 670 | "basicInfoDic", 671 | "start_codon", 672 | "stop_codon", 673 | "exons", 674 | "codingExons", 675 | "exon_types", 676 | "protein_id", 677 | "UTRs" 678 | ] 679 | # get one liner summary of the given instance 680 | def get_summary_string(self): 681 | if len(self.UTRs)==0: 682 | self.UTRs.append(["None","None"]) 683 | # 684 | summary="chr"+self.basicInfoDic["chromosome"]+"\t"+str(self.basicInfoDic["start_coord"])+"\t"+\ 685 | str(self.basicInfoDic["end_coord"])+"\t"+self.basicInfoDic["strand"]+"\t"+self.basicInfoDic["transcript_id"]+"\t"+\ 686 | self.basicInfoDic["transcript_name"]+"\t"+self.basicInfoDic["transcript_biotype"]+"\t"+\ 687 | self.basicInfoDic["gene_id"]+"\t"+self.basicInfoDic["gene_name"]+"\t"+\ 688 | ",".join([e[2] for e in self.exons])+"\t"+",".join(self.exon_types)+"\t"+\ 689 | ",".join([str(e[0]) for e in self.exons])+"\t"+",".join([str(e[1]) for e in self.exons])+"\t"+\ 690 | ",".join([str(e[0]) for e in self.codingExons])+"\t"+",".join([str(e[1]) for e in self.codingExons])+"\t"+\ 691 | ",".join([str(i) for i in self.start_codon])+"\t"+",".join([str(i) for i in self.stop_codon])+"\t"+\ 692 | ",".join([str(e[0]) for e in self.UTRs])+"\t"+",".join([str(e[1]) for e in self.UTRs])+"\t"+\ 693 | self.protein_id+"\t"+"transcript" 694 | return summary 695 | ################################# END EnsemblTranscript ################################## 696 | 697 | 698 | 699 | ################################# BEGIN EnsemblGene ################################## 700 | class EnsemblGene: 701 | """ 702 | This class is a container for Ensembl genes. 703 | """ 704 | def __init__(self, line): 705 | # parse the gene line 706 | chr,d,elemType,start_coord,end_coord,d,strand,d,others=line.rstrip().split("\t") 707 | if elemType!="gene": 708 | sys.exit("Not a gene parsed in class EnsemblGene:\t"+elemType) 709 | # 710 | #basic information about the gene 711 | self.basicInfoDic={"chromosome" : chr, "start_coord" : int(start_coord), "end_coord" : int(end_coord), "strand" : strand} 712 | 713 | # there are 5 items for genes: 714 | # e.g: gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed"; 715 | items=others.replace('"', '').split(";") 716 | for item in items: 717 | if len(item)>1: 718 | key, val = item.lstrip().split() 719 | self.basicInfoDic[key]=val 720 | if "gene_name" not in self.basicInfoDic: 721 | self.basicInfoDic["gene_name"]=self.basicInfoDic["gene_id"] 722 | self.exons = {} 723 | self.transcripts = {} 724 | # 725 | 726 | # data containers within this class 727 | __slots__ = [ 728 | "basicInfoDic", 729 | "exons", 730 | "transcripts"] 731 | 732 | # adding a transcript to the gene 733 | def add_transcript(self,newTranscript): 734 | self.transcripts[newTranscript.basicInfoDic["transcript_id"]]=newTranscript 735 | # adding an exon to the list of exons of the gene 736 | def add_exon(self,newExon): 737 | self.exons[newExon.basicInfoDic["exon_id"]]=\ 738 | [newExon.basicInfoDic["start_coord"],newExon.basicInfoDic["end_coord"]] 739 | # get one liner summary of the given instance 740 | def get_summary_string(self): 741 | summary="chr"+self.basicInfoDic["chromosome"]+"\t"+str(self.basicInfoDic["start_coord"])+"\t"+\ 742 | str(self.basicInfoDic["end_coord"])+"\t"+self.basicInfoDic["strand"]+"\t"+self.basicInfoDic["gene_id"]+"\t"+\ 743 | self.basicInfoDic["gene_name"]+"\t"+self.basicInfoDic["gene_biotype"]+"\t"+\ 744 | str(len(self.transcripts))+"\t"+str(len(self.exons))+"\tgene" 745 | return summary 746 | # 747 | #def gene_wrap_up(): 748 | # self.beg = min(self.exons[e][0] for e in self.exons) 749 | # self.end = max(self.exons[e][1] for e in self.exons) 750 | # 751 | 752 | ################################# END EnsemblGene ################################## 753 | 754 | 755 | def convert_UCSC_to_bed_format(l): 756 | """ 757 | Given a locus in UCSC format this function converts it to bed format with 3 fields 758 | chr1:121-21111 --> ['chr1', 121, 21111] 759 | """ 760 | chr=l[:l.find(':')] 761 | st=int(l[l.find(':')+1:l.find('-')]) 762 | en=int(l[l.find('-')+1:]) 763 | return (chr,st,en) 764 | 765 | 766 | def consistency_check(org1TOorg2,org2TOorg1): 767 | """ 768 | Check the consistency between the two matchings (e.g. human-to-mouse, mouse-to-human) 769 | read from separate Ensembl file. This function will do nothing if all is consistent. 770 | """ 771 | for g1 in org1TOorg2: 772 | type=org1TOorg2[g1][0][1] 773 | if type=="ortholog_one2one": 774 | g2=org1TOorg2[g1][0][0] 775 | if g2 not in org2TOorg1: 776 | sys.exit("Reverse entry for a one2one match couldn't be found\t"+g1+"\t"+g2) 777 | elif org2TOorg1[g2][0][0]!=g1: 778 | sys.exit("Reverse entry for a one2one match mismatches with original one\t"+g1+"\t"+g2) 779 | # else good 780 | else: 781 | for oneMatch1 in org1TOorg2[g1]: 782 | g2=oneMatch1[0] 783 | if g2 not in org2TOorg1: 784 | sys.exit("Reverse entry for a NON-one2one match couldn't be found\t"+g1+"\t"+g2) 785 | else: 786 | reverseFound=False 787 | for oneMatch2 in org2TOorg1[g2]: 788 | if oneMatch2[0]==g1: 789 | reverseFound=True 790 | break 791 | # 792 | if reverseFound==False: 793 | sys.exit("Reverse entry for a NON-one2one match mismatches with original one\t"+g1+"\t"+g2) 794 | # else good 795 | # 796 | # 797 | for g1 in org2TOorg1: 798 | type=org2TOorg1[g1][0][1] 799 | if type=="ortholog_one2one": 800 | g2=org2TOorg1[g1][0][0] 801 | if g2 not in org1TOorg2: 802 | sys.exit("Reverse entry for a one2one match couldn't be found\t"+g1+"\t"+g2) 803 | elif org1TOorg2[g2][0][0]!=g1: 804 | sys.exit("Reverse entry for a one2one match mismatches with original one\t"+g1+"\t"+g2) 805 | # else good 806 | else: 807 | for oneMatch1 in org2TOorg1[g1]: 808 | g2=oneMatch1[0] 809 | if g2 not in org1TOorg2: 810 | sys.exit("Reverse entry for a NON-one2one match couldn't be found\t"+g1+"\t"+g2) 811 | else: 812 | reverseFound=False 813 | for oneMatch2 in org1TOorg2[g2]: 814 | if oneMatch2[0]==g1: 815 | reverseFound=True 816 | break 817 | # 818 | if reverseFound==False: 819 | sys.exit("Reverse entry for a NON-one2one match mismatches with original one\t"+g1+"\t"+g2) 820 | # else good 821 | # 822 | # 823 | return 824 | 825 | 826 | def pickle_one2one_genePairs_allInfo(genePairsDic,geneDic1,geneDic2,exonDic1,exonDic2,transcriptDic1,transcriptDic2,outdir): 827 | """ 828 | Pickle the gene, transcript and exon dictionaries for each pair of ortholog_one2one genes. 829 | There are around 16.5k such genes for human-mouse and 15.8k are protein_coding pairs. 830 | """ 831 | geneOnlyOrthoDic1,transcriptOnlyOrthoDic1,exonOnlyOrthoDic1={},{},{} 832 | geneOnlyOrthoDic2,transcriptOnlyOrthoDic2,exonOnlyOrthoDic2={},{},{} 833 | 834 | for g1 in genePairsDic: 835 | type=genePairsDic[g1][0][1] 836 | if type=="ortholog_one2one": 837 | g2=genePairsDic[g1][0][0] 838 | else: 839 | continue 840 | 841 | if geneDic1[g1].basicInfoDic["gene_biotype"]!="protein_coding" or geneDic2[g2].basicInfoDic["gene_biotype"]!="protein_coding": 842 | continue 843 | 844 | # small dictionaries that have only the relevant stuff for one gene pair 845 | newGeneDic1={}; newGeneDic2={} 846 | newExonDic1={}; newExonDic2={} 847 | newTranscriptDic1={}; newTranscriptDic2={} 848 | # 849 | newGeneDic1[g1]=geneDic1[g1] 850 | newGeneDic2[g2]=geneDic2[g2] 851 | geneOnlyOrthoDic1[g1]=geneDic1[g1] 852 | geneOnlyOrthoDic2[g2]=geneDic2[g2] 853 | for tId in geneDic1[g1].transcripts: 854 | newTranscriptDic1[tId]=transcriptDic1[tId] 855 | transcriptOnlyOrthoDic1[tId]=transcriptDic1[tId] 856 | for tId in geneDic2[g2].transcripts: 857 | newTranscriptDic2[tId]=transcriptDic2[tId] 858 | transcriptOnlyOrthoDic2[tId]=transcriptDic2[tId] 859 | # 860 | for eId in geneDic1[g1].exons: 861 | newExonDic1[eId]=exonDic1[eId] 862 | exonOnlyOrthoDic1[eId]=exonDic1[eId] 863 | for eId in geneDic2[g2].exons: 864 | newExonDic2[eId]=exonDic2[eId] 865 | exonOnlyOrthoDic2[eId]=exonDic2[eId] 866 | # 867 | os.system("mkdir -p "+ outdir+"/"+g1+"-"+g2) 868 | #print geneDic1[g1].get_summary_string()+"\t"+geneDic2[g2].get_summary_string() 869 | outfilename=outdir+"/"+g1+"-"+g2+"/org1.pickledDictionaries" 870 | pickle.dump((newGeneDic1,newTranscriptDic1,newExonDic1), open(outfilename,"wb")) 871 | outfilename=outdir+"/"+g1+"-"+g2+"/org2.pickledDictionaries" 872 | pickle.dump((newGeneDic2,newTranscriptDic2,newExonDic2), open(outfilename,"wb")) 873 | # to load use: 874 | #geneDic1,transcriptDic1,exonDic1=pickle.load(open("pickled.stuff","rb")) 875 | # 876 | 877 | return (geneOnlyOrthoDic1,transcriptOnlyOrthoDic1,exonOnlyOrthoDic1, geneOnlyOrthoDic2,transcriptOnlyOrthoDic2,exonOnlyOrthoDic2) 878 | 879 | def print_one2one_genePairs(genePairsDic, geneDic1,geneDic2,outfilename): 880 | """ 881 | Print one liner for each pair of genes that match each other one to one. 882 | There are around 16.5k such genes for human-mouse and 15.8k are protein_coding pairs. 883 | """ 884 | outfile=open(outfilename,'w') 885 | outfile.write("chrName1\tstart_coord1\tend_coord1\tstrand1\tgeneID1\tgeneName1\tgeneType1\tnoOfTranscripts1\tnoOfExons1\ttype1\t") 886 | outfile.write("chrName2\tstart_coord2\tend_coord2\tstrand2\tgeneID2\tgeneName2\tgeneType2\tnoOfTranscripts2\tnoOfExons2\ttype2\n") 887 | #print "chrName1\tstart_coord1\tend_coord1\tstrand1\tgeneID1\tgeneName1\tgeneType1\tnoOfTranscripts1\tnoOfExons1\ttype1\t", 888 | #print "chrName2\tstart_coord2\tend_coord2\tstrand2\tgeneID2\tgeneName2\tgeneType2\tnoOfTranscripts2\tnoOfExons2\ttype2" 889 | for g1 in genePairsDic: 890 | type=genePairsDic[g1][0][1] 891 | if type=="ortholog_one2one": 892 | g2=genePairsDic[g1][0][0] 893 | else: 894 | continue 895 | outfile.write(geneDic1[g1].get_summary_string()+"\t"+geneDic2[g2].get_summary_string()+"\n") 896 | #print geneDic1[g1].get_summary_string()+"\t"+geneDic2[g2].get_summary_string() 897 | # 898 | outfile.close() 899 | return 900 | 901 | def print_one2one_transcriptListPairs(genePairsDic, geneDic1,geneDic2,transcriptDic1,transcriptDic2,orgId1,orgId2,outdir): 902 | """ 903 | Print the lists of transcripts for each one to one mapped gene pair. 904 | There are around 16.5k such genes for human-mouse and 15.8k are protein_coding pairs. 905 | """ 906 | for g1 in genePairsDic: 907 | type=genePairsDic[g1][0][1] 908 | if type=="ortholog_one2one": 909 | g2=genePairsDic[g1][0][0] 910 | else: 911 | continue 912 | # 913 | outdirTemp=outdir+"/"+g1+"-"+g2; os.system("mkdir -p "+outdirTemp) 914 | outfile1=open(outdirTemp+"/"+orgId1+"_transcripts.bed",'w') 915 | outfile2=open(outdirTemp+"/"+orgId2+"_transcripts.bed",'w') 916 | 917 | transcripts1=geneDic1[g1].transcripts 918 | transcripts2=geneDic2[g2].transcripts 919 | for t1 in transcripts1: 920 | outfile1.write(transcriptDic1[t1].get_summary_string()+"\n") 921 | for t2 in transcripts2: 922 | outfile2.write(transcriptDic2[t2].get_summary_string()+"\n") 923 | # 924 | outfile1.close() 925 | outfile2.close() 926 | # 927 | return 928 | 929 | def print_one2one_exonListPairs(genePairsDic, geneDic1,geneDic2,exonDic1,exonDic2,orgId1,orgId2,outdir): 930 | """ 931 | Print the lists of exons for each one to one mapped gene pair. 932 | There are around 16.5k such genes for human-mouse and 15.8k are protein_coding pairs. 933 | """ 934 | for g1 in genePairsDic: 935 | type=genePairsDic[g1][0][1] 936 | if type=="ortholog_one2one": 937 | g2=genePairsDic[g1][0][0] 938 | else: 939 | continue 940 | # 941 | outdirTemp=outdir+"/"+g1+"-"+g2; os.system("mkdir -p "+outdirTemp) 942 | outfile1=open(outdirTemp+"/"+orgId1+"_exons.bed",'w') 943 | outfile2=open(outdirTemp+"/"+orgId2+"_exons.bed",'w') 944 | 945 | exons1=geneDic1[g1].exons 946 | exons2=geneDic2[g2].exons 947 | for e1 in exons1: 948 | outfile1.write(exonDic1[e1].get_summary_string()+"\n") 949 | for e2 in exons2: 950 | outfile2.write(exonDic2[e2].get_summary_string()+"\n") 951 | # 952 | # print exonDic[e].get_summary_string() 953 | 954 | outfile1.close() 955 | outfile2.close() 956 | 957 | # 958 | return 959 | 960 | def extract_fasta_files_for_exons(refGD,exonDic,typ,fivePrimeFlank,threePrimeFlank,outfilename): 961 | """ 962 | With the help of genomedata archive extract the nucleotide sequences 963 | from and around each exon and write them in a .fa file. 964 | refGD is the genomedata archive created for the reference genome. 965 | typ can be one of the following: 966 | "allExon": Extract the sequence of the whole exon. 967 | "allExonPlusMinus": Like allExon but with flanking 5' and 3'. 968 | "intronExon": Extract the sequence from the juction of this 969 | exon and the previous intron. 970 | "exonIntron": Extract the sequence from the juction of this 971 | exon and the next intron. 972 | fivePrimeFlank is the amount to extract extra from the 5' end. 973 | threePrimeFlank is the amount to extract extra from the 3' end. 974 | 975 | """ 976 | if typ=="allExon": 977 | fivePrimeFlank=0; threePrimeFlank=0 978 | # 979 | outfile=open(outfilename,'w') 980 | with Genome(refGD) as genome: 981 | for id in exonDic: 982 | e=exonDic[id] 983 | ch,st,en="chr"+e.basicInfoDic["chromosome"], e.basicInfoDic["start_coord"], e.basicInfoDic["end_coord"] 984 | strand,id=e.basicInfoDic["strand"], e.basicInfoDic["exon_id"] 985 | # off by one error fix by -1 986 | st=int(st)-1 987 | en=int(en)-1 988 | if strand=="+": 989 | if typ=="intronExon": 990 | en=st # make sure we're around the first bp of exon 991 | st=st-fivePrimeFlank # make sure 5' part is of size fivePrimeFlank including st 992 | en=en+threePrimeFlank # make sure 3' part is of size threePrimeFlank including en 993 | elif typ=="exonIntron": 994 | st=en # make sure we're around the last bp of exon 995 | st=st-fivePrimeFlank+1 996 | en=en+threePrimeFlank+1 997 | elif typ=="allExonPlusMinus" or typ=="allExon": 998 | st=st-fivePrimeFlank 999 | en=en+threePrimeFlank+1 1000 | # 1001 | id=id+"_plusStrand" 1002 | sq=genome[ch].seq[st:en].tostring().lower().upper() 1003 | else: 1004 | if typ=="intronExon": 1005 | st=en # make sure we're around the first bp of exon 1006 | en=en+fivePrimeFlank+1 1007 | st=st-threePrimeFlank+1 1008 | elif typ=="exonIntron": 1009 | en=st # make sure we're around the last bp of exon 1010 | en=en+fivePrimeFlank 1011 | st=st-threePrimeFlank 1012 | elif typ=="allExonPlusMinus" or typ=="allExon": 1013 | st=st-threePrimeFlank 1014 | en=en+fivePrimeFlank+1 1015 | # 1016 | id=id+"_minusStrand" 1017 | sq=genome[ch].seq[st:en].tostring() 1018 | #sq=sq.lower()[::-1].upper() # reverse 1019 | #sq=sq.lower().translate(complement).upper() # complement 1020 | sq=sq.lower().translate(complement)[::-1].upper() # reverse complement 1021 | # 1022 | outfile.write(">"+id+"_"+typ+"\n") 1023 | outfile.write(sq+"\n") 1024 | # 1025 | # 1026 | outfile.close() 1027 | return 1028 | 1029 | def extract_conservation_stats_for_exons(refGD,exonDic,typ,fivePrimeFlank,threePrimeFlank,outfilename): 1030 | """ 1031 | With the help of genomedata archive extract the nucleotide sequences 1032 | from and around each exon and convservation scores and write them to a file. 1033 | refGD is the genomedata archive created for the reference genome. 1034 | typ can be one of the following: 1035 | "allExon": Extract the sequence of the whole exon. 1036 | "allExonPlusMinus": Like allExon but with flanking 5' and 3'. 1037 | "intronExon": Extract the sequence from the juction of this 1038 | exon and the previous intron. 1039 | "exonIntron": Extract the sequence from the juction of this 1040 | exon and the next intron. 1041 | fivePrimeFlank is the amount to extract extra from the 5' end. 1042 | threePrimeFlank is the amount to extract extra from the 3' end. 1043 | IF outfilename is "None" then no output file is written, only 1044 | relevant fields are added to the exon in exonDic. 1045 | """ 1046 | sys.stderr.write("Extracting conservation stats and acceptor donor sites for exons from genomedata archive\n") 1047 | # this is the trackname for phastCons scores loaded from wig files 1048 | trackName="phastCons" 1049 | # 1050 | 1051 | if typ=="allExon": 1052 | fivePrimeFlank=0; threePrimeFlank=0 1053 | # 1054 | if outfilename!="None": 1055 | outfile=open(outfilename,'w') 1056 | # header line 1057 | outfile.write("CHR\tstart\tend\tstrand\tExonID\tacceptor2bp\tdonor2bp\tpreAcceptorCons\taccepterCons1\taccepterCons2\texon5primeCons\texonMidCons\texon3primeCons\tdonorCons1\tdonorCons2\tpostDonorCons\n") 1058 | # 1059 | with Genome(refGD) as genome: 1060 | lineCount=0 1061 | for id in exonDic: 1062 | print (id) 1063 | e=exonDic[id] 1064 | ch,st,en="chr"+e.basicInfoDic["chromosome"], e.basicInfoDic["start_coord"], e.basicInfoDic["end_coord"] 1065 | if e.exon_type=="partCoding": 1066 | codingSt=min(int(e.codingExon[0])-1,int(e.codingExon[1])-1) 1067 | codingEn=max(int(e.codingExon[0])-1,int(e.codingExon[1])-1) 1068 | # 1069 | stOrig=st; enOrig=en; 1070 | strand,id=e.basicInfoDic["strand"], e.basicInfoDic["exon_id"] 1071 | # off by one error fix by -1 1072 | st=int(st)-1 1073 | en=int(en)-1 1074 | if strand=="+": 1075 | if typ=="intronExon": 1076 | en=st # make sure we're around the first bp of exon 1077 | st=st-fivePrimeFlank # make sure 5' part is of size fivePrimeFlank including st 1078 | en=en+threePrimeFlank # make sure 3' part is of size threePrimeFlank including en 1079 | elif typ=="exonIntron": 1080 | st=en # make sure we're around the last bp of exon 1081 | st=st-fivePrimeFlank+1 1082 | en=en+threePrimeFlank+1 1083 | elif typ=="allExonPlusMinus": 1084 | st=st-fivePrimeFlank 1085 | en=en+threePrimeFlank+1 1086 | # 1087 | #id=id+"_plusStrand" 1088 | sq=genome[ch].seq[st:en].tostring().lower().upper() 1089 | allScores=(genome[ch])[st:en,trackName] 1090 | else: 1091 | if typ=="intronExon": 1092 | st=en # make sure we're around the first bp of exon 1093 | en=en+fivePrimeFlank+1 1094 | st=st-threePrimeFlank+1 1095 | elif typ=="exonIntron": 1096 | en=st # make sure we're around the last bp of exon 1097 | en=en+fivePrimeFlank 1098 | st=st-threePrimeFlank 1099 | elif typ=="allExonPlusMinus": 1100 | st=st-threePrimeFlank 1101 | en=en+fivePrimeFlank+1 1102 | # 1103 | #id=id+"_minusStrand" 1104 | sq=genome[ch].seq[st:en].tostring() 1105 | sq=sq.lower().translate(complement)[::-1].upper() # reverse complement 1106 | allScores=(genome[ch])[st:en,trackName][::-1] 1107 | # 1108 | print (sq) 1109 | print (allScores) 1110 | print (genome[ch].seq[st:en].tostring()) 1111 | if e.exon_type=="partCoding": 1112 | codingScores=(genome[ch])[codingSt:codingEn,trackName][::-1] 1113 | ### Extract all the scores to be written to the output file ### 1114 | acceptor2bp=sq[fivePrimeFlank-2:fivePrimeFlank] 1115 | donor2bp=(sq[-threePrimeFlank:])[0:2] 1116 | # 1117 | x=allScores[:fivePrimeFlank-2] 1118 | preAcceptorCons=np.nanmean(x) 1119 | # 1120 | accepterCons1=allScores[fivePrimeFlank-2] 1121 | accepterCons2=allScores[fivePrimeFlank-1] 1122 | # 1123 | x=allScores[fivePrimeFlank:fivePrimeFlank+(fivePrimeFlank-2)] 1124 | exon5primeCons=np.nanmean(x) 1125 | # 1126 | x=allScores[fivePrimeFlank+(fivePrimeFlank-2):-(threePrimeFlank+(threePrimeFlank-2))] 1127 | exonMidCons=np.nanmean(x) 1128 | # 1129 | x=allScores[-(threePrimeFlank+(threePrimeFlank+2)):-threePrimeFlank] 1130 | exon3primeCons=np.nanmean(x) 1131 | # 1132 | donorCons1=allScores[-threePrimeFlank] 1133 | donorCons2=allScores[-threePrimeFlank+1] 1134 | # 1135 | x=allScores[-threePrimeFlank+2:] 1136 | postDonorCons=np.nanmean(x) 1137 | # 1138 | #first20bp=allScores[:20] 1139 | #outfile.write("%s\t%d\t%d\t%s\t%s\t%s\t%s\t" % (ch,stOrig,enOrig,strand,id,acceptor2bp,donor2bp)) 1140 | #outfile.write("\t".join([repr(x) for x in first20bp])+"\n") 1141 | exonDic[id].acceptor2bp=acceptor2bp 1142 | exonDic[id].donor2bp=donor2bp 1143 | exonDic[id].phastConsScores=allScores 1144 | exonDic[id].avgConsScore=np.nanmean(allScores[fivePrimeFlank:-threePrimeFlank]) 1145 | if e.exon_type=="partCoding": 1146 | exonDic[id].avgCodingConsScore=np.nanmean(codingScores) 1147 | 1148 | if lineCount%100000==0: 1149 | sys.stderr.write(str(lineCount)+"\t") 1150 | lineCount+=1 1151 | print (sq) 1152 | print ("%s\t%d\t%d\t%s\t%s\t%s\t%s\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n" % \ 1153 | (ch,stOrig,enOrig,strand,id,acceptor2bp,donor2bp, preAcceptorCons, accepterCons1, accepterCons2, exon5primeCons,\ 1154 | exonMidCons, exon3primeCons, donorCons1, donorCons2, postDonorCons, exonDic[id].avgCodingConsScore, exonDic[id].avgConsScore)) 1155 | 1156 | if outfilename!="None": 1157 | outfile.write("%s\t%d\t%d\t%s\t%s\t%s\t%s\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n" % \ 1158 | (ch,stOrig,enOrig,strand,id,acceptor2bp,donor2bp, preAcceptorCons, accepterCons1, accepterCons2, exon5primeCons,\ 1159 | exonMidCons, exon3primeCons, donorCons1, donorCons2, postDonorCons, exonDic[id].avgCodingConsScore, exonDic[id].avgConsScore)) 1160 | # 1161 | ### 1162 | # 1163 | # 1164 | sys.stderr.write("\n\n") 1165 | if outfilename!="None": 1166 | outfile.close() 1167 | return exonDic 1168 | 1169 | def assign_firstMidLast_exon_counts(exonDic,transcriptDic): 1170 | for e in exonDic: 1171 | for i in range(len(exonDic[e].exonNumbers)): 1172 | transcriptLength=len(transcriptDic[exonDic[e].transcriptIds[i]].exons) 1173 | tempExNo=exonDic[e].exonNumbers[i] 1174 | #print [transcriptLength, tempExNo] 1175 | #single exon 1176 | if tempExNo==1 and tempExNo==transcriptLength: 1177 | exonDic[e].firstMidLast[0]+=1 1178 | exonDic[e].firstMidLast[2]+=1 1179 | #first exon 1180 | elif tempExNo==1: 1181 | exonDic[e].firstMidLast[0]+=1 1182 | #last exon 1183 | elif tempExNo==transcriptLength: 1184 | exonDic[e].firstMidLast[2]+=1 1185 | else: 1186 | exonDic[e].firstMidLast[1]+=1 1187 | # 1188 | #if exonDic[e].firstMidLast[0]>0 or exonDic[e].firstMidLast[2]>0: 1189 | #print exonDic[e].get_summary_string() 1190 | # 1191 | return exonDic 1192 | 1193 | # Testing functionalities 1194 | def main(argv): 1195 | orgId1="human"; orgId2="mouse"; 1196 | refGD1="/home/fao150/proj/2015orthoR01/results/2015-03-17_creating-genomedata-archives-for-refs/hg38" 1197 | refGD2="/home/fao150/proj/2015orthoR01/results/2015-03-17_creating-genomedata-archives-for-refs/mm10" 1198 | 1199 | # outdir="GTFsummaries"; 1200 | if len(argv)==1: 1201 | return 1202 | 1203 | outdir=argv[1] 1204 | os.system("mkdir -p "+outdir) 1205 | os.system("mkdir -p "+outdir+"/GTFsummaries") 1206 | #infilename="/projects/b1017/shared/Ensembl-files/Homo_sapiens.GRCh38.78.gtf.gz" 1207 | infilename="Homo_sapiens.GRCh38.102.gtf" 1208 | #geneDic1,transcriptDic1,exonDic1,infoDic1=parse_organism_GTF(orgId1, infilename, outdir+"/GTFsummaries") 1209 | 1210 | #infilename="/projects/b1017/shared/Ensembl-files/Mus_musculus.GRCm38.78.gtf.gz" 1211 | infilename="Mus_musculus.GRCm38.102.gtf" 1212 | #geneDic2,transcriptDic2,exonDic2,infoDic2=parse_organism_GTF(orgId2, infilename, outdir+"/GTFsummaries") 1213 | 1214 | ## these two files were downloaded by hand selecting columns from Ensembl's Biomart 1215 | ## I weren't able to redo the same column selections recently so I decided to switch to 1216 | ## parsing the orthology information from readily available Ensembl files like below ones: 1217 | ## ftp://ftp.ensembl.org/pub/release-80/mysql/ensembl_mart_80/ 1218 | ## hsapiens_gene_ensembl__homolog_mmus__dm.txt.gz 1219 | 1220 | #infilename="/projects/b1017/shared/Ensembl-files/Ensembl-human-GRCh38-to-mouse-GRCm38.p3.txt.gz" 1221 | #genePairsHumanToMouse=parse_ensembl_gene_pairings(orgId1,orgId2,infilename) 1222 | #infilename="/projects/b1017/shared/Ensembl-files/Ensembl-mouse-GRCm38.p3-to-human-GRCh38.txt.gz" 1223 | #genePairsHumanToMouse=parse_ensembl_gene_pairings(orgId1,orgId2,infilename) 1224 | 1225 | #### Rewritten: Abhijit #### 1226 | infilename="Ensembl-human-GRCh38-to-mouse-GRCm38.Formatted.txt" 1227 | #genePairsHumanToMouse=parse_ensembl_gene_pairings(infilename) 1228 | 1229 | infilename="Ensembl-mouse-GRCm38-to-human-GRCh38.Formatted.txt" 1230 | #genePairsMouseToHuman=parse_ensembl_gene_pairings(infilename) 1231 | #consistency_check(genePairsHumanToMouse,genePairsMouseToHuman) 1232 | 1233 | ## if consistency check is ok then just use one side. This is OK for one2one mappings. 1234 | #genePairsDic=genePairsHumanToMouse 1235 | #os.system("mkdir -p "+outdir+"/perGenePairPickledInfo") 1236 | #pickle_one2one_genePairs_allInfo(genePairsDic,geneDic1,geneDic2,exonDic1,exonDic2,transcriptDic1,transcriptDic2,outdir+"/perGenePairPickledInfo") 1237 | 1238 | #infilename="/projects/b1017/shared/Ensembl-files/hsapiens_gene_ensembl__homolog_mmus__dm.txt.gz" 1239 | #infilename="hsapiens_gene_ensembl__homolog_mmusculus__dm.txt" 1240 | #proteinToGeneDic,genePairsDic,proteinPairsDic=parse_ensembl_geneAndProtein_pairings(infilename,{},{}) 1241 | #print (["1",len(proteinToGeneDic),len(genePairsDic),len(proteinPairsDic)]) 1242 | 1243 | #infilename="/projects/b1017/shared/Ensembl-files/mmusculus_gene_ensembl__homolog_hsap__dm.txt.gz" 1244 | #infilename="mmusculus_gene_ensembl__homolog_hsapiens__dm.txt" 1245 | #proteinToGeneDic,genePairsDic,proteinPairsDic=parse_ensembl_geneAndProtein_pairings(infilename,proteinToGeneDic,proteinPairsDic) 1246 | #print (["2",len(proteinToGeneDic),len(genePairsDic),len(proteinPairsDic)]) 1247 | 1248 | 1249 | #exonDic1=assign_firstMidLast_exon_counts(exonDic1,transcriptDic1) 1250 | #exonDic2=assign_firstMidLast_exon_counts(exonDic2,transcriptDic2) 1251 | 1252 | typ="allExonPlusMinus" 1253 | outfilename="None" 1254 | fivePrimeFlank=12; threePrimeFlank=12 1255 | #exonDic1=extract_conservation_stats_for_exons(refGD1,exonDic1,typ,fivePrimeFlank,threePrimeFlank,outfilename) 1256 | #exonDic2=extract_conservation_stats_for_exons(refGD2,exonDic2,typ,fivePrimeFlank,threePrimeFlank,outfilename) 1257 | 1258 | #outdir=argv[1]+"/after" 1259 | #os.system("mkdir -p "+outdir) 1260 | #print_some_summary(orgId1, geneDic1,transcriptDic1,exonDic1,{}, outdir) 1261 | #print_some_summary(orgId2, geneDic2,transcriptDic2,exonDic2,{}, outdir) 1262 | 1263 | # outdir="perGenePairExonLists" 1264 | if len(argv)==2: 1265 | return 1266 | 1267 | #outdir=argv[2] 1268 | #os.system("mkdir -p "+outdir) 1269 | 1270 | #pickle_one2one_genePairs_allInfo(genePairsDic,geneDic1,geneDic2,exonDic1,exonDic2,transcriptDic1,transcriptDic2,outdir) 1271 | 1272 | # outfilename=outdir+"/genePairsSummary-one2one.txt" 1273 | # print_one2one_genePairs(genePairsDic,geneDic1,geneDic2,outfilename) # either way is ok since one2one 1274 | # 1275 | # print_one2one_exonListPairs(genePairsDic,geneDic1,geneDic2,exonDic1,exonDic2,orgId1,orgId2,outdir) 1276 | # print_one2one_transcriptListPairs(genePairsDic,geneDic1,geneDic2,transcriptDic1,transcriptDic2,orgId1,orgId2,outdir) 1277 | 1278 | return 1279 | 1280 | if __name__ == "__main__": 1281 | main(sys.argv) 1282 | 1283 | 1284 | -------------------------------------------------------------------------------- /Human-Monkey-Processed-Data/scripts/liftOver: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ay-lab/ExTraMapper/ff8bf6399e457c041e10ab8d94c83ae54414b273/Human-Monkey-Processed-Data/scripts/liftOver -------------------------------------------------------------------------------- /Human-Monkey-Processed-Data/scripts/liftover-withMultiples: -------------------------------------------------------------------------------- 1 | #!/bin/bash -ex 2 | set -o pipefail 3 | set -o errexit 4 | 5 | source config.conf 6 | 7 | 8 | dataDir=${EXTRAMAPPER_DIR}/preprocess/data 9 | dataDirPerPair=${EXTRAMAPPER_DIR}/preprocess/data/$org1-$org2 10 | 11 | chainsDir=$dataDir/liftover_chains 12 | ensemblDir=$dataDirPerPair/ensemblDownloads 13 | 14 | liftOverFilesDir=$dataDirPerPair/liftoverRelatedFiles 15 | perExonLiftoverDir=$dataDirPerPair/perExonLiftoverCoords 16 | 17 | outdir=$liftOverFilesDir 18 | flank=$1 19 | minMatch=$2 20 | 21 | chain1to2=$3 22 | chain2to1=$4 23 | 24 | mkdir -p $ensemblDir 25 | 26 | GTFfile1=$ensemblDir/org1.gtf.gz 27 | GTFfile2=$ensemblDir/org2.gtf.gz 28 | org1to2homologFile=$ensemblDir/org1_homolog_org2.txt.gz 29 | org2to1homologFile=$ensemblDir/org2_homolog_org1.txt.gz 30 | refGDdir1=$ensemblDir/org1 # genomedata archive for org1 31 | refGDdir2=$ensemblDir/org2 # genomedata archive for org2 32 | 33 | ########################## need to add 1 to liftedOver coordinates to match UCSC coordinates ################### 34 | ############## HOWEVER, this is only correct if original/lifted strands are same -/- or +/+ #################### 35 | ############## THEREFORE, I account manually for this by checking the strand pairs #################### 36 | 37 | ############## ALSO, liftOver does not CHANGE the strand of original coordinates when used ###################### 38 | ############# without the -multiple option and it DOES with -multiple. ######################### 39 | ############ HENCE, I handle these two cases differently. ######################## 40 | 41 | ## OLDER AND INCORRECT WAY #1 ########################################################################################################### 42 | # zcat $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.bed.gz | awk '{print $1"\t"$2+1"\t"$3+1"\t"$4"\t"$5}' \ 43 | # > $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp 44 | # 45 | # zcat $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.bed.gz | awk '{print $1"\t"$2+1"\t"$3+1"\t"$4"\t"$5}' \ 46 | # > $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp 47 | # 48 | #rm -rf $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp 49 | ############################################################################################################################################### 50 | 51 | 52 | # 53 | # first work on the partCoding exons 54 | suffix=flank$flank-minMatch$minMatch-multiples-partCoding 55 | 56 | # fourth fields stays the same, fifth is replaced by multiplicity, sixth will be the new strand after liftover 57 | ${EXTRAMAPPER_DIR}/preprocess/bin/liftOver <(cat $outdir/org1_partCodingExonsList.bed | awk '{print $1,$2-s,$3+s,$5,$4,$4}' s=$flank) \ 58 | $chain1to2 org2_mapped-$suffix.bed org2_unmapped-$suffix.bed -minMatch=$minMatch -multiple 59 | 60 | ${EXTRAMAPPER_DIR}/preprocess/bin/liftOver <(cat $outdir/org2_partCodingExonsList.bed | awk '{print $1,$2-s,$3+s,$5,$4,$4}' s=$flank) \ 61 | $chain2to1 org1_mapped-$suffix.bed org1_unmapped-$suffix.bed -minMatch=$minMatch -multiple 62 | 63 | # chr, start, end, exonId, Multiplicity, strand (after conversion) 64 | cat org1_mapped-$suffix.bed | sort -k1,1 -k2,2n > $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.bed 65 | cat org2_mapped-$suffix.bed | sort -k1,1 -k2,2n > $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.bed 66 | 67 | # chr, start, end, exonId, Why unmapped, strand (before conversion) 68 | cat org1_unmapped-$suffix.bed | awk '{l1=$1; getline; printf("%s\t%s\t%s\t%s\t%s\t%s\n",$1,$2,$3,$4,l1,$5)}' |\ 69 | sort -k1,1 -k2,2n > $outdir/org2_to_org1_liftOver_unmappedExonsList-$suffix.bed 70 | cat org2_unmapped-$suffix.bed | awk '{l1=$1; getline; printf("%s\t%s\t%s\t%s\t%s\t%s\n",$1,$2,$3,$4,l1,$5)}' |\ 71 | sort -k1,1 -k2,2n > $outdir/org1_to_org2_liftOver_unmappedExonsList-$suffix.bed 72 | 73 | rm -rf org1_mapped-$suffix.bed org2_mapped-$suffix.bed org1_unmapped-$suffix.bed org2_unmapped-$suffix.bed 74 | 75 | # take the intersections 76 | ## NEW AND CORRECT WAY - FOR ONLY liftOver with -multiple OPTION ########################################################################### 77 | cat $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.bed | awk '{print $4,$0}' | sort -k1,1 > mapped.temp 78 | join $outdir/org2_partCodingExonsList.sorted.temp mapped.temp | \ 79 | awk '{s=$8; e=$9; if ($5!=$12) {s=s+1; e=e+1;}; print $7"\t"s"\t"e"\t"$10"\t"$11"\t"$12}' \ 80 | | sort -k1,1 -k2,2n > $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp 81 | bedtools intersect -a $outdir/org1_allCodingExonsList.bed \ 82 | -b $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp -sorted -wao \ 83 | > $outdir/org1_VS_org2_to_org1_intersectingExonsList-$suffix.bed 84 | 85 | bedtools intersect -b $outdir/org1_allCodingExonsList.bed \ 86 | -a $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp -sorted -v \ 87 | > $outdir/org1_VS_org2_to_org1_nonintersectingExonsList-$suffix.bed 88 | 89 | cat $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.bed | awk '{print $4,$0}' | sort -k1,1 > mapped.temp 90 | join $outdir/org1_partCodingExonsList.sorted.temp mapped.temp | \ 91 | awk '{s=$8; e=$9; if ($5!=$12) {s=s+1; e=e+1;}; print $7"\t"s"\t"e"\t"$10"\t"$11"\t"$12}' \ 92 | | sort -k1,1 -k2,2n > $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp 93 | 94 | bedtools intersect -a $outdir/org2_allCodingExonsList.bed \ 95 | -b $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp -sorted -wao \ 96 | > $outdir/org2_VS_org1_to_org2_intersectingExonsList-$suffix.bed 97 | 98 | bedtools intersect -b $outdir/org2_allCodingExonsList.bed \ 99 | -a $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp -sorted -v \ 100 | > $outdir/org2_VS_org1_to_org2_nonintersectingExonsList-$suffix.bed 101 | ############################################################################################################################################### 102 | 103 | rm -rf $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp mapped.temp 104 | 105 | gzip $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.bed $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.bed 106 | gzip $outdir/org2_to_org1_liftOver_unmappedExonsList-$suffix.bed $outdir/org1_to_org2_liftOver_unmappedExonsList-$suffix.bed 107 | gzip $outdir/org2_VS_org1_to_org2_intersectingExonsList-$suffix.bed $outdir/org1_VS_org2_to_org1_intersectingExonsList-$suffix.bed 108 | gzip $outdir/org2_VS_org1_to_org2_nonintersectingExonsList-$suffix.bed $outdir/org1_VS_org2_to_org1_nonintersectingExonsList-$suffix.bed 109 | 110 | # 111 | # now work on all exons including the partCoding, nonCoding and fullCoding ones 112 | 113 | suffix=flank$flank-minMatch$minMatch-multiples 114 | 115 | # fourth fields stays the same, fifth is replaced by multiplicity, sixth will be the new strand after liftover 116 | #liftOver <(cat $outdir/org1_allExonsList.bed | awk '{if ($4=="+") print $1,$2-s,$3+s,$5,$4,$4; else print $1,$2-s,$3+s,$5,$4,$4;}' s=$flank) \ 117 | ${EXTRAMAPPER_DIR}/preprocess/bin/liftOver <(cat $outdir/org1_allExonsList.bed | awk '{print $1,$2-s,$3+s,$5,$4,$4}' s=$flank) \ 118 | $chain1to2 org2_mapped-$suffix.bed org2_unmapped-$suffix.bed -minMatch=$minMatch -multiple 119 | 120 | ${EXTRAMAPPER_DIR}/preprocess/bin/liftOver <(cat $outdir/org2_allExonsList.bed | awk '{print $1,$2-s,$3+s,$5,$4,$4}' s=$flank) \ 121 | $chain2to1 org1_mapped-$suffix.bed org1_unmapped-$suffix.bed -minMatch=$minMatch -multiple 122 | 123 | cat org1_mapped-$suffix.bed | sort -k1,1 -k2,2n > $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.bed 124 | cat org2_mapped-$suffix.bed | sort -k1,1 -k2,2n > $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.bed 125 | #cat org1_unmapped-$suffix.bed | awk 'NR%2==1' | sort | uniq - 126 | #cat org2_unmapped-$suffix.bed | awk 'NR%2==1' | sort | uniq - 127 | 128 | cat org1_unmapped-$suffix.bed | awk '{l1=$1; getline; printf("%s\t%s\t%s\t%s\t%s\t%s\n",$1,$2,$3,$4,l1,$5)}' |\ 129 | sort -k1,1 -k2,2n > $outdir/org2_to_org1_liftOver_unmappedExonsList-$suffix.bed 130 | cat org2_unmapped-$suffix.bed | awk '{l1=$1; getline; printf("%s\t%s\t%s\t%s\t%s\t%s\n",$1,$2,$3,$4,l1,$5)}' |\ 131 | sort -k1,1 -k2,2n > $outdir/org1_to_org2_liftOver_unmappedExonsList-$suffix.bed 132 | 133 | rm -rf org1_mapped-$suffix.bed org2_mapped-$suffix.bed org1_unmapped-$suffix.bed org2_unmapped-$suffix.bed 134 | 135 | # take the intersections 136 | ## NEW AND CORRECT WAY - FOR ONLY liftOver with -multiple OPTION ########################################################################### 137 | # This correction in coordinates leads to some exons with that doesn't have any file mapped, unmapped, nonintersecting. ## 138 | # There is only 2 such exons and they be deemed unmapped (i.e., deleted from the second organism) # 139 | cat $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.bed | awk '{print $4,$0}' | sort -k1,1 > mapped.temp 140 | join $outdir/org2_allExonsList.sorted.temp mapped.temp | \ 141 | awk '{s=$8; e=$9; if ($5!=$12) {s=s+1; e=e+1;}; print $7"\t"s"\t"e"\t"$10"\t"$11"\t"$12}' \ 142 | | sort -k1,1 -k2,2n > $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp 143 | bedtools intersect -a $outdir/org1_allExonsList.bed \ 144 | -b $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp -sorted -wao \ 145 | > $outdir/org1_VS_org2_to_org1_intersectingExonsList-$suffix.bed 146 | bedtools intersect -b $outdir/org1_allExonsList.bed \ 147 | -a $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp -sorted -v \ 148 | > $outdir/org1_VS_org2_to_org1_nonintersectingExonsList-$suffix.bed 149 | 150 | cat $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.bed | awk '{print $4,$0}' | sort -k1,1 > mapped.temp 151 | join $outdir/org1_allExonsList.sorted.temp mapped.temp | \ 152 | awk '{s=$8; e=$9; if ($5!=$12) {s=s+1; e=e+1;}; print $7"\t"s"\t"e"\t"$10"\t"$11"\t"$12}' \ 153 | | sort -k1,1 -k2,2n > $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp 154 | 155 | bedtools intersect -a $outdir/org2_allExonsList.bed \ 156 | -b $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp -sorted -wao \ 157 | > $outdir/org2_VS_org1_to_org2_intersectingExonsList-$suffix.bed 158 | 159 | bedtools intersect -b $outdir/org2_allExonsList.bed \ 160 | -a $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp -sorted -v \ 161 | > $outdir/org2_VS_org1_to_org2_nonintersectingExonsList-$suffix.bed 162 | 163 | rm -rf $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp mapped.temp 164 | 165 | gzip $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.bed $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.bed 166 | gzip $outdir/org2_to_org1_liftOver_unmappedExonsList-$suffix.bed $outdir/org1_to_org2_liftOver_unmappedExonsList-$suffix.bed 167 | gzip $outdir/org2_VS_org1_to_org2_intersectingExonsList-$suffix.bed $outdir/org1_VS_org2_to_org1_intersectingExonsList-$suffix.bed 168 | gzip $outdir/org2_VS_org1_to_org2_nonintersectingExonsList-$suffix.bed $outdir/org1_VS_org2_to_org1_nonintersectingExonsList-$suffix.bed 169 | 170 | ############################################################################################################################################### 171 | 172 | -------------------------------------------------------------------------------- /Human-Monkey-Processed-Data/scripts/parseAndPicklePerPair.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | ############################################################################## 3 | ### To use the functions in this lib simply import this python module using 4 | ### import ensemblUtils 5 | ### Then you'll able able to call functions with the proper arguments using 6 | ### returnVal=ensemblUtils.func1(arg1,arg2) 7 | ############################################################################## 8 | ############################################################################## 9 | import sys 10 | import os 11 | import string 12 | import math 13 | import gzip 14 | import _pickle as pickle 15 | 16 | # reads from exported environment variable 17 | ExTraMapperPath=os.environ['EXTRAMAPPER_DIR'] 18 | sys.path.append(ExTraMapperPath+"/scripts") 19 | from ensemblUtils import * 20 | 21 | # Testing functionalities 22 | def main(argv): 23 | indir=argv[1] 24 | orgId1="org1"; orgId2="org2"; 25 | refGD1=indir+"/genomedataArchives/org1" 26 | refGD2=indir+"/genomedataArchives/org2" 27 | 28 | 29 | # outdir="GTFsummaries"; 30 | if len(argv)==2: 31 | return 32 | 33 | outdir=argv[2] 34 | os.system("mkdir -p "+outdir) 35 | 36 | #infilename=indir+"/ensemblDownloads/org1.gtf.gz" 37 | infilename=indir+"/ensemblDownloads/org1.gtf" ## Abhijit 38 | geneDic1,transcriptDic1,exonDic1,infoDic1=parse_organism_GTF(orgId1, infilename, outdir) 39 | 40 | #infilename=indir+"/ensemblDownloads/org2.gtf.gz" 41 | infilename=indir+"/ensemblDownloads/org2.gtf" ## Abhijit 42 | geneDic2,transcriptDic2,exonDic2,infoDic2=parse_organism_GTF(orgId2, infilename, outdir) 43 | 44 | ## these two files were downloaded by hand selecting columns from Ensembl's Biomart 45 | ## I weren't able to redo the same column selections recently so I decided to switch to 46 | ## parsing the orthology information from readily available Ensembl files like below ones: 47 | ## ftp://ftp.ensembl.org/pub/release-80/mysql/ensembl_mart_80/ 48 | ## hsapiens_gene_ensembl__homolog_mmus__dm.txt.gz 49 | #infilename="/projects/b1017/shared/Ensembl-files/Ensembl-human-GRCh38-to-mouse-GRCm38.p3.txt.gz" 50 | #genePairsHumanToMouse=parse_ensembl_gene_pairings(infilename) 51 | #infilename="/projects/b1017/shared/Ensembl-files/Ensembl-mouse-GRCm38.p3-to-human-GRCh38.txt.gz" 52 | #genePairsMouseToHuman=parse_ensembl_gene_pairings(infilename) 53 | #consistency_check(genePairsHumanToMouse,genePairsMouseToHuman) 54 | ## if consistency check is ok then just use one side. This is OK for one2one mappings. 55 | #genePairsDic=genePairsHumanToMouse 56 | #pickle_one2one_genePairs_allInfo(genePairsDic,geneDic1,geneDic2,exonDic1,exonDic2,transcriptDic1,transcriptDic2,outdir) 57 | 58 | #infilename=indir+"/ensemblDownloads/org1_homolog_org2.txt.gz" 59 | infilename=indir+"/ensemblDownloads/org1_homolog_org2.txt" ## Abhijit 60 | proteinToGeneDic,genePairsDic,proteinPairsDic=parse_ensembl_geneAndProtein_pairings(infilename,{},{}) 61 | print (["1",len(proteinToGeneDic),len(genePairsDic),len(proteinPairsDic)]) 62 | 63 | #infilename=indir+"/ensemblDownloads/org2_homolog_org1.txt.gz" 64 | infilename=indir+"/ensemblDownloads/org2_homolog_org1.txt" ## Abhijit 65 | proteinToGeneDic,genePairsDic,proteinPairsDic=parse_ensembl_geneAndProtein_pairings(infilename,proteinToGeneDic,proteinPairsDic) 66 | print (["2",len(proteinToGeneDic),len(genePairsDic),len(proteinPairsDic)]) 67 | 68 | 69 | exonDic1=assign_firstMidLast_exon_counts(exonDic1,transcriptDic1) 70 | exonDic2=assign_firstMidLast_exon_counts(exonDic2,transcriptDic2) 71 | 72 | typ="allExonPlusMinus" 73 | outfilename="None" 74 | fivePrimeFlank=12; threePrimeFlank=12 75 | 76 | ###### Not required ###### 77 | #exonDic1=extract_conservation_stats_for_exons(refGD1,exonDic1,typ,fivePrimeFlank,threePrimeFlank,outfilename) 78 | #exonDic2=extract_conservation_stats_for_exons(refGD2,exonDic2,typ,fivePrimeFlank,threePrimeFlank,outfilename) 79 | ###### 80 | 81 | outdir=argv[2] # overwrite previous summaries 82 | os.system("mkdir -p "+outdir) 83 | print_some_summary(orgId1, geneDic1,transcriptDic1,exonDic1,{}, outdir) 84 | print_some_summary(orgId2, geneDic2,transcriptDic2,exonDic2,{}, outdir) 85 | 86 | # outdir="perGenePairExonLists" 87 | if len(argv)==3: 88 | return 89 | 90 | outdir=argv[3] 91 | os.system("mkdir -p "+outdir) 92 | 93 | outfilename=outdir+"/genePairsSummary-one2one.txt" 94 | print_one2one_genePairs(genePairsDic,geneDic1,geneDic2,outfilename) # either way is ok since one2one 95 | 96 | geneOnlyOrthoDic1,transcriptOnlyOrthoDic1,exonOnlyOrthoDic1, geneOnlyOrthoDic2,transcriptOnlyOrthoDic2,exonOnlyOrthoDic2=pickle_one2one_genePairs_allInfo(genePairsDic,geneDic1,geneDic2,exonDic1,exonDic2,transcriptDic1,transcriptDic2,outdir) 97 | 98 | print ([len(geneDic1), len(geneDic2)]) 99 | print (len(geneOnlyOrthoDic1)) 100 | print (len(geneOnlyOrthoDic2)) 101 | 102 | outdir=argv[2]+"/onlyOrthologAndCodingGenes" 103 | os.system("mkdir -p "+outdir) 104 | print (outdir) 105 | print_some_summary(orgId1, geneOnlyOrthoDic1,transcriptOnlyOrthoDic1,exonOnlyOrthoDic1,{}, outdir) 106 | print_some_summary(orgId2, geneOnlyOrthoDic2,transcriptOnlyOrthoDic2,exonOnlyOrthoDic2,{}, outdir) 107 | 108 | # 109 | # print_one2one_exonListPairs(genePairsDic,geneDic1,geneDic2,exonDic1,exonDic2,orgId1,orgId2,outdir) 110 | # print_one2one_transcriptListPairs(genePairsDic,geneDic1,geneDic2,transcriptDic1,transcriptDic2,orgId1,orgId2,outdir) 111 | 112 | return 113 | 114 | if __name__ == "__main__": 115 | main(sys.argv) 116 | 117 | -------------------------------------------------------------------------------- /Human-Monkey-Processed-Data/scripts/splitExonsIntoIndividualFiles.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import sys 3 | 4 | def main(argv): 5 | infilename=argv[1] 6 | outdir=argv[2] 7 | whichCol=int(argv[3])-1 8 | fileSuffix=argv[4] 9 | infile=open(infilename,'r') 10 | lastExon="dummy" 11 | outfile=open("dummy.txt",'w') 12 | for line in infile: 13 | newExon=line.rstrip().split()[whichCol] # where exon name is 14 | if newExon!=lastExon: 15 | outfile.close() 16 | outfile=open(outdir+"/"+newExon+fileSuffix,'w') 17 | # 18 | outfile.write(line) 19 | lastExon=newExon 20 | # 21 | outfile.close() 22 | return 23 | 24 | if __name__ == "__main__": 25 | main(sys.argv) 26 | # 27 | 28 | -------------------------------------------------------------------------------- /Human-Mouse-Preprocess-Data/README.md: -------------------------------------------------------------------------------- 1 | ## Steps to generate the input files (Human - Mouse) 2 | The users should run the _extMpreprocess_ to generate the inputfiles. All the input files will be generated under _preprocess/data_ folder. All the required executables and scripts are provided here. The _extMpreprocess_ has 7 individual steps and should be run in the following manner 3 | 4 | ### Run the following steps 5 | 6 | - ![#f03c15](https://via.placeholder.com/15/f03c15/000000?text=+) For help, type
7 | 8 | ```bash 9 | ./extMpreprocess help 10 | 11 | This script will download and preprocess the dataset required for exon-pair and transcript pair finding by ExTraMapper. 12 | Type ./extMpreprocess to execute the script. 13 | Type ./extMpreprocess example to print a example config.conf file. 14 | 15 | This script will run seven (7) sequential steps to create the inputs for ExTraMapper program. 16 | Users can provide step numbers (1-7) or all in the arugemt of this script. 17 | Short description of the individual scripts: 18 | Step 1: Download per organism specific files e.g. reference genomes, gene annotation files. 19 | Step 2: Will create genomedata archives with the genomes of org1 and org2 (Make sure to install genomedata package). 20 | Step 3: Pickle files for each homologous gene pair will be created. 21 | Step 4: Perform coordinate liftOver of exons with multiple mappings (This step requires bedtools and liftOver executables). 22 | Step 5-7: postprocessing the liftOver files. 23 | 24 | example: 25 | 26 | ./extMpreprocess config.human-mouse.conf all 27 | ``` 28 |
29 |
30 | - ![#f03c15](https://via.placeholder.com/15/f03c15/000000?text=+) The script requires genomedata package which can be installed by running the following commnand.
31 | 32 | ```bash 33 | $ pip install genomedata --user 34 | ``` 35 |
36 | 37 |
38 | 39 | #### Once finished the _extMpreprocess_ script shoudld produce the _preprocess folder with the following subfolders.
40 | 41 | ```bash 42 | ./preprocess 43 | |-- bin 44 | | `-- liftOver 45 | |-- data 46 | |-- human-mouse 47 | | |-- GTFsummaries 48 | | | |-- onlyOrthologAndCodingGenes 49 | | | | |-- org1-allExons-GTFparsed.txt 50 | | | | |-- org1-allGenes-GTFparsed.txt 51 | | | | |-- org1-allTranscripts-GTFparsed.txt 52 | | | | |-- org2-allExons-GTFparsed.txt 53 | | | | |-- org2-allGenes-GTFparsed.txt 54 | | | | `-- org2-allTranscripts-GTFparsed.txt 55 | | | |-- org1-allExons-GTFparsed.txt 56 | | | |-- org1-allGenes-GTFparsed.txt 57 | | | |-- org1-allTranscripts-GTFparsed.txt 58 | | | |-- org2-allExons-GTFparsed.txt 59 | | | |-- org2-allGenes-GTFparsed.txt 60 | | | `-- org2-allTranscripts-GTFparsed.txt 61 | | |-- ensemblDownloads 62 | | | |-- org1.gtf 63 | | | |-- org1.gtf.gz 64 | | | |-- org1_homolog_org2.txt 65 | | | |-- org1_homolog_org2.txt.gz 66 | | | |-- org2.gtf 67 | | | |-- org2.gtf.gz 68 | | | |-- org2_homolog_org1.txt 69 | | | `-- org2_homolog_org1.txt.gz 70 | | |-- genePairsSummary-one2one.txt 71 | | |-- genomedataArchives 72 | | | |-- org1 [25 entries exceeds filelimit, not opening dir] 73 | | | `-- org2 [22 entries exceeds filelimit, not opening dir] 74 | | |-- liftoverRelatedFiles [56 entries exceeds filelimit, not opening dir] 75 | | |-- perExonLiftoverCoords 76 | | | |-- org1 [654707 entries exceeds filelimit, not opening dir] 77 | | | `-- org2 [484860 entries exceeds filelimit, not opening dir] 78 | | |-- perGenePairPickledInfo [15804 entries exceeds filelimit, not opening dir] 79 | | 80 | |-- liftover_chains 81 | | |-- hg38 82 | | | `-- liftOver 83 | | | `-- hg38ToMm10.over.chain.gz 84 | | `-- mm10 85 | | `-- liftOver 86 | | `-- mm10ToHg38.over.chain.gz 87 | `-- reference_genomes 88 | |-- hg38 [27 entries exceeds filelimit, not opening dir] 89 | `-- mm10 [24 entries exceeds filelimit, not opening dir] 90 | 91 | ``` 92 |
93 | 94 | ##### The whole process should take several hours to complete! 95 | ##### [(Check also the Human-Moneky data processing steps)](https://github.com/ay-lab/ExTraMapper/tree/master/Human-Monkey-Processed-Data) 96 | -------------------------------------------------------------------------------- /Human-Mouse-Preprocess-Data/config.human-mouse.conf: -------------------------------------------------------------------------------- 1 | # reference genome versions 2 | ref1=hg38 3 | ref2=mm10 4 | 5 | # short names of organisms 6 | org1=human 7 | org2=mouse 8 | 9 | # Ensembl release version number to be used for both organisms 10 | releaseNo=102 11 | 12 | # Find out the standard Ensembl names for your organisms of interest from ftp://ftp.ensembl.org/pub/release-81/gtf/ 13 | org1EnsemblName=homo_sapiens 14 | org2EnsemblName=mus_musculus 15 | 16 | # Find out the full and short Ensembl Mart names for your organisms of interest from ftp://ftp.ensembl.org/pub/release-81/mysql/ensembl_mart_81 17 | org1EnsemblMartName=hsapiens 18 | org2EnsemblMartName=mmusculus 19 | org1EnsemblMartNameShort=hsap 20 | org2EnsemblMartNameShort=mmus 21 | 22 | #liftOver executable path (Please make sure it is executable, chmod u+x liftOver) 23 | liftOver=/Human-Mouse-Preprocess-Data/scripts/liftOver 24 | -------------------------------------------------------------------------------- /Human-Mouse-Preprocess-Data/extMpreprocess: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl 2 | 3 | ## This script will download and preprocess the dataset required for 4 | ## exon-pair and transcript pair finding by ExTraMapper. 5 | ## The script requires a config.conf file which will direct this script 6 | ## to download and process the essential data. 7 | 8 | ##################### config.conf file ##################### 9 | ## Example of human-monkey confif.conf file: 10 | ## 11 | ## #Reference genome versions 12 | ## ref1=hg38 13 | ## ref2=rheMac10 14 | ## 15 | ## #Short names of organisms 16 | ## org1=human 17 | ## org2=rhesus 18 | ## 19 | ## #Ensembl release version number to be used for both organisms 20 | ## releaseNo=102 21 | ## 22 | ## #Find out the standard Ensembl names for your organisms of interest from ftp://ftp.ensembl.org/pub/release-81/gtf/ 23 | ## org1EnsemblName=homo_sapiens 24 | ## org2EnsemblName=macaca_mulatta 25 | ## 26 | ## #Find out the full and short Ensembl Mart names for your organisms of interest from ftp://ftp.ensembl.org/pub/release-81/mysql/ensembl_mart_102 27 | ## org1EnsemblMartName=hsapiens 28 | ## org2EnsemblMartName=mmulatta 29 | ## org1EnsemblMartNameShort=hsap 30 | ## org2EnsemblMartNameShort=mmul 31 | ## 32 | ## #liftOver executable path (Check here https://hgdownload.cse.ucsc.edu/admin/exe) 33 | ## liftOver=./usr/bin/liftOver 34 | ## 35 | ## 36 | ## Example of human-mouse confif.conf file: 37 | ## 38 | ## #Reference genome versions 39 | ## ref1=hg38 40 | ## ref2=mm10 41 | ## 42 | ## #Short names of organisms 43 | ## org1=human 44 | ## org2=mouse 45 | ## 46 | ## #Ensembl release version number to be used for both organisms 47 | ## releaseNo=102 48 | ## 49 | ## #Find out the standard Ensembl names for your organisms of interest from ftp://ftp.ensembl.org/pub/release-81/gtf/ 50 | ## org1EnsemblName=homo_sapiens 51 | ## org2EnsemblName=mus_musculus 52 | ## 53 | ## #Find out the full and short Ensembl Mart names for your organisms of interest from ftp://ftp.ensembl.org/pub/release-81/mysql/ensembl_mart_102 54 | ## org1EnsemblMartName=hsapiens 55 | ## org2EnsemblMartName=mmusculus 56 | ## org1EnsemblMartNameShort=hsap 57 | ## org2EnsemblMartNameShort=mmus 58 | ## 59 | ## #liftOver executable path (Check here https://hgdownload.cse.ucsc.edu/admin/exe) 60 | ## liftOver=./usr/bin/liftOver 61 | ## 62 | ############################################################ 63 | 64 | if ($#ARGV == -1 || $ARGV[0] eq "help") { 65 | print ("\n"); 66 | print ("This script will download and preprocess the dataset required for exon-pair and transcript pair finding by ExTraMapper.\n"); 67 | print ("Type ./extMpreprocess to execute the script.\n"); 68 | print ("Type ./extMpreprocess example to print a example config.conf file.\n\n"); 69 | print ("This script will run seven (7) sequential steps to create the inputs for ExTraMapper program.\n"); 70 | print ("Users can provide step numbers (1-7) or all in the arugemt of this script.\n"); 71 | print ("Short description of the individual scripts:\n"); 72 | print ("Step 1: Download per organism specific files e.g. reference genomes, gene annotation files.\n"); 73 | print ("Step 2: Will create genomedata archives with the genomes of org1 and org2 (Make sure to install genomedata package).\n"); 74 | print ("Step 3: Pickle files for each homologous gene pair will be created.\n"); 75 | print ("Step 4: Perform coordinate liftOver of exons with multiple mappings (This step requires bedtools and liftOver executables).\n"); 76 | print ("Step 5-7: postprocessing the liftOver files.\n"); 77 | print ("\n"); 78 | exit(); 79 | } elsif ($ARGV[0] eq "example") { 80 | my @exmpl = "# reference genome versions 81 | ref1=hg38 82 | ref2=mm10 83 | 84 | # short names of organisms 85 | org1=human 86 | org2=mouse 87 | 88 | # Ensembl release version number to be used for both organisms 89 | releaseNo=102 90 | 91 | # Find out the standard Ensembl names for your organisms of interest from ftp://ftp.ensembl.org/pub/release-81/gtf/ 92 | org1EnsemblName=homo_sapiens 93 | org2EnsemblName=mus_musculus 94 | 95 | # Find out the full and short Ensembl Mart names for your organisms of interest from ftp://ftp.ensembl.org/pub/release-81/mysql/ensembl_mart_81 96 | org1EnsemblMartName=hsapiens 97 | org2EnsemblMartName=mmusculus 98 | org1EnsemblMartNameShort=hsap 99 | org2EnsemblMartNameShort=mmus 100 | 101 | #liftOver executable path (Check here https://hgdownload.cse.ucsc.edu/admin/exe) 102 | liftOver=/usr/bin/liftOver\n"; 103 | 104 | print (@exmpl); 105 | print ("\n"); 106 | open (out, ">config.human-mouse.conf"); 107 | print out @exmpl; 108 | close out; 109 | print ("The example config.human-mouse.conf file is written\n"); 110 | exit; 111 | } 112 | my ($configfile, $step) = @ARGV; 113 | chomp ($configfile, $step); 114 | 115 | #### File and folder check #### 116 | die "The $configfile does not exists, exit!" unless -e "$configfile"; 117 | 118 | 119 | #### Get the environmental variables #### 120 | $ENV{'EXTRAMAPPER_DIR'} = $ENV{'PWD'}; 121 | open(in, $configfile); 122 | while (my $var = ) { 123 | chomp $var; 124 | if ($var =~ /=/) { 125 | $var_n = (split(/=/,$var))[0]; 126 | $var_v = (split(/=/,$var))[1]; 127 | $ENV{$var_n} = $var_v; 128 | } 129 | } 130 | close in; 131 | 132 | #### Set the variable folders and files #### 133 | $dataDir = "$ENV{'EXTRAMAPPER_DIR'}/preprocess/data"; 134 | $dataDirPerPair = "$ENV{'EXTRAMAPPER_DIR'}/preprocess/data/$ENV{'org1'}-$ENV{'org2'}"; 135 | $referenceGenomesDir = "$dataDir/reference_genomes"; 136 | $chainsDir = "$dataDir/liftover_chains"; 137 | $ensemblDir = "$dataDirPerPair/ensemblDownloads"; 138 | $genomedataDir = "$dataDirPerPair/genomedataArchives"; 139 | $GTFsummaryDir = "$dataDirPerPair/GTFsummaries"; 140 | $perGenePairPickleDir= "$dataDirPerPair/perGenePairPickledInfo"; 141 | $liftOverFilesDir = "$dataDirPerPair/liftoverRelatedFiles"; 142 | $perExonLiftoverDir = "$dataDirPerPair/perExonLiftoverCoords"; 143 | 144 | #### Main functions and sub-routines #### 145 | sub getfasta { 146 | my $path = $_[0]; 147 | my $org = $_[1]; 148 | my %chr; 149 | open(chrname,"$path/$org/name_chr.txt"); 150 | while ( ){ 151 | chomp $_; 152 | $chr{$_} = 1; 153 | } 154 | close (chrname); 155 | 156 | my $file = "$path/$org/$org.fa.gz"; 157 | open(in, "zcat $file |"); 158 | while ( ) { 159 | chomp $_; 160 | if ($_ =~ />/) { 161 | $name = $_; 162 | $ckpt = 0; 163 | $name =~ s/>//g; 164 | if ($chr{$name} ne "") { 165 | print ("Extracting $name from $org.fa.gz file\n"); 166 | $ckpt = 1; 167 | open($out,"|gzip -c > $path/$org/$name.fa.gz"); 168 | print $out (">$name\n"); 169 | } else { 170 | close ($out); 171 | } 172 | } else { 173 | if ($ckpt == 1) { 174 | print $out ("$_\n"); 175 | } 176 | } 177 | } 178 | close(in); 179 | system("rm -rf $path/$org/$org.fa.gz"); 180 | print ("Finished extracting chromosomes and writing the individual *.fa.gz files\n"); 181 | print ("Removed $path/$org/$org.fa.gz\n"); 182 | } 183 | 184 | sub downloadrefgenome { 185 | 186 | my $path = $_[0]; 187 | my $org = $_[1]; 188 | if (!-d "$path/$org") { 189 | print ("Creating $path/$org folder\n"); 190 | system("mkdir -p $path/$org"); 191 | print ("Running: wget --timestamping ftp://hgdownload.cse.ucsc.edu/goldenPath/$org/chromosomes/* --directory-prefix=$path/$org 2>&1 | grep \"Login incorrect\"\n"); 192 | my $error = `wget --timestamping ftp://hgdownload.cse.ucsc.edu/goldenPath/$org/chromosomes/* --directory-prefix=$path/$org 2>&1 | grep "No such directory"`; 193 | if ($error =~ "No such directory") { 194 | print ("There is no chromosome folder for $org. So, downloding the bigZip file and extracting them\n"); 195 | print ("Running: wget --timestamping ftp://hgdownload.cse.ucsc.edu/goldenPath/$org/bigZips/$org.fa.gz --directory-prefix=$path/$org 2> /dev/null\n"); 196 | system("wget --timestamping ftp://hgdownload.cse.ucsc.edu/goldenPath/$org/bigZips/$org.fa.gz --directory-prefix=$path/$org 2> /dev/null"); 197 | print ("Extracting the individual chromosomes\n"); 198 | print ("zcat $path/$org/$org.fa.gz |grep \">\" |grep -v \"_random\" |grep -v \"chrUn\" |sed 's/>//g' > $path/$org/name_chr.txt\n"); 199 | system("zcat $path/$org/$org.fa.gz |grep \">\" |grep -v \"_random\" |grep -v \"chrUn\" |sed 's/>//g' > $path/$org/name_chr.txt"); 200 | getfasta($path, $org); 201 | print "Reference genomes are downloaded in $path/$org\n"; 202 | } else { 203 | system("rm -rf $path/$org/*_random*"); 204 | system("rm -rf $path/$org/chrUn*"); 205 | system("rm -rf $path/$org/*_alt*"); 206 | } 207 | } else { 208 | print ("$path/$org folder already exists, skipping downloading the dataset\n"); 209 | } 210 | } 211 | 212 | sub downloadliftoverfiles { 213 | 214 | my $path = $_[0]; 215 | my $org1 = $_[1]; 216 | my $org2 = $_[2]; 217 | if (!-d "$path/$org1/liftOver") { 218 | print ("Creating $path/$org1/liftOver folder\n"); 219 | system("mkdir -p $path/$org1/liftOver"); 220 | my $ref2Cap =`echo $org2 | python -c "s=input(); print (s[0].upper()+s[1:])"`; 221 | chomp $ref2Cap; 222 | my $chain_name = $org1."To".$ref2Cap; 223 | print ("Running: wget http://hgdownload.cse.ucsc.edu/goldenPath/$org1/liftOver/$chain_name.over.chain.gz --directory-prefix=$path/$org1/liftOver\n"); 224 | system("wget http://hgdownload.cse.ucsc.edu/goldenPath/$org1/liftOver/$chain_name.over.chain.gz --directory-prefix=$path/$org1/liftOver 2> /dev/null"); 225 | print ("LiftOver chain saved to $path/$org1/liftOver/$chain_name.over.chain.gz\n"); 226 | } else { 227 | print ("$path/$org1 folder already exists, skipping download\n"); 228 | } 229 | } 230 | 231 | sub downloadensmblfiles { 232 | 233 | my $path = $_[0]; 234 | my $releaseNo = $_[1]; 235 | my $org1EnsemblName = $_[2]; 236 | my $org1EnsemblMartName = $_[3]; 237 | my $org2EnsemblName = $_[4]; 238 | my $org2EnsemblMartName = $_[5]; 239 | 240 | print ("Downloading GTF files\n"); 241 | if (!-e "$path/org1.gtf.gz") { 242 | print ("wget ftp://ftp.ensembl.org/pub/release-$releaseNo/gtf/$org1EnsemblName/*.$releaseNo.gtf.gz -O $path/org1.gtf.gz\n"); 243 | system("wget ftp://ftp.ensembl.org/pub/release-$releaseNo/gtf/$org1EnsemblName/*.$releaseNo.gtf.gz -O $path/org1.gtf.gz 2> /dev/null"); 244 | print ("GTF files downloaded in $path\n"); 245 | } else { 246 | print ("$path/org1.gtf.gz file exists, skipping download\n"); 247 | } 248 | if (!-e "$path/org2.gtf.gz") { 249 | print ("wget ftp://ftp.ensembl.org/pub/release-$releaseNo/gtf/$org2EnsemblName/*.$releaseNo.gtf.gz -O $path/org2.gtf.gz\n"); 250 | system("wget ftp://ftp.ensembl.org/pub/release-$releaseNo/gtf/$org2EnsemblName/*.$releaseNo.gtf.gz -O $path/org2.gtf.gz 2> /dev/null"); 251 | print ("GTF files downloaded in $path\n"); 252 | } else { 253 | print ("$path/org2.gtf.gz file exists, skipping download\n"); 254 | } 255 | 256 | print ("Downloading ENSEMBL homologs\n"); 257 | if (!-e "$path/org1_homolog_org2.txt.gz") { 258 | print ("wget ftp://ftp.ensembl.org/pub/release-$releaseNo/mysql/ensembl_mart_$releaseNo/$org1EnsemblMartName\_gene_ensembl__homolog_$org2EnsemblMartName\__dm.txt.gz -O $path/org1_homolog_org2.txt.gz\n"); 259 | system("wget ftp://ftp.ensembl.org/pub/release-$releaseNo/mysql/ensembl_mart_$releaseNo/$org1EnsemblMartName\_gene_ensembl__homolog_$org2EnsemblMartName\__dm.txt.gz -O $path/org1_homolog_org2.txt.gz 2> /dev/null"); 260 | print ("ENSEMBL homolog downloaded in $path\n"); 261 | } else { 262 | print ("$path/org1_homolog_org2.txt.gz file exists, skipping download\n"); 263 | } 264 | 265 | if (!-e "$path/org2_homolog_org1.txt.gz") { 266 | print ("wget ftp://ftp.ensembl.org/pub/release-$releaseNo/mysql/ensembl_mart_$releaseNo/$org2EnsemblMartName\_gene_ensembl__homolog_$org1EnsemblMartName\__dm.txt.gz -O $path/org2_homolog_org1.txt.gz\n"); 267 | system("wget ftp://ftp.ensembl.org/pub/release-$releaseNo/mysql/ensembl_mart_$releaseNo/$org2EnsemblMartName\_gene_ensembl__homolog_$org1EnsemblMartName\__dm.txt.gz -O $path/org2_homolog_org1.txt.gz 2> /dev/null"); 268 | print ("ENSEMBL homolog downloaded in $path\n"); 269 | } else { 270 | print ("$path/org2_homolog_org1.txt.gz file exists, skipping download\n"); 271 | } 272 | 273 | } 274 | 275 | sub ltime { 276 | 277 | my $time = localtime; 278 | return($time); 279 | } 280 | 281 | sub genomedataarchive { 282 | 283 | my $path = $_[0]; 284 | my $org = $_[1]; 285 | my $ref = $_[2]; 286 | my $referenceGenomesDir = $_[3]; 287 | my $old_path = $ENV{'PWD'}; 288 | chdir $path; 289 | if (-e "$ref.fa") { 290 | print ("Deleting the existing $ref.fa\n"); 291 | system("rm -rf $ref.fa"); 292 | } 293 | if (!-d $org) { 294 | print ("Running : zcat $referenceGenomesDir/$ref/*.fa.gz > $ref.fa\n"); 295 | print ("Started at ",ltime(),"\n"); 296 | system("zcat $referenceGenomesDir/$ref/*.fa.gz > $ref.fa"); 297 | print ("Ended at ",ltime(),"\n"); 298 | print ("Running : genomedata-load-seq -d $org $ref.fa\n"); 299 | print ("Started at ",ltime(),"\n"); 300 | system("genomedata-load-seq -d $org $ref.fa"); 301 | system("genomedata-close-data $org"); 302 | print ("Ended at ",ltime(),"\n"); 303 | system("rm -rf $ref.fa"); 304 | } else { 305 | print ("$org genomedata exists, skipping the step\n"); 306 | } 307 | chdir $old_path; 308 | } 309 | 310 | sub parseAndPicklePerPair { 311 | 312 | my $extmapper_path = $_[0]; 313 | my $ensemblDir = $_[1]; 314 | my $dataDirPerPair = $_[2]; 315 | my $GTFsummaryDir = $_[3]; 316 | my $perGenePairPickleDir = $_[4]; 317 | 318 | if (!-e "$ensemblDir/org1.gtf") { 319 | print ("Running : gunzip -k $ensemblDir/org1.gtf.gz\n"); 320 | system("gunzip -k $ensemblDir/org1.gtf.gz"); 321 | } else { 322 | print ("$ensemblDir/org1.gtf file present, skipping gunzip action\n"); 323 | } 324 | if (!-e "$ensemblDir/org2.gtf") { 325 | print ("Running : gunzip -k $ensemblDir/org2.gtf.gz\n"); 326 | system("gunzip -k $ensemblDir/org2.gtf.gz"); 327 | } else { 328 | print ("$ensemblDir/org2.gtf file present, skipping gunzip action\n"); 329 | } 330 | if (!-e "$ensemblDir/org1_homolog_org2.txt") { 331 | print ("Running : gunzip -k $ensemblDir/org1_homolog_org2.txt.gz\n"); 332 | system("gunzip -k $ensemblDir/org1_homolog_org2.txt.gz"); 333 | } else { 334 | print ("$ensemblDir/org1_homolog_org2.txt file present, skipping gunzip action\n"); 335 | } 336 | if (!-e "$ensemblDir/org2_homolog_org1.txt") { 337 | print ("Running : gunzip -k $ensemblDir/org2_homolog_org1.txt.gz\n"); 338 | system("gunzip -k $ensemblDir/org2_homolog_org1.txt.gz"); 339 | } else { 340 | print ("$ensemblDir/org2_homolog_org1.txt file present, skipping gunzip action\n"); 341 | } 342 | 343 | if (!-d $perGenePairPickleDir) { 344 | print ("Running : python $extmapper_path/scripts/parseAndPicklePerPair.py $dataDirPerPair $GTFsummaryDir $perGenePairPickleDir\n"); 345 | print ("Started at ",ltime(),"\n"); 346 | system("python $extmapper_path/scripts/parseAndPicklePerPair.py $dataDirPerPair $GTFsummaryDir $perGenePairPickleDir"); 347 | print ("Ended at ",ltime(),"\n"); 348 | system("mv $perGenePairPickleDir/genePairsSummary-one2one.txt $dataDirPerPair/genePairsSummary-one2one.txt"); 349 | } else { 350 | print ("perGenePairPickleDir found, skipping\n"); 351 | } 352 | } 353 | 354 | sub liftoverexonmultiplemapping { 355 | 356 | my $GTFsummaryDir = $_[0]; 357 | my $liftOverFilesDir = $_[1]; 358 | my $chainsDir = $_[2]; 359 | my $ref1 = $_[3]; 360 | my $ref2 = $_[4]; 361 | my $extmapper_path = $_[5]; 362 | 363 | my $indir = "$GTFsummaryDir/onlyOrthologAndCodingGenes"; 364 | 365 | print ("Running : cat $indir/org1-allExons-GTFparsed.txt | awk -v OFS='\\t' 'NR>1{print \$1,\$2,\$3,\$4,\$5}' | sort -k1,1 -k2,2n > $liftOverFilesDir/org1_allExonsList.bed\n"); 366 | print ("Started at ",ltime(),"\n"); 367 | system("cat $indir/org1-allExons-GTFparsed.txt | awk -v OFS='\\t' 'NR>1{print \$1,\$2,\$3,\$4,\$5}' | sort -k1,1 -k2,2n > $liftOverFilesDir/org1_allExonsList.bed"); 368 | print ("Ended at ",ltime(),"\n"); 369 | 370 | print ("Running : cat $indir/org2-allExons-GTFparsed.txt | awk -v OFS='\\t' 'NR>1{print \$1,\$2,\$3,\$4,\$5}' | sort -k1,1 -k2,2n > $liftOverFilesDir/org2_allExonsList.bed\n"); 371 | print ("Started at ",ltime(),"\n"); 372 | system("cat $indir/org2-allExons-GTFparsed.txt | awk -v OFS='\\t' 'NR>1{print \$1,\$2,\$3,\$4,\$5}' | sort -k1,1 -k2,2n > $liftOverFilesDir/org2_allExonsList.bed"); 373 | print ("Ended at ",ltime(),"\n"); 374 | 375 | print ("Running : cat $indir/org1-allExons-GTFparsed.txt |awk -v OFS='\\t' '\$6==\"partCoding\" {print \$1,\$7,\$8,\$4,\$5}' | sort -k1,1 -k2,2n > $liftOverFilesDir/org1_partCodingExonsList.bed\n"); 376 | print ("Started at ",ltime(),"\n"); 377 | system("cat $indir/org1-allExons-GTFparsed.txt |awk -v OFS='\\t' '\$6==\"partCoding\" {print \$1,\$7,\$8,\$4,\$5}' | sort -k1,1 -k2,2n > $liftOverFilesDir/org1_partCodingExonsList.bed"); 378 | print ("Ended at ",ltime(),"\n"); 379 | 380 | print ("Running : cat $indir/org2-allExons-GTFparsed.txt |awk -v OFS='\\t' '\$6==\"partCoding\" {print \$1,\$7,\$8,\$4,\$5}' | sort -k1,1 -k2,2n > $liftOverFilesDir/org2_partCodingExonsList.bed\n"); 381 | print ("Started at ",ltime(),"\n"); 382 | system("cat $indir/org2-allExons-GTFparsed.txt |awk -v OFS='\\t' '\$6==\"partCoding\" {print \$1,\$7,\$8,\$4,\$5}' | sort -k1,1 -k2,2n > $liftOverFilesDir/org2_partCodingExonsList.bed"); 383 | print ("Ended at ",ltime(),"\n"); 384 | 385 | print ("Running : cat $indir/org1-allExons-GTFparsed.txt |awk -v OFS='\\t' '\$6==\"fullCoding\" {print \$1,\$2,\$3,\$4,\$5}' > $liftOverFilesDir/org1_f.temp\n"); 386 | print ("Started at ",ltime(),"\n"); 387 | system("cat $indir/org1-allExons-GTFparsed.txt |awk -v OFS='\\t' '\$6==\"fullCoding\" {print \$1,\$2,\$3,\$4,\$5}' > $liftOverFilesDir/org1_f.temp"); 388 | print ("Ended at ",ltime(),"\n"); 389 | 390 | print ("Running : cat $indir/org2-allExons-GTFparsed.txt |awk -v OFS='\\t' '\$6==\"fullCoding\" {print \$1,\$2,\$3,\$4,\$5}' > $liftOverFilesDir/org2_f.temp\n"); 391 | print ("Started at ",ltime(),"\n"); 392 | system("cat $indir/org2-allExons-GTFparsed.txt |awk -v OFS='\\t' '\$6==\"fullCoding\" {print \$1,\$2,\$3,\$4,\$5}' > $liftOverFilesDir/org2_f.temp"); 393 | print ("Ended at ",ltime(),"\n"); 394 | 395 | print ("Running : cat $liftOverFilesDir/org1_partCodingExonsList.bed $liftOverFilesDir/org1_f.temp | sort -k1,1 -k2,2n > $liftOverFilesDir/org1_allCodingExonsList.bed\n"); 396 | print ("Started at ",ltime(),"\n"); 397 | system("cat $liftOverFilesDir/org1_partCodingExonsList.bed $liftOverFilesDir/org1_f.temp | sort -k1,1 -k2,2n > $liftOverFilesDir/org1_allCodingExonsList.bed"); 398 | print ("Ended at ",ltime(),"\n"); 399 | 400 | print ("Running : cat $liftOverFilesDir/org2_partCodingExonsList.bed $liftOverFilesDir/org2_f.temp | sort -k1,1 -k2,2n > $liftOverFilesDir/org2_allCodingExonsList.bed\n"); 401 | print ("Started at ",ltime(),"\n"); 402 | system("cat $liftOverFilesDir/org2_partCodingExonsList.bed $liftOverFilesDir/org2_f.temp | sort -k1,1 -k2,2n > $liftOverFilesDir/org2_allCodingExonsList.bed"); 403 | print ("Ended at ",ltime(),"\n"); 404 | 405 | print ("Running : cat $liftOverFilesDir/org1_allCodingExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org1_allCodingExonsList.sorted.temp\n"); 406 | print ("Started at ",ltime(),"\n"); 407 | system("cat $liftOverFilesDir/org1_allCodingExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org1_allCodingExonsList.sorted.temp"); 408 | print ("Ended at ",ltime(),"\n"); 409 | 410 | print ("Running : cat $liftOverFilesDir/org2_allCodingExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org2_allCodingExonsList.sorted.temp\n"); 411 | print ("Started at ",ltime(),"\n"); 412 | system("cat $liftOverFilesDir/org2_allCodingExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org2_allCodingExonsList.sorted.temp"); 413 | print ("Ended at ",ltime(),"\n"); 414 | 415 | print ("Running : cat $liftOverFilesDir/org1_allExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org1_allExonsList.sorted.temp\n"); 416 | print ("Started at ",ltime(),"\n"); 417 | system("cat $liftOverFilesDir/org1_allExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org1_allExonsList.sorted.temp"); 418 | print ("Ended at ",ltime(),"\n"); 419 | 420 | print ("Running : cat $liftOverFilesDir/org2_allExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org2_allExonsList.sorted.temp\n"); 421 | print ("Started at ",ltime(),"\n"); 422 | system("cat $liftOverFilesDir/org2_allExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org2_allExonsList.sorted.temp"); 423 | print ("Ended at ",ltime(),"\n"); 424 | 425 | print ("Running : cat $liftOverFilesDir/org1_partCodingExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org1_partCodingExonsList.sorted.temp\n"); 426 | print ("Started at ",ltime(),"\n"); 427 | system("cat $liftOverFilesDir/org1_partCodingExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org1_partCodingExonsList.sorted.temp"); 428 | print ("Ended at ",ltime(),"\n"); 429 | 430 | print ("Running : cat $liftOverFilesDir/org2_partCodingExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org2_partCodingExonsList.sorted.temp\n"); 431 | print ("Started at ",ltime(),"\n"); 432 | system("cat $liftOverFilesDir/org2_partCodingExonsList.bed |awk '{print \$5,\$0}' | sort -k1,1 > $liftOverFilesDir/org2_partCodingExonsList.sorted.temp"); 433 | print ("Ended at ",ltime(),"\n"); 434 | 435 | my $chain1to2=`ls $chainsDir/$ref1/liftOver/*.over.chain.gz`; 436 | my $chain2to1=`ls $chainsDir/$ref2/liftOver/*.over.chain.gz`; 437 | chomp ($chain1to2, $chain2to1); 438 | 439 | foreach my $minMatch (qw{1 0.95 0.9}) { 440 | print ("Running : $extmapper_path/scripts/liftover-withMultiples 0 $minMatch $chain1to2 $chain2to1\n"); 441 | print ("Started at ",ltime(),"\n"); 442 | system("$extmapper_path/scripts/liftover-withMultiples 0 $minMatch $chain1to2 $chain2to1"); 443 | print ("Ended at ",ltime(),"\n"); 444 | } 445 | system("rm -rf $liftOverFilesDir/org2_allExonsList.sorted.temp"); 446 | system("rm -rf $liftOverFilesDir/org1_allExonsList.sorted.temp"); 447 | system("rm -rf $liftOverFilesDir/org2_partCodingExonsList.sorted.temp"); 448 | system("rm -rf $liftOverFilesDir/org1_partCodingExonsList.sorted.temp"); 449 | system("rm -rf $liftOverFilesDir/org2_allCodingExonsList.sorted.temp"); 450 | system("rm -rf $liftOverFilesDir/org1_allCodingExonsList.sorted.temp"); 451 | } 452 | 453 | sub liftoverfilesprocess { 454 | 455 | my $indir = $_[0]; 456 | my $outdir = $_[1]; 457 | my $flank = $_[2]; 458 | my $extmapper_path = $_[3]; 459 | 460 | if (-e "oneHugeFile-2to1-partCoding.txt") { 461 | system("rm -rf oneHugeFile-2to1-partCoding.txt"); 462 | } 463 | if (-e "oneHugeFile-1to2-partCoding.txt") { 464 | system("rm -rf oneHugeFile-1to2-partCoding.txt"); 465 | } 466 | 467 | foreach my $minMatch (qw{1 0.95 0.9}) { 468 | $suffix="flank$flank-minMatch$minMatch-multiples-partCoding"; 469 | print ("Running : zcat $indir/org1_VS_org2_to_org1_intersectingExonsList-$suffix.bed.gz |awk -v OFS='\\t' '\$6!=\".\"{print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$11,\$9,\$12,s}' s=$suffix >> oneHugeFile-2to1-partCoding.txt\n"); 470 | print ("Started at ",ltime(),"\n"); 471 | system("zcat $indir/org1_VS_org2_to_org1_intersectingExonsList-$suffix.bed.gz |awk -v OFS='\\t' '\$6!=\".\"{print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$11,\$9,\$12,s}' s=$suffix >> oneHugeFile-2to1-partCoding.txt"); 472 | print ("Ended at ",ltime(),"\n"); 473 | 474 | print ("Running : zcat $indir/org2_VS_org1_to_org2_intersectingExonsList-$suffix.bed.gz |awk -v OFS='\\t' '\$6!=\".\"{print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$11,\$9,\$12,s}' s=$suffix >> oneHugeFile-1to2-partCoding.txt\n"); 475 | print ("Started at ",ltime(),"\n"); 476 | system("zcat $indir/org2_VS_org1_to_org2_intersectingExonsList-$suffix.bed.gz |awk -v OFS='\\t' '\$6!=\".\"{print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$11,\$9,\$12,s}' s=$suffix >> oneHugeFile-1to2-partCoding.txt"); 477 | print ("Started at ",ltime(),"\n"); 478 | } 479 | 480 | if (-e "oneHugeFile-2to1-others.txt") { 481 | system("rm -rf oneHugeFile-2to1-others.txt"); 482 | } 483 | if (-e "oneHugeFile-1to2-others.txt") { 484 | system("rm -rf oneHugeFile-1to2-others.txt"); 485 | } 486 | 487 | foreach my $minMatch (qw{1 0.95 0.9}) { 488 | $suffix="flank$flank-minMatch$minMatch-multiples"; 489 | print ("Running : zcat $indir/org1_VS_org2_to_org1_intersectingExonsList-$suffix.bed.gz |awk -v OFS='\\t' '\$6!=\"\.\"{print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$11,\$9,\$12,s}' s=$suffix >> oneHugeFile-2to1-others.txt\n"); 490 | print ("Started at ",ltime(),"\n"); 491 | system("zcat $indir/org1_VS_org2_to_org1_intersectingExonsList-$suffix.bed.gz |awk -v OFS='\\t' '\$6!=\"\.\"{print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$11,\$9,\$12,s}' s=$suffix >> oneHugeFile-2to1-others.txt"); 492 | print ("Ended at ",ltime(),"\n"); 493 | 494 | print ("Running : zcat $indir/org2_VS_org1_to_org2_intersectingExonsList-$suffix.bed.gz |awk -v OFS='\\t' '\$6!=\"\.\"{print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$11,\$9,\$12,s}' s=$suffix >> oneHugeFile-1to2-others.txt\n"); 495 | print ("Started at ",ltime(),"\n"); 496 | system("zcat $indir/org2_VS_org1_to_org2_intersectingExonsList-$suffix.bed.gz |awk -v OFS='\\t' '\$6!=\"\.\"{print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$11,\$9,\$12,s}' s=$suffix >> oneHugeFile-1to2-others.txt"); 497 | print ("Ended at ",ltime(),"\n"); 498 | } 499 | 500 | print ("Running : cat oneHugeFile-1to2-partCoding.txt oneHugeFile-1to2-others.txt | sort -k10,10 >oneHugeFile-1to2.txt.sorted\n"); 501 | print ("Started at ",ltime(),"\n"); 502 | system("cat oneHugeFile-1to2-partCoding.txt oneHugeFile-1to2-others.txt | sort -k10,10 >oneHugeFile-1to2.txt.sorted"); 503 | print ("Ended at ",ltime(),"\n"); 504 | 505 | print ("Running : cat oneHugeFile-2to1-partCoding.txt oneHugeFile-2to1-others.txt | sort -k10,10 >oneHugeFile-2to1.txt.sorted\n"); 506 | print ("Started at ",ltime(),"\n"); 507 | system("cat oneHugeFile-2to1-partCoding.txt oneHugeFile-2to1-others.txt | sort -k10,10 >oneHugeFile-2to1.txt.sorted"); 508 | print ("Ended at ",ltime(),"\n"); 509 | 510 | system("mkdir -p $outdir/org1 $outdir/org2"); 511 | $whichCol=10; 512 | $fileSuffix="_mapped.txt"; 513 | 514 | print ("Running : python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-1to2.txt.sorted $outdir/org1 $whichCol $fileSuffix\n"); 515 | print ("Started at ",ltime(),"\n"); 516 | system("python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-1to2.txt.sorted $outdir/org1 $whichCol $fileSuffix"); 517 | print ("Ended at ",ltime(),"\n"); 518 | 519 | print ("Running : python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-2to1.txt.sorted $outdir/org2 $whichCol $fileSuffix\n"); 520 | print ("Started at ",ltime(),"\n"); 521 | system("python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-2to1.txt.sorted $outdir/org2 $whichCol $fileSuffix"); 522 | print ("Ended at ",ltime(),"\n"); 523 | 524 | print ("Removing temporary files\n"); 525 | system("rm -rf oneHugeFile*.txt"); 526 | 527 | } 528 | 529 | sub liftoverfilesprocessunmappedexons { 530 | 531 | my $indir = $_[0]; 532 | my $outdir = $_[1]; 533 | my $flank = $_[2]; 534 | my $extmapper_path = $_[3]; 535 | 536 | if (-e "oneHugeFile-2to1-partCoding.txt") { 537 | system("rm -rf oneHugeFile-2to1-partCoding.txt"); 538 | } 539 | if (-e "oneHugeFile-1to2-partCoding.txt") { 540 | system("rm -rf oneHugeFile-1to2-partCoding.txt"); 541 | } 542 | 543 | foreach my $minMatch (qw{1 0.95 0.9}) { 544 | $suffix="flank$flank-minMatch$minMatch-multiples-partCoding"; 545 | print ("Running : zcat $indir/org1_to_org2_liftOver_unmappedExonsList-$suffix.bed.gz |awk -v OFS='\\t' '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-1to2-partCoding.txt\n"); 546 | print ("Started at ",ltime(),"\n"); 547 | system("zcat $indir/org1_to_org2_liftOver_unmappedExonsList-$suffix.bed.gz |awk -v OFS='\\t' '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-1to2-partCoding.txt"); 548 | print ("Ended at ",ltime(),"\n"); 549 | 550 | print ("Running : zcat $indir/org2_to_org1_liftOver_unmappedExonsList-$suffix.bed.gz |awk -v OFS='\\t' '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-2to1-partCoding.txt\n"); 551 | print ("Started at ",ltime(),"\n"); 552 | system("zcat $indir/org2_to_org1_liftOver_unmappedExonsList-$suffix.bed.gz |awk -v OFS='\\t' '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-2to1-partCoding.txt"); 553 | print ("Ended at ",ltime(),"\n"); 554 | } 555 | 556 | if (-e "oneHugeFile-2to1-others.txt") { 557 | system("rm -rf oneHugeFile-2to1-others.txt"); 558 | } 559 | if (-e "oneHugeFile-1to2-others.txt") { 560 | system("rm -rf oneHugeFile-1to2-others.txt"); 561 | } 562 | 563 | foreach my $minMatch (qw{1 0.95 0.9}) { 564 | $suffix="flank$flank-minMatch$minMatch-multiples"; 565 | print ("Running : zcat $indir/org1_to_org2_liftOver_unmappedExonsList-$suffix.bed.gz |awk '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-1to2-others.txt\n"); 566 | print ("Started at ",ltime(),"\n"); 567 | system("zcat $indir/org1_to_org2_liftOver_unmappedExonsList-$suffix.bed.gz |awk '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-1to2-others.txt"); 568 | print ("Ended at ",ltime(),"\n"); 569 | 570 | print ("Running : zcat $indir/org2_to_org1_liftOver_unmappedExonsList-$suffix.bed.gz |awk '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-2to1-others.txt\n"); 571 | print ("Started at ",ltime(),"\n"); 572 | system("zcat $indir/org2_to_org1_liftOver_unmappedExonsList-$suffix.bed.gz |awk '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-2to1-others.txt"); 573 | print ("Ended at ",ltime(),"\n"); 574 | } 575 | 576 | print ("Running : cat oneHugeFile-1to2-partCoding.txt oneHugeFile-1to2-others.txt | sort -k5,5 >oneHugeFile-1to2.txt.sorted\n"); 577 | print ("Started at ",ltime(),"\n"); 578 | system("cat oneHugeFile-1to2-partCoding.txt oneHugeFile-1to2-others.txt | sort -k5,5 >oneHugeFile-1to2.txt.sorted"); 579 | print ("Ended at ",ltime(),"\n"); 580 | 581 | print ("Running : cat oneHugeFile-2to1-partCoding.txt oneHugeFile-2to1-others.txt | sort -k5,5 >oneHugeFile-2to1.txt.sorted\n"); 582 | print ("Started at ",ltime(),"\n"); 583 | system("cat oneHugeFile-2to1-partCoding.txt oneHugeFile-2to1-others.txt | sort -k5,5 >oneHugeFile-2to1.txt.sorted"); 584 | print ("Ended at ",ltime(),"\n"); 585 | 586 | system("mkdir -p $outdir/org1 $outdir/org2"); 587 | $whichCol=5; 588 | $fileSuffix="_unmapped.txt"; 589 | 590 | print ("Running : python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-1to2.txt.sorted $outdir/org1 $whichCol $fileSuffix\n"); 591 | print ("Started at ",ltime(),"\n"); 592 | system("python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-1to2.txt.sorted $outdir/org1 $whichCol $fileSuffix"); 593 | print ("Ended at ",ltime(),"\n"); 594 | 595 | print ("Running : python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-2to1.txt.sorted $outdir/org2 $whichCol $fileSuffix\n"); 596 | print ("Started at ",ltime(),"\n"); 597 | system("python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-2to1.txt.sorted $outdir/org2 $whichCol $fileSuffix"); 598 | print ("Ended at ",ltime(),"\n"); 599 | } 600 | 601 | sub liftoverfilesprocessmappedexons { 602 | 603 | my $indir = $_[0]; 604 | my $outdir = $_[1]; 605 | my $flank = $_[2]; 606 | my $extmapper_path = $_[3]; 607 | 608 | if (-e "oneHugeFile-2to1-others.txt") { 609 | system("rm -rf oneHugeFile-2to1-others.txt"); 610 | } 611 | if (-e "oneHugeFile-1to2-others.txt") { 612 | system("rm -rf oneHugeFile-1to2-others.txt"); 613 | } 614 | 615 | foreach my $minMatch (qw{1 0.95 0.9}) { 616 | $suffix="flank$flank-minMatch$minMatch-multiples"; 617 | print ("Running : zcat $indir/org1_VS_org2_to_org1_nonintersectingExonsList-$suffix.bed.gz |awk '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-2to1-others.txt\n"); 618 | print ("Started at ",ltime(),"\n"); 619 | system("zcat $indir/org1_VS_org2_to_org1_nonintersectingExonsList-$suffix.bed.gz |awk '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-2to1-others.txt"); 620 | print ("Ended at ",ltime(),"\n"); 621 | 622 | print ("Running : zcat $indir/org2_VS_org1_to_org2_nonintersectingExonsList-$suffix.bed.gz |awk '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-1to2-others.txt\n"); 623 | print ("Started at ",ltime(),"\n"); 624 | system("zcat $indir/org2_VS_org1_to_org2_nonintersectingExonsList-$suffix.bed.gz |awk '{print \$1,\$2,\$3,\$6,\$4,\$5,s}' s=$suffix >> oneHugeFile-1to2-others.txt"); 625 | print ("Ended at ",ltime(),"\n"); 626 | } 627 | 628 | print ("Running : cat oneHugeFile-1to2-others.txt | sort -k5,5 >oneHugeFile-1to2.txt.sorted\n"); 629 | print ("Started at ",ltime(),"\n"); 630 | system("cat oneHugeFile-1to2-others.txt | sort -k5,5 >oneHugeFile-1to2.txt.sorted"); 631 | print ("Ended at ",ltime(),"\n"); 632 | 633 | print ("Running : cat oneHugeFile-2to1-others.txt | sort -k5,5 >oneHugeFile-2to1.txt.sorted\n"); 634 | print ("Started at ",ltime(),"\n"); 635 | system("cat oneHugeFile-2to1-others.txt | sort -k5,5 >oneHugeFile-2to1.txt.sorted"); 636 | print ("Ended at ",ltime(),"\n"); 637 | 638 | system("mkdir -p $outdir/org1 $outdir/org2"); 639 | $whichCol=5; 640 | $fileSuffix="_nonintersecting.txt"; 641 | print ("Running : python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-1to2.txt.sorted $outdir/org1 $whichCol $fileSuffix\n"); 642 | print ("Started at ",ltime(),"\n"); 643 | system("python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-1to2.txt.sorted $outdir/org1 $whichCol $fileSuffix"); 644 | print ("Ended at ",ltime(),"\n"); 645 | 646 | print ("Running : python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-2to1.txt.sorted $outdir/org2 $whichCol $fileSuffix\n"); 647 | print ("Started at ",ltime(),"\n"); 648 | system("python $extmapper_path/scripts/splitExonsIntoIndividualFiles.py oneHugeFile-2to1.txt.sorted $outdir/org2 $whichCol $fileSuffix"); 649 | print ("Ended at ",ltime(),"\n"); 650 | 651 | print ("Removing temporary files\n"); 652 | system("rm -rf oneHugeFile* dummy.txt"); 653 | } 654 | 655 | sub step { 656 | 657 | my $step = $_[0]; 658 | 659 | if ($step == 1 || $step eq "all" || $step eq "All" || $step eq "ALL") { 660 | 661 | print ("Running step 1:\n"); 662 | print ("Downloading per organism specific files and keep the original organism names for future reuse\n"); 663 | print ("Downloading the two reference genomes from UCSC and get rid of unknown, random and alt contigs\n"); 664 | 665 | system("mkdir -p $referenceGenomesDir"); 666 | downloadrefgenome($referenceGenomesDir, $ENV{'ref1'}); 667 | downloadrefgenome($referenceGenomesDir, $ENV{'ref2'}); 668 | 669 | system("mkdir -p $chainsDir"); 670 | downloadliftoverfiles($chainsDir, $ENV{'ref1'}, $ENV{'ref2'}); 671 | downloadliftoverfiles($chainsDir, $ENV{'ref2'}, $ENV{'ref1'}); 672 | 673 | system("mkdir -p $ensemblDir"); 674 | downloadensmblfiles($ensemblDir, $ENV{'releaseNo'}, $ENV{'org1EnsemblName'}, $ENV{'org1EnsemblMartName'}, $ENV{'org2EnsemblName'}, $ENV{'org2EnsemblMartName'}); 675 | print ("---------------------- Step 1 Finished ----------------------\n"); 676 | } 677 | 678 | if ($step == 2 || $step eq "all" || $step eq "All" || $step eq "ALL") { 679 | 680 | print ("Running step 2:\n"); 681 | print ("Initialize the genomedata archives with the genomes of org1 and org2\n"); 682 | print ("Make sure genomedata is installed first\n"); 683 | print ("Installation: pip install genomedata --user\n"); 684 | system("mkdir -p $genomedataDir"); 685 | 686 | genomedataarchive($genomedataDir, "org1", $ENV{'ref1'}, $referenceGenomesDir); 687 | genomedataarchive($genomedataDir, "org2", $ENV{'ref2'}, $referenceGenomesDir); 688 | print ("---------------------- Step 2 Finished ----------------------\n"); 689 | } 690 | 691 | if ($step == 3 || $step eq "all" || $step eq "All" || $step eq "ALL") { 692 | print ("Running step 3:\n"); 693 | print ("Creating pickle files\n"); 694 | parseAndPicklePerPair($ENV{'EXTRAMAPPER_DIR'}, $ensemblDir, $dataDirPerPair, $GTFsummaryDir, $perGenePairPickleDir); 695 | print ("---------------------- Step 3 Finished ----------------------\n"); 696 | } 697 | 698 | if ($step == 4 || $step eq "all" || $step eq "All" || $step eq "ALL") { 699 | print ("Running step 4:\n"); 700 | print ("liftOver the exon lists but this time allow multiple mappings and also compute intersections with the other set of exons\n"); 701 | system("mkdir -p $liftOverFilesDir"); 702 | system("mkdir -p preprocess/bin"); 703 | if (!-e "./preprocess/bin/liftOver") { 704 | system("ln -s \$(readlink $ENV{liftOver}) ./preprocess/bin"); 705 | } 706 | liftoverexonmultiplemapping($GTFsummaryDir, $liftOverFilesDir, $chainsDir, $ENV{'ref1'}, $ENV{'ref2'}, $ENV{'EXTRAMAPPER_DIR'}); 707 | print ("---------------------- Step 4 Finished ----------------------\n"); 708 | } 709 | 710 | if ($step == 5 || $step eq "all" || $step eq "All" || $step eq "ALL") { 711 | print ("Running step 5:\n"); 712 | print ("Putting together, sorting, making them uniq and then splitting into one file per exon for all the liftover files created so far\n"); 713 | liftoverfilesprocess($liftOverFilesDir, $perExonLiftoverDir, 0, $ENV{'EXTRAMAPPER_DIR'}); 714 | print ("---------------------- Step 5 Finished ----------------------\n"); 715 | } 716 | 717 | if ($step == 6 || $step eq "all" || $step eq "All" || $step eq "ALL") { 718 | print ("Running step 6:\n"); 719 | print ("Putting together, sorting, making them uniq and then splitting into one file per exon for all the liftover files created for UNMAPPED EXONS so far\n"); 720 | liftoverfilesprocessunmappedexons($liftOverFilesDir, $perExonLiftoverDir, 0, $ENV{'EXTRAMAPPER_DIR'}); 721 | print ("---------------------- Step 6 Finished ----------------------\n"); 722 | } 723 | 724 | if ($step == 7 || $step eq "all" || $step eq "All" || $step eq "ALL") { 725 | print ("Runing step 7:\n"); 726 | print ("Putting together, sorting, making them uniq and then splitting into one file per exon for all the liftover files for MAPPED EXONS that DO NOT INTERSECT WITH AN EXON so far\n"); 727 | liftoverfilesprocessmappedexons($liftOverFilesDir, $perExonLiftoverDir, 0, $ENV{'EXTRAMAPPER_DIR'}); 728 | print ("---------------------- Step 7 Finished ----------------------\n"); 729 | print ("Preporcessing steps finished!\n"); 730 | } 731 | } 732 | 733 | step($step); 734 | -------------------------------------------------------------------------------- /Human-Mouse-Preprocess-Data/scripts/liftOver: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ay-lab/ExTraMapper/ff8bf6399e457c041e10ab8d94c83ae54414b273/Human-Mouse-Preprocess-Data/scripts/liftOver -------------------------------------------------------------------------------- /Human-Mouse-Preprocess-Data/scripts/liftover-withMultiples: -------------------------------------------------------------------------------- 1 | #!/bin/bash -ex 2 | set -o pipefail 3 | set -o errexit 4 | 5 | source config.conf 6 | 7 | 8 | dataDir=${EXTRAMAPPER_DIR}/preprocess/data 9 | dataDirPerPair=${EXTRAMAPPER_DIR}/preprocess/data/$org1-$org2 10 | 11 | chainsDir=$dataDir/liftover_chains 12 | ensemblDir=$dataDirPerPair/ensemblDownloads 13 | 14 | liftOverFilesDir=$dataDirPerPair/liftoverRelatedFiles 15 | perExonLiftoverDir=$dataDirPerPair/perExonLiftoverCoords 16 | 17 | outdir=$liftOverFilesDir 18 | flank=$1 19 | minMatch=$2 20 | 21 | chain1to2=$3 22 | chain2to1=$4 23 | 24 | mkdir -p $ensemblDir 25 | 26 | GTFfile1=$ensemblDir/org1.gtf.gz 27 | GTFfile2=$ensemblDir/org2.gtf.gz 28 | org1to2homologFile=$ensemblDir/org1_homolog_org2.txt.gz 29 | org2to1homologFile=$ensemblDir/org2_homolog_org1.txt.gz 30 | refGDdir1=$ensemblDir/org1 # genomedata archive for org1 31 | refGDdir2=$ensemblDir/org2 # genomedata archive for org2 32 | 33 | ########################## need to add 1 to liftedOver coordinates to match UCSC coordinates ################### 34 | ############## HOWEVER, this is only correct if original/lifted strands are same -/- or +/+ #################### 35 | ############## THEREFORE, I account manually for this by checking the strand pairs #################### 36 | 37 | ############## ALSO, liftOver does not CHANGE the strand of original coordinates when used ###################### 38 | ############# without the -multiple option and it DOES with -multiple. ######################### 39 | ############ HENCE, I handle these two cases differently. ######################## 40 | 41 | ## OLDER AND INCORRECT WAY #1 ########################################################################################################### 42 | # zcat $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.bed.gz | awk '{print $1"\t"$2+1"\t"$3+1"\t"$4"\t"$5}' \ 43 | # > $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp 44 | # 45 | # zcat $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.bed.gz | awk '{print $1"\t"$2+1"\t"$3+1"\t"$4"\t"$5}' \ 46 | # > $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp 47 | # 48 | #rm -rf $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp 49 | ############################################################################################################################################### 50 | 51 | 52 | # 53 | # first work on the partCoding exons 54 | suffix=flank$flank-minMatch$minMatch-multiples-partCoding 55 | 56 | # fourth fields stays the same, fifth is replaced by multiplicity, sixth will be the new strand after liftover 57 | ${EXTRAMAPPER_DIR}/preprocess/bin/liftOver <(cat $outdir/org1_partCodingExonsList.bed | awk '{print $1,$2-s,$3+s,$5,$4,$4}' s=$flank) \ 58 | $chain1to2 org2_mapped-$suffix.bed org2_unmapped-$suffix.bed -minMatch=$minMatch -multiple 59 | 60 | ${EXTRAMAPPER_DIR}/preprocess/bin/liftOver <(cat $outdir/org2_partCodingExonsList.bed | awk '{print $1,$2-s,$3+s,$5,$4,$4}' s=$flank) \ 61 | $chain2to1 org1_mapped-$suffix.bed org1_unmapped-$suffix.bed -minMatch=$minMatch -multiple 62 | 63 | # chr, start, end, exonId, Multiplicity, strand (after conversion) 64 | cat org1_mapped-$suffix.bed | sort -k1,1 -k2,2n > $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.bed 65 | cat org2_mapped-$suffix.bed | sort -k1,1 -k2,2n > $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.bed 66 | 67 | # chr, start, end, exonId, Why unmapped, strand (before conversion) 68 | cat org1_unmapped-$suffix.bed | awk '{l1=$1; getline; printf("%s\t%s\t%s\t%s\t%s\t%s\n",$1,$2,$3,$4,l1,$5)}' |\ 69 | sort -k1,1 -k2,2n > $outdir/org2_to_org1_liftOver_unmappedExonsList-$suffix.bed 70 | cat org2_unmapped-$suffix.bed | awk '{l1=$1; getline; printf("%s\t%s\t%s\t%s\t%s\t%s\n",$1,$2,$3,$4,l1,$5)}' |\ 71 | sort -k1,1 -k2,2n > $outdir/org1_to_org2_liftOver_unmappedExonsList-$suffix.bed 72 | 73 | rm -rf org1_mapped-$suffix.bed org2_mapped-$suffix.bed org1_unmapped-$suffix.bed org2_unmapped-$suffix.bed 74 | 75 | # take the intersections 76 | ## NEW AND CORRECT WAY - FOR ONLY liftOver with -multiple OPTION ########################################################################### 77 | cat $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.bed | awk '{print $4,$0}' | sort -k1,1 > mapped.temp 78 | join $outdir/org2_partCodingExonsList.sorted.temp mapped.temp | \ 79 | awk '{s=$8; e=$9; if ($5!=$12) {s=s+1; e=e+1;}; print $7"\t"s"\t"e"\t"$10"\t"$11"\t"$12}' \ 80 | | sort -k1,1 -k2,2n > $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp 81 | bedtools intersect -a $outdir/org1_allCodingExonsList.bed \ 82 | -b $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp -sorted -wao \ 83 | > $outdir/org1_VS_org2_to_org1_intersectingExonsList-$suffix.bed 84 | 85 | bedtools intersect -b $outdir/org1_allCodingExonsList.bed \ 86 | -a $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp -sorted -v \ 87 | > $outdir/org1_VS_org2_to_org1_nonintersectingExonsList-$suffix.bed 88 | 89 | cat $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.bed | awk '{print $4,$0}' | sort -k1,1 > mapped.temp 90 | join $outdir/org1_partCodingExonsList.sorted.temp mapped.temp | \ 91 | awk '{s=$8; e=$9; if ($5!=$12) {s=s+1; e=e+1;}; print $7"\t"s"\t"e"\t"$10"\t"$11"\t"$12}' \ 92 | | sort -k1,1 -k2,2n > $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp 93 | 94 | bedtools intersect -a $outdir/org2_allCodingExonsList.bed \ 95 | -b $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp -sorted -wao \ 96 | > $outdir/org2_VS_org1_to_org2_intersectingExonsList-$suffix.bed 97 | 98 | bedtools intersect -b $outdir/org2_allCodingExonsList.bed \ 99 | -a $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp -sorted -v \ 100 | > $outdir/org2_VS_org1_to_org2_nonintersectingExonsList-$suffix.bed 101 | ############################################################################################################################################### 102 | 103 | rm -rf $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp mapped.temp 104 | 105 | gzip $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.bed $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.bed 106 | gzip $outdir/org2_to_org1_liftOver_unmappedExonsList-$suffix.bed $outdir/org1_to_org2_liftOver_unmappedExonsList-$suffix.bed 107 | gzip $outdir/org2_VS_org1_to_org2_intersectingExonsList-$suffix.bed $outdir/org1_VS_org2_to_org1_intersectingExonsList-$suffix.bed 108 | gzip $outdir/org2_VS_org1_to_org2_nonintersectingExonsList-$suffix.bed $outdir/org1_VS_org2_to_org1_nonintersectingExonsList-$suffix.bed 109 | 110 | # 111 | # now work on all exons including the partCoding, nonCoding and fullCoding ones 112 | 113 | suffix=flank$flank-minMatch$minMatch-multiples 114 | 115 | # fourth fields stays the same, fifth is replaced by multiplicity, sixth will be the new strand after liftover 116 | #liftOver <(cat $outdir/org1_allExonsList.bed | awk '{if ($4=="+") print $1,$2-s,$3+s,$5,$4,$4; else print $1,$2-s,$3+s,$5,$4,$4;}' s=$flank) \ 117 | ${EXTRAMAPPER_DIR}/preprocess/bin/liftOver <(cat $outdir/org1_allExonsList.bed | awk '{print $1,$2-s,$3+s,$5,$4,$4}' s=$flank) \ 118 | $chain1to2 org2_mapped-$suffix.bed org2_unmapped-$suffix.bed -minMatch=$minMatch -multiple 119 | 120 | ${EXTRAMAPPER_DIR}/preprocess/bin/liftOver <(cat $outdir/org2_allExonsList.bed | awk '{print $1,$2-s,$3+s,$5,$4,$4}' s=$flank) \ 121 | $chain2to1 org1_mapped-$suffix.bed org1_unmapped-$suffix.bed -minMatch=$minMatch -multiple 122 | 123 | cat org1_mapped-$suffix.bed | sort -k1,1 -k2,2n > $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.bed 124 | cat org2_mapped-$suffix.bed | sort -k1,1 -k2,2n > $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.bed 125 | #cat org1_unmapped-$suffix.bed | awk 'NR%2==1' | sort | uniq - 126 | #cat org2_unmapped-$suffix.bed | awk 'NR%2==1' | sort | uniq - 127 | 128 | cat org1_unmapped-$suffix.bed | awk '{l1=$1; getline; printf("%s\t%s\t%s\t%s\t%s\t%s\n",$1,$2,$3,$4,l1,$5)}' |\ 129 | sort -k1,1 -k2,2n > $outdir/org2_to_org1_liftOver_unmappedExonsList-$suffix.bed 130 | cat org2_unmapped-$suffix.bed | awk '{l1=$1; getline; printf("%s\t%s\t%s\t%s\t%s\t%s\n",$1,$2,$3,$4,l1,$5)}' |\ 131 | sort -k1,1 -k2,2n > $outdir/org1_to_org2_liftOver_unmappedExonsList-$suffix.bed 132 | 133 | rm -rf org1_mapped-$suffix.bed org2_mapped-$suffix.bed org1_unmapped-$suffix.bed org2_unmapped-$suffix.bed 134 | 135 | # take the intersections 136 | ## NEW AND CORRECT WAY - FOR ONLY liftOver with -multiple OPTION ########################################################################### 137 | # This correction in coordinates leads to some exons with that doesn't have any file mapped, unmapped, nonintersecting. ## 138 | # There is only 2 such exons and they be deemed unmapped (i.e., deleted from the second organism) # 139 | cat $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.bed | awk '{print $4,$0}' | sort -k1,1 > mapped.temp 140 | join $outdir/org2_allExonsList.sorted.temp mapped.temp | \ 141 | awk '{s=$8; e=$9; if ($5!=$12) {s=s+1; e=e+1;}; print $7"\t"s"\t"e"\t"$10"\t"$11"\t"$12}' \ 142 | | sort -k1,1 -k2,2n > $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp 143 | bedtools intersect -a $outdir/org1_allExonsList.bed \ 144 | -b $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp -sorted -wao \ 145 | > $outdir/org1_VS_org2_to_org1_intersectingExonsList-$suffix.bed 146 | bedtools intersect -b $outdir/org1_allExonsList.bed \ 147 | -a $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp -sorted -v \ 148 | > $outdir/org1_VS_org2_to_org1_nonintersectingExonsList-$suffix.bed 149 | 150 | cat $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.bed | awk '{print $4,$0}' | sort -k1,1 > mapped.temp 151 | join $outdir/org1_allExonsList.sorted.temp mapped.temp | \ 152 | awk '{s=$8; e=$9; if ($5!=$12) {s=s+1; e=e+1;}; print $7"\t"s"\t"e"\t"$10"\t"$11"\t"$12}' \ 153 | | sort -k1,1 -k2,2n > $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp 154 | 155 | bedtools intersect -a $outdir/org2_allExonsList.bed \ 156 | -b $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp -sorted -wao \ 157 | > $outdir/org2_VS_org1_to_org2_intersectingExonsList-$suffix.bed 158 | 159 | bedtools intersect -b $outdir/org2_allExonsList.bed \ 160 | -a $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp -sorted -v \ 161 | > $outdir/org2_VS_org1_to_org2_nonintersectingExonsList-$suffix.bed 162 | 163 | rm -rf $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.temp $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.temp mapped.temp 164 | 165 | gzip $outdir/org2_to_org1_liftOver_mappedExonsList-$suffix.bed $outdir/org1_to_org2_liftOver_mappedExonsList-$suffix.bed 166 | gzip $outdir/org2_to_org1_liftOver_unmappedExonsList-$suffix.bed $outdir/org1_to_org2_liftOver_unmappedExonsList-$suffix.bed 167 | gzip $outdir/org2_VS_org1_to_org2_intersectingExonsList-$suffix.bed $outdir/org1_VS_org2_to_org1_intersectingExonsList-$suffix.bed 168 | gzip $outdir/org2_VS_org1_to_org2_nonintersectingExonsList-$suffix.bed $outdir/org1_VS_org2_to_org1_nonintersectingExonsList-$suffix.bed 169 | 170 | ############################################################################################################################################### 171 | 172 | -------------------------------------------------------------------------------- /Human-Mouse-Preprocess-Data/scripts/parseAndPicklePerPair.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | ############################################################################## 3 | ### To use the functions in this lib simply import this python module using 4 | ### import ensemblUtils 5 | ### Then you'll able able to call functions with the proper arguments using 6 | ### returnVal=ensemblUtils.func1(arg1,arg2) 7 | ############################################################################## 8 | ############################################################################## 9 | import sys 10 | import os 11 | import string 12 | import math 13 | import gzip 14 | import _pickle as pickle 15 | 16 | # reads from exported environment variable 17 | ExTraMapperPath=os.environ['EXTRAMAPPER_DIR'] 18 | sys.path.append(ExTraMapperPath+"/scripts") 19 | from ensemblUtils import * 20 | 21 | # Testing functionalities 22 | def main(argv): 23 | indir=argv[1] 24 | orgId1="org1"; orgId2="org2"; 25 | refGD1=indir+"/genomedataArchives/org1" 26 | refGD2=indir+"/genomedataArchives/org2" 27 | 28 | 29 | # outdir="GTFsummaries"; 30 | if len(argv)==2: 31 | return 32 | 33 | outdir=argv[2] 34 | os.system("mkdir -p "+outdir) 35 | 36 | #infilename=indir+"/ensemblDownloads/org1.gtf.gz" 37 | infilename=indir+"/ensemblDownloads/org1.gtf" ## Abhijit 38 | geneDic1,transcriptDic1,exonDic1,infoDic1=parse_organism_GTF(orgId1, infilename, outdir) 39 | 40 | #infilename=indir+"/ensemblDownloads/org2.gtf.gz" 41 | infilename=indir+"/ensemblDownloads/org2.gtf" ## Abhijit 42 | geneDic2,transcriptDic2,exonDic2,infoDic2=parse_organism_GTF(orgId2, infilename, outdir) 43 | 44 | ## these two files were downloaded by hand selecting columns from Ensembl's Biomart 45 | ## I weren't able to redo the same column selections recently so I decided to switch to 46 | ## parsing the orthology information from readily available Ensembl files like below ones: 47 | ## ftp://ftp.ensembl.org/pub/release-80/mysql/ensembl_mart_80/ 48 | ## hsapiens_gene_ensembl__homolog_mmus__dm.txt.gz 49 | #infilename="/projects/b1017/shared/Ensembl-files/Ensembl-human-GRCh38-to-mouse-GRCm38.p3.txt.gz" 50 | #genePairsHumanToMouse=parse_ensembl_gene_pairings(infilename) 51 | #infilename="/projects/b1017/shared/Ensembl-files/Ensembl-mouse-GRCm38.p3-to-human-GRCh38.txt.gz" 52 | #genePairsMouseToHuman=parse_ensembl_gene_pairings(infilename) 53 | #consistency_check(genePairsHumanToMouse,genePairsMouseToHuman) 54 | ## if consistency check is ok then just use one side. This is OK for one2one mappings. 55 | #genePairsDic=genePairsHumanToMouse 56 | #pickle_one2one_genePairs_allInfo(genePairsDic,geneDic1,geneDic2,exonDic1,exonDic2,transcriptDic1,transcriptDic2,outdir) 57 | 58 | #infilename=indir+"/ensemblDownloads/org1_homolog_org2.txt.gz" 59 | infilename=indir+"/ensemblDownloads/org1_homolog_org2.txt" ## Abhijit 60 | proteinToGeneDic,genePairsDic,proteinPairsDic=parse_ensembl_geneAndProtein_pairings(infilename,{},{}) 61 | print (["1",len(proteinToGeneDic),len(genePairsDic),len(proteinPairsDic)]) 62 | 63 | #infilename=indir+"/ensemblDownloads/org2_homolog_org1.txt.gz" 64 | infilename=indir+"/ensemblDownloads/org2_homolog_org1.txt" ## Abhijit 65 | proteinToGeneDic,genePairsDic,proteinPairsDic=parse_ensembl_geneAndProtein_pairings(infilename,proteinToGeneDic,proteinPairsDic) 66 | print (["2",len(proteinToGeneDic),len(genePairsDic),len(proteinPairsDic)]) 67 | 68 | 69 | exonDic1=assign_firstMidLast_exon_counts(exonDic1,transcriptDic1) 70 | exonDic2=assign_firstMidLast_exon_counts(exonDic2,transcriptDic2) 71 | 72 | typ="allExonPlusMinus" 73 | outfilename="None" 74 | fivePrimeFlank=12; threePrimeFlank=12 75 | 76 | ###### Not required ###### 77 | #exonDic1=extract_conservation_stats_for_exons(refGD1,exonDic1,typ,fivePrimeFlank,threePrimeFlank,outfilename) 78 | #exonDic2=extract_conservation_stats_for_exons(refGD2,exonDic2,typ,fivePrimeFlank,threePrimeFlank,outfilename) 79 | ###### 80 | 81 | outdir=argv[2] # overwrite previous summaries 82 | os.system("mkdir -p "+outdir) 83 | print_some_summary(orgId1, geneDic1,transcriptDic1,exonDic1,{}, outdir) 84 | print_some_summary(orgId2, geneDic2,transcriptDic2,exonDic2,{}, outdir) 85 | 86 | # outdir="perGenePairExonLists" 87 | if len(argv)==3: 88 | return 89 | 90 | outdir=argv[3] 91 | os.system("mkdir -p "+outdir) 92 | 93 | outfilename=outdir+"/genePairsSummary-one2one.txt" 94 | print_one2one_genePairs(genePairsDic,geneDic1,geneDic2,outfilename) # either way is ok since one2one 95 | 96 | geneOnlyOrthoDic1,transcriptOnlyOrthoDic1,exonOnlyOrthoDic1, geneOnlyOrthoDic2,transcriptOnlyOrthoDic2,exonOnlyOrthoDic2=pickle_one2one_genePairs_allInfo(genePairsDic,geneDic1,geneDic2,exonDic1,exonDic2,transcriptDic1,transcriptDic2,outdir) 97 | 98 | print ([len(geneDic1), len(geneDic2)]) 99 | print (len(geneOnlyOrthoDic1)) 100 | print (len(geneOnlyOrthoDic2)) 101 | 102 | outdir=argv[2]+"/onlyOrthologAndCodingGenes" 103 | os.system("mkdir -p "+outdir) 104 | print (outdir) 105 | print_some_summary(orgId1, geneOnlyOrthoDic1,transcriptOnlyOrthoDic1,exonOnlyOrthoDic1,{}, outdir) 106 | print_some_summary(orgId2, geneOnlyOrthoDic2,transcriptOnlyOrthoDic2,exonOnlyOrthoDic2,{}, outdir) 107 | 108 | # 109 | # print_one2one_exonListPairs(genePairsDic,geneDic1,geneDic2,exonDic1,exonDic2,orgId1,orgId2,outdir) 110 | # print_one2one_transcriptListPairs(genePairsDic,geneDic1,geneDic2,transcriptDic1,transcriptDic2,orgId1,orgId2,outdir) 111 | 112 | return 113 | 114 | if __name__ == "__main__": 115 | main(sys.argv) 116 | 117 | -------------------------------------------------------------------------------- /Human-Mouse-Preprocess-Data/scripts/splitExonsIntoIndividualFiles.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import sys 3 | 4 | def main(argv): 5 | infilename=argv[1] 6 | outdir=argv[2] 7 | whichCol=int(argv[3])-1 8 | fileSuffix=argv[4] 9 | infile=open(infilename,'r') 10 | lastExon="dummy" 11 | outfile=open("dummy.txt",'w') 12 | for line in infile: 13 | newExon=line.rstrip().split()[whichCol] # where exon name is 14 | if newExon!=lastExon: 15 | outfile.close() 16 | outfile=open(outdir+"/"+newExon+fileSuffix,'w') 17 | # 18 | outfile.write(line) 19 | lastExon=newExon 20 | # 21 | outfile.close() 22 | return 23 | 24 | if __name__ == "__main__": 25 | main(sys.argv) 26 | # 27 | 28 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2016 ay-lab 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ExTraMapper 2 | ExTraMapper is a tool to find Exon and Transcript-level Mappings of a given pair of orthologous genes between two organisms using sequence conservation. The figure below shows the overall schematic description of ExTraMapper mapping the homologous transcript and exon-pairs between human and mouse genome. 3 | 4 | 5 | ![ExTraMapper_Figure](https://user-images.githubusercontent.com/18036388/90572310-8b693e00-e168-11ea-9fbc-8188c2834de9.jpg) 6 | 7 | # Steps to run ExtraMapper (For python version 3 or later usage) 8 | 9 | ### Step 1: Prepare the input files 10 | ExTraMapper requires a set of preprocessed files to find the conservation scores. Examples to create these files are provided within the following folders 11 | 12 | 1. [__Human-Mouse-Preprocessed-Data__](https://github.com/ay-lab/ExTraMapper/tree/master/Human-Mouse-Preprocess-Data) 13 | 14 | and 15 | 16 | 2. [__Human-Rhesus_macaque-Preprocessed-Data__](https://github.com/ay-lab/ExTraMapper/tree/master/Human-Monkey-Processed-Data) 17 | 18 | ### Steps to generate the input files 19 | The users should run the _extMpreprocess_ to generate the inputfiles within the above Preprocessed-Data folders. All the input files will be generated under _preprocess/data_ folder. All the required executables and scripts are provided here. The _extMpreprocess_ has 7 individual steps and should be run in the following manner 20 | 21 | - ![#f03c15](https://via.placeholder.com/15/f03c15/000000?text=+) For help, type
22 | 23 | ```bash 24 | ./extMpreprocess help 25 | 26 | This script will download and preprocess the dataset required for exon-pair and transcript pair finding by ExTraMapper. 27 | Type ./extMpreprocess to execute the script. 28 | Type ./extMpreprocess example to print a example config.conf file. 29 | 30 | This script will run seven (7) sequential steps to create the inputs for ExTraMapper program. 31 | Users can provide step numbers (1-7) or all in the arugemt of this script. 32 | Short description of the individual scripts: 33 | Step 1: Download per organism specific files e.g. reference genomes, gene annotation files. 34 | Step 2: Will create genomedata archives with the genomes of org1 and org2 (Make sure to install genomedata package). 35 | Step 3: Pickle files for each homologous gene pair will be created. 36 | Step 4: Perform coordinate liftOver of exons with multiple mappings (This step requires bedtools and liftOver executables). 37 | Step 5-7: postprocessing the liftOver files. 38 | 39 | example: 40 | 41 | ./extMpreprocess config.human-mouse.conf all 42 | ``` 43 |
44 |
45 | 46 | ### Step 2: Set the following path 47 | ```bash export EXTRAMAPPER_DIR=/path/to/this/folder``` 48 | 49 |
50 | 51 | ### Step 3: Run ExTraMapper individually 52 | ```bash 53 | $ python ExTraMapper.py -h 54 | usage: ExTraMapper.py [-h] -m MAPPING -o1 ORG1 -o2 ORG2 -p ORTHOLOG 55 | 56 | Check the help flag 57 | 58 | optional arguments: 59 | -h, --help show this help message and exit 60 | -m MAPPING ExTraMapper Exon threshold value [e.g. 1] 61 | -o1 ORG1 First organism name [e.g. human] 62 | -o2 ORG2 Second organism name [e.g. mouse] 63 | -p ORTHOLOG Orthologous gene pair [e.g. ENSG00000141510-ENSMUSG00000059552 OR all] 64 | ``` 65 | 66 | #### Example run of ExTraMapper.py using orthologous gene pair ENSG00000141510-ENSMUSG00000059552 67 | ```bash 68 | $ python ExTraMapper.py -m 1 -o1 human -o2 mouse -p ENSG00000141510-ENSMUSG00000059552 69 | 70 | Finding exon mappings for gene pair number 0 ENSG00000141510-ENSMUSG00000059552 71 | ***************************************************************** 72 | Gene pair ID: ENSG00000141510-ENSMUSG00000059552 73 | 74 | Information about each gene. Last two numbers are no of transcripts and exons 75 | ENSG00000141510 chr17 7661779 7687538 - ENSG00000141510 TP53 protein_coding 27 49 gene 76 | ENSMUSG00000059552 chr11 69580359 69591873 + ENSMUSG00000059552 Trp53 protein_coding 6 24 gene 77 | 78 | Number of exons before and after duplicate removal according to coordinates 79 | Org1 49 40 80 | Org2 24 20 81 | 82 | ***************************************************************** 83 | 84 | ***************************************************************** 85 | GCGCTGGGGACCTGTCCCTAGGGGGCAGATGAGACACTGATGGGCGTACTTAGAGATTTGCCATGAAGTGGGTTTGAAGAATGGAGCTGTGTGTGAAAT 86 | Exon file type summaries for the first gene from: ENSG00000141510-ENSMUSG00000059552 87 | 0 exons with: No file exists 88 | 22 exons with: Only Mapped 89 | 0 exons with: Only nonintersecting 90 | 11 exons with: Only unmapped 91 | 15 exons with: Mapped and unmapped 92 | 0 exons with: Mapped and nonintersecting 93 | 1 exons with: Nonintersecting and unmapped 94 | 0 exons with: All three files 95 | Exon file type summaries for the second gene from: ENSG00000141510-ENSMUSG00000059552 96 | 0 exons with: No file exists 97 | 14 exons with: Only Mapped 98 | 0 exons with: Only nonintersecting 99 | 3 exons with: Only unmapped 100 | 7 exons with: Mapped and unmapped 101 | 0 exons with: Mapped and nonintersecting 102 | 0 exons with: Nonintersecting and unmapped 103 | 0 exons with: All three files 104 | Writing exon-level similarity scores into file: 105 | /path/output/human-mouse/ENSG00000141510-ENSMUSG00000059552/exonLevelSimilarities-1.0.txt 106 | 107 | Writing exon classes into file: 108 | /path/output/human-mouse/ENSG00000141510-ENSMUSG00000059552/exonClasses-1.0.txt 109 | For org1: Mapped exons= 17, Unmapped exons= 21, Nonintersecting exons= 1, OTHER= 10 110 | For org2: Mapped exons= 13, Unmapped exons= 7, Nonintersecting exons= 0, OTHER= 4 111 | ***************************************************************** 112 | 113 | ***************************************************************** 114 | Writing exon-level mappings into file: 115 | /path/output/human-mouse/ENSG00000141510-ENSMUSG00000059552/exonLevelMappings-1.0.txt 116 | Writing trascript-level similarity scores into file: 117 | /path/output/human-mouse/ENSG00000141510-ENSMUSG00000059552/transcriptLevelSimilarities-1.0.txt 118 | Writing transcript-level mappings into file: 119 | /path/output/human-mouse/ENSG00000141510-ENSMUSG00000059552/transcriptLevelMappings-1.0.txt 120 | 121 | Condition counter from the greedy transcript mapping stage: 122 | 5 pairs with Condition1: Unique winner pair 123 | 0 pairs with Condition2: Tie in one score, not in the other 124 | 0 pairs with Condition3: Tie in both scores but coding exon length diff breaks the tie 125 | 0 pairs with Condition4: Tie in both scores and coding exon length diff but overall exon length breaks the tie 126 | 1 pairs with Condition5: Tie in all the above but coding length (bp) diff breaks the tie 127 | 0 pairs with Condition6: Tie in all the above, just give up and report all 128 | 129 | Writing UCSC browser bed output for org1 into file: 130 | /path/output/human-mouse/ENSG00000141510-ENSMUSG00000059552/org1-ucsc-1.0.bed 131 | Writing UCSC browser bed output for org2 into file: 132 | /path/output/human-mouse/ENSG00000141510-ENSMUSG00000059552/org2-ucsc-1.0.bed 133 | 134 | ........ 135 | ExTraMapper ran successfully for 1 gene pairs between: human and mouse 136 | 137 | 138 | ***************************************************************** 139 | $ tree ./output 140 | 141 | ./output 142 | `-- human-mouse 143 | `-- ENSG00000141510-ENSMUSG00000059552 144 | |-- exonClasses-1.0.txt 145 | |-- exonLevelMappings-1.0.txt 146 | |-- exonLevelSimilarities-1.0.txt 147 | |-- org1-ucsc-1.0.bed 148 | |-- org2-ucsc-1.0.bed 149 | |-- transcriptLevelMappings-1.0.txt 150 | `-- transcriptLevelSimilarities-1.0.txt 151 | ``` 152 | 153 | Note: The __exonLevelMappings-1.0.txt__ & __transcriptLevelMappings-1.0.txt__ file contains the mapped exon and transcript pairs from __ENSG00000141510-ENSMUSG00000059552__ orthologous gene-pair. 154 | 155 |
156 | 157 | # OR 158 | 159 | ### Step 3: Run ExTraMapper for all the gene pairs 160 | ```bash 161 | $ python ExTraMapper.py -h 162 | usage: ExTraMapper.py [-h] -m MAPPING -o1 ORG1 -o2 ORG2 -p all 163 | ``` 164 | 165 |
166 | 167 | ### Summarise the ExTraMapper results ### 168 | Run _extMsummarise_ script to generate a concatenated file will all the results. Run the script in the follwoing manner 169 | ```bash 170 | $ ./extMsummarise help 171 | Type ./extMsummarise 172 | preprocess_folder : Path to the preprocess folder generated by the extMpreproces script 173 | extramapper_folder : Path to the output folder generated by ExTraMapper program 174 | orthologous_genepair_list : A list of orthologous gene-pairs 175 | org1name : org1 name e.g. human 176 | org2name : org2 name e.g. mouse 177 | outputprefix : output file prefix 178 | 179 | example : 180 | ./extMsummarise ./preprocess ./output gene-pair.list human mouse extramapper-result 181 | ``` 182 |
183 | 184 | 185 | # Prepocessed Results 186 | 187 | Check the [Result/Exon-Pairs](https://github.com/ay-lab/ExTraMapper/tree/master/Result/Exon-Pairs) and [Result/Transcript-Pairs](https://github.com/ay-lab/ExTraMapper/tree/master/Result/Transcript-Pairs) to download the precomputed ExTraMapper result for human-mouse and human-rhesus orthologous exon and transcript pairs. 188 | 189 |
190 | 191 | ### Refer the work 192 | [_ExTraMapper: Exon- and Transcript-level mappings for orthologous gene pairs._](https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab393/6278896?redirectedFrom=fulltext) 193 | 194 | __Chakraborty A, Ay F, Davuluri RV. ExTraMapper: Exon- and Transcript-level mappings for orthologous gene pairs. Bioinformatics. 2021 May 20:btab393. doi: 10.1093/bioinformatics/btab393. Epub ahead of print. PMID: 34014317.__ 195 | 196 | The data shown in the above paper was performed using Human & Mouse ENSMBL release 81 with python 2.7 code. 197 | The current update is with ENSMBL release 102 and python 3 or later version. To see the older code and data please 198 | change the __Branch__ to [__ExTraMapper-python2v__](https://github.com/ay-lab/ExTraMapper/tree/ExTraMapper-python2v) from __master__ 199 | 200 | ### Check the webserver for a nice vizualization 201 | https://ay-lab-tools.lji.org/extramapper/index.html 202 | -------------------------------------------------------------------------------- /Result/Exon-Pairs/README.md: -------------------------------------------------------------------------------- 1 | ## Download the exon pair files 2 | 3 | [ExTraMapper_Exon_Mapping_ENSMBL102_Human_vs_Monkey](https://drive.google.com/file/d/1L9Ef7vYr9R66xW-zz4wfVU4moVDCVddX/view?usp=sharing) 4 | 5 | (This file contains Human and monkey (Rhesus macaque) orthologous exon pairs from ExTraMapper. The results were generarted using ENSEMBL vserion 102) 6 | 7 | [ExTraMapper_Exon_Mapping_ENSMBL102_Genome_Build_Human_vs_Mouse](https://drive.google.com/file/d/1vpJCW5hmNDmWdmGn6cxDWHvyC7oiCRFy/view?usp=sharing) 8 | 9 | (This file contains Human and mouse orthologous exon pairs from ExTraMapper. The results were generarted using ENSEMBL vserion 102) 10 | 11 | [ExTraMapper_Exon_Mapping_ENSMBL81_Genome_Build_Human_vs_Mouse](https://drive.google.com/file/d/1eeJ9_ck6-WKMox2Kw1A4VT43z3IDEJYU/view?usp=sharing) 12 | 13 | (This file contains Human and mouse orthologous exon pairs from ExTraMapper. The results were generarted using ENSEMBL vserion 81) 14 | -------------------------------------------------------------------------------- /Result/Transcript-Pairs/ExTraMapper_Transcript_Mapping_ENSMBL102_Genome_Build_Human_vs_Mouse.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ay-lab/ExTraMapper/ff8bf6399e457c041e10ab8d94c83ae54414b273/Result/Transcript-Pairs/ExTraMapper_Transcript_Mapping_ENSMBL102_Genome_Build_Human_vs_Mouse.xlsx -------------------------------------------------------------------------------- /Result/Transcript-Pairs/ExTraMapper_Transcript_Mapping_ENSMBL102_Human_vs_Monkey.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ay-lab/ExTraMapper/ff8bf6399e457c041e10ab8d94c83ae54414b273/Result/Transcript-Pairs/ExTraMapper_Transcript_Mapping_ENSMBL102_Human_vs_Monkey.xlsx -------------------------------------------------------------------------------- /Result/Transcript-Pairs/ExTraMapper_Transcript_Mapping_ENSMBL81_Genome_Build_Human_vs_Mouse.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ay-lab/ExTraMapper/ff8bf6399e457c041e10ab8d94c83ae54414b273/Result/Transcript-Pairs/ExTraMapper_Transcript_Mapping_ENSMBL81_Genome_Build_Human_vs_Mouse.xlsx -------------------------------------------------------------------------------- /Result/Transcript-Pairs/README.md: -------------------------------------------------------------------------------- 1 | ## File Description 2 | 3 | #### 1. ExTraMapper_Transcript_Mapping_ENSMBL102_Genome_Build_Human_vs_Mouse.xlsx : 4 | 5 | This file contains Human and mouse orthologous transcript pairs from ExTraMapper. The results were generarted using ENSEMBL vserion 102. 6 | 7 | #### 2. ExTraMapper_Transcript_Mapping_ENSMBL81_Genome_Build_Human_vs_Mouse.xlsx : 8 | 9 | This file contains Human and mouse orthologous transcript pairs from ExTraMapper. The results were generarted using ENSEMBL vserion 81. 10 | 11 | #### 3. ExTraMapper_Transcript_Mapping_ENSMBL102_Human_vs_Monkey.xlsx : 12 | 13 | This file contains Human and monkey (Rhesus macaque) orthologous transcript pairs from ExTraMapper. The results were generarted using ENSEMBL vserion 102. 14 | -------------------------------------------------------------------------------- /extMsummarise: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl 2 | 3 | sub generatefiles { 4 | 5 | my $htgtf = $_[0]; 6 | my $mtgtf = $_[1]; 7 | my $ogene = $_[2]; 8 | my $etmfl = $_[3]; 9 | my $mping = $_[4]; 10 | my $outpt = $_[5]; 11 | my %geneid; 12 | 13 | open(hgtf_in, $htgtf); 14 | while (my $line = ) { 15 | chomp $line; 16 | if ($. > 1) { 17 | my $tname = (split(/\s+/,$line))[4]; 18 | my $tgene = (split(/\s+/,$line))[5]; 19 | my @exons = split(/,/,(split(/\s+/,$line))[9]); 20 | $geneid{'h'}{'t'}{$tname} = $tgene; 21 | foreach my $e (@exons) { 22 | chomp $e; 23 | if ($geneid{'h'}{'e'}{$e} eq "") { 24 | $geneid{'h'}{'e'}{$e} = $tgene; 25 | } else { 26 | $geneid{'h'}{'e'}{$e} = "$geneid{'h'}{'e'}{$e},$tgene"; 27 | } 28 | } 29 | undef @exons; 30 | } 31 | } 32 | close(hgtf_in); 33 | 34 | open(mgtf_in, $mtgtf); 35 | while (my $line = ) { 36 | chomp $line; 37 | if ($. > 1) { 38 | my $tname = (split(/\s+/,$line))[4]; 39 | my $tgene = (split(/\s+/,$line))[5]; 40 | my @exons = split(/,/,(split(/\s+/,$line))[9]); 41 | $geneid{'m'}{'t'}{$tname} = $tgene; 42 | foreach my $e (@exons) { 43 | chomp $e; 44 | if ($geneid{'m'}{'e'}{$e} eq "") { 45 | $geneid{'m'}{'e'}{$e} = $tgene; 46 | } else { 47 | $geneid{'m'}{'e'}{$e} = "$geneid{'m'}{'e'}{$e},$tgene"; 48 | } 49 | } 50 | undef @exons; 51 | } 52 | } 53 | close(mgtf_in); 54 | 55 | open(out_tpair,">$outpt.transcriptLevelMappings-$mping.txt"); 56 | open(out_epair,">$outpt.exonLevelMappings-$mping.txt"); 57 | print out_tpair ("chrName1\tstartCoord1\tendCoord1\tstrand1\tchrName2\tstartCoord2\tendCoord2\tstrand2\ttranscriptID1\ttranscriptID2\ttranscriptName1\ttranscriptName2\ttranscriptType1\ttranscriptType2\toverallSimScore\tcodingSimScore\tortholog\n"); 58 | print out_epair ("chrName1\tstartCoord1\tendCoord1\tstrand1\tchrName2\tstartCoord2\tendCoord2\tstrand2\texonID1\texonID2\texonName1\texonName2\texonType1\texonType2\toverlapScoreFromFullLength\toverlapScoreFromPartialCodingPart\tortholog\n"); 59 | open(ogene_in, $ogene); 60 | while (my $line = ) { 61 | chomp $line; 62 | open (extresult_trans_in, "$etmfl/$line/transcriptLevelMappings-$mping.txt"); 63 | while (my $r = ) { 64 | if ($. > 1) { 65 | my $chrName1 = (split(/\s+/, $r))[0]; 66 | my $startCoord1 = (split(/\s+/, $r))[1]; 67 | my $endCoord1 = (split(/\s+/, $r))[2]; 68 | my $strand1 = (split(/\s+/, $r))[3]; 69 | my $chrName2 = (split(/\s+/, $r))[6]; 70 | my $startCoord2 = (split(/\s+/, $r))[7]; 71 | my $endCoord2 = (split(/\s+/, $r))[8]; 72 | my $strand2 = (split(/\s+/, $r))[9]; 73 | my $transcriptID1 = (split(/\s+/, $r))[4]; 74 | my $transcriptType1 = (split(/\s+/, $r))[5]; 75 | my $transcriptID2 = (split(/\s+/, $r))[10]; 76 | my $transcriptType2 = (split(/\s+/, $r))[11]; 77 | my $overallSimScore = (split(/\s+/, $r))[18]; 78 | my $codingSimScore = (split(/\s+/, $r))[19]; 79 | my $transcriptName1 = $geneid{'h'}{'t'}{$transcriptID1}; 80 | my $transcriptName2 = $geneid{'m'}{'t'}{$transcriptID2}; 81 | print out_tpair ("$chrName1\t$startCoord1\t$endCoord1\t$strand1\t$chrName2\t$startCoord2\t$endCoord2\t$strand2\t$transcriptID1\t$transcriptID2\t$transcriptName1\t$transcriptName2\t$transcriptType1\t$transcriptType2\t$overallSimScore\t$codingSimScore\t$line\n"); 82 | } 83 | } 84 | close(extresult_trans_in); 85 | 86 | open (extresult_exons_in, "$etmfl/$line/exonLevelMappings-$mping.txt"); 87 | while (my $r = ) { 88 | if ($. > 1) { 89 | my $chrName1 = (split(/\s+/, $r))[0]; 90 | my $startCoord1 = (split(/\s+/, $r))[1]; 91 | my $endCoord1 = (split(/\s+/, $r))[2]; 92 | my $strand1 = (split(/\s+/, $r))[3]; 93 | my $chrName2 = (split(/\s+/, $r))[6]; 94 | my $startCoord2 = (split(/\s+/, $r))[7]; 95 | my $endCoord2 = (split(/\s+/, $r))[8]; 96 | my $strand2 = (split(/\s+/, $r))[9]; 97 | my $exonID1 = (split(/\s+/, $r))[4]; 98 | my $exonType1 = (split(/\s+/, $r))[5]; 99 | my $exonID2 = (split(/\s+/, $r))[10]; 100 | my $exonType2 = (split(/\s+/, $r))[11]; 101 | my $overlapScoreFromFullLength = (split(/\s+/, $r))[12]; 102 | my $overlapScoreFromPartialCodingPart = (split(/\s+/, $r))[13]; 103 | my $exonName1 = $geneid{'h'}{'e'}{$exonID1}; 104 | my $exonName2 = $geneid{'m'}{'e'}{$exonID2}; 105 | print out_epair ("$chrName1\t$startCoord1\t$endCoord1\t$strand1\t$chrName2\t$startCoord2\t$endCoord2\t$strand2\t$exonID1\t$exonID2\t$exonName1\t$exonName2\t$exonType1\t$exonType2\t$overlapScoreFromFullLength\t$overlapScoreFromPartialCodingPart\t$line\n"); 106 | } 107 | } 108 | close(extresult_exons_in); 109 | } 110 | close(out_tpair); 111 | close(out_epair); 112 | close(ogene_in); 113 | } 114 | 115 | if ($#ARGV == -1 || $ARGV[0] eq "help" || $#ARGV < 5) { 116 | print ("Type ./extMsummarise \n"); 117 | print ("preprocess_folder : Path to the preprocess folder generated by the extMpreproces script\n"); 118 | print ("extramapper_folder : Path to the output folder generated by ExTraMapper program\n"); 119 | print ("orthologous_genepair_list : A list of orthologous gene-pairs\n"); 120 | print ("org1name : org1 name e.g. human\n"); 121 | print ("org2name : org2 name e.g. mouse\n"); 122 | print ("outputprefix : output file prefix\n\n"); 123 | exit; 124 | } 125 | else { 126 | my ($preprocess_folder, $extmapper_result, $pair_list, $org1, $org2, $output) = @ARGV; 127 | chomp ($preprocess_folder, $extmapper_result, $pair_list, $org1, $org2, $output); 128 | 129 | my $org1_transcript_gtf = "$preprocess_folder/data/$org1-$org2/GTFsummaries/org1-allTranscripts-GTFparsed.txt"; 130 | my $org2_transcript_gtf = "$preprocess_folder/data/$org1-$org2/GTFsummaries/org2-allTranscripts-GTFparsed.txt"; 131 | my $ogene = $pair_list; 132 | my $etmfl = "$extmapper_result/$org1-$org2"; 133 | generatefiles($org1_transcript_gtf,$org2_transcript_gtf,$ogene,$etmfl,"0.8",$output); 134 | } 135 | --------------------------------------------------------------------------------