├── LICENSE.txt ├── README.md ├── ROSE-local.sh ├── annotation ├── hg18_refseq.ucsc ├── hg19_refseq.ucsc ├── hg38_refseq.ucsc ├── mm10_refseq.ucsc ├── mm8_refseq.ucsc └── mm9_refseq.ucsc ├── bin ├── ROSE_bamToGFF.py ├── ROSE_callSuper.R ├── ROSE_geneMapper.py └── ROSE_main.py └── lib └── ROSE_utils.py /LICENSE.txt: -------------------------------------------------------------------------------- 1 | =====ROSE: RANK ORDERING OF SUPER-ENHANCERS===== 2 | 3 | ROSE IS RELEASED UNDER THE MIT X11 LICENSE 4 | 5 | EXAMPLE DATA AND ADDTIONAL INFORMATION CAN BE FOUND HERE: 6 | http://younglab.wi.mit.edu/super_enhancer_code.html 7 | 8 | N.B. LICENSE.txt 9 | 10 | For details of this analysis see: 11 | 12 | Master Transcription Factors and Mediator Establish Super-Enhancers at Key Cell Identity Genes 13 | Warren A. Whyte, David A. Orlando, Denes Hnisz, Brian J. Abraham, Charles Y. Lin, Michael H. Kagey, Peter B. Rahl, Tong Ihn Lee and Richard A. Young 14 | Cell 153, 307-319, April 11, 2013 15 | 16 | and 17 | 18 | Selective Inhibition of Tumor Oncogenes by Disruption of Super-enhancers 19 | Jakob Lovén, Heather A. Hoke, Charles Y. Lin, Ashley Lau, David A. Orlando, Christopher R. Vakoc, James E. Bradner, Tong Ihn Lee, and Richard A. Young 20 | Cell 153, 320-334, April 11, 2013 21 | 22 | Please cite these papers when using this code. 23 | 24 | SOFTWARE AUTHORS: Charles Y. Lin, David A. Orlando, Brian J. Abraham 25 | CONTACT: young_computation@wi.mit.edu 26 | ACKNOWLEDGEMENTS: Graham Ruby 27 | Developed using Python 2.7.3, R 2.15.3, and SAMtools 0.1.18 28 | 29 | PURPOSE: To create stitched enhancers, and to separate super-enhancers from typical enhancers using sequencing data (.bam) given a file of previously identified constituent enhancers (.gff) 30 | 31 | 1) PREPARATION/REQUIREMENTS 32 | 33 | .bam files of sequencing reads for factor of interest and control (WCE/IgG recommended). 34 | .bam files must have chromosome IDs starting with "chr" 35 | .bam files must be sorted and indexed using SAMtools in order for bamToGFF.py to work. (http://samtools.sourceforge.net/samtools.shtml) 36 | Code must be run from directory in which it is stored. 37 | .gff file of constituent enhancers previously identified (gff format ref: https://genome.ucsc.edu/FAQ/FAQformat.html#format3). 38 | .gff must have the following columns: 39 | 1: chromosome (chr#) 40 | 2: unique ID for each constituent enhancer region 41 | 4: start of constituent 42 | 5: end of constituent 43 | 7: strand (+,-,.) 44 | 9: unique ID for each constituent enhancer region 45 | NOTE: if value for column 2 and 9 differ, value in column 2 will be used 46 | 47 | 2) CONTENTS 48 | 49 | ROSE_main.py: main program 50 | ROSE_utils.py: utility methods 51 | ROSE_bamToGFF.py: calculates density of .bam sequencing reads in .gff regions 52 | ROSE_callSuper.R: ranks regions by their densities, creates a cutoff to separate super-enhancers from typical enhancers 53 | ROSE_geneMapper.py: assigns stitched enhancers to genes 54 | annotation/: Refseq gene tables for genomes MM8,MM9,MM10,HG18,HG19,HG38 55 | In ROSE_DATA 56 | 57 | example.sh: sample call of ROSE_main.py 58 | data/: folder containing an example .gff input file and two example .bam files 59 | example/: folder containing example output generated by example.sh 60 | 3) USAGE 61 | 62 | Program is run by calling ROSE_main.py 63 | 64 | From within root directory: 65 | python ROSE_main.py -g GENOME_BUILD -i INPUT_CONSTITUENT_GFF -r RANKING_BAM -o OUTPUT_DIRECTORY [optional: -s STITCHING_DISTANCE -t TSS_EXCLUSION_ZONE_SIZE -c CONTROL_BAM] 66 | 67 | Required parameters: 68 | 69 | GENOME_BUILD: one of hg18, hg19, hg38, mm8, mm9, or mm10 referring to the UCSC genome build used for read mapping 70 | INPUT_CONSTITUENT_GFF: .gff file (described above) of regions that were previously calculated to be enhancers. I.e. Med1-enriched regions identified using MACS. 71 | RANKING_BAM: .bam file to be used for ranking enhancers by density of this factor. I.e. Med1 ChIP-Seq reads. 72 | OUTPUT_DIRECTORY: directory to be used for storing output. 73 | Optional parameters: 74 | 75 | STITCHING_DISTANCE: maximum distance between two regions that will be stitched together (Default: 12.5kb) 76 | TSS_EXCLUSION_ZONE_SIZE: exclude regions contained within +/- this distance from TSS in order to account for promoter biases (Default: 0; recommended if used: 2500). If this value is 0, will not look for a gene file. 77 | CONTROL_BAM: .bam file to be used as a control. Subtracted from the density of the RANKING_BAM. I.e. Whole cell extract reads. 78 | 4) CODE PROCEDURE: 79 | 80 | ROSE_main.py will: 81 | 82 | format output directory hierarchy 83 | Root name of input .gff ([input_enhancer_list].gff) used as naming root for output files. 84 | stitch enhancer constituents in INPUT_CONSTITUENT_GFF based on STITCHING_DISTANCE and make .gff and .bed of stitched collection 85 | TSS exclusion, if not zero, is attempted before stitching 86 | Names of stitched regions start with number of regions stitched followed by leftmost constituent ID 87 | call bamToGFF.py to get density of RANKING_BAM and CONTROL_BAM in stitched regions and constituents 88 | Maximum time to wait for bamToGFF.py is 12h but can be changed -- quits if running too long 89 | call callSuper.R to sort stitched enhancers by their background-subtracted density of RANKING_BAM and separate into two groups 90 | 5) OUTPUT: 91 | 92 | All file names begin with the root of INPUT_CONSTITUENT_GFF 93 | 94 | **OUTPUT_DIRECTORY/gff/ 95 | .gff: copied .gff file of INPUT_CONSTITUENT_GFF 96 | (chrom, name, [blank], start, end, [blank], [blank], strand, [blank], [blank], name) 97 | STITCHED.gff: regions created by stitching together INPUT_CONSTITUENT_GFF at STITCHING_DISTANCE 98 | (chrom, name, [blank], start, end, [blank], [blank], strand, [blank], [blank], name) Name is number of constituents stitched together followed by ID of leftmost constituent. 99 | **OUTPUT_DIRECTORY/mappedGFF/ 100 | *_MAPPED.gff: output of bamToGFF using each bam file containing densities of factor in each constituent 101 | (constituent ID, region tested, average read density in units of reads-per-million-mapped per bp of constituent) 102 | *_STITCHED*_MAPPED.gff: output of bamToGFF using each bam file containing densities of factor in each stitched enhancer 103 | (stitched enhancer ID, region tested, average read density in units of reads-per-million-mapped per bp of stitched enhancer) 104 | **OUTPUT_DIRECTORY/ 105 | STITCHED_ENHANCER_REGION_MAP.txt: all densities from bamToGFF calculated in stitched enhancers 106 | (stitched enhancer ID, chromosome, stitched enhancer start, stitched enhancer end, number of constituents stitched, rank of RANKING_BAM signal, signal of RANKING_BAM) 107 | Signal of RANKING_BAM is density times length. 108 | *_AllEnhancers.table.txt: Rankings and super status for each stitched enhancer 109 | (stitched enhancer ID, chromosome, stitched enhancer start, stitched enhancer end, number of constituents stitched, size of constituents that were stitched together, signal of RANKING_BAM, rank of RANKING_BAM, binary of super-enhancer (1) vs. typical (0)) 110 | Signal of RANKING_BAM is density times length. 111 | *_SuperEnhancers.table.txt: Rankings and super status for super-enhancers 112 | (stitched enhancer ID, chromosome, stitched enhancer start, stitched enhancer end, number of constituents stitched, size of constituents that were stitched together, signal of RANKING_BAM, rank of RANKING_BAM, binary of super-enhancer (1) vs. typical (0)) 113 | Signal of RANKING_BAM is density times length. 114 | *_Enhancers_withSuper.bed: .bed file to be loaded into the UCSC browser to visualize super-enhancers and typical enhancers. 115 | (chromosome, stitched enhancer start, stitched enhancer end, stitched enhancer ID, rank by RANKING_BAM signal) 116 | *_Plot_points.png: visualization of the ranks of super-enhancers and the two groups. Stitched enhancers are ranked by their RANKING_BAM signal and their ranks are along the X axis. Corresponding RANKING_BAM signal on the Y axis. 117 | ==================================== 118 | NOTES: 119 | 120 | mapEnhancerFromFactor.py has a debug mode that can be enabled in the beginning of the main function. 121 | Enhancers in INPUT_RANKING_GFF may overlap each other in the input. 122 | This code can be easily parallelized by following the instructions in the main function of mapEnhancerFromFactor around line 369. 123 | Other gene lists may be added if downloaded from UCSC. 124 | Code from external sources are also cited in-line. 125 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ROSE : RANK ORDERING OF SUPER-ENHANCERS 2 | 3 | CLONED using SOURCETREE from: https://bitbucket.org/young_computation/rose/src/master/ 4 | 5 | ### Changes from Source. 6 | 1. USAGE 7 | 8 | ```bash 9 | PATHTO=/path/to/ROSE 10 | PYTHONPATH=$PATHTO/lib 11 | export PYTHONPATH 12 | export PATH=$PATH:$PATHTO/bin 13 | 14 | ROSE_main.py [options] -g [GENOME] -i [INPUT_REGION_GFF] -r [RANKBY_BAM_FILE] -o [OUTPUT_FOLDER] [OPTIONAL_FLAGS] 15 | ``` 16 | 17 | 1. Update: 18 | 19 | * ROSE is executable independent of software directory location. 20 | * ROSE is compatible with python3 21 | * ROSE as a docker image: ghcr.io/stjude/abralab/rose:latest 22 | 23 | 1. REQUIREMENTS: 24 | 25 | 1. All files : 26 | All input files much be in one directory. 27 | 28 | 1. Annotation file : 29 | Annotation file should be in UCSC table track format (https://genome.ucsc.edu/cgi-bin/hgTables). 30 | Annotation file should be saved as [GENOME]_refseq.ucsc (example: hg19_refseq.ucsc). 31 | Annotation file should be in annotation/ folder in the input files directory. 32 | 33 | 1. BAM files (of sequencing reads for factor of interest and control) : 34 | Files must have chromosome IDs starting with "chr" 35 | Files must be sorted and indexed using SAMtools in order for bamToGFF.py to work. (http://samtools.sourceforge.net/samtools.shtml) 36 | 37 | 1. Peak file of constituent enhancers : 38 | File must be in GFF format with the following columns: 39 | 40 | column 1: chromosome (chr#) 41 | column 2: unique ID for each constituent enhancer region 42 | column 4: start of constituent 43 | column 5: end of constituent 44 | column 7: strand (+,-,.) 45 | column 9: unique ID for each constituent enhancer region 46 | 47 | NOTE: if value for column 2 and 9 differ, value in column 2 will be used 48 | 49 | 1. DIRECTORY structure 50 | ``` 51 | ├── LICENSE.txt 52 | │ 53 | ├── README.md 54 | │ 55 | ├── bin 56 | │   ├── ROSE_bamToGFF.py : calculates density of .bam reads in .gff regions 57 | │   ├── ROSE_callSuper.R : ranks regions by their densities, creates cutoff 58 | │   ├── ROSE_geneMapper.py : assigns stitched enhancers to genes 59 | │   └── ROSE_main.py : main program 60 | └── lib 61 | └── ROSE_utils.py : utilities method 62 | 63 | Total: 2 directories, 8 files 64 | ``` 65 | 1. DEPENDENCIES 66 | 67 | * samtools 68 | * R version > 3.4 69 | * bedtools > 2 70 | * python3 71 | -------------------------------------------------------------------------------- /ROSE-local.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # Rose Caller to detect both Enhancers and Super-Enhancers 4 | # Hardcoded implementation of ROSE for St. Jude, Abraham's lab. 5 | # Version 1 11/16/2019 6 | 7 | ############################################################## 8 | # ##### Please replace PATHTO with your own directory ###### # 9 | ############################################################## 10 | PATHTO=/path/to/ROSE 11 | PYTHONPATH=$PATHTO/lib 12 | export PYTHONPATH 13 | export PATH=$PATH:$PATHTO/bin 14 | 15 | if [ $# -lt 7 ]; then 16 | echo "" 17 | echo 1>&2 Usage: $0 ["GTF file"] ["BAM file"] ["OutputDir"] ["feature type"] ["species"] ["bed fileA"] ["bed fileB"] 18 | echo "" 19 | exit 1 20 | fi 21 | 22 | #================================================================================ 23 | #Parameters for running 24 | 25 | # GTF files 26 | GTFFILE=$1 27 | 28 | # BAM file 29 | BAMFILE=$2 30 | 31 | # Output Directory 32 | OUTPUTDIR=$3 33 | OUTPUTDIR=${OUTPUTDIR:=ROSE_out} 34 | 35 | # Feature type 36 | FEATURE=$4 37 | FEATURE=${FEATURE:=gene} 38 | 39 | # Species 40 | SPECIES=$5 41 | SPECIES=${SPECIES:=hg19} 42 | 43 | # Bed File A 44 | FILEA=$6 45 | 46 | # Bed File B 47 | FILEB=$7 48 | 49 | # Transcription Start Size Window 50 | #TSS= 51 | TSS=${TSS:=2000} 52 | 53 | # Maximum linking distance for stitching 54 | #STITCH= 55 | STITCH=${STITCH:=12500} 56 | 57 | 58 | echo "#############################################" 59 | echo "###### ROSE v1 ######" 60 | echo "#############################################" 61 | 62 | echo "Input Bed File A: $FILEA" 63 | echo "Input Bed File B: $FILEB" 64 | echo "BAM file: $BAMFILE" 65 | echo "Output directory: $OUTPUTDIR" 66 | echo "Species: $SPECIES" 67 | echo "Feature type: $FEATURE" 68 | #================================================================================ 69 | # 70 | # UCSC TRACK FORMAT ANNOTATION FILE 71 | # Generate UCSC table track annotation file using NCBI GTF refseq. 72 | # 73 | mkdir -p annotation 74 | echo -e "#bin\tname\tchrom\tstrand\ttxStart\ttxEnd\tcdsStart\tcdsEnd\tX\tX\tX\t\tX\tname2" > annotation/$SPECIES"_refseq.ucsc" 75 | 76 | if [[ $FEATURE == "gene" ]]; then 77 | awk -F'[\t ]' '{ 78 | if($3=="gene") 79 | print "0\t" $14 "\tchr" $1 "\t" $7 "\t" $4 "\t" $5 "\t" $4 "\t" $5 "\t.\t.\t.\t.\t" $18}' $GTFFILE | sed s/\"//g >> annotation/$SPECIES"_refseq.ucsc" 80 | 81 | elif [[ $FEATURE == "transcript" ]]; then 82 | awk -F'[\t ]' '{ 83 | if($3=="transcript") 84 | print "0\t" $14 "\tchr" $1 "\t" $7 "\t" $4 "\t" $5 "\t" $4 "\t" $5 "\t.\t.\t.\t.\t" $18}' $GTFFILE | sed s/\"//g >> annotation/$SPECIES"_refseq.ucsc" 85 | fi 86 | echo "Annotation file: "$SPECIES"_refseq.ucsc" 87 | 88 | # 89 | # INPUT CONSTITUENT FILE 90 | # merge peak bed files generated from MACS1 "keep_dup=all" and "keep_dup=auto" to generate constituent enhancers. 91 | cat $FILEA $FILEB | sort -k1,1 -k2,2n | mergeBed -i - | awk -F\\t '{print $1 "\t" NR "\t\t" $2 "\t" $3 "\t\t.\t\t" NR}' > unionpeaks.gff 92 | echo "Merge Bed file: unionpeaks.gff" 93 | echo 94 | 95 | # 96 | # ROSE 97 | # 98 | ROSE_main.py -s $STITCH -t $TSS -g $SPECIES -i unionpeaks.gff -r $BAMFILE -o $OUTPUTDIR 99 | 100 | echo "Done!" 101 | -------------------------------------------------------------------------------- /bin/ROSE_bamToGFF.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | #bamToGFF.py 3 | 4 | #script to grab reads from a bam that align to a .gff file 5 | import sys 6 | import re 7 | 8 | import ROSE_utils 9 | 10 | 11 | from collections import defaultdict 12 | 13 | import os 14 | 15 | import string 16 | 17 | 18 | 19 | #===================================================================== 20 | #====================MAPPING BAM READS TO GFF REGIONS================= 21 | #===================================================================== 22 | 23 | 24 | def mapBamToGFF(bamFile,gff,sense = 'both',extension = 200,floor = 0,rpm = False,matrix = None): 25 | 26 | #def mapBamToGFF(bamFile,gff,sense = 'both',unique = 0,extension = 200,floor = 0,density = False,rpm = False,binSize = 25,clusterGram = None,matrix = None,raw = False,includeJxnReads = False): 27 | '''maps reads from a bam to a gff''' 28 | floor = int(floor) 29 | 30 | #USING BAM CLASS 31 | bam = ROSE_utils.Bam(bamFile) 32 | 33 | 34 | #new GFF to write to 35 | newGFF = [] 36 | #millionMappedReads 37 | 38 | 39 | if rpm: 40 | MMR= round(float(bam.getTotalReads('mapped'))/1000000,4) 41 | else: 42 | MMR = 1 43 | 44 | print(('using a MMR value of %s' % (MMR))) 45 | 46 | #senseTrans = maketrans('-+.','+-+') #deprecated 47 | 48 | if ROSE_utils.checkChrStatus(bamFile) == 1: 49 | print("has chr") 50 | hasChrFlag = 1 51 | #sys.exit(); 52 | else: 53 | print("does not have chr") 54 | hasChrFlag = 0 55 | #sys.exit() 56 | 57 | if type(gff) == str: 58 | gff = ROSE_utils.parseTable(gff,'\t') 59 | 60 | #setting up a maxtrix table 61 | 62 | newGFF.append(['GENE_ID','locusLine'] + ['bin_'+str(n)+'_'+bamFile.split('/')[-1] for n in range(1,int(matrix)+1,1)]) 63 | 64 | #getting and processing reads for gff lines 65 | ticker = 0 66 | print('Number lines processed') 67 | for line in gff: 68 | line = line[0:9] 69 | if ticker%100 == 0: 70 | print(ticker) 71 | ticker+=1 72 | if not hasChrFlag: 73 | line[0] = re.sub(r"chr",r"",line[0]) 74 | gffLocus = ROSE_utils.Locus(line[0],int(line[3]),int(line[4]),line[6],line[1]) 75 | #print line[0] 76 | #sys.exit() 77 | searchLocus = ROSE_utils.makeSearchLocus(gffLocus,int(extension),int(extension)) 78 | 79 | reads = bam.getReadsLocus(searchLocus,'both',False,'none') 80 | #now extend the reads and make a list of extended reads 81 | extendedReads = [] 82 | for locus in reads: 83 | if locus.sense() == '+' or locus.sense() == '.': 84 | locus = ROSE_utils.Locus(locus.chr(),locus.start(),locus.end()+extension,locus.sense(), locus.ID()) 85 | if locus.sense() == '-': 86 | locus = ROSE_utils.Locus(locus.chr(),locus.start()-extension,locus.end(),locus.sense(),locus.ID()) 87 | extendedReads.append(locus) 88 | if gffLocus.sense() == '+' or gffLocus.sense == '.': 89 | senseReads = [x for x in extendedReads if x.sense() == '+' or x.sense() == '.'] 90 | antiReads = [x for x in extendedReads if x.sense() == '-'] 91 | else: 92 | senseReads = [x for x in extendedReads if x.sense() == '-' or x.sense() == '.'] 93 | antiReads = [x for x in extendedReads if x.sense() == '+'] 94 | 95 | senseHash = defaultdict(int) 96 | antiHash = defaultdict(int) 97 | 98 | #filling in the readHashes 99 | if sense == '+' or sense == 'both' or sense =='.': 100 | for read in senseReads: 101 | for x in range(read.start(),read.end()+1,1): 102 | senseHash[x]+=1 103 | if sense == '-' or sense == 'both' or sense == '.': 104 | #print('foo') 105 | for read in antiReads: 106 | for x in range(read.start(),read.end()+1,1): 107 | antiHash[x]+=1 108 | 109 | #now apply flooring and filtering for coordinates 110 | keys = ROSE_utils.uniquify(list(senseHash.keys())+list(antiHash.keys())) 111 | if floor > 0: 112 | 113 | keys = [x for x in keys if (senseHash[x]+antiHash[x]) > floor] 114 | #coordinate filtering 115 | keys = [x for x in keys if gffLocus.start() < x < gffLocus.end()] 116 | 117 | 118 | #setting up the output table 119 | #clusterLine = [gffLocus.ID(),gffLocus.__str__()] 120 | 121 | # bug fix gff coordinates with same chromosomal name as BAM 122 | if not hasChrFlag: 123 | clusterLine = [gffLocus.ID(),"chr" + gffLocus.__str__()] 124 | else: 125 | clusterLine = [gffLocus.ID(),gffLocus.__str__()] 126 | 127 | #getting the binsize 128 | binSize = (gffLocus.len()-1)/int(matrix) 129 | nBins = int(matrix) 130 | if binSize == 0: 131 | clusterLine+=['NA']*int(matrix) 132 | newGFF.append(clusterLine) 133 | continue 134 | n=0 135 | if gffLocus.sense() == '+' or gffLocus.sense() =='.' or gffLocus.sense() == 'both': 136 | i = gffLocus.start() 137 | 138 | while n cutoff_options$absolute) 130 | typicalEnhancers = setdiff(1:nrow(stitched_regions),superEnhancerRows) 131 | enhancerDescription <- paste(enhancerName," Stitched Peaks\nCreated from ", enhancerFile,"\nRanked by ",rankBy_factor,"\nUsing cutoff of ",cutoff_options$absolute," for Super-Stitched Peaks",sep="",collapse="") 132 | 133 | 134 | #MAKING HOCKEY STICK PLOT 135 | plotFileName = paste(outFolder,enhancerName,'_Plot_points.png',sep='') 136 | png(filename=plotFileName,height=600,width=600) 137 | signalOrder = order(rankBy_vector,decreasing=TRUE) 138 | if(wceName == 'NONE'){ 139 | plot(length(rankBy_vector):1,rankBy_vector[signalOrder], col='red',xlab=paste(rankBy_factor,' Stitched peaks'),ylab=paste(rankBy_factor,' Signal'),pch=19,cex=2) 140 | 141 | }else{ 142 | plot(length(rankBy_vector):1,rankBy_vector[signalOrder], col='red',xlab=paste(rankBy_factor,' Stitched peaks'),ylab=paste(rankBy_factor,' Signal','- ',wceName),pch=19,cex=2) 143 | } 144 | abline(h=cutoff_options$absolute,col='grey',lty=2) 145 | abline(v=length(rankBy_vector)-length(superEnhancerRows),col='grey',lty=2) 146 | lines(length(rankBy_vector):1,rankBy_vector[signalOrder],lwd=4, col='red') 147 | text(0,0.8*max(rankBy_vector),paste(' Cutoff used: ',cutoff_options$absolute,'\n','Super-Stitched peaks identified: ',length(superEnhancerRows)),pos=4) 148 | 149 | dev.off() 150 | 151 | 152 | 153 | 154 | #Writing a bed file 155 | bedFileName = paste(outFolder,enhancerName,'_Stitched_withSuper.bed',sep='') 156 | convert_stitched_to_bed(stitched_regions,paste(rankBy_factor,"Stitched"), enhancerDescription,bedFileName,score=rankBy_vector,splitSuper=TRUE,superRows= superEnhancerRows,baseColor="0,0,0",superColor="255,0,0") 157 | 158 | 159 | 160 | #This matrix is just the super_enhancers 161 | true_super_enhancers <- stitched_regions[superEnhancerRows,] 162 | 163 | additionalTableData <- matrix(data=NA,ncol=2,nrow=nrow(stitched_regions)) 164 | colnames(additionalTableData) <- c("stitchedPeakRank","isSuper") 165 | additionalTableData[,1] <- nrow(stitched_regions)-rank(rankBy_vector,ties.method="first")+1 166 | additionalTableData[,2] <- 0 167 | additionalTableData[superEnhancerRows,2] <- 1 168 | 169 | 170 | #Writing enhancer and super-enhancer tables with enhancers ranked and super status annotated 171 | enhancerTableFile = paste(outFolder,enhancerName,'_AllStitched.table.txt',sep='') 172 | writeSuperEnhancer_table(stitched_regions, enhancerDescription,enhancerTableFile, additionalData= additionalTableData) 173 | 174 | superTableFile = paste(outFolder,enhancerName,'_SuperStitched.table.txt',sep='') 175 | writeSuperEnhancer_table(true_super_enhancers, enhancerDescription,superTableFile, additionalData= additionalTableData[superEnhancerRows,]) 176 | -------------------------------------------------------------------------------- /bin/ROSE_geneMapper.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | #130428 3 | 4 | #ROSE_geneMapper.py 5 | 6 | #main method wrapped script to take the enhancer region table output of ROSE_Main and map genes to it 7 | #will create two outputs a gene mapped region table where each row is an enhancer 8 | #and a gene table where each row is a gene 9 | #does this by default for super-enhancers only 10 | 11 | import sys 12 | 13 | import ROSE_utils 14 | 15 | import os 16 | 17 | import string 18 | 19 | from collections import defaultdict 20 | 21 | 22 | #================================================================== 23 | #====================MAPPING GENES TO ENHANCERS==================== 24 | #================================================================== 25 | 26 | 27 | 28 | def mapEnhancerToGene(annotFile,enhancerFile,transcribedFile='',uniqueGenes=True,byRefseq=False,subtractInput=False): 29 | 30 | ''' 31 | maps genes to enhancers. if uniqueGenes, reduces to gene name only. Otherwise, gives for each refseq 32 | ''' 33 | print("Herp") 34 | startDict = ROSE_utils.makeStartDict(annotFile) 35 | print("Derp") 36 | enhancerTable = ROSE_utils.parseTable(enhancerFile,'\t') 37 | 38 | 39 | 40 | 41 | 42 | if len(transcribedFile) > 0: 43 | transcribedTable = ROSE_utils.parseTable(transcribedFile,'\t') 44 | transcribedGenes = [line[1] for line in transcribedTable] 45 | else: 46 | transcribedGenes = list(startDict.keys()) 47 | 48 | print('MAKING TRANSCRIPT COLLECTION') 49 | transcribedCollection = ROSE_utils.makeTranscriptCollection(annotFile,0,0,500,transcribedGenes) 50 | 51 | 52 | print('MAKING TSS COLLECTION') 53 | tssLoci = [] 54 | for geneID in transcribedGenes: 55 | tssLoci.append(ROSE_utils.makeTSSLocus(geneID,startDict,0,0)) 56 | 57 | 58 | #this turns the tssLoci list into a LocusCollection 59 | #50 is the internal parameter for LocusCollection and doesn't really matter 60 | tssCollection = ROSE_utils.LocusCollection(tssLoci,50) 61 | 62 | 63 | 64 | geneDict = {'overlapping':defaultdict(list),'proximal':defaultdict(list),'enhancerString':defaultdict(list)} 65 | #list of all genes that appear in this analysis 66 | overallGeneList = [] 67 | 68 | #set up the output tables 69 | #first by enhancer 70 | enhancerToGeneTable = [enhancerTable[5][0:6]+['OVERLAP_GENES','PROXIMAL_GENES','CLOSEST_GENE'] + enhancerTable[5][-2:]] 71 | 72 | #next by gene 73 | geneToEnhancerTable = [['GENE_NAME','REFSEQ_ID','PROXIMAL_STITCHED_PEAKS']] 74 | 75 | #have all information 76 | signalWithGenes = [['GENE_NAME', 'REFSEQ_ID', 'PROXIMAL_STITCHED_PEAKS', 'SIGNAL']] 77 | 78 | for line in enhancerTable[6:]: 79 | 80 | enhancerString = '%s:%s-%s' % (line[1],line[2],line[3]) 81 | enhancerSignal = int(float(line[6])) 82 | if subtractInput: enhancerSignal = int(float(line[6]) - float(line[7])) 83 | 84 | enhancerLocus = ROSE_utils.Locus(line[1],line[2],line[3],'.',line[0]) 85 | 86 | 87 | #overlapping genes are transcribed genes whose transcript is directly in the stitchedLocus 88 | overlappingLoci = transcribedCollection.getOverlap(enhancerLocus,'both') 89 | overlappingGenes =[] 90 | for overlapLocus in overlappingLoci: 91 | overlappingGenes.append(overlapLocus.ID()) 92 | 93 | #proximalGenes are transcribed genes where the tss is within 50kb of the boundary of the stitched loci 94 | proximalLoci = tssCollection.getOverlap(ROSE_utils.makeSearchLocus(enhancerLocus,50000,50000),'both') 95 | proximalGenes =[] 96 | for proxLocus in proximalLoci: 97 | proximalGenes.append(proxLocus.ID()) 98 | 99 | 100 | distalLoci = tssCollection.getOverlap(ROSE_utils.makeSearchLocus(enhancerLocus,50000000,50000000),'both') 101 | distalGenes =[] 102 | for proxLocus in distalLoci: 103 | distalGenes.append(proxLocus.ID()) 104 | 105 | 106 | 107 | overlappingGenes = ROSE_utils.uniquify(overlappingGenes) 108 | proximalGenes = ROSE_utils.uniquify(proximalGenes) 109 | distalGenes = ROSE_utils.uniquify(distalGenes) 110 | allEnhancerGenes = overlappingGenes + proximalGenes + distalGenes 111 | #these checks make sure each gene list is unique. 112 | #technically it is possible for a gene to be overlapping, but not proximal since the 113 | #gene could be longer than the 50kb window, but we'll let that slide here 114 | for refID in overlappingGenes: 115 | if proximalGenes.count(refID) == 1: 116 | proximalGenes.remove(refID) 117 | 118 | for refID in proximalGenes: 119 | if distalGenes.count(refID) == 1: 120 | distalGenes.remove(refID) 121 | 122 | 123 | #Now find the closest gene 124 | if len(allEnhancerGenes) == 0: 125 | closestGene = '' 126 | else: 127 | #get enhancerCenter 128 | enhancerCenter = (int(line[2]) + int(line[3]))/2 129 | 130 | #get absolute distance to enhancer center 131 | distList = [abs(enhancerCenter - startDict[geneID]['start'][0]) for geneID in allEnhancerGenes] 132 | closestGene = startDict[allEnhancerGenes[distList.index(min(distList))]]['name'] 133 | 134 | #NOW WRITE THE ROW FOR THE ENHANCER TABLE 135 | newEnhancerLine = line[0:6] 136 | 137 | if byRefseq: 138 | newEnhancerLine.append(','.join(ROSE_utils.uniquify([x for x in overlappingGenes]))) 139 | newEnhancerLine.append(','.join(ROSE_utils.uniquify([x for x in proximalGenes]))) 140 | closestGene = allEnhancerGenes[distList.index(min(distList))] 141 | newEnhancerLine.append(closestGene) 142 | else: 143 | newEnhancerLine.append(','.join(ROSE_utils.uniquify([startDict[x]['name'] for x in overlappingGenes]))) 144 | newEnhancerLine.append(','.join(ROSE_utils.uniquify([startDict[x]['name'] for x in proximalGenes]))) 145 | closestGene = startDict[allEnhancerGenes[distList.index(min(distList))]]['name'] 146 | newEnhancerLine.append(closestGene) 147 | 148 | 149 | #WRITE GENE TABLE 150 | signalWithGenes.append([startDict[closestGene]['name'], closestGene, enhancerString, enhancerSignal]) 151 | 152 | newEnhancerLine += line[-2:] 153 | enhancerToGeneTable.append(newEnhancerLine) 154 | #Now grab all overlapping and proximal genes for the gene ordered table 155 | 156 | overallGeneList +=overlappingGenes 157 | for refID in overlappingGenes: 158 | geneDict['overlapping'][refID].append(enhancerString) 159 | 160 | 161 | overallGeneList+=proximalGenes 162 | for refID in proximalGenes: 163 | geneDict['proximal'][refID].append(enhancerString) 164 | 165 | 166 | #End loop through 167 | 168 | #Make table by gene 169 | overallGeneList = ROSE_utils.uniquify(overallGeneList) 170 | 171 | 172 | nameOrder = ROSE_utils.order([startDict[x]['name'] for x in overallGeneList]) 173 | 174 | usedNames = [] 175 | 176 | for i in nameOrder: 177 | refID = overallGeneList[i] 178 | geneName = startDict[refID]['name'] 179 | if usedNames.count(geneName) > 0 and uniqueGenes == True: 180 | 181 | continue 182 | else: 183 | usedNames.append(geneName) 184 | 185 | proxEnhancers = geneDict['proximal'][refID] + geneDict['overlapping'][refID] 186 | 187 | newLine = [geneName,refID,','.join(proxEnhancers)] 188 | 189 | 190 | geneToEnhancerTable.append(newLine) 191 | 192 | #re-sort enhancerToGeneTable 193 | 194 | enhancerOrder = ROSE_utils.order([int(line[-2]) for line in enhancerToGeneTable[1:]]) 195 | sortedTable = [enhancerToGeneTable[0]] 196 | for i in enhancerOrder: 197 | sortedTable.append(enhancerToGeneTable[(i+1)]) 198 | 199 | return sortedTable,geneToEnhancerTable,signalWithGenes 200 | 201 | 202 | 203 | #================================================================== 204 | #=========================MAIN METHOD============================== 205 | #================================================================== 206 | 207 | def main(): 208 | ''' 209 | main run call 210 | ''' 211 | debug = False 212 | 213 | 214 | from optparse import OptionParser 215 | usage = "usage: %prog [options] -g [GENOME] -i [INPUT_ENHANCER_FILE]" 216 | parser = OptionParser(usage = usage) 217 | #required flags 218 | parser.add_option("-i","--i", dest="input",nargs = 1, default=None, 219 | help = "Enter a ROSE ranked enhancer or super-enhancer file") 220 | parser.add_option("-g","--genome", dest="genome",nargs = 1, default=None, 221 | help = "Enter the genome build (MM9,MM8,HG18,HG19,HG38)") 222 | parser.add_option("--custom", dest="custom_genome", default=None, 223 | help = "Enter the custom genome annotation .ucsc") 224 | 225 | #optional flags 226 | parser.add_option("-l","--list", dest="geneList",nargs = 1, default=None, 227 | help = "Enter a gene list to filter through") 228 | parser.add_option("-o","--out", dest="out",nargs = 1, default=None, 229 | help = "Enter an output folder. Default will be same folder as input file") 230 | parser.add_option("-r","--refseq",dest="refseq",action = 'store_true', default=False, 231 | help = "If flagged will write output by refseq ID and not common name") 232 | parser.add_option("-c","--control",dest="control",action = 'store_true', default=False, 233 | help = "If flagged will subtract input from sample signal") 234 | 235 | #RETRIEVING FLAGS 236 | (options,args) = parser.parse_args() 237 | 238 | 239 | if not options.input or not (options.genome or options.custom_genome): 240 | 241 | parser.print_help() 242 | exit() 243 | 244 | #GETTING THE INPUT 245 | enhancerFile = options.input 246 | 247 | #making the out folder if it doesn't exist 248 | if options.out: 249 | outFolder = ROSE_utils.formatFolder(options.out,True) 250 | else: 251 | outFolder = '/'.join(enhancerFile.split('/')[0:-1]) + '/' 252 | 253 | 254 | #GETTING THE CORRECT ANNOT FILE 255 | cwd = os.getcwd() 256 | genomeDict = { 257 | 'HG18':'%s/annotation/hg18_refseq.ucsc' % (cwd), 258 | 'MM9': '%s/annotation/mm9_refseq.ucsc' % (cwd), 259 | 'HG19':'%s/annotation/hg19_refseq.ucsc' % (cwd), 260 | 'HG38':'%s/annotation/hg38_refseq.ucsc' % (cwd), 261 | 'MM8': '%s/annotation/mm8_refseq.ucsc' % (cwd), 262 | 'MM10':'%s/annotation/mm10_refseq.ucsc' % (cwd), 263 | } 264 | 265 | #GETTING THE GENOME 266 | if options.custom_genome: 267 | annotFile = options.custom_genome 268 | print('USING CUSTOM GENOME %s AS THE GENOME FILE' % options.custom_genome) 269 | else: 270 | genome = options.genome 271 | annotFile = genomeDict[genome.upper()] 272 | print('USING %s AS THE GENOME' % genome) 273 | 274 | #GETTING THE TRANSCRIBED LIST 275 | if options.geneList: 276 | 277 | transcribedFile = options.geneList 278 | else: 279 | transcribedFile = '' 280 | 281 | enhancerToGeneTable,geneToEnhancerTable,withGenesTable = mapEnhancerToGene(annotFile,enhancerFile, uniqueGenes=True, byRefseq=options.refseq, subtractInput=options.control, transcribedFile=transcribedFile) 282 | 283 | #Writing enhancer output 284 | enhancerFileName = enhancerFile.split('/')[-1].split('.')[0] 285 | 286 | #writing the enhancer table 287 | out1 = '%s%s_REGION_TO_GENE.txt' % (outFolder,enhancerFileName) 288 | ROSE_utils.unParseTable(enhancerToGeneTable,out1,'\t') 289 | 290 | #writing the gene table 291 | out2 = '%s%s_GENE_TO_REGION.txt' % (outFolder,enhancerFileName) 292 | ROSE_utils.unParseTable(geneToEnhancerTable,out2,'\t') 293 | 294 | out3 = '%s%s.table_withGENES.txt' % (outFolder,enhancerFileName) 295 | ROSE_utils.unParseTable(withGenesTable,out3,'\t') 296 | 297 | if __name__ == "__main__": 298 | main() 299 | -------------------------------------------------------------------------------- /bin/ROSE_main.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | #mapEnhancerFromFactor.py 4 | ''' 5 | PROGRAM TO STITCH TOGETHER REGIONS TO FORM ENHANCERS, MAP READ DENSITY TO STITCHED REGIONS, 6 | AND RANK ENHANCERS BY READ DENSITY TO DISCOVER SUPER-ENHANCERS 7 | APRIL 11, 2013 8 | VERSION 0.1 9 | CONTACT: youngcomputation@wi.mit.edu 10 | ''' 11 | 12 | import sys 13 | 14 | 15 | 16 | import ROSE_utils 17 | 18 | import time 19 | 20 | import os 21 | 22 | import string 23 | 24 | from collections import defaultdict 25 | 26 | #================================================================== 27 | #=====================REGION STITCHING============================= 28 | #================================================================== 29 | 30 | 31 | def regionStitching(inputGFF,stitchWindow,tssWindow,annotFile,removeTSS=True): 32 | print('PERFORMING REGION STITCHING') 33 | #first have to turn bound region file into a locus collection 34 | 35 | #need to make sure this names correctly... each region should have a unique name 36 | boundCollection = ROSE_utils.gffToLocusCollection(inputGFF) 37 | 38 | debugOutput = [] 39 | #filter out all bound regions that overlap the TSS of an ACTIVE GENE 40 | if removeTSS: 41 | #first make a locus collection of TSS 42 | startDict = ROSE_utils.makeStartDict(annotFile) 43 | 44 | #now makeTSS loci for active genes 45 | removeTicker=0 46 | #this loop makes a locus centered around +/- tssWindow of transcribed genes 47 | #then adds it to the list tssLoci 48 | tssLoci = [] 49 | for geneID in list(startDict.keys()): 50 | tssLoci.append(ROSE_utils.makeTSSLocus(geneID,startDict,tssWindow,tssWindow)) 51 | 52 | 53 | #this turns the tssLoci list into a LocusCollection 54 | #50 is the internal parameter for LocusCollection and doesn't really matter 55 | tssCollection = ROSE_utils.LocusCollection(tssLoci,50) 56 | 57 | #gives all the loci in boundCollection 58 | boundLoci = boundCollection.getLoci() 59 | 60 | #this loop will check if each bound region is contained by the TSS exclusion zone 61 | #this will drop out a lot of the promoter only regions that are tiny 62 | #typical exclusion window is around 2kb 63 | for locus in boundLoci: 64 | if len(tssCollection.getContainers(locus,'both'))>0: 65 | 66 | #if true, the bound locus overlaps an active gene 67 | boundCollection.remove(locus) 68 | debugOutput.append([locus.__str__(),locus.ID(),'CONTAINED']) 69 | removeTicker+=1 70 | print(('REMOVED %s LOCI BECAUSE THEY WERE CONTAINED BY A TSS' % (removeTicker))) 71 | 72 | #boundCollection is now all enriched region loci that don't overlap an active TSS 73 | stitchedCollection = boundCollection.stitchCollection(stitchWindow,'both') 74 | 75 | if removeTSS: 76 | #now replace any stitched region that overlap 2 distinct genes 77 | #with the original loci that were there 78 | fixedLoci = [] 79 | tssLoci = [] 80 | for geneID in list(startDict.keys()): 81 | tssLoci.append(ROSE_utils.makeTSSLocus(geneID,startDict,50,50)) 82 | 83 | 84 | #this turns the tssLoci list into a LocusCollection 85 | #50 is the internal parameter for LocusCollection and doesn't really matter 86 | tssCollection = ROSE_utils.LocusCollection(tssLoci,50) 87 | removeTicker = 0 88 | originalTicker = 0 89 | for stitchedLocus in stitchedCollection.getLoci(): 90 | overlappingTSSLoci = tssCollection.getOverlap(stitchedLocus,'both') 91 | tssNames = [startDict[tssLocus.ID()]['name'] for tssLocus in overlappingTSSLoci] 92 | tssNames = ROSE_utils.uniquify(tssNames) 93 | if len(tssNames) > 2: 94 | 95 | #stitchedCollection.remove(stitchedLocus) 96 | originalLoci = boundCollection.getOverlap(stitchedLocus,'both') 97 | originalTicker+=len(originalLoci) 98 | fixedLoci+=originalLoci 99 | debugOutput.append([stitchedLocus.__str__(),stitchedLocus.ID(),'MULTIPLE_TSS']) 100 | removeTicker+=1 101 | else: 102 | fixedLoci.append(stitchedLocus) 103 | 104 | print(('REMOVED %s STITCHED LOCI BECAUSE THEY OVERLAPPED MULTIPLE TSSs' % (removeTicker))) 105 | print(('ADDED BACK %s ORIGINAL LOCI' % (originalTicker))) 106 | fixedCollection = ROSE_utils.LocusCollection(fixedLoci,50) 107 | return fixedCollection,debugOutput 108 | else: 109 | return stitchedCollection,debugOutput 110 | 111 | #================================================================== 112 | #=====================REGION LINKING MAPPING======================= 113 | #================================================================== 114 | 115 | def mapCollection(stitchedCollection,referenceCollection,bamFileList,mappedFolder,output,refName): 116 | 117 | 118 | ''' 119 | makes a table of factor density in a stitched locus and ranks table by number of loci stitched together 120 | ''' 121 | 122 | 123 | print('FORMATTING TABLE') 124 | loci = stitchedCollection.getLoci() 125 | 126 | locusTable = [['REGION_ID','CHROM','START','STOP','NUM_LOCI','CONSTITUENT_SIZE']] 127 | 128 | lociLenList = [] 129 | 130 | #strip out any that are in chrY 131 | for locus in list(loci): 132 | if locus.chr() == 'chrY': 133 | loci.remove(locus) 134 | 135 | for locus in loci: 136 | #numLociList.append(int(stitchLocus.ID().split('_')[1])) 137 | lociLenList.append(locus.len()) 138 | #numOrder = order(numLociList,decreasing=True) 139 | lenOrder = ROSE_utils.order(lociLenList,decreasing=True) 140 | ticker = 0 141 | for i in lenOrder: 142 | ticker+=1 143 | if ticker%1000 ==0: 144 | print(ticker) 145 | locus = loci[i] 146 | 147 | #First get the size of the enriched regions within the stitched locus 148 | refEnrichSize = 0 149 | refOverlappingLoci = referenceCollection.getOverlap(locus,'both') 150 | for refLocus in refOverlappingLoci: 151 | refEnrichSize+=refLocus.len() 152 | 153 | try: 154 | stitchCount = int(locus.ID().split('_')[0]) 155 | except ValueError: 156 | stitchCount = 1 157 | 158 | locusTable.append([locus.ID(),locus.chr(),locus.start(),locus.end(),stitchCount,refEnrichSize]) 159 | 160 | 161 | print('GETTING MAPPED DATA') 162 | for bamFile in bamFileList: 163 | 164 | bamFileName = bamFile.split('/')[-1] 165 | 166 | print(('GETTING MAPPING DATA FOR %s' % bamFile)) 167 | #assumes standard convention for naming enriched region gffs 168 | 169 | #opening up the mapped GFF 170 | print(('OPENING %s%s_%s_MAPPED.gff' % (mappedFolder,refName,bamFileName))) 171 | 172 | mappedGFF =ROSE_utils.parseTable('%s%s_%s_MAPPED.gff' % (mappedFolder,refName,bamFileName),'\t') 173 | 174 | signalDict = defaultdict(float) 175 | print(('MAKING SIGNAL DICT FOR %s' % (bamFile))) 176 | mappedLoci = [] 177 | for line in mappedGFF[1:]: 178 | 179 | chrom = line[1].split('(')[0] 180 | start = int(line[1].split(':')[-1].split('-')[0]) 181 | end = int(line[1].split(':')[-1].split('-')[1]) 182 | mappedLoci.append(ROSE_utils.Locus(chrom,start,end,'.',line[0])) 183 | try: 184 | signalDict[line[0]] = float(line[2])*(abs(end-start)) 185 | except ValueError: 186 | print('WARNING NO SIGNAL FOR LINE:') 187 | print(line) 188 | continue 189 | 190 | 191 | 192 | mappedCollection = ROSE_utils.LocusCollection(mappedLoci,500) 193 | locusTable[0].append(bamFileName) 194 | 195 | for i in range(1,len(locusTable)): 196 | signal=0.0 197 | line = locusTable[i] 198 | lineLocus = ROSE_utils.Locus(line[1],line[2],line[3],'.') 199 | overlappingRegions = mappedCollection.getOverlap(lineLocus,sense='both') 200 | for region in overlappingRegions: 201 | signal+= signalDict[region.ID()] 202 | locusTable[i].append(signal) 203 | 204 | ROSE_utils.unParseTable(locusTable,output,'\t') 205 | 206 | 207 | 208 | #================================================================== 209 | #=========================MAIN METHOD============================== 210 | #================================================================== 211 | 212 | def main(): 213 | ''' 214 | main run call 215 | ''' 216 | debug = False 217 | 218 | 219 | from optparse import OptionParser 220 | usage = "usage: %prog [options] -g [GENOME] -i [INPUT_REGION_GFF] -r [RANKBY_BAM_FILE] -o [OUTPUT_FOLDER] [OPTIONAL_FLAGS]" 221 | parser = OptionParser(usage = usage) 222 | #required flags 223 | parser.add_option("-i","--i", dest="input",nargs = 1, default=None, 224 | help = "Enter a .gff or .bed file of binding sites used to make enhancers") 225 | parser.add_option("-r","--rankby", dest="rankby",nargs = 1, default=None, 226 | help = "bamfile to rank enhancer by") 227 | parser.add_option("-o","--out", dest="out",nargs = 1, default=None, 228 | help = "Enter an output folder") 229 | parser.add_option("-g","--genome", dest="genome", default=None, 230 | help = "Enter the genome build (MM9,MM8,HG18,HG19,HG38)") 231 | parser.add_option("--custom", dest="custom_genome", default=None, 232 | help = "Enter the custom genome annotation refseq.ucsc") 233 | 234 | #optional flags 235 | parser.add_option("-b","--bams", dest="bams",nargs = 1, default=None, 236 | help = "Enter a comma separated list of additional bam files to map to") 237 | parser.add_option("-c","--control", dest="control",nargs = 1, default=None, 238 | help = "bamfile to rank enhancer by") 239 | parser.add_option("-s","--stitch", dest="stitch",nargs = 1, default=12500, 240 | help = "Enter a max linking distance for stitching") 241 | parser.add_option("-t","--tss", dest="tss",nargs = 1, default=0, 242 | help = "Enter a distance from TSS to exclude. 0 = no TSS exclusion") 243 | 244 | 245 | 246 | 247 | #RETRIEVING FLAGS 248 | (options,args) = parser.parse_args() 249 | 250 | 251 | if not options.input or not options.rankby or not options.out or not (options.genome or options.custom_genome): 252 | print('hi there') 253 | parser.print_help() 254 | exit() 255 | 256 | #making the out folder if it doesn't exist 257 | outFolder = ROSE_utils.formatFolder(options.out,True) 258 | 259 | 260 | #figuring out folder schema 261 | gffFolder = ROSE_utils.formatFolder(outFolder+'gff/',True) 262 | mappedFolder = ROSE_utils.formatFolder(outFolder+ 'mappedGFF/',True) 263 | 264 | 265 | #GETTING INPUT FILE 266 | if options.input.split('.')[-1] == 'bed': 267 | #CONVERTING A BED TO GFF 268 | inputGFFName = options.input.split('/')[-1][0:-4] 269 | inputGFFFile = '%s%s.gff' % (gffFolder,inputGFFName) 270 | ROSE_utils.bedToGFF(options.input,inputGFFFile) 271 | elif options.input.split('.')[-1] =='gff': 272 | #COPY THE INPUT GFF TO THE GFF FOLDER 273 | inputGFFFile = options.input 274 | os.system('cp %s %s' % (inputGFFFile,gffFolder)) 275 | 276 | else: 277 | print('WARNING: INPUT FILE DOES NOT END IN .gff or .bed. ASSUMING .gff FILE FORMAT') 278 | #COPY THE INPUT GFF TO THE GFF FOLDER 279 | inputGFFFile = options.input 280 | os.system('cp %s %s' % (inputGFFFile,gffFolder)) 281 | 282 | 283 | #GETTING THE LIST OF BAMFILES TO PROCESS 284 | if options.control: 285 | bamFileList = [options.rankby,options.control] 286 | 287 | else: 288 | bamFileList = [options.rankby] 289 | 290 | if options.bams: 291 | bamFileList += options.bams.split(',') 292 | bamFileLIst = ROSE_utils.uniquify(bamFileList) 293 | #optional args 294 | 295 | #Stitch parameter 296 | stitchWindow = int(options.stitch) 297 | 298 | #tss options 299 | tssWindow = int(options.tss) 300 | if tssWindow != 0: 301 | removeTSS = True 302 | else: 303 | removeTSS = False 304 | 305 | #GETTING THE BOUND REGION FILE USED TO DEFINE ENHANCERS 306 | print(('USING %s AS THE INPUT GFF' % (inputGFFFile))) 307 | inputName = inputGFFFile.split('/')[-1].split('.')[0] 308 | 309 | #GETTING THE CORRECT ANNOT FILE 310 | cwd = os.getcwd() 311 | genomeDict = { 312 | 'HG18':'%s/annotation/hg18_refseq.ucsc' % (cwd), 313 | 'MM9': '%s/annotation/mm9_refseq.ucsc' % (cwd), 314 | 'HG19':'%s/annotation/hg19_refseq.ucsc' % (cwd), 315 | 'HG38':'%s/annotation/hg38_refseq.ucsc' % (cwd), 316 | 'MM8': '%s/annotation/mm8_refseq.ucsc' % (cwd), 317 | 'MM10':'%s/annotation/mm10_refseq.ucsc' % (cwd), 318 | } 319 | 320 | #GETTING THE GENOME 321 | if options.custom_genome: 322 | annotFile = options.custom_genome 323 | print('USING CUSTOM GENOME %s AS THE GENOME FILE' % options.custom_genome) 324 | else: 325 | genome = options.genome 326 | annotFile = genomeDict[genome.upper()] 327 | print('USING %s AS THE GENOME' % genome) 328 | 329 | 330 | #MAKING THE START DICT 331 | print('MAKING START DICT') 332 | startDict = ROSE_utils.makeStartDict(annotFile) 333 | 334 | 335 | #LOADING IN THE BOUND REGION REFERENCE COLLECTION 336 | print('LOADING IN GFF REGIONS') 337 | referenceCollection = ROSE_utils.gffToLocusCollection(inputGFFFile) 338 | 339 | 340 | #NOW STITCH REGIONS 341 | print('STITCHING REGIONS TOGETHER') 342 | stitchedCollection,debugOutput = regionStitching(inputGFFFile,stitchWindow,tssWindow,annotFile,removeTSS) 343 | 344 | 345 | 346 | #NOW MAKE A STITCHED COLLECTION GFF 347 | print('MAKING GFF FROM STITCHED COLLECTION') 348 | stitchedGFF=ROSE_utils.locusCollectionToGFF(stitchedCollection) 349 | 350 | if not removeTSS: 351 | stitchedGFFFile = '%s%s_%sKB_STITCHED.gff' % (gffFolder,inputName,stitchWindow/1000) 352 | stitchedGFFName = '%s_%sKB_STITCHED' % (inputName,stitchWindow/1000) 353 | debugOutFile = '%s%s_%sKB_STITCHED.debug' % (gffFolder,inputName,stitchWindow/1000) 354 | else: 355 | stitchedGFFFile = '%s%s_%sKB_STITCHED_TSS_DISTAL.gff' % (gffFolder,inputName,stitchWindow/1000) 356 | stitchedGFFName = '%s_%sKB_STITCHED_TSS_DISTAL' % (inputName,stitchWindow/1000) 357 | debugOutFile = '%s%s_%sKB_STITCHED_TSS_DISTAL.debug' % (gffFolder,inputName,stitchWindow/1000) 358 | 359 | #WRITING DEBUG OUTPUT TO DISK 360 | 361 | if debug: 362 | print(('WRITING DEBUG OUTPUT TO DISK AS %s' % (debugOutFile))) 363 | ROSE_utils.unParseTable(debugOutput,debugOutFile,'\t') 364 | 365 | #WRITE THE GFF TO DISK 366 | print(('WRITING STITCHED GFF TO DISK AS %s' % (stitchedGFFFile))) 367 | ROSE_utils.unParseTable(stitchedGFF,stitchedGFFFile,'\t') 368 | 369 | 370 | 371 | #SETTING UP THE OVERALL OUTPUT FILE 372 | outputFile1 = outFolder + stitchedGFFName + '_REGION_MAP.txt' 373 | 374 | print(('OUTPUT WILL BE WRITTEN TO %s' % (outputFile1))) 375 | 376 | #MAPPING TO THE NON STITCHED (ORIGINAL GFF) 377 | #MAPPING TO THE STITCHED GFF 378 | 379 | 380 | # bin for bam mapping 381 | nBin =1 382 | 383 | #IMPORTANT 384 | #CHANGE cmd1 and cmd2 TO PARALLELIZE OUTPUT FOR BATCH SUBMISSION 385 | #e.g. if using LSF cmd1 = "bsub python bamToGFF.py -f 1 -e 200 -r -m %s -b %s -i %s -o %s" % (nBin,bamFile,stitchedGFFFile,mappedOut1) 386 | 387 | for bamFile in bamFileList: 388 | 389 | bamFileName = bamFile.split('/')[-1] 390 | 391 | #MAPPING TO THE STITCHED GFF 392 | mappedOut1 ='%s%s_%s_MAPPED.gff' % (mappedFolder,stitchedGFFName,bamFileName) 393 | #WILL TRY TO RUN AS A BACKGROUND PROCESS. BATCH SUBMIT THIS LINE TO IMPROVE SPEED 394 | cmd1 = "ROSE_bamToGFF.py -f 1 -e 200 -r -m %s -b %s -i %s -o %s &" % (nBin,bamFile,stitchedGFFFile,mappedOut1) 395 | print(cmd1) 396 | os.system(cmd1) 397 | 398 | #MAPPING TO THE ORIGINAL GFF 399 | mappedOut2 ='%s%s_%s_MAPPED.gff' % (mappedFolder,inputName,bamFileName) 400 | #WILL TRY TO RUN AS A BACKGROUND PROCESS. BATCH SUBMIT THIS LINE TO IMPROVE SPEED 401 | cmd2 = "ROSE_bamToGFF.py -f 1 -e 200 -r -m %s -b %s -i %s -o %s &" % (nBin,bamFile,inputGFFFile,mappedOut2) 402 | print(cmd2) 403 | os.system(cmd2) 404 | 405 | 406 | 407 | print('PAUSING TO MAP') 408 | time.sleep(10) 409 | 410 | #CHECK FOR MAPPING OUTPUT 411 | outputDone = False 412 | ticker = 0 413 | print('WAITING FOR MAPPING TO COMPLETE. ELAPSED TIME (MIN):') 414 | while not outputDone: 415 | 416 | ''' 417 | check every 5 minutes for completed output 418 | ''' 419 | outputDone = True 420 | if ticker%6 == 0: 421 | print((ticker*5)) 422 | ticker +=1 423 | #CHANGE THIS PARAMETER TO ALLOW MORE TIME TO MAP 424 | if ticker == 144: 425 | print('ERROR: OPERATION TIME OUT. MAPPING OUTPUT NOT DETECTED') 426 | exit() 427 | break 428 | for bamFile in bamFileList: 429 | 430 | #GET THE MAPPED OUTPUT NAMES HERE FROM MAPPING OF EACH BAMFILE 431 | bamFileName = bamFile.split('/')[-1] 432 | mappedOut1 ='%s%s_%s_MAPPED.gff' % (mappedFolder,stitchedGFFName,bamFileName) 433 | 434 | try: 435 | mapFile = open(mappedOut1,'r') 436 | mapFile.close() 437 | except IOError: 438 | outputDone = False 439 | 440 | mappedOut2 ='%s%s_%s_MAPPED.gff' % (mappedFolder,inputName,bamFileName) 441 | 442 | try: 443 | mapFile = open(mappedOut2,'r') 444 | mapFile.close() 445 | except IOError: 446 | outputDone = False 447 | if outputDone == True: 448 | break 449 | time.sleep(300) 450 | print(('MAPPING TOOK %s MINUTES' % (ticker*5))) 451 | 452 | print('BAM MAPPING COMPLETED NOW MAPPING DATA TO REGIONS') 453 | #CALCULATE DENSITY BY REGION 454 | mapCollection(stitchedCollection,referenceCollection,bamFileList,mappedFolder,outputFile1,refName = stitchedGFFName) 455 | 456 | 457 | time.sleep(10) 458 | 459 | print('CALLING AND PLOTTING SUPER-STITCHED PEAKS') 460 | 461 | 462 | if options.control: 463 | rankbyName = options.rankby.split('/')[-1] 464 | controlName = options.control.split('/')[-1] 465 | cmd = 'ROSE_callSuper.R %s %s %s %s' % (outFolder,outputFile1,inputName,controlName) 466 | 467 | else: 468 | rankbyName = options.rankby.split('/')[-1] 469 | controlName = 'NONE' 470 | cmd = 'ROSE_callSuper.R %s %s %s %s' % (outFolder,outputFile1,inputName,controlName) 471 | print(cmd) 472 | os.system(cmd) 473 | 474 | #calling the gene mapper 475 | time.sleep(60) 476 | superTableFile = "%s/%s_SuperStitched.table.txt" % (outFolder,inputName) 477 | allTableFile = "%s/%s_AllStitched.table.txt" % (outFolder,inputName) 478 | 479 | suffixScript = '' 480 | if options.control: suffixScript = '-c' 481 | if options.custom_genome: 482 | cmd1 = "ROSE_geneMapper.py --custom %s -i %s -r TRUE %s" % (options.custom_genome,superTableFile,suffixScript) 483 | cmd2 = "ROSE_geneMapper.py --custom %s -i %s -r TRUE %s " % (options.custom_genome,allTableFile,suffixScript) 484 | else: 485 | cmd1 = "ROSE_geneMapper.py -g %s -i %s -r TRUE %s" % (genome,superTableFile,suffixScript) 486 | cmd2 = "ROSE_geneMapper.py -g %s -i %s -r TRUE %s" % (genome,allTableFile,suffixScript) 487 | 488 | #gene mapper for super-stitched peaks 489 | print(cmd1) 490 | os.system(cmd1) 491 | 492 | #gene mapper for stitched peaks 493 | print(cmd2) 494 | os.system(cmd2) 495 | 496 | if __name__ == "__main__": 497 | main() 498 | -------------------------------------------------------------------------------- /lib/ROSE_utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | import re 4 | 5 | from string import * 6 | 7 | import subprocess 8 | import datetime 9 | 10 | from collections import defaultdict 11 | 12 | #SET OF UTILITY FUNCTIONS FOR mapEnhancerFromFactor.py 13 | 14 | #================================================================== 15 | #==========================I/O FUNCTIONS=========================== 16 | #================================================================== 17 | 18 | #unParseTable 4/14/08 19 | #takes in a table generated by parseTable and writes it to an output file 20 | #takes as parameters (table, output, sep), where sep is how the file is delimited 21 | #example call unParseTable(table, 'table.txt', '\t') for a tab del file 22 | 23 | def unParseTable(table, output, sep): 24 | fh_out = open(output,'w') 25 | if len(sep) == 0: 26 | for i in table: 27 | fh_out.write(str(i)) 28 | fh_out.write('\n') 29 | else: 30 | for line in table: 31 | line = [str(x) for x in line] 32 | line = sep.join(line) 33 | 34 | fh_out.write(line) 35 | fh_out.write('\n') 36 | 37 | fh_out.close() 38 | 39 | #parseTable 4/14/08 40 | #takes in a table where columns are separated by a given symbol and outputs 41 | #a nested list such that list[row][col] 42 | #example call: 43 | #table = parseTable('file.txt','\t') 44 | def parseTable(fn, sep, header = False,excel = False): 45 | fh = open(fn) 46 | lines = fh.readlines() 47 | fh.close() 48 | if excel: 49 | lines = lines[0].split('\r') 50 | if lines[0].count('\r') > 0: 51 | lines = lines[0].split('\r') 52 | table = [] 53 | if header == True: 54 | lines =lines[1:] 55 | for i in lines: 56 | table.append(i[:-1].split(sep)) 57 | 58 | return table 59 | 60 | 61 | def bedToGFF(bed,output=''): 62 | 63 | ''' 64 | turns a bed into a gff file 65 | ''' 66 | if type(bed) == str: 67 | bed = parseTable(bed,'\t') 68 | 69 | gff = [] 70 | for line in bed: 71 | gffLine = [line[0],line[3],'',line[1],line[2],line[4],'.','',line[3]] 72 | gff.append(gffLine) 73 | 74 | if len(output) > 0: 75 | unParseTable(gff,output,'\t') 76 | else: 77 | return gff 78 | 79 | 80 | #100912 81 | #gffToBed 82 | 83 | def gffToBed(gff,output= ''): 84 | ''' 85 | turns a gff to a bed file 86 | ''' 87 | bed = [] 88 | for line in gff: 89 | newLine = [line[0],line[3],line[4],line[1],0,line[6]] 90 | bed.append(newLine) 91 | if len(output) == 0: 92 | return bed 93 | else: 94 | unParseTable(bed,output,'\t') 95 | 96 | def formatFolder(folderName,create=False): 97 | 98 | ''' 99 | makes sure a folder exists and if not makes it 100 | returns a bool for folder 101 | ''' 102 | 103 | if folderName[-1] != '/': 104 | folderName +='/' 105 | 106 | try: 107 | foo = os.listdir(folderName) 108 | return folderName 109 | except OSError: 110 | print(('folder %s does not exist' % (folderName))) 111 | if create: 112 | os.system('mkdir %s' % (folderName)) 113 | return folderName 114 | else: 115 | 116 | return False 117 | 118 | 119 | 120 | #================================================================== 121 | #===================ANNOTATION FUNCTIONS=========================== 122 | #================================================================== 123 | 124 | 125 | def makeStartDict(annotFile,geneList = []): 126 | ''' 127 | makes a dictionary keyed by refseq ID that contains information about 128 | chrom/start/stop/strand/common name 129 | ''' 130 | 131 | if type(geneList) == str: 132 | geneList = parseTable(geneList,'\t') 133 | geneList = [line[0] for line in geneList] 134 | 135 | if annotFile.upper().count('REFSEQ') >= 0: 136 | refseqTable,refseqDict = importRefseq(annotFile) 137 | if len(geneList) == 0: 138 | geneList = list(refseqDict.keys()) 139 | startDict = {} 140 | for gene in geneList: 141 | if (gene in refseqDict) == False: 142 | continue 143 | startDict[gene]={} 144 | startDict[gene]['sense'] = refseqTable[refseqDict[gene][0]][3] 145 | startDict[gene]['chr'] = refseqTable[refseqDict[gene][0]][2] 146 | startDict[gene]['start'] = getTSSs([gene],refseqTable,refseqDict) 147 | if startDict[gene]['sense'] == '+': 148 | startDict[gene]['end'] =[int(refseqTable[refseqDict[gene][0]][5])] 149 | else: 150 | startDict[gene]['end'] = [int(refseqTable[refseqDict[gene][0]][4])] 151 | startDict[gene]['name'] = refseqTable[refseqDict[gene][0]][12] 152 | return startDict 153 | 154 | 155 | #generic function to get the TSS of any gene 156 | def getTSSs(geneList,refseqTable,refseqDict): 157 | #refseqTable,refseqDict = importRefseq(refseqFile) 158 | if len(geneList) == 0: 159 | refseq = refseqTable 160 | else: 161 | refseq = refseqFromKey(geneList,refseqDict,refseqTable) 162 | TSS = [] 163 | for line in refseq: 164 | if line[3] == '+': 165 | TSS.append(line[4]) 166 | if line[3] == '-': 167 | TSS.append(line[5]) 168 | TSS = list(map(int,TSS)) 169 | 170 | return TSS 171 | 172 | 173 | #12/29/08 174 | #refseqFromKey(refseqKeyList,refseqDict,refseqTable) 175 | #function that grabs refseq lines from refseq IDs 176 | def refseqFromKey(refseqKeyList,refseqDict,refseqTable): 177 | typeRefseq = [] 178 | for name in refseqKeyList: 179 | if name in refseqDict: 180 | typeRefseq.append(refseqTable[refseqDict[name][0]]) 181 | return typeRefseq 182 | 183 | 184 | 185 | #10/13/08 186 | #importRefseq 187 | #takes in a refseq table and makes a refseq table and a refseq dictionary for keying the table 188 | 189 | def importRefseq(refseqFile, returnMultiples = False): 190 | 191 | ''' 192 | opens up a refseq file downloaded by UCSC 193 | ''' 194 | refseqTable = parseTable(refseqFile,'\t') 195 | refseqDict = {} 196 | ticker = 1 197 | for line in refseqTable[1:]: 198 | if line[1] in refseqDict: 199 | refseqDict[line[1]].append(ticker) 200 | else: 201 | refseqDict[line[1]] = [ticker] 202 | ticker = ticker + 1 203 | 204 | multiples = [] 205 | for i in refseqDict: 206 | if len(refseqDict[i]) > 1: 207 | multiples.append(i) 208 | 209 | if returnMultiples == True: 210 | return refseqTable,refseqDict,multiples 211 | else: 212 | return refseqTable,refseqDict 213 | 214 | 215 | #================================================================== 216 | #========================LOCUS INSTANCE============================ 217 | #================================================================== 218 | 219 | #Locus and LocusCollection instances courtesy of Graham Ruby 220 | 221 | 222 | class Locus: 223 | # this may save some space by reducing the number of chromosome strings 224 | # that are associated with Locus instances (see __init__). 225 | __chrDict = dict() 226 | __senseDict = {'+':'+', '-':'-', '.':'.'} 227 | # chr = chromosome name (string) 228 | # sense = '+' or '-' (or '.' for an ambidexterous locus) 229 | # start,end = ints of the start and end coords of the locus 230 | # end coord is the coord of the last nucleotide. 231 | def __init__(self,chr,start,end,sense,ID=''): 232 | coords = [int(start),int(end)] 233 | coords.sort() 234 | # this method for assigning chromosome should help avoid storage of 235 | # redundant strings. 236 | if not(chr in self.__chrDict): self.__chrDict[chr] = chr 237 | self._chr = self.__chrDict[chr] 238 | self._sense = self.__senseDict[sense] 239 | self._start = int(coords[0]) 240 | self._end = int(coords[1]) 241 | self._ID = ID 242 | def ID(self): return self._ID 243 | def chr(self): return self._chr 244 | def start(self): return self._start ## returns the smallest coordinate 245 | def end(self): return self._end ## returns the biggest coordinate 246 | def len(self): return self._end - self._start + 1 247 | def getAntisenseLocus(self): 248 | if self._sense=='.': return self 249 | else: 250 | switch = {'+':'-', '-':'+'} 251 | return Locus(self._chr,self._start,self._end,switch[self._sense]) 252 | def coords(self): return [self._start,self._end] ## returns a sorted list of the coordinates 253 | def sense(self): return self._sense 254 | # returns boolean; True if two loci share any coordinates in common 255 | def overlaps(self,otherLocus): 256 | if self.chr()!=otherLocus.chr(): return False 257 | elif not(self._sense=='.' or \ 258 | otherLocus.sense()=='.' or \ 259 | self.sense()==otherLocus.sense()): return False 260 | elif self.start() > otherLocus.end() or otherLocus.start() > self.end(): return False 261 | else: return True 262 | 263 | # returns boolean; True if all the nucleotides of the given locus overlap 264 | # with the self locus 265 | def contains(self,otherLocus): 266 | if self.chr()!=otherLocus.chr(): return False 267 | elif not(self._sense=='.' or \ 268 | otherLocus.sense()=='.' or \ 269 | self.sense()==otherLocus.sense()): return False 270 | elif self.start() > otherLocus.start() or otherLocus.end() > self.end(): return False 271 | else: return True 272 | 273 | # same as overlaps, but considers the opposite strand 274 | def overlapsAntisense(self,otherLocus): 275 | return self.getAntisenseLocus().overlaps(otherLocus) 276 | # same as contains, but considers the opposite strand 277 | def containsAntisense(self,otherLocus): 278 | return self.getAntisenseLocus().contains(otherLocus) 279 | def __hash__(self): return self._start + self._end 280 | def __eq__(self,other): 281 | if self.__class__ != other.__class__: return False 282 | if self.chr()!=other.chr(): return False 283 | if self.start()!=other.start(): return False 284 | if self.end()!=other.end(): return False 285 | if self.sense()!=other.sense(): return False 286 | return True 287 | def __ne__(self,other): return not(self.__eq__(other)) 288 | def __str__(self): return self.chr()+'('+self.sense()+'):'+'-'.join(map(str,self.coords())) 289 | def checkRep(self): 290 | pass 291 | 292 | 293 | class LocusCollection: 294 | def __init__(self,loci,windowSize): 295 | ### top-level keys are chr, then strand, no space 296 | self.__chrToCoordToLoci = dict() 297 | self.__loci = dict() 298 | self.__winSize = windowSize 299 | for lcs in loci: self.__addLocus(lcs) 300 | 301 | def __addLocus(self,lcs): 302 | if not(lcs in self.__loci): 303 | self.__loci[lcs] = None 304 | if lcs.sense()=='.': chrKeyList = [lcs.chr()+'+', lcs.chr()+'-'] 305 | else: chrKeyList = [lcs.chr()+lcs.sense()] 306 | for chrKey in chrKeyList: 307 | if not(chrKey in self.__chrToCoordToLoci): self.__chrToCoordToLoci[chrKey] = dict() 308 | for n in self.__getKeyRange(lcs): 309 | if not(n in self.__chrToCoordToLoci[chrKey]): self.__chrToCoordToLoci[chrKey][n] = [] 310 | self.__chrToCoordToLoci[chrKey][n].append(lcs) 311 | 312 | def __getKeyRange(self,locus): 313 | start = locus.start() // self.__winSize 314 | end = locus.end() // self.__winSize + 1 ## add 1 because of the range 315 | return range(start, end) 316 | 317 | def __len__(self): return len(self.__loci) 318 | 319 | def append(self,new): self.__addLocus(new) 320 | def extend(self,newList): 321 | for lcs in newList: self.__addLocus(lcs) 322 | def hasLocus(self,locus): 323 | return locus in self.__loci 324 | def remove(self,old): 325 | if not(old in self.__loci): raise ValueError("requested locus isn't in collection") 326 | del self.__loci[old] 327 | if old.sense()=='.': senseList = ['+','-'] 328 | else: senseList = [old.sense()] 329 | for k in self.__getKeyRange(old): 330 | for sense in senseList: 331 | self.__chrToCoordToLoci[old.chr()+sense][k].remove(old) 332 | 333 | def getWindowSize(self): return self.__winSize 334 | def getLoci(self): return list(self.__loci.keys()) 335 | def getChrList(self): 336 | # i need to remove the strand info from the chromosome keys and make 337 | # them non-redundant. 338 | tempKeys = dict() 339 | for k in list(self.__chrToCoordToLoci.keys()): tempKeys[k[:-1]] = None 340 | return list(tempKeys.keys()) 341 | 342 | def __subsetHelper(self,locus,sense): 343 | sense = sense.lower() 344 | if ['sense','antisense','both'].count(sense)!=1: 345 | raise ValueError("sense command invalid: '"+sense+"'.") 346 | matches = dict() 347 | senses = ['+','-'] 348 | if locus.sense()=='.' or sense=='both': lamb = lambda s: True 349 | elif sense=='sense': lamb = lambda s: s==locus.sense() 350 | elif sense=='antisense': lamb = lambda s: s!=locus.sense() 351 | else: raise ValueError("sense value was inappropriate: '"+sense+"'.") 352 | for s in filter(lamb, senses): 353 | chrKey = locus.chr()+s 354 | if chrKey in self.__chrToCoordToLoci: 355 | for n in self.__getKeyRange(locus): 356 | if n in self.__chrToCoordToLoci[chrKey]: 357 | for lcs in self.__chrToCoordToLoci[chrKey][n]: 358 | matches[lcs] = None 359 | return list(matches.keys()) 360 | 361 | # sense can be 'sense' (default), 'antisense', or 'both' 362 | # returns all members of the collection that overlap the locus 363 | def getOverlap(self,locus,sense='sense'): 364 | matches = self.__subsetHelper(locus,sense) 365 | ### now, get rid of the ones that don't really overlap 366 | realMatches = dict() 367 | if sense=='sense' or sense=='both': 368 | for i in [lcs for lcs in matches if lcs.overlaps(locus)]: 369 | realMatches[i] = None 370 | if sense=='antisense' or sense=='both': 371 | for i in [lcs for lcs in matches if lcs.overlapsAntisense(locus)]: 372 | realMatches[i] = None 373 | return list(realMatches.keys()) 374 | 375 | # sense can be 'sense' (default), 'antisense', or 'both' 376 | # returns all members of the collection that are contained by the locus 377 | def getContained(self,locus,sense='sense'): 378 | matches = self.__subsetHelper(locus,sense) 379 | ### now, get rid of the ones that don't really overlap 380 | realMatches = dict() 381 | if sense=='sense' or sense=='both': 382 | for i in [lcs for lcs in matches if locus.contains(lcs)]: 383 | realMatches[i] = None 384 | if sense=='antisense' or sense=='both': 385 | for i in [lcs for lcs in matches if locus.containsAntisense(lcs)]: 386 | realMatches[i] = None 387 | return list(realMatches.keys()) 388 | 389 | # sense can be 'sense' (default), 'antisense', or 'both' 390 | # returns all members of the collection that contain the locus 391 | def getContainers(self,locus,sense='sense'): 392 | matches = self.__subsetHelper(locus,sense) 393 | ### now, get rid of the ones that don't really overlap 394 | realMatches = dict() 395 | if sense=='sense' or sense=='both': 396 | for i in [lcs for lcs in matches if lcs.contains(locus)]: 397 | realMatches[i] = None 398 | if sense=='antisense' or sense=='both': 399 | for i in [lcs for lcs in matches if lcs.containsAntisense(locus)]: 400 | realMatches[i] = None 401 | return list(realMatches.keys()) 402 | 403 | def stitchCollection(self,stitchWindow=1,sense='both'): 404 | 405 | ''' 406 | reduces the collection by stitching together overlapping loci 407 | returns a new collection 408 | ''' 409 | 410 | #initializing stitchWindow to 1 411 | #this helps collect directly adjacent loci 412 | 413 | 414 | 415 | locusList = self.getLoci() 416 | oldCollection = LocusCollection(locusList,500) 417 | 418 | stitchedCollection = LocusCollection([],500) 419 | 420 | for locus in locusList: 421 | #print(locus.coords()) 422 | if oldCollection.hasLocus(locus): 423 | oldCollection.remove(locus) 424 | overlappingLoci = oldCollection.getOverlap(Locus(locus.chr(),locus.start()-stitchWindow,locus.end()+stitchWindow,locus.sense(),locus.ID()),sense) 425 | 426 | stitchTicker = 1 427 | while len(overlappingLoci) > 0: 428 | stitchTicker+=len(overlappingLoci) 429 | overlapCoords = locus.coords() 430 | 431 | 432 | for overlappingLocus in overlappingLoci: 433 | overlapCoords+=overlappingLocus.coords() 434 | oldCollection.remove(overlappingLocus) 435 | if sense == 'both': 436 | locus = Locus(locus.chr(),min(overlapCoords),max(overlapCoords),'.',locus.ID()) 437 | else: 438 | locus = Locus(locus.chr(),min(overlapCoords),max(overlapCoords),locus.sense(),locus.ID()) 439 | overlappingLoci = oldCollection.getOverlap(Locus(locus.chr(),locus.start()-stitchWindow,locus.end()+stitchWindow,locus.sense()),sense) 440 | locus._ID = '%s_%s_lociStitched' % (stitchTicker,locus.ID()) 441 | 442 | stitchedCollection.append(locus) 443 | 444 | else: 445 | continue 446 | return stitchedCollection 447 | 448 | 449 | 450 | #================================================================== 451 | #========================LOCUS FUNCTIONS=========================== 452 | #================================================================== 453 | #06/11/09 454 | #turns a locusCollection into a gff 455 | #does not write to disk though 456 | def locusCollectionToGFF(locusCollection): 457 | lociList = locusCollection.getLoci() 458 | gff = [] 459 | for locus in lociList: 460 | newLine = [locus.chr(),locus.ID(),'',locus.coords()[0],locus.coords()[1],'',locus.sense(),'',locus.ID()] 461 | gff.append(newLine) 462 | return gff 463 | 464 | 465 | def gffToLocusCollection(gff,window =500): 466 | 467 | ''' 468 | opens up a gff file and turns it into a LocusCollection instance 469 | ''' 470 | 471 | lociList = [] 472 | if type(gff) == str: 473 | gff = parseTable(gff,'\t') 474 | 475 | for line in gff: 476 | #USE line[2] as the locus ID. If that is empty use line[8] 477 | if len(line[2]) > 0: 478 | name = line[2] 479 | elif len(line[8]) >0: 480 | name = line[8] 481 | else: 482 | name = '%s:%s:%s-%s' % (line[0],line[6],line[3],line[4]) 483 | 484 | lociList.append(Locus(line[0],line[3],line[4],line[6],name)) 485 | return LocusCollection(lociList,window) 486 | 487 | 488 | #maketranscriptCollection 489 | #04/07/09 490 | #makes a LocusCollection w/ each transcript as a locus 491 | #bob = makeTranscriptCollection('/Users/chazlin/genomes/mm8/mm8refseq.txt') 492 | def makeTranscriptCollection(annotFile,upSearch,downSearch,window = 500,geneList = []): 493 | ''' 494 | makes a LocusCollection w/ each transcript as a locus 495 | takes in a refseqfile 496 | ''' 497 | 498 | if annotFile.upper().count('REFSEQ') >= 0: 499 | refseqTable,refseqDict = importRefseq(annotFile) 500 | locusList = [] 501 | ticker = 0 502 | if len(geneList) == 0: 503 | geneList =list(refseqDict.keys()) 504 | for line in refseqTable[1:]: 505 | if geneList.count(line[1]) > 0: 506 | if line[3] == '-': 507 | locus = Locus(line[2],int(line[4])-downSearch,int(line[5])+upSearch,line[3],line[1]) 508 | else: 509 | locus = Locus(line[2],int(line[4])-upSearch,int(line[5])+downSearch,line[3],line[1]) 510 | locusList.append(locus) 511 | ticker = ticker + 1 512 | if ticker%1000 == 0: 513 | print(ticker) 514 | 515 | 516 | transCollection = LocusCollection(locusList,window) 517 | 518 | return transCollection 519 | 520 | 521 | 522 | def makeTSSLocus(gene,startDict,upstream,downstream): 523 | ''' 524 | given a startDict, make a locus for any gene's TSS w/ upstream and downstream windows 525 | ''' 526 | 527 | start = startDict[gene]['start'][0] 528 | if startDict[gene]['sense'] =='-': 529 | return Locus(startDict[gene]['chr'],start-downstream,start+upstream,'-',gene) 530 | else: 531 | return Locus(startDict[gene]['chr'],start-upstream,start+downstream,'+',gene) 532 | 533 | 534 | #06/11/09 535 | #takes a locus and expands it by a fixed upstream/downstream amount. spits out the new larger locus 536 | def makeSearchLocus(locus,upSearch,downSearch): 537 | if locus.sense() == '-': 538 | searchLocus = Locus(locus.chr(),locus.start()-downSearch,locus.end()+upSearch,locus.sense(),locus.ID()) 539 | else: 540 | searchLocus = Locus(locus.chr(),locus.start()-upSearch,locus.end()+downSearch,locus.sense(),locus.ID()) 541 | return searchLocus 542 | 543 | 544 | #================================================================== 545 | #==========================BAM CLASS=============================== 546 | #================================================================== 547 | 548 | #11/11/10 549 | #makes a new class Bam for dealing with bam files and integrating them into the SolexaRun class 550 | 551 | 552 | def checkChrStatus(bamFile): 553 | command = 'samtools view %s | head -n 1' % (bamFile) 554 | #print "TESTING" 555 | #print command 556 | stats = subprocess.Popen(command,stdin = subprocess.PIPE,stderr = subprocess.PIPE,stdout = subprocess.PIPE,shell = True) 557 | statLines = stats.stdout.readlines() 558 | stats.stdout.close() 559 | chrPattern = re.compile('chr') 560 | for line in statLines: 561 | #print line 562 | line = line.decode("utf-8") 563 | sline = line.split("\t") 564 | #print sline[2] 565 | if re.search(chrPattern, sline[2]): 566 | return 1 567 | else: 568 | return 0 569 | 570 | def convertBitwiseFlag(flag): 571 | if int(flag) & 16: 572 | return "-" 573 | else: 574 | return "+" 575 | 576 | class Bam: 577 | '''A class for a sorted and indexed bam file that allows easy analysis of reads''' 578 | def __init__(self,bamFile): 579 | self._bam = bamFile 580 | 581 | def getTotalReads(self,readType = 'mapped'): 582 | command = 'samtools flagstat %s' % (self._bam) 583 | stats = subprocess.Popen(command,stdin = subprocess.PIPE,stderr = subprocess.PIPE,stdout = subprocess.PIPE,shell = True) 584 | statLines = stats.stdout.readlines() 585 | stats.stdout.close() 586 | if readType == 'mapped': 587 | for line in statLines: 588 | line = line.decode("utf-8") 589 | if line.count('mapped (') == 1: 590 | 591 | return int(line.split(' ')[0]) 592 | if readType == 'total': 593 | return int(statLines[0].split(' ')[0]) 594 | 595 | def convertBitwiseFlag(self,flag): 596 | if flag & 16: 597 | return "-" 598 | else: 599 | return "+" 600 | 601 | def getRawReads(self,locus,sense,unique = False,includeJxnReads = False,printCommand = False): 602 | ''' 603 | gets raw reads from the bam using samtools view. 604 | can enforce uniqueness and strandedness 605 | ''' 606 | locusLine = locus.chr()+':'+str(locus.start())+'-'+str(locus.end()) 607 | 608 | command = 'samtools view %s %s' % (self._bam,locusLine) 609 | if printCommand: 610 | print(command) 611 | getReads = subprocess.Popen(command,stdin = subprocess.PIPE,stderr = subprocess.PIPE,stdout = subprocess.PIPE,shell = True) 612 | reads = getReads.communicate() 613 | reads = reads[0].decode("utf-8") 614 | reads = reads.split('\n')[:-1] 615 | reads = [read.split('\t') for read in reads] 616 | if includeJxnReads == False: 617 | reads = [x for x in reads if x[5].count('N') < 1] 618 | 619 | #convertDict = {'16':'-','0':'+','64':'+','65':'+','80':'-','81':'-','129':'+','145':'-'} 620 | convertDict = {'16':'-','0':'+','64':'+','65':'+','80':'-','81':'-','129':'+','145':'-','256':'+','272':'-','99':'+','147':'-'} 621 | 622 | 623 | #BJA added 256 and 272, which correspond to 0 and 16 for multi-mapped reads respectively: 624 | #http://onetipperday.blogspot.com/2012/04/understand-flag-code-of-sam-format.html 625 | #convert = string.maketrans('160','--+') 626 | keptReads = [] 627 | seqDict = defaultdict(int) 628 | if sense == '-': 629 | strand = ['+','-'] 630 | strand.remove(locus.sense()) 631 | strand = strand[0] 632 | else: 633 | strand = locus.sense() 634 | for read in reads: 635 | #readStrand = read[1].translate(convert)[0] 636 | #print read[1], read[0] 637 | #readStrand = convertDict[read[1]] 638 | readStrand = convertBitwiseFlag(read[1]) 639 | 640 | if sense == 'both' or sense == '.' or readStrand == strand: 641 | 642 | if unique and seqDict[read[9]] == 0: 643 | keptReads.append(read) 644 | elif not unique: 645 | keptReads.append(read) 646 | seqDict[read[9]]+=1 647 | 648 | return keptReads 649 | 650 | def readsToLoci(self,reads,IDtag = 'sequence,seqID,none'): 651 | ''' 652 | takes raw read lines from the bam and converts them into loci 653 | ''' 654 | loci = [] 655 | ID = '' 656 | if IDtag == 'sequence,seqID,none': 657 | print('please specify one of the three options: sequence, seqID, none') 658 | return 659 | #convert = string.maketrans('160','--+') 660 | #convertDict = {'16':'-','0':'+','64':'+','65':'+','80':'-','81':'-','129':'+','145':'-'} 661 | #convertDict = {'16':'-','0':'+','64':'+','65':'+','80':'-','81':'-','129':'+','145':'-','256':'+','272':'-'} 662 | 663 | #BJA added 256 and 272, which correspond to 0 and 16 for multi-mapped reads respectively: 664 | #http://onetipperday.blogspot.com/2012/04/understand-flag-code-of-sam-format.html 665 | #convert = string.maketrans('160','--+') 666 | numPattern = re.compile('\d*') 667 | for read in reads: 668 | chrom = read[2] 669 | #strand = read[1].translate(convert)[0] 670 | #strand = convertDict[read[1]] 671 | strand = convertBitwiseFlag(read[1]) 672 | if IDtag == 'sequence': 673 | ID = read[9] 674 | elif IDtag == 'seqID': 675 | ID = read[0] 676 | else: 677 | ID = '' 678 | 679 | length = len(read[9]) 680 | start = int(read[3]) 681 | if read[5].count('N') == 1: 682 | #this awful oneliner first finds all of the numbers in the read string 683 | #then it filters out the '' and converts them to integers 684 | #only works for reads that span one junction 685 | 686 | [first,gap,second] = [int(x) for x in [x for x in re.findall(numPattern,read[5]) if len(x) > 0]][0:3] 687 | if IDtag == 'sequence': 688 | loci.append(Locus(chrom,start,start+first,strand,ID[0:first])) 689 | loci.append(Locus(chrom,start+first+gap,start+first+gap+second,strand,ID[first:])) 690 | else: 691 | loci.append(Locus(chrom,start,start+first,strand,ID)) 692 | loci.append(Locus(chrom,start+first+gap,start+first+gap+second,strand,ID)) 693 | elif read[5].count('N') > 1: 694 | continue 695 | else: 696 | loci.append(Locus(chrom,start,start+length,strand,ID)) 697 | return loci 698 | 699 | def getReadsLocus(self,locus,sense = 'both',unique = True,IDtag = 'sequence,seqID,none',includeJxnReads = False): 700 | ''' 701 | gets all of the reads for a given locus 702 | ''' 703 | reads = self.getRawReads(locus,sense,unique,includeJxnReads) 704 | 705 | loci = self.readsToLoci(reads,IDtag) 706 | 707 | return loci 708 | 709 | def getReadSequences(self,locus,sense = 'both',unique = True,includeJxnReads = False): 710 | 711 | reads = self.getRawReads(locus,sense,unique,includeJxnReads) 712 | 713 | return [read[9] for read in reads] 714 | 715 | def getReadStarts(self,locus,sense = 'both',unique = False,includeJxnReads = False): 716 | reads = self.getRawReads(locus,sense,unique,includeJxnReads) 717 | 718 | return [int(read[3]) for read in reads] 719 | 720 | 721 | def getReadCount(self,locus,sense = 'both',unique = True,includeJxnReads = False): 722 | reads = self.getRawReads(locus,sense,unique,includeJxnReads) 723 | 724 | return len(reads) 725 | 726 | 727 | 728 | #================================================================== 729 | #========================MISC FUNCTIONS============================ 730 | #================================================================== 731 | 732 | 733 | 734 | 735 | #uniquify function 736 | #by Peter Bengtsson 737 | #Used under a creative commons license 738 | #sourced from here: http://www.peterbe.com/plog/uniqifiers-benchmark 739 | 740 | def uniquify(seq, idfun=None): 741 | # order preserving 742 | if idfun is None: 743 | def idfun(x): return x 744 | seen = {} 745 | result = [] 746 | for item in seq: 747 | marker = idfun(item) 748 | # in old Python versions: 749 | # if seen.has_key(marker) 750 | # but in new ones: 751 | if marker in seen: continue 752 | seen[marker] = 1 753 | result.append(item) 754 | return result 755 | 756 | 757 | #082009 758 | #taken from http://code.activestate.com/recipes/491268/ 759 | 760 | def order(x, NoneIsLast = True, decreasing = False): 761 | """ 762 | Returns the ordering of the elements of x. The list 763 | [ x[j] for j in order(x) ] is a sorted version of x. 764 | 765 | Missing values in x are indicated by None. If NoneIsLast is true, 766 | then missing values are ordered to be at the end. 767 | Otherwise, they are ordered at the beginning. 768 | """ 769 | omitNone = False 770 | if NoneIsLast == None: 771 | NoneIsLast = True 772 | omitNone = True 773 | 774 | n = len(x) 775 | ix = list(range(n)) 776 | if None not in x: 777 | ix.sort(reverse = decreasing, key = lambda j : x[j]) 778 | else: 779 | # Handle None values properly. 780 | def key(i, x = x): 781 | elem = x[i] 782 | # Valid values are True or False only. 783 | if decreasing == NoneIsLast: 784 | return not(elem is None), elem 785 | else: 786 | return elem is None, elem 787 | ix = list(range(n)) 788 | ix.sort(key=key, reverse=decreasing) 789 | 790 | if omitNone: 791 | n = len(x) 792 | for i in range(n-1, -1, -1): 793 | if x[ix[i]] == None: 794 | n -= 1 795 | return ix[:n] 796 | return ix 797 | --------------------------------------------------------------------------------