├── LICENSE.txt
├── README.md
├── ROSE-local.sh
├── annotation
    ├── hg18_refseq.ucsc
    ├── hg19_refseq.ucsc
    ├── hg38_refseq.ucsc
    ├── mm10_refseq.ucsc
    ├── mm8_refseq.ucsc
    └── mm9_refseq.ucsc
├── bin
    ├── ROSE_bamToGFF.py
    ├── ROSE_callSuper.R
    ├── ROSE_geneMapper.py
    └── ROSE_main.py
└── lib
    └── ROSE_utils.py


/LICENSE.txt:
--------------------------------------------------------------------------------
  1 | =====ROSE: RANK ORDERING OF SUPER-ENHANCERS=====
  2 | 
  3 | ROSE IS RELEASED UNDER THE MIT X11 LICENSE
  4 | 
  5 | EXAMPLE DATA AND ADDTIONAL INFORMATION CAN BE FOUND HERE:
  6 | http://younglab.wi.mit.edu/super_enhancer_code.html
  7 | 
  8 | N.B. LICENSE.txt
  9 | 
 10 | For details of this analysis see:
 11 | 
 12 | Master Transcription Factors and Mediator Establish Super-Enhancers at Key Cell Identity Genes 
 13 | Warren A. Whyte, David A. Orlando, Denes Hnisz, Brian J. Abraham, Charles Y. Lin, Michael H. Kagey, Peter B. Rahl, Tong Ihn Lee and Richard A. Young
 14 | Cell 153, 307-319, April 11, 2013
 15 | 
 16 | and
 17 | 
 18 | Selective Inhibition of Tumor Oncogenes by Disruption of Super-enhancers 
 19 | Jakob Lovén, Heather A. Hoke, Charles Y. Lin, Ashley Lau, David A. Orlando, Christopher R. Vakoc, James E. Bradner, Tong Ihn Lee, and Richard A. Young
 20 | Cell 153, 320-334, April 11, 2013
 21 | 
 22 | Please cite these papers when using this code.
 23 | 
 24 | SOFTWARE AUTHORS: Charles Y. Lin, David A. Orlando, Brian J. Abraham
 25 | CONTACT: young_computation@wi.mit.edu 
 26 | ACKNOWLEDGEMENTS: Graham Ruby
 27 | Developed using Python 2.7.3, R 2.15.3, and SAMtools 0.1.18
 28 | 
 29 | PURPOSE: To create stitched enhancers, and to separate super-enhancers from typical enhancers using sequencing data (.bam) given a file of previously identified constituent enhancers (.gff)
 30 | 
 31 | 1) PREPARATION/REQUIREMENTS
 32 | 
 33 | .bam files of sequencing reads for factor of interest and control (WCE/IgG recommended).
 34 | .bam files must have chromosome IDs starting with "chr"
 35 | .bam files must be sorted and indexed using SAMtools in order for bamToGFF.py to work. (http://samtools.sourceforge.net/samtools.shtml)
 36 | Code must be run from directory in which it is stored.
 37 | .gff file of constituent enhancers previously identified (gff format ref: https://genome.ucsc.edu/FAQ/FAQformat.html#format3).
 38 | .gff must have the following columns:
 39 | 1: chromosome (chr#)
 40 | 2: unique ID for each constituent enhancer region
 41 | 4: start of constituent
 42 | 5: end of constituent
 43 | 7: strand (+,-,.)
 44 | 9: unique ID for each constituent enhancer region
 45 | NOTE: if value for column 2 and 9 differ, value in column 2 will be used
 46 | 
 47 | 2) CONTENTS
 48 | 
 49 | ROSE_main.py: main program
 50 | ROSE_utils.py: utility methods
 51 | ROSE_bamToGFF.py: calculates density of .bam sequencing reads in .gff regions
 52 | ROSE_callSuper.R: ranks regions by their densities, creates a cutoff to separate super-enhancers from typical enhancers
 53 | ROSE_geneMapper.py: assigns stitched enhancers to genes
 54 | annotation/: Refseq gene tables for genomes MM8,MM9,MM10,HG18,HG19,HG38
 55 | In ROSE_DATA
 56 | 
 57 | example.sh: sample call of ROSE_main.py
 58 | data/: folder containing an example .gff input file and two example .bam files
 59 | example/: folder containing example output generated by example.sh
 60 | 3) USAGE
 61 | 
 62 | Program is run by calling ROSE_main.py
 63 | 
 64 | From within root directory: 
 65 | python ROSE_main.py -g GENOME_BUILD -i INPUT_CONSTITUENT_GFF -r RANKING_BAM -o OUTPUT_DIRECTORY [optional: -s STITCHING_DISTANCE -t TSS_EXCLUSION_ZONE_SIZE -c CONTROL_BAM]
 66 | 
 67 | Required parameters:
 68 | 
 69 | GENOME_BUILD: one of hg18, hg19, hg38, mm8, mm9, or mm10 referring to the UCSC genome build used for read mapping
 70 | INPUT_CONSTITUENT_GFF: .gff file (described above) of regions that were previously calculated to be enhancers. I.e. Med1-enriched regions identified using MACS.
 71 | RANKING_BAM: .bam file to be used for ranking enhancers by density of this factor. I.e. Med1 ChIP-Seq reads.
 72 | OUTPUT_DIRECTORY: directory to be used for storing output.
 73 | Optional parameters:
 74 | 
 75 | STITCHING_DISTANCE: maximum distance between two regions that will be stitched together (Default: 12.5kb)
 76 | TSS_EXCLUSION_ZONE_SIZE: exclude regions contained within +/- this distance from TSS in order to account for promoter biases (Default: 0; recommended if used: 2500). If this value is 0, will not look for a gene file.
 77 | CONTROL_BAM: .bam file to be used as a control. Subtracted from the density of the RANKING_BAM. I.e. Whole cell extract reads.
 78 | 4) CODE PROCEDURE:
 79 | 
 80 | ROSE_main.py will:
 81 | 
 82 | format output directory hierarchy
 83 | Root name of input .gff ([input_enhancer_list].gff) used as naming root for output files.
 84 | stitch enhancer constituents in INPUT_CONSTITUENT_GFF based on STITCHING_DISTANCE and make .gff and .bed of stitched collection
 85 | TSS exclusion, if not zero, is attempted before stitching
 86 | Names of stitched regions start with number of regions stitched followed by leftmost constituent ID
 87 | call bamToGFF.py to get density of RANKING_BAM and CONTROL_BAM in stitched regions and constituents
 88 | Maximum time to wait for bamToGFF.py is 12h but can be changed -- quits if running too long
 89 | call callSuper.R to sort stitched enhancers by their background-subtracted density of RANKING_BAM and separate into two groups
 90 | 5) OUTPUT:
 91 | 
 92 | All file names begin with the root of INPUT_CONSTITUENT_GFF
 93 | 
 94 | **OUTPUT_DIRECTORY/gff/
 95 | .gff: copied .gff file of INPUT_CONSTITUENT_GFF 
 96 | (chrom, name, [blank], start, end, [blank], [blank], strand, [blank], [blank], name)
 97 | STITCHED.gff: regions created by stitching together INPUT_CONSTITUENT_GFF at STITCHING_DISTANCE
 98 | (chrom, name, [blank], start, end, [blank], [blank], strand, [blank], [blank], name) Name is number of constituents stitched together followed by ID of leftmost constituent.
 99 | **OUTPUT_DIRECTORY/mappedGFF/
100 | *_MAPPED.gff: output of bamToGFF using each bam file containing densities of factor in each constituent
101 | (constituent ID, region tested, average read density in units of reads-per-million-mapped per bp of constituent)
102 | *_STITCHED*_MAPPED.gff: output of bamToGFF using each bam file containing densities of factor in each stitched enhancer
103 | (stitched enhancer ID, region tested, average read density in units of reads-per-million-mapped per bp of stitched enhancer)
104 | **OUTPUT_DIRECTORY/
105 | STITCHED_ENHANCER_REGION_MAP.txt: all densities from bamToGFF calculated in stitched enhancers 
106 | (stitched enhancer ID, chromosome, stitched enhancer start, stitched enhancer end, number of constituents stitched, rank of RANKING_BAM signal, signal of RANKING_BAM) 
107 | Signal of RANKING_BAM is density times length.
108 | *_AllEnhancers.table.txt: Rankings and super status for each stitched enhancer
109 | (stitched enhancer ID, chromosome, stitched enhancer start, stitched enhancer end, number of constituents stitched, size of constituents that were stitched together, signal of RANKING_BAM, rank of RANKING_BAM, binary of super-enhancer (1) vs. typical (0)) 
110 | Signal of RANKING_BAM is density times length.
111 | *_SuperEnhancers.table.txt: Rankings and super status for super-enhancers 
112 | (stitched enhancer ID, chromosome, stitched enhancer start, stitched enhancer end, number of constituents stitched, size of constituents that were stitched together, signal of RANKING_BAM, rank of RANKING_BAM, binary of super-enhancer (1) vs. typical (0)) 
113 | Signal of RANKING_BAM is density times length.
114 | *_Enhancers_withSuper.bed: .bed file to be loaded into the UCSC browser to visualize super-enhancers and typical enhancers.
115 | (chromosome, stitched enhancer start, stitched enhancer end, stitched enhancer ID, rank by RANKING_BAM signal)
116 | *_Plot_points.png: visualization of the ranks of super-enhancers and the two groups. Stitched enhancers are ranked by their RANKING_BAM signal and their ranks are along the X axis. Corresponding RANKING_BAM signal on the Y axis.
117 | ====================================
118 | NOTES:
119 | 
120 | mapEnhancerFromFactor.py has a debug mode that can be enabled in the beginning of the main function.
121 | Enhancers in INPUT_RANKING_GFF may overlap each other in the input.
122 | This code can be easily parallelized by following the instructions in the main function of mapEnhancerFromFactor around line 369.
123 | Other gene lists may be added if downloaded from UCSC.
124 | Code from external sources are also cited in-line.
125 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ROSE : RANK ORDERING OF SUPER-ENHANCERS
 2 | 
 3 | CLONED using SOURCETREE from: https://bitbucket.org/young_computation/rose/src/master/
 4 | 
 5 | ### Changes from Source.
 6 | 1. USAGE
 7 | 
 8 | 	```bash
 9 | 	PATHTO=/path/to/ROSE
10 | 	PYTHONPATH=$PATHTO/lib
11 | 	export PYTHONPATH
12 | 	export PATH=$PATH:$PATHTO/bin
13 | 
14 | 	ROSE_main.py [options] -g [GENOME] -i [INPUT_REGION_GFF] -r [RANKBY_BAM_FILE] -o [OUTPUT_FOLDER] [OPTIONAL_FLAGS]
15 | 	```
16 | 
17 | 1. Update: 
18 | 
19 | 	* ROSE is executable independent of software directory location.
20 | 	* ROSE is compatible with python3
21 | 	* ROSE as a docker image: ghcr.io/stjude/abralab/rose:latest
22 | 
23 | 1. REQUIREMENTS:
24 | 
25 | 	1. All files :
26 | 	All input files much be in one directory.
27 | 
28 | 	1. Annotation file :
29 | 	Annotation file should be in UCSC table track format (https://genome.ucsc.edu/cgi-bin/hgTables).
30 | 	Annotation file should be saved as [GENOME]_refseq.ucsc (example: hg19_refseq.ucsc).
31 | 	Annotation file should be in annotation/ folder in the input files directory.
32 | 
33 | 	1. BAM files (of sequencing reads for factor of interest and control) :
34 | 	Files must have chromosome IDs starting with "chr"
35 | 	Files must be sorted and indexed using SAMtools in order for bamToGFF.py to work. (http://samtools.sourceforge.net/samtools.shtml)
36 | 
37 | 	1. Peak file of constituent enhancers :
38 | 	File must be in GFF format with the following columns:
39 | 
40 | 			column 1: chromosome (chr#)
41 | 			column 2: unique ID for each constituent enhancer region
42 | 			column 4: start of constituent
43 | 			column 5: end of constituent
44 | 			column 7: strand (+,-,.)
45 | 			column 9: unique ID for each constituent enhancer region
46 | 			
47 | 		NOTE: if value for column 2 and 9 differ, value in column 2 will be used
48 | 
49 | 1. DIRECTORY structure
50 | 	```
51 | 	├── LICENSE.txt
52 | 	│
53 | 	├── README.md
54 | 	│
55 | 	├── bin
56 | 	│   ├── ROSE_bamToGFF.py : calculates density of .bam reads in .gff regions
57 | 	│   ├── ROSE_callSuper.R : ranks regions by their densities, creates cutoff
58 | 	│   ├── ROSE_geneMapper.py : assigns stitched enhancers to genes
59 | 	│   └── ROSE_main.py : main program
60 | 	└── lib
61 | 	    └── ROSE_utils.py : utilities method
62 | 
63 | 	Total: 2 directories, 8 files
64 | 	```
65 | 1. DEPENDENCIES
66 | 
67 | 	* samtools
68 | 	* R version > 3.4
69 | 	* bedtools > 2
70 | 	* python3
71 | 


--------------------------------------------------------------------------------
/ROSE-local.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | #
  3 | # Rose Caller to detect both Enhancers and Super-Enhancers
  4 | # Hardcoded implementation of ROSE for St. Jude, Abraham's lab.
  5 | # Version 1 11/16/2019
  6 | 
  7 | ##############################################################
  8 | # ##### Please replace PATHTO with your own directory ###### #
  9 | ##############################################################
 10 | PATHTO=/path/to/ROSE
 11 | PYTHONPATH=$PATHTO/lib
 12 | export PYTHONPATH
 13 | export PATH=$PATH:$PATHTO/bin
 14 | 
 15 | if [ $# -lt 7 ]; then
 16 |   echo ""
 17 |   echo 1>&2 Usage: $0 ["GTF file"] ["BAM file"] ["OutputDir"] ["feature type"] ["species"] ["bed fileA"] ["bed fileB"]
 18 |   echo ""
 19 |   exit 1
 20 | fi
 21 | 
 22 | #================================================================================
 23 | #Parameters for running
 24 | 
 25 | # GTF files
 26 | GTFFILE=$1
 27 | 
 28 | # BAM file
 29 | BAMFILE=$2
 30 | 
 31 | # Output Directory
 32 | OUTPUTDIR=$3
 33 | OUTPUTDIR=${OUTPUTDIR:=ROSE_out}
 34 | 
 35 | # Feature type
 36 | FEATURE=$4
 37 | FEATURE=${FEATURE:=gene}
 38 | 
 39 | # Species
 40 | SPECIES=$5
 41 | SPECIES=${SPECIES:=hg19}
 42 | 
 43 | # Bed File A
 44 | FILEA=$6
 45 | 
 46 | # Bed File B
 47 | FILEB=$7
 48 | 
 49 | # Transcription Start Size Window
 50 | #TSS=
 51 | TSS=${TSS:=2000}
 52 | 
 53 | # Maximum linking distance for stitching
 54 | #STITCH=
 55 | STITCH=${STITCH:=12500}
 56 | 
 57 | 
 58 | echo "#############################################"
 59 | echo "######             ROSE v1             ######"
 60 | echo "#############################################"
 61 | 
 62 | echo "Input Bed File A: $FILEA"
 63 | echo "Input Bed File B: $FILEB"
 64 | echo "BAM file: $BAMFILE"
 65 | echo "Output directory: $OUTPUTDIR"
 66 | echo "Species: $SPECIES"
 67 | echo "Feature type: $FEATURE"
 68 | #================================================================================
 69 | # 
 70 | # UCSC TRACK FORMAT ANNOTATION FILE 
 71 | # Generate UCSC table track annotation file using NCBI GTF refseq.
 72 | #
 73 | mkdir -p annotation
 74 | echo -e "#bin\tname\tchrom\tstrand\ttxStart\ttxEnd\tcdsStart\tcdsEnd\tX\tX\tX\t\tX\tname2" > annotation/$SPECIES"_refseq.ucsc"
 75 | 
 76 | if [[ $FEATURE == "gene" ]]; then
 77 | awk -F'[\t ]' '{
 78 |   if($3=="gene")
 79 |     print "0\t" $14 "\tchr" $1 "\t" $7 "\t" $4 "\t" $5 "\t" $4 "\t" $5 "\t.\t.\t.\t.\t" $18}' $GTFFILE | sed s/\"//g >> annotation/$SPECIES"_refseq.ucsc"
 80 | 
 81 | elif [[ $FEATURE == "transcript" ]]; then
 82 | awk -F'[\t ]' '{
 83 |   if($3=="transcript")
 84 |     print "0\t" $14 "\tchr" $1 "\t" $7 "\t" $4 "\t" $5 "\t" $4 "\t" $5 "\t.\t.\t.\t.\t" $18}' $GTFFILE | sed s/\"//g >> annotation/$SPECIES"_refseq.ucsc"
 85 | fi
 86 | echo "Annotation file: "$SPECIES"_refseq.ucsc"
 87 | 
 88 | #
 89 | # INPUT CONSTITUENT FILE
 90 | # merge peak bed files generated from MACS1 "keep_dup=all" and "keep_dup=auto" to generate constituent enhancers.
 91 | cat $FILEA $FILEB | sort -k1,1 -k2,2n | mergeBed -i - | awk -F\\t '{print $1 "\t" NR "\t\t" $2 "\t" $3 "\t\t.\t\t" NR}' > unionpeaks.gff
 92 | echo "Merge Bed file: unionpeaks.gff"
 93 | echo
 94 | 
 95 | #
 96 | # ROSE
 97 | #
 98 | ROSE_main.py -s $STITCH -t $TSS -g $SPECIES -i unionpeaks.gff -r $BAMFILE -o $OUTPUTDIR
 99 | 
100 | echo "Done!"
101 | 


--------------------------------------------------------------------------------
/bin/ROSE_bamToGFF.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | #bamToGFF.py
  3 | 
  4 | #script to grab reads from a bam that align to a .gff file
  5 | import sys
  6 | import re
  7 | 
  8 | import ROSE_utils
  9 | 
 10 | 
 11 | from collections import defaultdict
 12 | 
 13 | import os
 14 | 
 15 | import string
 16 | 
 17 | 
 18 | 
 19 | #=====================================================================
 20 | #====================MAPPING BAM READS TO GFF REGIONS=================
 21 | #=====================================================================
 22 | 
 23 | 
 24 | def mapBamToGFF(bamFile,gff,sense = 'both',extension = 200,floor = 0,rpm = False,matrix = None):
 25 | 
 26 | #def mapBamToGFF(bamFile,gff,sense = 'both',unique = 0,extension = 200,floor = 0,density = False,rpm = False,binSize = 25,clusterGram = None,matrix = None,raw = False,includeJxnReads = False):
 27 |     '''maps reads from a bam to a gff'''
 28 |     floor = int(floor)
 29 |     
 30 |     #USING BAM CLASS
 31 |     bam = ROSE_utils.Bam(bamFile)
 32 | 
 33 | 
 34 |     #new GFF to write to
 35 |     newGFF = []
 36 |     #millionMappedReads
 37 | 
 38 | 
 39 |     if rpm:    
 40 |         MMR= round(float(bam.getTotalReads('mapped'))/1000000,4)
 41 |     else:
 42 |         MMR = 1
 43 | 
 44 |     print(('using a MMR value of %s' % (MMR)))
 45 |     
 46 |     #senseTrans = maketrans('-+.','+-+') #deprecated
 47 | 
 48 |     if ROSE_utils.checkChrStatus(bamFile) == 1:
 49 |       print("has chr")
 50 |       hasChrFlag = 1
 51 |       #sys.exit();
 52 |     else:
 53 |       print("does not have chr")
 54 |       hasChrFlag = 0
 55 |       #sys.exit()
 56 |       
 57 |     if type(gff) == str:
 58 |         gff = ROSE_utils.parseTable(gff,'\t')
 59 |         
 60 |     #setting up a maxtrix table
 61 | 
 62 |     newGFF.append(['GENE_ID','locusLine'] + ['bin_'+str(n)+'_'+bamFile.split('/')[-1] for n in range(1,int(matrix)+1,1)])        
 63 | 
 64 |     #getting and processing reads for gff lines
 65 |     ticker = 0
 66 |     print('Number lines processed')
 67 |     for line in gff:
 68 |         line = line[0:9]
 69 |         if ticker%100 == 0:
 70 |             print(ticker)
 71 |         ticker+=1
 72 |         if not hasChrFlag:
 73 |             line[0] = re.sub(r"chr",r"",line[0])
 74 |         gffLocus = ROSE_utils.Locus(line[0],int(line[3]),int(line[4]),line[6],line[1])
 75 |         #print line[0]
 76 |         #sys.exit()
 77 |         searchLocus = ROSE_utils.makeSearchLocus(gffLocus,int(extension),int(extension))
 78 |         
 79 |         reads = bam.getReadsLocus(searchLocus,'both',False,'none')
 80 |         #now extend the reads and make a list of extended reads
 81 |         extendedReads = []
 82 |         for locus in reads:
 83 |             if locus.sense() == '+' or locus.sense() == '.':
 84 |                 locus = ROSE_utils.Locus(locus.chr(),locus.start(),locus.end()+extension,locus.sense(), locus.ID())
 85 |             if locus.sense() == '-':
 86 |                 locus = ROSE_utils.Locus(locus.chr(),locus.start()-extension,locus.end(),locus.sense(),locus.ID())
 87 |             extendedReads.append(locus)
 88 |         if gffLocus.sense() == '+' or gffLocus.sense == '.':
 89 |             senseReads = [x for x in extendedReads if x.sense() == '+' or x.sense() == '.']
 90 |             antiReads = [x for x in extendedReads if x.sense() == '-']
 91 |         else:
 92 |             senseReads = [x for x in extendedReads if x.sense() == '-' or x.sense() == '.']
 93 |             antiReads = [x for x in extendedReads if x.sense() == '+']
 94 | 
 95 |         senseHash = defaultdict(int)
 96 |         antiHash = defaultdict(int)
 97 | 
 98 |         #filling in the readHashes             
 99 |         if sense == '+' or sense == 'both' or sense =='.':
100 |             for read in senseReads:
101 |                 for x in range(read.start(),read.end()+1,1):
102 |                     senseHash[x]+=1
103 |         if sense == '-' or sense == 'both' or sense == '.':
104 |             #print('foo')
105 |             for read in antiReads:
106 |                 for x in range(read.start(),read.end()+1,1):
107 |                     antiHash[x]+=1
108 | 
109 |         #now apply flooring and filtering for coordinates
110 |         keys = ROSE_utils.uniquify(list(senseHash.keys())+list(antiHash.keys()))
111 |         if floor > 0:
112 | 
113 |             keys = [x for x in keys if (senseHash[x]+antiHash[x]) > floor]
114 |         #coordinate filtering
115 |         keys = [x for x in keys if gffLocus.start() < x < gffLocus.end()]
116 | 
117 | 
118 |         #setting up the output table
119 |         #clusterLine = [gffLocus.ID(),gffLocus.__str__()]
120 | 
121 |         # bug fix gff coordinates with same chromosomal name as BAM
122 |         if not hasChrFlag:
123 |             clusterLine = [gffLocus.ID(),"chr" + gffLocus.__str__()]
124 |         else: 
125 |            clusterLine = [gffLocus.ID(),gffLocus.__str__()]
126 | 
127 |         #getting the binsize
128 |         binSize = (gffLocus.len()-1)/int(matrix)
129 |         nBins = int(matrix)
130 |         if binSize == 0:
131 |             clusterLine+=['NA']*int(matrix)
132 |             newGFF.append(clusterLine)
133 |             continue
134 |         n=0
135 |         if gffLocus.sense() == '+' or gffLocus.sense() =='.' or gffLocus.sense() == 'both':
136 |             i = gffLocus.start()
137 | 
138 |             while n <nBins:
139 |                 n+=1
140 |                 binKeys = [x for x in keys if i < x < i+binSize]
141 |                 binDen = float(sum([senseHash[x]+antiHash[x] for x in binKeys]))/binSize
142 |                 clusterLine+=[round(binDen/MMR,4)]
143 |                 i = i+binSize
144 |         else:
145 |             i = gffLocus.end()
146 |             while n < nBins:
147 |                 n+=1
148 |                 binKeys = [x for x in keys if i-binSize < x < i]
149 |                 binDen = float(sum([senseHash[x]+antiHash[x] for x in binKeys]))/binSize
150 |                 clusterLine+=[round(binDen/MMR,4)]
151 |                 i = i-binSize
152 |         newGFF.append(clusterLine)
153 |         
154 |             
155 |     return newGFF
156 |         
157 |                 
158 |                 
159 | #=====================================================================
160 | #============================MAIN METHOD==============================
161 | #=====================================================================
162 |         
163 | 
164 | def main():
165 |     from optparse import OptionParser
166 |     usage = "usage: %prog [options] -b [SORTED BAMFILE] -i [INPUTFILE] -o [OUTPUTFILE]"
167 |     parser = OptionParser(usage = usage)
168 |     #required flags
169 |     parser.add_option("-b","--bam", dest="bam",nargs = 1, default=None,
170 |                       help = "Enter .bam file to be processed.")
171 |     parser.add_option("-i","--input", dest="input",nargs = 1, default=None,
172 |                       help = "Enter .gff or ENRICHED REGION file to be processed.")
173 |     #output flag
174 |     parser.add_option("-o","--output", dest="output",nargs = 1, default=None,
175 |                       help = "Enter the output filename.")
176 |     #additional options
177 |     parser.add_option("-s","--sense", dest="sense",nargs = 1, default='both',
178 |                       help = "Map to '+','-' or 'both' strands. Default maps to both.")
179 | 
180 | 
181 |     parser.add_option("-f","--floor", dest="floor",nargs =1, default=0,
182 |                       help = "Sets a read floor threshold necessary to count towards density")    
183 |     parser.add_option("-e","--extension", dest="extension",nargs = 1, default=200,
184 |                       help = "Extends reads by n bp. Default value is 200bp")
185 |     parser.add_option("-r","--rpm", dest="rpm",action = 'store_true', default=False,
186 |                       help = "Normalizes density to reads per million (rpm)")
187 | 
188 | 
189 |     parser.add_option("-m","--matrix", dest="matrix",nargs = 1, default=None,
190 |                       help = "Outputs a variable bin sized matrix. User must specify number of bins.")
191 | 
192 |     (options,args) = parser.parse_args()
193 | 
194 |     print(options)
195 |     print(args)
196 | 
197 |     if options.bam:
198 |         bamFile = options.bam
199 |         fullPath = os.path.abspath(bamFile)
200 |         bamName = fullPath.split('/')[-1].split('.')[0]
201 |         pathFolder = '/'.join(fullPath.split('/')[0:-1])
202 |         fileList = os.listdir(pathFolder)
203 |         hasBai = False
204 |         for fileName in fileList:
205 |             if fileName.count(bamName) == 1 and fileName.count('.bai') == 1:
206 |                 hasBai = True
207 | 
208 |         if not hasBai:
209 |             print('ERROR: no associated .bai file found with bam. Must use a sorted bam with accompanying index file')
210 |             parser.print_help()
211 |             exit()
212 |    
213 |     if options.sense:
214 |         if ['+','-','.','both'].count(options.sense) == 0:
215 |             print('ERROR: sense flag must be followed by +,-,.,both')
216 |             parser.print_help()
217 |             exit()
218 | 
219 | 
220 |     if options.matrix:
221 |         try:
222 |             int(options.matrix)
223 |         except:
224 |             print('ERROR: User must specify an integer bin number for matrix (try 50)')
225 |             parser.print_help()
226 |             exit()
227 |             
228 | 
229 |     
230 |     
231 |     if options.input and options.bam:
232 |         inputFile = options.input
233 |         gffFile = inputFile
234 | 
235 |         bamFile = options.bam
236 |         
237 |         if options.output == None:
238 |             output = os.getcwd() + inputFile.split('/')[-1]+'.mapped'
239 |         else:
240 |             output = options.output
241 |         if options.matrix:
242 |             print('mapping to GFF and making a matrix with fixed bin number')
243 | 
244 |             newGFF = mapBamToGFF(bamFile,gffFile,options.sense,int(options.extension),options.floor,options.rpm,options.matrix)
245 | 
246 |             
247 |         ROSE_utils.unParseTable(newGFF,output,'\t')
248 |     else:
249 |         parser.print_help()
250 |                 
251 |  
252 | if __name__ == "__main__":
253 |     main()
254 | 


--------------------------------------------------------------------------------
/bin/ROSE_callSuper.R:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env Rscript
  2 | #============================================================================
  3 | #==============SUPER-ENHANCER CALLING AND PLOTTING FUNCTIONS=================
  4 | #============================================================================
  5 | 
  6 | #This function calculates the cutoff by sliding a diagonal line and finding where it is tangential (or as close as possible)
  7 | calculate_cutoff <- function(inputVector, drawPlot=TRUE,...){
  8 |  	inputVector <- sort(inputVector)
  9 | 	inputVector[inputVector<0]<-0 #set those regions with more control than ranking equal to zero
 10 | 	slope <- (max(inputVector)-min(inputVector))/length(inputVector) #This is the slope of the line we want to slide. This is the diagonal.
 11 | 	xPt <- floor(optimize(numPts_below_line,lower=1,upper=length(inputVector),myVector= inputVector,slope=slope)$minimum) #Find the x-axis point where a line passing through that point has the minimum number of points below it. (ie. tangent)
 12 | 	y_cutoff <- inputVector[xPt] #The y-value at this x point. This is our cutoff.
 13 | 	
 14 | 	if(drawPlot){  #if TRUE, draw the plot
 15 | 		plot(1:length(inputVector), inputVector,type="l",...)
 16 | 		b <- y_cutoff-(slope* xPt)
 17 | 		abline(v= xPt,h= y_cutoff,lty=2,col=8)
 18 | 		points(xPt,y_cutoff,pch=16,cex=0.9,col=2)
 19 | 		abline(coef=c(b,slope),col=2)
 20 | 		title(paste("x=",xPt,"\ny=",signif(y_cutoff,3),"\nFold over Median=",signif(y_cutoff/median(inputVector),3),"x\nFold over Mean=",signif(y_cutoff/mean(inputVector),3),"x",sep=""))
 21 | 		axis(1,sum(inputVector==0),sum(inputVector==0),col.axis="pink",col="pink") #Number of regions with zero signal
 22 | 	}
 23 | 	return(list(absolute=y_cutoff,overMedian=y_cutoff/median(inputVector),overMean=y_cutoff/mean(inputVector)))
 24 | }
 25 | 
 26 | #this is an accessory function, that determines the number of points below a diagnoal passing through [x,yPt]
 27 | numPts_below_line <- function(myVector,slope,x){
 28 | 	yPt <- myVector[x]
 29 | 	b <- yPt-(slope*x)
 30 | 	xPts <- 1:length(myVector)
 31 | 	return(sum(myVector<=(xPts*slope+b)))
 32 | }
 33 | 
 34 | 
 35 | convert_stitched_to_bed <- function(inputStitched,trackName,trackDescription,outputFile,splitSuper=TRUE,score=c(),superRows=c(),baseColor="0,0,0",superColor="255,0,0"){
 36 | 	outMatrix <- matrix(data="",ncol=4+ifelse(length(score)==nrow(inputStitched),1,0),nrow=nrow(inputStitched))
 37 | 	
 38 | 	outMatrix[,1] <- as.character(inputStitched$CHROM)
 39 | 	outMatrix[,2] <- as.character(inputStitched$START)
 40 | 	outMatrix[,3] <- as.character(inputStitched$STOP)
 41 | 	outMatrix[,4] <- as.character(inputStitched$REGION_ID)
 42 | 	if(length(score)==nrow(inputStitched)){
 43 | 		score <- rank(score,ties.method="first")
 44 | 		score <- length(score)-score+1  #Stupid rank only does smallest to largest. 
 45 | 		outMatrix[,5] <- as.character(score)
 46 | 	}
 47 | 	trackDescription <- paste(trackDescription,"\nCreated on ",format(Sys.time(), "%b %d %Y"),collapse="",sep="")
 48 | 	trackDescription <- gsub("\n","\t", trackDescription)
 49 | 	tName <- gsub(" ","_",trackName)
 50 | 	cat('track name="', tName,'" description="', trackDescription,'" itemRGB=On color=',baseColor,"\n",sep="",file=outputFile)
 51 | 	write.table(file= outputFile,outMatrix,sep="\t",quote=FALSE,row.names=FALSE,col.names=FALSE,append=TRUE)
 52 | 	if(splitSuper==TRUE){
 53 | 		cat("\ntrack name=\"Super_", tName,'" description="Super ', trackDescription,'" itemRGB=On color=', superColor,"\n",sep="",file=outputFile,append=TRUE)
 54 | 		write.table(file= outputFile,outMatrix[superRows,],sep="\t",quote=FALSE,row.names=FALSE,col.names=FALSE,append=TRUE)
 55 | 	}
 56 | }
 57 | 
 58 | writeSuperEnhancer_table <- function(superEnhancer,description,outputFile,additionalData=NA){
 59 | 	description <- paste("#",description,"\nCreated on ",format(Sys.time(), "%b %d %Y"),collapse="",sep="")
 60 | 	description <- gsub("\n","\n#",description)
 61 | 	cat(description,"\n",file=outputFile)
 62 | 	if(is.matrix(additionalData)){
 63 | 		if(nrow(additionalData)!=nrow(superEnhancer)){
 64 | 			warning("Additional data does not have the same number of rows as the number of super stitched peaks.\n--->>> ADDITIONAL DATA NOT INCLUDED <<<---\n")
 65 | 		}else{
 66 | 			superEnhancer <- cbind(superEnhancer,additionalData)
 67 | 			superEnhancer = superEnhancer[order(superEnhancer$stitchedPeakRank),]
 68 | 			
 69 | 		}
 70 | 	}
 71 | 	write.table(file=outputFile,superEnhancer,sep="\t",quote=FALSE,row.names=FALSE,append=TRUE)
 72 | }
 73 | 
 74 | 
 75 | 
 76 | #============================================================================
 77 | #===================SUPER-ENHANCER CALLING AND PLOTTING======================
 78 | #============================================================================
 79 | 
 80 | 
 81 | 
 82 | args <- commandArgs()
 83 | 
 84 | print('THESE ARE THE ARGUMENTS')
 85 | print(args)
 86 | 
 87 | #ARGS
 88 | outFolder = args[6] #3
 89 | enhancerFile = args[7] #4
 90 | enhancerName = args[8] #5
 91 | wceName = args[9] #6
 92 | 
 93 | 
 94 | 
 95 | #Read enhancer regions with closestGene columns
 96 | stitched_regions <- read.delim(file= enhancerFile,sep="\t")
 97 | 
 98 | 
 99 | #perform WCE subtraction. Using pipeline table to match samples to proper background. 
100 | rankBy_factor = colnames(stitched_regions)[7]
101 | 
102 | prefix = unlist(strsplit(rankBy_factor,'_'))[1]
103 | 
104 | if(wceName == 'NONE'){
105 | 	
106 | 	
107 | 	rankBy_vector = as.numeric(stitched_regions[,7])
108 | 	
109 | }else{
110 | 	wceName = colnames(stitched_regions)[8]
111 | 	print('HERE IS THE WCE NAME')
112 | 	print(wceName)
113 | 	
114 | 	rankBy_vector = as.numeric(stitched_regions[,7])-as.numeric(stitched_regions[,8])
115 | }	
116 | 	
117 | 
118 | #SETTING NEGATIVE VALUES IN THE rankBy_vector to 0
119 | 
120 | rankBy_vector[rankBy_vector < 0] <- 0
121 | 
122 | 
123 | #FIGURING OUT THE CUTOFF
124 | 
125 | cutoff_options <- calculate_cutoff(rankBy_vector, drawPlot=FALSE,xlab=paste(rankBy_factor,' Stitched peaks'),ylab=paste(rankByFactor,' Signal','- ',wceName),lwd=2,col=4)
126 | 
127 | 
128 | #These are the super-enhancers
129 | superEnhancerRows <- which(rankBy_vector> cutoff_options$absolute)
130 | typicalEnhancers = setdiff(1:nrow(stitched_regions),superEnhancerRows)
131 | enhancerDescription <- paste(enhancerName," Stitched Peaks\nCreated from ", enhancerFile,"\nRanked by ",rankBy_factor,"\nUsing cutoff of ",cutoff_options$absolute," for Super-Stitched Peaks",sep="",collapse="")
132 | 
133 | 
134 | #MAKING HOCKEY STICK PLOT
135 | plotFileName = paste(outFolder,enhancerName,'_Plot_points.png',sep='')
136 | png(filename=plotFileName,height=600,width=600)
137 | signalOrder = order(rankBy_vector,decreasing=TRUE)
138 | if(wceName == 'NONE'){
139 | 	plot(length(rankBy_vector):1,rankBy_vector[signalOrder], col='red',xlab=paste(rankBy_factor,' Stitched peaks'),ylab=paste(rankBy_factor,' Signal'),pch=19,cex=2)	
140 | 	
141 | }else{
142 | 	plot(length(rankBy_vector):1,rankBy_vector[signalOrder], col='red',xlab=paste(rankBy_factor,' Stitched peaks'),ylab=paste(rankBy_factor,' Signal','- ',wceName),pch=19,cex=2)
143 | }
144 | abline(h=cutoff_options$absolute,col='grey',lty=2)
145 | abline(v=length(rankBy_vector)-length(superEnhancerRows),col='grey',lty=2)
146 | lines(length(rankBy_vector):1,rankBy_vector[signalOrder],lwd=4, col='red')
147 | text(0,0.8*max(rankBy_vector),paste(' Cutoff used: ',cutoff_options$absolute,'\n','Super-Stitched peaks identified: ',length(superEnhancerRows)),pos=4)
148 | 
149 | dev.off()
150 | 
151 | 
152 | 
153 | 
154 | #Writing a bed file
155 | bedFileName = paste(outFolder,enhancerName,'_Stitched_withSuper.bed',sep='')
156 | convert_stitched_to_bed(stitched_regions,paste(rankBy_factor,"Stitched"), enhancerDescription,bedFileName,score=rankBy_vector,splitSuper=TRUE,superRows= superEnhancerRows,baseColor="0,0,0",superColor="255,0,0")
157 | 
158 | 
159 | 
160 | #This matrix is just the super_enhancers
161 | true_super_enhancers <- stitched_regions[superEnhancerRows,]
162 | 
163 | additionalTableData <- matrix(data=NA,ncol=2,nrow=nrow(stitched_regions))
164 | colnames(additionalTableData) <- c("stitchedPeakRank","isSuper")
165 | additionalTableData[,1] <- nrow(stitched_regions)-rank(rankBy_vector,ties.method="first")+1
166 | additionalTableData[,2] <- 0
167 | additionalTableData[superEnhancerRows,2] <- 1
168 | 
169 | 
170 | #Writing enhancer and super-enhancer tables with enhancers ranked and super status annotated
171 | enhancerTableFile = paste(outFolder,enhancerName,'_AllStitched.table.txt',sep='')
172 | writeSuperEnhancer_table(stitched_regions, enhancerDescription,enhancerTableFile, additionalData= additionalTableData)
173 | 
174 | superTableFile = paste(outFolder,enhancerName,'_SuperStitched.table.txt',sep='')
175 | writeSuperEnhancer_table(true_super_enhancers, enhancerDescription,superTableFile, additionalData= additionalTableData[superEnhancerRows,])
176 | 


--------------------------------------------------------------------------------
/bin/ROSE_geneMapper.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | #130428
  3 | 
  4 | #ROSE_geneMapper.py
  5 | 
  6 | #main method wrapped script to take the enhancer region table output of ROSE_Main and map genes to it
  7 | #will create two outputs a gene mapped region table where each row is an enhancer
  8 | #and a gene table where each row is a gene
  9 | #does this by default for super-enhancers only
 10 | 
 11 | import sys
 12 | 
 13 | import ROSE_utils
 14 | 
 15 | import os
 16 | 
 17 | import string
 18 | 
 19 | from collections import defaultdict
 20 | 
 21 | 
 22 | #==================================================================
 23 | #====================MAPPING GENES TO ENHANCERS====================
 24 | #==================================================================
 25 | 
 26 | 
 27 | 
 28 | def mapEnhancerToGene(annotFile,enhancerFile,transcribedFile='',uniqueGenes=True,byRefseq=False,subtractInput=False):
 29 |     
 30 |     '''
 31 |     maps genes to enhancers. if uniqueGenes, reduces to gene name only. Otherwise, gives for each refseq
 32 |     '''
 33 |     print("Herp")
 34 |     startDict = ROSE_utils.makeStartDict(annotFile)
 35 |     print("Derp")
 36 |     enhancerTable = ROSE_utils.parseTable(enhancerFile,'\t')
 37 | 
 38 | 
 39 | 
 40 | 
 41 | 
 42 |     if len(transcribedFile) > 0:
 43 |         transcribedTable = ROSE_utils.parseTable(transcribedFile,'\t')
 44 |         transcribedGenes = [line[1] for line in transcribedTable]
 45 |     else:
 46 |         transcribedGenes = list(startDict.keys())
 47 | 
 48 |     print('MAKING TRANSCRIPT COLLECTION')
 49 |     transcribedCollection = ROSE_utils.makeTranscriptCollection(annotFile,0,0,500,transcribedGenes)
 50 | 
 51 | 
 52 |     print('MAKING TSS COLLECTION')
 53 |     tssLoci = []
 54 |     for geneID in transcribedGenes:
 55 |         tssLoci.append(ROSE_utils.makeTSSLocus(geneID,startDict,0,0))
 56 | 
 57 | 
 58 |     #this turns the tssLoci list into a LocusCollection
 59 |     #50 is the internal parameter for LocusCollection and doesn't really matter
 60 |     tssCollection = ROSE_utils.LocusCollection(tssLoci,50)
 61 | 
 62 | 
 63 | 
 64 |     geneDict = {'overlapping':defaultdict(list),'proximal':defaultdict(list),'enhancerString':defaultdict(list)}
 65 |     #list of all genes that appear in this analysis
 66 |     overallGeneList = []
 67 | 
 68 |     #set up the output tables
 69 |     #first by enhancer
 70 |     enhancerToGeneTable = [enhancerTable[5][0:6]+['OVERLAP_GENES','PROXIMAL_GENES','CLOSEST_GENE'] + enhancerTable[5][-2:]]
 71 |     
 72 |     #next by gene
 73 |     geneToEnhancerTable = [['GENE_NAME','REFSEQ_ID','PROXIMAL_STITCHED_PEAKS']]
 74 | 
 75 |     #have all information
 76 |     signalWithGenes = [['GENE_NAME', 'REFSEQ_ID', 'PROXIMAL_STITCHED_PEAKS', 'SIGNAL']]
 77 | 
 78 |     for line in enhancerTable[6:]:
 79 | 
 80 |         enhancerString = '%s:%s-%s' % (line[1],line[2],line[3])
 81 |         enhancerSignal = int(float(line[6]))
 82 |         if subtractInput: enhancerSignal = int(float(line[6]) - float(line[7]))
 83 | 
 84 |         enhancerLocus = ROSE_utils.Locus(line[1],line[2],line[3],'.',line[0])
 85 | 
 86 | 
 87 |         #overlapping genes are transcribed genes whose transcript is directly in the stitchedLocus         
 88 |         overlappingLoci = transcribedCollection.getOverlap(enhancerLocus,'both')           
 89 |         overlappingGenes =[]
 90 |         for overlapLocus in overlappingLoci:                
 91 |             overlappingGenes.append(overlapLocus.ID())
 92 | 
 93 |         #proximalGenes are transcribed genes where the tss is within 50kb of the boundary of the stitched loci
 94 |         proximalLoci = tssCollection.getOverlap(ROSE_utils.makeSearchLocus(enhancerLocus,50000,50000),'both')           
 95 |         proximalGenes =[]
 96 |         for proxLocus in proximalLoci:
 97 |             proximalGenes.append(proxLocus.ID())
 98 | 
 99 | 
100 |         distalLoci = tssCollection.getOverlap(ROSE_utils.makeSearchLocus(enhancerLocus,50000000,50000000),'both')           
101 |         distalGenes =[]
102 |         for proxLocus in distalLoci:
103 |             distalGenes.append(proxLocus.ID())
104 | 
105 |             
106 |             
107 |         overlappingGenes = ROSE_utils.uniquify(overlappingGenes)
108 |         proximalGenes = ROSE_utils.uniquify(proximalGenes)
109 |         distalGenes = ROSE_utils.uniquify(distalGenes)
110 |         allEnhancerGenes = overlappingGenes + proximalGenes + distalGenes
111 |         #these checks make sure each gene list is unique.
112 |         #technically it is possible for a gene to be overlapping, but not proximal since the
113 |         #gene could be longer than the 50kb window, but we'll let that slide here
114 |         for refID in overlappingGenes:
115 |             if proximalGenes.count(refID) == 1:
116 |                 proximalGenes.remove(refID)
117 | 
118 |         for refID in proximalGenes:
119 |             if distalGenes.count(refID) == 1:
120 |                 distalGenes.remove(refID)
121 | 
122 | 
123 |         #Now find the closest gene
124 |         if len(allEnhancerGenes) == 0:
125 |             closestGene = ''
126 |         else:
127 |             #get enhancerCenter
128 |             enhancerCenter = (int(line[2]) + int(line[3]))/2
129 | 
130 |             #get absolute distance to enhancer center
131 |             distList = [abs(enhancerCenter - startDict[geneID]['start'][0]) for geneID in allEnhancerGenes]
132 |             closestGene = startDict[allEnhancerGenes[distList.index(min(distList))]]['name']
133 | 
134 |         #NOW WRITE THE ROW FOR THE ENHANCER TABLE
135 |         newEnhancerLine = line[0:6]
136 | 
137 |         if byRefseq:
138 |             newEnhancerLine.append(','.join(ROSE_utils.uniquify([x for x in overlappingGenes])))
139 |             newEnhancerLine.append(','.join(ROSE_utils.uniquify([x for x in proximalGenes])))
140 |             closestGene = allEnhancerGenes[distList.index(min(distList))]
141 |             newEnhancerLine.append(closestGene)
142 |         else:
143 |             newEnhancerLine.append(','.join(ROSE_utils.uniquify([startDict[x]['name'] for x in overlappingGenes])))
144 |             newEnhancerLine.append(','.join(ROSE_utils.uniquify([startDict[x]['name'] for x in proximalGenes])))
145 |             closestGene = startDict[allEnhancerGenes[distList.index(min(distList))]]['name']
146 |             newEnhancerLine.append(closestGene)
147 | 
148 | 
149 |         #WRITE GENE TABLE
150 |         signalWithGenes.append([startDict[closestGene]['name'], closestGene, enhancerString, enhancerSignal])
151 | 
152 |         newEnhancerLine += line[-2:]
153 |         enhancerToGeneTable.append(newEnhancerLine)
154 |         #Now grab all overlapping and proximal genes for the gene ordered table
155 | 
156 |         overallGeneList +=overlappingGenes
157 |         for refID in overlappingGenes:
158 |             geneDict['overlapping'][refID].append(enhancerString)
159 | 
160 | 
161 |         overallGeneList+=proximalGenes
162 |         for refID in proximalGenes:
163 |             geneDict['proximal'][refID].append(enhancerString)
164 | 
165 | 
166 |     #End loop through
167 |     
168 |     #Make table by gene
169 |     overallGeneList = ROSE_utils.uniquify(overallGeneList)
170 |     
171 | 
172 |     nameOrder = ROSE_utils.order([startDict[x]['name'] for x in overallGeneList])
173 | 
174 |     usedNames = []
175 | 
176 |     for i in nameOrder:
177 |         refID = overallGeneList[i]
178 |         geneName = startDict[refID]['name']
179 |         if usedNames.count(geneName) > 0 and uniqueGenes == True:
180 | 
181 |             continue
182 |         else:
183 |             usedNames.append(geneName)
184 |         
185 |         proxEnhancers = geneDict['proximal'][refID] + geneDict['overlapping'][refID]
186 |         
187 |         newLine = [geneName,refID,','.join(proxEnhancers)]
188 | 
189 |         
190 |         geneToEnhancerTable.append(newLine)
191 | 
192 |     #re-sort enhancerToGeneTable
193 | 
194 |     enhancerOrder = ROSE_utils.order([int(line[-2]) for line in enhancerToGeneTable[1:]])
195 |     sortedTable = [enhancerToGeneTable[0]]
196 |     for i in enhancerOrder:
197 |         sortedTable.append(enhancerToGeneTable[(i+1)])
198 | 
199 |     return sortedTable,geneToEnhancerTable,signalWithGenes
200 | 
201 | 
202 | 
203 | #==================================================================
204 | #=========================MAIN METHOD==============================
205 | #==================================================================
206 | 
207 | def main():
208 |     '''
209 |     main run call
210 |     '''
211 |     debug = False
212 | 
213 | 
214 |     from optparse import OptionParser
215 |     usage = "usage: %prog [options] -g [GENOME] -i [INPUT_ENHANCER_FILE]"
216 |     parser = OptionParser(usage = usage)
217 |     #required flags
218 |     parser.add_option("-i","--i", dest="input",nargs = 1, default=None,
219 |                       help = "Enter a ROSE ranked enhancer or super-enhancer file")
220 |     parser.add_option("-g","--genome", dest="genome",nargs = 1, default=None,
221 |                       help = "Enter the genome build (MM9,MM8,HG18,HG19,HG38)")
222 |     parser.add_option("--custom", dest="custom_genome", default=None,
223 |                       help = "Enter the custom genome annotation .ucsc")
224 | 
225 |     #optional flags
226 |     parser.add_option("-l","--list", dest="geneList",nargs = 1, default=None,
227 |                       help = "Enter a gene list to filter through")
228 |     parser.add_option("-o","--out", dest="out",nargs = 1, default=None,
229 |                       help = "Enter an output folder. Default will be same folder as input file")
230 |     parser.add_option("-r","--refseq",dest="refseq",action = 'store_true', default=False,
231 |                       help = "If flagged will write output by refseq ID and not common name")
232 |     parser.add_option("-c","--control",dest="control",action = 'store_true', default=False,
233 |                       help = "If flagged will subtract input from sample signal")
234 | 
235 |     #RETRIEVING FLAGS
236 |     (options,args) = parser.parse_args()
237 | 
238 | 
239 |     if not options.input or not (options.genome or options.custom_genome):
240 | 
241 |         parser.print_help()
242 |         exit()
243 | 
244 |     #GETTING THE INPUT
245 |     enhancerFile = options.input
246 | 
247 |     #making the out folder if it doesn't exist
248 |     if options.out:
249 |         outFolder = ROSE_utils.formatFolder(options.out,True)
250 |     else:
251 |         outFolder = '/'.join(enhancerFile.split('/')[0:-1]) + '/'
252 | 
253 | 
254 |     #GETTING THE CORRECT ANNOT FILE
255 |     cwd = os.getcwd()
256 |     genomeDict = {
257 |         'HG18':'%s/annotation/hg18_refseq.ucsc' % (cwd),
258 |         'MM9': '%s/annotation/mm9_refseq.ucsc' % (cwd),
259 |         'HG19':'%s/annotation/hg19_refseq.ucsc' % (cwd),
260 |         'HG38':'%s/annotation/hg38_refseq.ucsc' % (cwd),
261 |         'MM8': '%s/annotation/mm8_refseq.ucsc' % (cwd),
262 |         'MM10':'%s/annotation/mm10_refseq.ucsc' % (cwd),
263 |         }
264 | 
265 |     #GETTING THE GENOME
266 |     if options.custom_genome:
267 |         annotFile = options.custom_genome
268 |         print('USING CUSTOM GENOME %s AS THE GENOME FILE' % options.custom_genome)
269 |     else:
270 |         genome = options.genome
271 |         annotFile = genomeDict[genome.upper()]
272 |         print('USING %s AS THE GENOME' % genome)
273 | 
274 |     #GETTING THE TRANSCRIBED LIST
275 |     if options.geneList:
276 | 
277 |         transcribedFile = options.geneList
278 |     else:
279 |         transcribedFile = ''
280 | 
281 |     enhancerToGeneTable,geneToEnhancerTable,withGenesTable = mapEnhancerToGene(annotFile,enhancerFile, uniqueGenes=True, byRefseq=options.refseq, subtractInput=options.control, transcribedFile=transcribedFile)
282 | 
283 |     #Writing enhancer output
284 |     enhancerFileName = enhancerFile.split('/')[-1].split('.')[0]
285 | 
286 |     #writing the enhancer table
287 |     out1 = '%s%s_REGION_TO_GENE.txt' % (outFolder,enhancerFileName)
288 |     ROSE_utils.unParseTable(enhancerToGeneTable,out1,'\t')
289 | 
290 |     #writing the gene table
291 |     out2 = '%s%s_GENE_TO_REGION.txt' % (outFolder,enhancerFileName)
292 |     ROSE_utils.unParseTable(geneToEnhancerTable,out2,'\t')
293 | 
294 |     out3 = '%s%s.table_withGENES.txt' % (outFolder,enhancerFileName)
295 |     ROSE_utils.unParseTable(withGenesTable,out3,'\t')
296 | 
297 | if __name__ == "__main__":
298 |     main()
299 | 


--------------------------------------------------------------------------------
/bin/ROSE_main.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | #mapEnhancerFromFactor.py
  4 | '''
  5 | PROGRAM TO STITCH TOGETHER REGIONS TO FORM ENHANCERS, MAP READ DENSITY TO STITCHED REGIONS,
  6 | AND RANK ENHANCERS BY READ DENSITY TO DISCOVER SUPER-ENHANCERS
  7 | APRIL 11, 2013
  8 | VERSION 0.1
  9 | CONTACT: youngcomputation@wi.mit.edu
 10 | '''
 11 | 
 12 | import sys
 13 | 
 14 | 
 15 | 
 16 | import ROSE_utils
 17 | 
 18 | import time
 19 | 
 20 | import os
 21 | 
 22 | import string
 23 | 
 24 | from collections import defaultdict
 25 | 
 26 | #==================================================================
 27 | #=====================REGION STITCHING=============================
 28 | #==================================================================
 29 | 
 30 | 
 31 | def regionStitching(inputGFF,stitchWindow,tssWindow,annotFile,removeTSS=True):
 32 |     print('PERFORMING REGION STITCHING')
 33 |     #first have to turn bound region file into a locus collection
 34 | 
 35 |     #need to make sure this names correctly... each region should have a unique name
 36 |     boundCollection = ROSE_utils.gffToLocusCollection(inputGFF)
 37 | 
 38 |     debugOutput = []
 39 |     #filter out all bound regions that overlap the TSS of an ACTIVE GENE
 40 |     if removeTSS:
 41 |         #first make a locus collection of TSS
 42 |         startDict = ROSE_utils.makeStartDict(annotFile)
 43 | 
 44 |         #now makeTSS loci for active genes
 45 |         removeTicker=0
 46 |         #this loop makes a locus centered around +/- tssWindow of transcribed genes
 47 |         #then adds it to the list tssLoci
 48 |         tssLoci = []
 49 |         for geneID in list(startDict.keys()):
 50 |             tssLoci.append(ROSE_utils.makeTSSLocus(geneID,startDict,tssWindow,tssWindow))
 51 | 
 52 | 
 53 |         #this turns the tssLoci list into a LocusCollection
 54 |         #50 is the internal parameter for LocusCollection and doesn't really matter
 55 |         tssCollection = ROSE_utils.LocusCollection(tssLoci,50)
 56 | 
 57 |         #gives all the loci in boundCollection
 58 |         boundLoci = boundCollection.getLoci()
 59 | 
 60 |         #this loop will check if each bound region is contained by the TSS exclusion zone
 61 |         #this will drop out a lot of the promoter only regions that are tiny
 62 |         #typical exclusion window is around 2kb
 63 |         for locus in boundLoci:
 64 |             if len(tssCollection.getContainers(locus,'both'))>0:
 65 |                 
 66 |                 #if true, the bound locus overlaps an active gene
 67 |                 boundCollection.remove(locus)
 68 |                 debugOutput.append([locus.__str__(),locus.ID(),'CONTAINED'])
 69 |                 removeTicker+=1
 70 |         print(('REMOVED %s LOCI BECAUSE THEY WERE CONTAINED BY A TSS' % (removeTicker)))
 71 | 
 72 |     #boundCollection is now all enriched region loci that don't overlap an active TSS
 73 |     stitchedCollection = boundCollection.stitchCollection(stitchWindow,'both')
 74 | 
 75 |     if removeTSS:
 76 |         #now replace any stitched region that overlap 2 distinct genes
 77 |         #with the original loci that were there
 78 |         fixedLoci = []
 79 |         tssLoci = []
 80 |         for geneID in list(startDict.keys()):
 81 |             tssLoci.append(ROSE_utils.makeTSSLocus(geneID,startDict,50,50))
 82 | 
 83 | 
 84 |         #this turns the tssLoci list into a LocusCollection
 85 |         #50 is the internal parameter for LocusCollection and doesn't really matter
 86 |         tssCollection = ROSE_utils.LocusCollection(tssLoci,50)
 87 |         removeTicker = 0
 88 |         originalTicker = 0
 89 |         for stitchedLocus in stitchedCollection.getLoci():
 90 |             overlappingTSSLoci = tssCollection.getOverlap(stitchedLocus,'both')
 91 |             tssNames = [startDict[tssLocus.ID()]['name'] for tssLocus in overlappingTSSLoci]
 92 |             tssNames = ROSE_utils.uniquify(tssNames)
 93 |             if len(tssNames) > 2:
 94 |             
 95 |                 #stitchedCollection.remove(stitchedLocus)
 96 |                 originalLoci = boundCollection.getOverlap(stitchedLocus,'both')
 97 |                 originalTicker+=len(originalLoci)
 98 |                 fixedLoci+=originalLoci
 99 |                 debugOutput.append([stitchedLocus.__str__(),stitchedLocus.ID(),'MULTIPLE_TSS'])
100 |                 removeTicker+=1
101 |             else:
102 |                 fixedLoci.append(stitchedLocus)
103 | 
104 |         print(('REMOVED %s STITCHED LOCI BECAUSE THEY OVERLAPPED MULTIPLE TSSs' % (removeTicker)))
105 |         print(('ADDED BACK %s ORIGINAL LOCI' % (originalTicker)))
106 |         fixedCollection = ROSE_utils.LocusCollection(fixedLoci,50)
107 |         return fixedCollection,debugOutput
108 |     else:
109 |         return stitchedCollection,debugOutput
110 | 
111 | #==================================================================
112 | #=====================REGION LINKING MAPPING=======================
113 | #==================================================================
114 | 
115 | def mapCollection(stitchedCollection,referenceCollection,bamFileList,mappedFolder,output,refName):
116 | 
117 | 
118 |     '''
119 |     makes a table of factor density in a stitched locus and ranks table by number of loci stitched together
120 |     '''
121 | 
122 |     
123 |     print('FORMATTING TABLE')
124 |     loci = stitchedCollection.getLoci()
125 | 
126 |     locusTable = [['REGION_ID','CHROM','START','STOP','NUM_LOCI','CONSTITUENT_SIZE']]
127 | 
128 |     lociLenList = []
129 | 
130 |     #strip out any that are in chrY
131 |     for locus in list(loci):
132 |         if locus.chr() == 'chrY':
133 |             loci.remove(locus)
134 |     
135 |     for locus in loci:
136 |         #numLociList.append(int(stitchLocus.ID().split('_')[1]))
137 |         lociLenList.append(locus.len())
138 |         #numOrder = order(numLociList,decreasing=True)
139 |     lenOrder = ROSE_utils.order(lociLenList,decreasing=True)
140 |     ticker = 0
141 |     for i in lenOrder:
142 |         ticker+=1
143 |         if ticker%1000 ==0:
144 |             print(ticker)
145 |         locus = loci[i]
146 | 
147 |         #First get the size of the enriched regions within the stitched locus
148 |         refEnrichSize = 0
149 |         refOverlappingLoci = referenceCollection.getOverlap(locus,'both')
150 |         for refLocus in refOverlappingLoci:
151 |             refEnrichSize+=refLocus.len()
152 | 
153 |         try:
154 |             stitchCount = int(locus.ID().split('_')[0])
155 |         except ValueError:
156 |             stitchCount = 1
157 |         
158 |         locusTable.append([locus.ID(),locus.chr(),locus.start(),locus.end(),stitchCount,refEnrichSize])
159 |         
160 |             
161 |     print('GETTING MAPPED DATA')
162 |     for bamFile in bamFileList:
163 |         
164 |         bamFileName = bamFile.split('/')[-1]
165 | 
166 |         print(('GETTING MAPPING DATA FOR  %s' % bamFile))
167 |         #assumes standard convention for naming enriched region gffs
168 |         
169 |         #opening up the mapped GFF
170 |         print(('OPENING %s%s_%s_MAPPED.gff' % (mappedFolder,refName,bamFileName)))
171 | 
172 |         mappedGFF =ROSE_utils.parseTable('%s%s_%s_MAPPED.gff' % (mappedFolder,refName,bamFileName),'\t')        
173 | 
174 |         signalDict = defaultdict(float)
175 |         print(('MAKING SIGNAL DICT FOR %s' % (bamFile)))
176 |         mappedLoci = []
177 |         for line in mappedGFF[1:]:
178 | 
179 |             chrom = line[1].split('(')[0]
180 |             start = int(line[1].split(':')[-1].split('-')[0])
181 |             end = int(line[1].split(':')[-1].split('-')[1])
182 |             mappedLoci.append(ROSE_utils.Locus(chrom,start,end,'.',line[0]))
183 |             try:
184 |                 signalDict[line[0]] = float(line[2])*(abs(end-start))
185 |             except ValueError:
186 |                 print('WARNING NO SIGNAL FOR LINE:')
187 |                 print(line)
188 |                 continue
189 |                 
190 |                 
191 |         
192 |         mappedCollection = ROSE_utils.LocusCollection(mappedLoci,500)
193 |         locusTable[0].append(bamFileName)
194 | 
195 |         for i in range(1,len(locusTable)):
196 |             signal=0.0
197 |             line = locusTable[i]
198 |             lineLocus = ROSE_utils.Locus(line[1],line[2],line[3],'.')
199 |             overlappingRegions = mappedCollection.getOverlap(lineLocus,sense='both')
200 |             for region in overlappingRegions:
201 |                 signal+= signalDict[region.ID()]
202 |             locusTable[i].append(signal)
203 | 
204 |     ROSE_utils.unParseTable(locusTable,output,'\t')
205 | 
206 | 
207 | 
208 | #==================================================================
209 | #=========================MAIN METHOD==============================
210 | #==================================================================
211 | 
212 | def main():
213 |     '''
214 |     main run call
215 |     '''
216 |     debug = False
217 | 
218 | 
219 |     from optparse import OptionParser
220 |     usage = "usage: %prog [options] -g [GENOME] -i [INPUT_REGION_GFF] -r [RANKBY_BAM_FILE] -o [OUTPUT_FOLDER] [OPTIONAL_FLAGS]"
221 |     parser = OptionParser(usage = usage)
222 |     #required flags
223 |     parser.add_option("-i","--i", dest="input",nargs = 1, default=None,
224 |                       help = "Enter a .gff or .bed file of binding sites used to make enhancers")
225 |     parser.add_option("-r","--rankby", dest="rankby",nargs = 1, default=None,
226 |                       help = "bamfile to rank enhancer by")
227 |     parser.add_option("-o","--out", dest="out",nargs = 1, default=None,
228 |                       help = "Enter an output folder")
229 |     parser.add_option("-g","--genome", dest="genome", default=None,
230 |                       help = "Enter the genome build (MM9,MM8,HG18,HG19,HG38)")
231 |     parser.add_option("--custom", dest="custom_genome", default=None,
232 |                       help = "Enter the custom genome annotation refseq.ucsc")
233 |     
234 |     #optional flags
235 |     parser.add_option("-b","--bams", dest="bams",nargs = 1, default=None,
236 |                       help = "Enter a comma separated list of additional bam files to map to")
237 |     parser.add_option("-c","--control", dest="control",nargs = 1, default=None,
238 |                       help = "bamfile to rank enhancer by")
239 |     parser.add_option("-s","--stitch", dest="stitch",nargs = 1, default=12500,
240 |                       help = "Enter a max linking distance for stitching")
241 |     parser.add_option("-t","--tss", dest="tss",nargs = 1, default=0,
242 |                       help = "Enter a distance from TSS to exclude. 0 = no TSS exclusion")
243 | 
244 | 
245 | 
246 | 
247 |     #RETRIEVING FLAGS
248 |     (options,args) = parser.parse_args()
249 | 
250 | 
251 |     if not options.input or not options.rankby or not options.out or not (options.genome or options.custom_genome):
252 |         print('hi there')
253 |         parser.print_help()
254 |         exit()
255 | 
256 |     #making the out folder if it doesn't exist
257 |     outFolder = ROSE_utils.formatFolder(options.out,True)
258 | 
259 |     
260 |     #figuring out folder schema
261 |     gffFolder = ROSE_utils.formatFolder(outFolder+'gff/',True)
262 |     mappedFolder = ROSE_utils.formatFolder(outFolder+ 'mappedGFF/',True)
263 | 
264 | 
265 |     #GETTING INPUT FILE
266 |     if options.input.split('.')[-1] == 'bed':
267 |         #CONVERTING A BED TO GFF
268 |         inputGFFName = options.input.split('/')[-1][0:-4]
269 |         inputGFFFile = '%s%s.gff' % (gffFolder,inputGFFName)
270 |         ROSE_utils.bedToGFF(options.input,inputGFFFile)
271 |     elif options.input.split('.')[-1] =='gff':
272 |         #COPY THE INPUT GFF TO THE GFF FOLDER
273 |         inputGFFFile = options.input
274 |         os.system('cp %s %s' % (inputGFFFile,gffFolder))        
275 | 
276 |     else:
277 |         print('WARNING: INPUT FILE DOES NOT END IN .gff or .bed. ASSUMING .gff FILE FORMAT')
278 |         #COPY THE INPUT GFF TO THE GFF FOLDER
279 |         inputGFFFile = options.input
280 |         os.system('cp %s %s' % (inputGFFFile,gffFolder))        
281 | 
282 | 
283 |     #GETTING THE LIST OF BAMFILES TO PROCESS
284 |     if options.control:        
285 |         bamFileList = [options.rankby,options.control]
286 | 
287 |     else:
288 |         bamFileList = [options.rankby]
289 | 
290 |     if options.bams:
291 |         bamFileList += options.bams.split(',')
292 |         bamFileLIst = ROSE_utils.uniquify(bamFileList)
293 |     #optional args
294 | 
295 |     #Stitch parameter
296 |     stitchWindow = int(options.stitch)
297 |     
298 |     #tss options
299 |     tssWindow = int(options.tss)
300 |     if tssWindow != 0:
301 |         removeTSS = True
302 |     else:
303 |         removeTSS = False
304 | 
305 |     #GETTING THE BOUND REGION FILE USED TO DEFINE ENHANCERS
306 |     print(('USING %s AS THE INPUT GFF' % (inputGFFFile)))
307 |     inputName = inputGFFFile.split('/')[-1].split('.')[0]
308 | 
309 |     #GETTING THE CORRECT ANNOT FILE
310 |     cwd = os.getcwd()
311 |     genomeDict = {
312 |         'HG18':'%s/annotation/hg18_refseq.ucsc' % (cwd),
313 |         'MM9': '%s/annotation/mm9_refseq.ucsc' % (cwd),
314 |         'HG19':'%s/annotation/hg19_refseq.ucsc' % (cwd),
315 |         'HG38':'%s/annotation/hg38_refseq.ucsc' % (cwd),
316 |         'MM8': '%s/annotation/mm8_refseq.ucsc' % (cwd),
317 |         'MM10':'%s/annotation/mm10_refseq.ucsc' % (cwd),
318 |         }
319 | 
320 |     #GETTING THE GENOME
321 |     if options.custom_genome:
322 |         annotFile = options.custom_genome
323 |         print('USING CUSTOM GENOME %s AS THE GENOME FILE' % options.custom_genome)
324 |     else:
325 |         genome = options.genome
326 |         annotFile = genomeDict[genome.upper()]
327 |         print('USING %s AS THE GENOME' % genome)
328 | 
329 | 
330 |     #MAKING THE START DICT
331 |     print('MAKING START DICT')
332 |     startDict = ROSE_utils.makeStartDict(annotFile)
333 | 
334 | 
335 |     #LOADING IN THE BOUND REGION REFERENCE COLLECTION
336 |     print('LOADING IN GFF REGIONS')
337 |     referenceCollection = ROSE_utils.gffToLocusCollection(inputGFFFile)
338 | 
339 | 
340 |     #NOW STITCH REGIONS
341 |     print('STITCHING REGIONS TOGETHER')
342 |     stitchedCollection,debugOutput = regionStitching(inputGFFFile,stitchWindow,tssWindow,annotFile,removeTSS)
343 |     
344 | 
345 |     
346 |     #NOW MAKE A STITCHED COLLECTION GFF
347 |     print('MAKING GFF FROM STITCHED COLLECTION')
348 |     stitchedGFF=ROSE_utils.locusCollectionToGFF(stitchedCollection)
349 |     
350 |     if not removeTSS:
351 |         stitchedGFFFile = '%s%s_%sKB_STITCHED.gff' % (gffFolder,inputName,stitchWindow/1000)
352 |         stitchedGFFName = '%s_%sKB_STITCHED' % (inputName,stitchWindow/1000)
353 |         debugOutFile = '%s%s_%sKB_STITCHED.debug' % (gffFolder,inputName,stitchWindow/1000)
354 |     else:
355 |         stitchedGFFFile = '%s%s_%sKB_STITCHED_TSS_DISTAL.gff' % (gffFolder,inputName,stitchWindow/1000)
356 |         stitchedGFFName = '%s_%sKB_STITCHED_TSS_DISTAL' % (inputName,stitchWindow/1000)
357 |         debugOutFile = '%s%s_%sKB_STITCHED_TSS_DISTAL.debug' % (gffFolder,inputName,stitchWindow/1000)
358 | 
359 |     #WRITING DEBUG OUTPUT TO DISK
360 |         
361 |     if debug:
362 |         print(('WRITING DEBUG OUTPUT TO DISK AS %s' % (debugOutFile)))
363 |         ROSE_utils.unParseTable(debugOutput,debugOutFile,'\t')
364 | 
365 |     #WRITE THE GFF TO DISK
366 |     print(('WRITING STITCHED GFF TO DISK AS %s' % (stitchedGFFFile)))
367 |     ROSE_utils.unParseTable(stitchedGFF,stitchedGFFFile,'\t')
368 | 
369 | 
370 | 
371 |     #SETTING UP THE OVERALL OUTPUT FILE
372 |     outputFile1 = outFolder + stitchedGFFName + '_REGION_MAP.txt'
373 | 
374 |     print(('OUTPUT WILL BE WRITTEN TO  %s' % (outputFile1)))
375 |     
376 |     #MAPPING TO THE NON STITCHED (ORIGINAL GFF)
377 |     #MAPPING TO THE STITCHED GFF
378 | 
379 | 
380 |     # bin for bam mapping
381 |     nBin =1
382 | 
383 |     #IMPORTANT
384 |     #CHANGE cmd1 and cmd2 TO PARALLELIZE OUTPUT FOR BATCH SUBMISSION
385 |     #e.g. if using LSF cmd1 = "bsub python bamToGFF.py -f 1 -e 200 -r -m %s -b %s -i %s -o %s" % (nBin,bamFile,stitchedGFFFile,mappedOut1)
386 | 
387 |     for bamFile in bamFileList:
388 | 
389 |         bamFileName = bamFile.split('/')[-1]
390 | 
391 |         #MAPPING TO THE STITCHED GFF
392 |         mappedOut1 ='%s%s_%s_MAPPED.gff' % (mappedFolder,stitchedGFFName,bamFileName)
393 |         #WILL TRY TO RUN AS A BACKGROUND PROCESS. BATCH SUBMIT THIS LINE TO IMPROVE SPEED
394 |         cmd1 = "ROSE_bamToGFF.py -f 1 -e 200 -r -m %s -b %s -i %s -o %s &" % (nBin,bamFile,stitchedGFFFile,mappedOut1)
395 |         print(cmd1)
396 |         os.system(cmd1)
397 | 
398 |         #MAPPING TO THE ORIGINAL GFF
399 |         mappedOut2 ='%s%s_%s_MAPPED.gff' % (mappedFolder,inputName,bamFileName)
400 |         #WILL TRY TO RUN AS A BACKGROUND PROCESS. BATCH SUBMIT THIS LINE TO IMPROVE SPEED
401 |         cmd2 = "ROSE_bamToGFF.py -f 1 -e 200 -r -m %s -b %s -i %s -o %s &" % (nBin,bamFile,inputGFFFile,mappedOut2)
402 |         print(cmd2)
403 |         os.system(cmd2)
404 |         
405 | 
406 |     
407 |     print('PAUSING TO MAP')
408 |     time.sleep(10)
409 | 
410 |     #CHECK FOR MAPPING OUTPUT
411 |     outputDone = False
412 |     ticker = 0
413 |     print('WAITING FOR MAPPING TO COMPLETE. ELAPSED TIME (MIN):')
414 |     while not outputDone:
415 | 
416 |         '''
417 |         check every 5 minutes for completed output
418 |         '''
419 |         outputDone = True
420 |         if ticker%6 == 0:
421 |             print((ticker*5))
422 |         ticker +=1
423 |         #CHANGE THIS PARAMETER TO ALLOW MORE TIME TO MAP
424 |         if ticker == 144:
425 |             print('ERROR: OPERATION TIME OUT. MAPPING OUTPUT NOT DETECTED')
426 |             exit()
427 |             break
428 |         for bamFile in bamFileList:
429 |             
430 |             #GET THE MAPPED OUTPUT NAMES HERE FROM MAPPING OF EACH BAMFILE
431 |             bamFileName = bamFile.split('/')[-1]
432 |             mappedOut1 ='%s%s_%s_MAPPED.gff' % (mappedFolder,stitchedGFFName,bamFileName)
433 | 
434 |             try:
435 |                  mapFile = open(mappedOut1,'r')
436 |                  mapFile.close()
437 |             except IOError:
438 |                 outputDone = False
439 | 
440 |             mappedOut2 ='%s%s_%s_MAPPED.gff' % (mappedFolder,inputName,bamFileName)
441 |             
442 |             try:
443 |                 mapFile = open(mappedOut2,'r')
444 |                 mapFile.close()
445 |             except IOError:
446 |                 outputDone = False
447 |         if outputDone == True:
448 |             break
449 |         time.sleep(300)
450 |     print(('MAPPING TOOK %s MINUTES' % (ticker*5)))
451 | 
452 |     print('BAM MAPPING COMPLETED NOW MAPPING DATA TO REGIONS')
453 |     #CALCULATE DENSITY BY REGION
454 |     mapCollection(stitchedCollection,referenceCollection,bamFileList,mappedFolder,outputFile1,refName = stitchedGFFName)
455 | 
456 | 
457 |     time.sleep(10)
458 | 
459 |     print('CALLING AND PLOTTING SUPER-STITCHED PEAKS')
460 | 
461 | 
462 |     if options.control:
463 |         rankbyName = options.rankby.split('/')[-1]
464 |         controlName = options.control.split('/')[-1]
465 |         cmd = 'ROSE_callSuper.R %s %s %s %s' % (outFolder,outputFile1,inputName,controlName)
466 | 
467 |     else:
468 |         rankbyName = options.rankby.split('/')[-1]
469 |         controlName = 'NONE'
470 |         cmd = 'ROSE_callSuper.R %s %s %s %s' % (outFolder,outputFile1,inputName,controlName)
471 |     print(cmd)
472 |     os.system(cmd)
473 |     
474 |     #calling the gene mapper
475 |     time.sleep(60)
476 |     superTableFile = "%s/%s_SuperStitched.table.txt" % (outFolder,inputName)
477 |     allTableFile = "%s/%s_AllStitched.table.txt" % (outFolder,inputName)
478 | 
479 |     suffixScript = ''
480 |     if options.control: suffixScript = '-c'
481 |     if options.custom_genome:
482 |         cmd1 = "ROSE_geneMapper.py --custom %s -i %s -r TRUE %s" % (options.custom_genome,superTableFile,suffixScript)
483 |         cmd2 = "ROSE_geneMapper.py --custom %s -i %s -r TRUE %s " % (options.custom_genome,allTableFile,suffixScript)
484 |     else:
485 |         cmd1 = "ROSE_geneMapper.py -g %s -i %s -r TRUE %s" % (genome,superTableFile,suffixScript)
486 |         cmd2 = "ROSE_geneMapper.py -g %s -i %s -r TRUE %s" % (genome,allTableFile,suffixScript)
487 | 
488 |     #gene mapper for super-stitched peaks
489 |     print(cmd1)
490 |     os.system(cmd1)
491 |     
492 |     #gene mapper for stitched peaks
493 |     print(cmd2)
494 |     os.system(cmd2)
495 |     
496 | if __name__ == "__main__":
497 |     main()
498 | 


--------------------------------------------------------------------------------
/lib/ROSE_utils.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | 
  3 | import re
  4 | 
  5 | from string import *
  6 | 
  7 | import subprocess
  8 | import datetime
  9 | 
 10 | from collections import defaultdict
 11 | 
 12 | #SET OF UTILITY FUNCTIONS FOR mapEnhancerFromFactor.py
 13 | 
 14 | #==================================================================
 15 | #==========================I/O FUNCTIONS===========================
 16 | #==================================================================
 17 | 
 18 | #unParseTable 4/14/08
 19 | #takes in a table generated by parseTable and writes it to an output file
 20 | #takes as parameters (table, output, sep), where sep is how the file is delimited
 21 | #example call unParseTable(table, 'table.txt', '\t') for a tab del file
 22 | 
 23 | def unParseTable(table, output, sep):
 24 |     fh_out = open(output,'w')
 25 |     if len(sep) == 0:
 26 |         for i in table:
 27 |             fh_out.write(str(i))
 28 |             fh_out.write('\n')
 29 |     else:
 30 |         for line in table:
 31 |             line = [str(x) for x in line]
 32 |             line = sep.join(line)
 33 | 
 34 |             fh_out.write(line)
 35 |             fh_out.write('\n')
 36 | 
 37 |     fh_out.close()
 38 | 
 39 | #parseTable 4/14/08
 40 | #takes in a table where columns are separated by a given symbol and outputs
 41 | #a nested list such that list[row][col]
 42 | #example call:
 43 | #table = parseTable('file.txt','\t')
 44 | def parseTable(fn, sep, header = False,excel = False):
 45 |     fh = open(fn)
 46 |     lines = fh.readlines()
 47 |     fh.close()
 48 |     if excel:
 49 |         lines = lines[0].split('\r')
 50 |     if lines[0].count('\r') > 0:
 51 |         lines = lines[0].split('\r')
 52 |     table = []
 53 |     if header == True:
 54 |         lines =lines[1:]
 55 |     for i in lines:
 56 |         table.append(i[:-1].split(sep))
 57 | 
 58 |     return table
 59 | 
 60 | 
 61 | def bedToGFF(bed,output=''):
 62 | 
 63 |     '''
 64 |     turns a bed into a gff file
 65 |     '''
 66 |     if type(bed) == str:
 67 |         bed = parseTable(bed,'\t')
 68 | 
 69 |     gff = []
 70 |     for line in bed:
 71 |         gffLine = [line[0],line[3],'',line[1],line[2],line[4],'.','',line[3]]
 72 |         gff.append(gffLine)
 73 | 
 74 |     if len(output) > 0:
 75 |         unParseTable(gff,output,'\t')
 76 |     else:
 77 |         return gff
 78 | 
 79 | 
 80 | #100912
 81 | #gffToBed
 82 | 
 83 | def gffToBed(gff,output= ''):
 84 |     '''
 85 |     turns a gff to a bed file
 86 |     '''
 87 |     bed = []
 88 |     for line in gff:
 89 |         newLine = [line[0],line[3],line[4],line[1],0,line[6]]
 90 |         bed.append(newLine)
 91 |     if len(output) == 0:
 92 |         return bed
 93 |     else:
 94 |         unParseTable(bed,output,'\t')
 95 | 
 96 | def formatFolder(folderName,create=False):
 97 | 
 98 |     '''
 99 |     makes sure a folder exists and if not makes it
100 |     returns a bool for folder
101 |     '''
102 |     
103 |     if folderName[-1] != '/':
104 |         folderName +='/'
105 | 
106 |     try: 
107 |         foo = os.listdir(folderName)
108 |         return folderName
109 |     except OSError:
110 |         print(('folder %s does not exist' % (folderName)))
111 |         if create:
112 |             os.system('mkdir %s' % (folderName))
113 |             return folderName
114 |         else:
115 |                     
116 |             return False 
117 | 
118 | 
119 | 
120 | #==================================================================
121 | #===================ANNOTATION FUNCTIONS===========================
122 | #==================================================================
123 | 
124 | 
125 | def makeStartDict(annotFile,geneList = []):
126 |     '''
127 |     makes a dictionary keyed by refseq ID that contains information about 
128 |     chrom/start/stop/strand/common name
129 |     '''
130 | 
131 |     if type(geneList) == str:
132 |         geneList = parseTable(geneList,'\t')
133 |         geneList = [line[0] for line in geneList]
134 |             
135 |     if annotFile.upper().count('REFSEQ') >= 0:
136 |         refseqTable,refseqDict = importRefseq(annotFile)
137 |         if len(geneList) == 0:
138 |             geneList = list(refseqDict.keys())
139 |         startDict = {}
140 |         for gene in geneList:
141 |             if (gene in refseqDict) == False:
142 |                 continue
143 |             startDict[gene]={}
144 |             startDict[gene]['sense'] = refseqTable[refseqDict[gene][0]][3]
145 |             startDict[gene]['chr'] = refseqTable[refseqDict[gene][0]][2]
146 |             startDict[gene]['start'] = getTSSs([gene],refseqTable,refseqDict)
147 |             if startDict[gene]['sense'] == '+':
148 |                 startDict[gene]['end'] =[int(refseqTable[refseqDict[gene][0]][5])]
149 |             else:
150 |                 startDict[gene]['end'] = [int(refseqTable[refseqDict[gene][0]][4])]
151 |             startDict[gene]['name'] = refseqTable[refseqDict[gene][0]][12]
152 |     return startDict
153 | 
154 | 
155 | #generic function to get the TSS of any gene
156 | def getTSSs(geneList,refseqTable,refseqDict):
157 |     #refseqTable,refseqDict = importRefseq(refseqFile)
158 |     if len(geneList) == 0:
159 |         refseq = refseqTable
160 |     else:
161 |         refseq = refseqFromKey(geneList,refseqDict,refseqTable)
162 |     TSS = []
163 |     for line in refseq:
164 |         if line[3] == '+':
165 |             TSS.append(line[4])
166 |         if line[3] == '-':
167 |             TSS.append(line[5])
168 |     TSS = list(map(int,TSS))
169 |     
170 |     return TSS
171 | 
172 | 
173 | #12/29/08
174 | #refseqFromKey(refseqKeyList,refseqDict,refseqTable)
175 | #function that grabs refseq lines from refseq IDs
176 | def refseqFromKey(refseqKeyList,refseqDict,refseqTable):
177 |     typeRefseq = []
178 |     for name in refseqKeyList:
179 |         if name in refseqDict:
180 |             typeRefseq.append(refseqTable[refseqDict[name][0]])
181 |     return typeRefseq
182 | 
183 | 
184 | 
185 | #10/13/08
186 | #importRefseq
187 | #takes in a refseq table and makes a refseq table and a refseq dictionary for keying the table
188 | 
189 | def importRefseq(refseqFile, returnMultiples = False):
190 | 
191 |     '''
192 |     opens up a refseq file downloaded by UCSC
193 |     '''
194 |     refseqTable = parseTable(refseqFile,'\t')
195 |     refseqDict = {}
196 |     ticker = 1
197 |     for line in refseqTable[1:]:
198 |         if line[1] in refseqDict:
199 |             refseqDict[line[1]].append(ticker)
200 |         else:
201 |             refseqDict[line[1]] = [ticker]
202 |         ticker = ticker + 1
203 | 
204 |     multiples = []
205 |     for i in refseqDict:
206 |         if len(refseqDict[i]) > 1:
207 |             multiples.append(i)
208 | 
209 |     if returnMultiples == True:
210 |         return refseqTable,refseqDict,multiples
211 |     else:
212 |         return refseqTable,refseqDict
213 | 
214 | 
215 | #==================================================================
216 | #========================LOCUS INSTANCE============================
217 | #==================================================================
218 | 
219 | #Locus and LocusCollection instances courtesy of Graham Ruby
220 | 
221 | 
222 | class Locus:
223 |     # this may save some space by reducing the number of chromosome strings
224 |     # that are associated with Locus instances (see __init__).
225 |     __chrDict = dict()
226 |     __senseDict = {'+':'+', '-':'-', '.':'.'}
227 |     # chr = chromosome name (string)
228 |     # sense = '+' or '-' (or '.' for an ambidexterous locus)
229 |     # start,end = ints of the start and end coords of the locus
230 |     #      end coord is the coord of the last nucleotide.
231 |     def __init__(self,chr,start,end,sense,ID=''):
232 |         coords = [int(start),int(end)]
233 |         coords.sort()
234 |         # this method for assigning chromosome should help avoid storage of
235 |         # redundant strings.
236 |         if not(chr in self.__chrDict): self.__chrDict[chr] = chr
237 |         self._chr = self.__chrDict[chr]
238 |         self._sense = self.__senseDict[sense]
239 |         self._start = int(coords[0])
240 |         self._end = int(coords[1])
241 |         self._ID = ID
242 |     def ID(self): return self._ID
243 |     def chr(self): return self._chr
244 |     def start(self): return self._start  ## returns the smallest coordinate
245 |     def end(self): return self._end   ## returns the biggest coordinate
246 |     def len(self): return self._end - self._start + 1
247 |     def getAntisenseLocus(self):
248 |         if self._sense=='.': return self
249 |         else:
250 |             switch = {'+':'-', '-':'+'}
251 |             return Locus(self._chr,self._start,self._end,switch[self._sense])
252 |     def coords(self): return [self._start,self._end]  ## returns a sorted list of the coordinates
253 |     def sense(self): return self._sense
254 |     # returns boolean; True if two loci share any coordinates in common
255 |     def overlaps(self,otherLocus):
256 |         if self.chr()!=otherLocus.chr(): return False
257 |         elif not(self._sense=='.' or \
258 |                  otherLocus.sense()=='.' or \
259 |                  self.sense()==otherLocus.sense()): return False
260 |         elif self.start() > otherLocus.end() or otherLocus.start() > self.end(): return False
261 |         else: return True
262 |         
263 |     # returns boolean; True if all the nucleotides of the given locus overlap
264 |     #      with the self locus
265 |     def contains(self,otherLocus):
266 |         if self.chr()!=otherLocus.chr(): return False
267 |         elif not(self._sense=='.' or \
268 |                  otherLocus.sense()=='.' or \
269 |                  self.sense()==otherLocus.sense()): return False
270 |         elif self.start() > otherLocus.start() or otherLocus.end() > self.end(): return False
271 |         else: return True
272 |         
273 |     # same as overlaps, but considers the opposite strand
274 |     def overlapsAntisense(self,otherLocus):
275 |         return self.getAntisenseLocus().overlaps(otherLocus)
276 |     # same as contains, but considers the opposite strand
277 |     def containsAntisense(self,otherLocus):
278 |         return self.getAntisenseLocus().contains(otherLocus)
279 |     def __hash__(self): return self._start + self._end
280 |     def __eq__(self,other):
281 |         if self.__class__ != other.__class__: return False
282 |         if self.chr()!=other.chr(): return False
283 |         if self.start()!=other.start(): return False
284 |         if self.end()!=other.end(): return False
285 |         if self.sense()!=other.sense(): return False
286 |         return True
287 |     def __ne__(self,other): return not(self.__eq__(other))
288 |     def __str__(self): return self.chr()+'('+self.sense()+'):'+'-'.join(map(str,self.coords()))
289 |     def checkRep(self):
290 |         pass
291 | 
292 | 
293 | class LocusCollection:
294 |     def __init__(self,loci,windowSize):
295 |         ### top-level keys are chr, then strand, no space
296 |         self.__chrToCoordToLoci = dict()
297 |         self.__loci = dict()
298 |         self.__winSize = windowSize
299 |         for lcs in loci: self.__addLocus(lcs)
300 | 
301 |     def __addLocus(self,lcs):
302 |         if not(lcs in self.__loci):
303 |             self.__loci[lcs] = None
304 |             if lcs.sense()=='.': chrKeyList = [lcs.chr()+'+', lcs.chr()+'-']
305 |             else: chrKeyList = [lcs.chr()+lcs.sense()]
306 |             for chrKey in chrKeyList:
307 |                 if not(chrKey in self.__chrToCoordToLoci): self.__chrToCoordToLoci[chrKey] = dict()
308 |                 for n in self.__getKeyRange(lcs):
309 |                     if not(n in self.__chrToCoordToLoci[chrKey]): self.__chrToCoordToLoci[chrKey][n] = []
310 |                     self.__chrToCoordToLoci[chrKey][n].append(lcs)
311 | 
312 |     def __getKeyRange(self,locus):
313 |         start = locus.start() // self.__winSize
314 |         end = locus.end() // self.__winSize + 1 ## add 1 because of the range
315 |         return range(start, end)
316 | 
317 |     def __len__(self): return len(self.__loci)
318 |         
319 |     def append(self,new): self.__addLocus(new)
320 |     def extend(self,newList):
321 |         for lcs in newList: self.__addLocus(lcs)
322 |     def hasLocus(self,locus):
323 |         return locus in self.__loci
324 |     def remove(self,old):
325 |         if not(old in self.__loci): raise ValueError("requested locus isn't in collection")
326 |         del self.__loci[old]
327 |         if old.sense()=='.': senseList = ['+','-']
328 |         else: senseList = [old.sense()]
329 |         for k in self.__getKeyRange(old):
330 |             for sense in senseList:
331 |                 self.__chrToCoordToLoci[old.chr()+sense][k].remove(old)
332 | 
333 |     def getWindowSize(self): return self.__winSize
334 |     def getLoci(self): return list(self.__loci.keys())
335 |     def getChrList(self):
336 |         # i need to remove the strand info from the chromosome keys and make
337 |         # them non-redundant.
338 |         tempKeys = dict()
339 |         for k in list(self.__chrToCoordToLoci.keys()): tempKeys[k[:-1]] = None
340 |         return list(tempKeys.keys())
341 |             
342 |     def __subsetHelper(self,locus,sense):
343 |         sense = sense.lower()
344 |         if ['sense','antisense','both'].count(sense)!=1:
345 |             raise ValueError("sense command invalid: '"+sense+"'.")
346 |         matches = dict()
347 |         senses = ['+','-']
348 |         if locus.sense()=='.' or sense=='both': lamb = lambda s: True
349 |         elif sense=='sense': lamb = lambda s: s==locus.sense()
350 |         elif sense=='antisense': lamb = lambda s: s!=locus.sense()
351 |         else: raise ValueError("sense value was inappropriate: '"+sense+"'.")
352 |         for s in filter(lamb, senses):
353 |             chrKey = locus.chr()+s
354 |             if chrKey in self.__chrToCoordToLoci:
355 |                 for n in self.__getKeyRange(locus):
356 |                     if n in self.__chrToCoordToLoci[chrKey]:
357 |                         for lcs in self.__chrToCoordToLoci[chrKey][n]:
358 |                             matches[lcs] = None
359 |         return list(matches.keys())
360 |         
361 |     # sense can be 'sense' (default), 'antisense', or 'both'
362 |     # returns all members of the collection that overlap the locus
363 |     def getOverlap(self,locus,sense='sense'):
364 |         matches = self.__subsetHelper(locus,sense)
365 |         ### now, get rid of the ones that don't really overlap
366 |         realMatches = dict()
367 |         if sense=='sense' or sense=='both':
368 |             for i in [lcs for lcs in matches if lcs.overlaps(locus)]:
369 |                 realMatches[i] = None
370 |         if sense=='antisense' or sense=='both':
371 |             for i in [lcs for lcs in matches if lcs.overlapsAntisense(locus)]:
372 |                 realMatches[i] = None 
373 |         return list(realMatches.keys())
374 | 
375 |     # sense can be 'sense' (default), 'antisense', or 'both'
376 |     # returns all members of the collection that are contained by the locus
377 |     def getContained(self,locus,sense='sense'):
378 |         matches = self.__subsetHelper(locus,sense)
379 |         ### now, get rid of the ones that don't really overlap
380 |         realMatches = dict()
381 |         if sense=='sense' or sense=='both':
382 |             for i in [lcs for lcs in matches if locus.contains(lcs)]:
383 |                 realMatches[i] = None
384 |         if sense=='antisense' or sense=='both':
385 |             for i in [lcs for lcs in matches if locus.containsAntisense(lcs)]:
386 |                 realMatches[i] = None
387 |         return list(realMatches.keys())
388 | 
389 |     # sense can be 'sense' (default), 'antisense', or 'both'
390 |     # returns all members of the collection that contain the locus
391 |     def getContainers(self,locus,sense='sense'):
392 |         matches = self.__subsetHelper(locus,sense)
393 |         ### now, get rid of the ones that don't really overlap
394 |         realMatches = dict()
395 |         if sense=='sense' or sense=='both':
396 |             for i in [lcs for lcs in matches if lcs.contains(locus)]:
397 |                 realMatches[i] = None
398 |         if sense=='antisense' or sense=='both':
399 |             for i in [lcs for lcs in matches if lcs.containsAntisense(locus)]:
400 |                 realMatches[i] = None
401 |         return list(realMatches.keys())
402 | 
403 |     def stitchCollection(self,stitchWindow=1,sense='both'):
404 | 
405 |         '''
406 |         reduces the collection by stitching together overlapping loci
407 |         returns a new collection
408 |         '''
409 | 
410 |         #initializing stitchWindow to 1 
411 |         #this helps collect directly adjacent loci
412 | 
413 | 
414 | 
415 |         locusList = self.getLoci()
416 |         oldCollection = LocusCollection(locusList,500)
417 |         
418 |         stitchedCollection = LocusCollection([],500)
419 | 
420 |         for locus in locusList:
421 |             #print(locus.coords())
422 |             if oldCollection.hasLocus(locus):
423 |                 oldCollection.remove(locus)
424 |                 overlappingLoci = oldCollection.getOverlap(Locus(locus.chr(),locus.start()-stitchWindow,locus.end()+stitchWindow,locus.sense(),locus.ID()),sense)
425 |                 
426 |                 stitchTicker = 1
427 |                 while len(overlappingLoci) > 0:
428 |                     stitchTicker+=len(overlappingLoci)
429 |                     overlapCoords = locus.coords()
430 | 
431 | 
432 |                     for overlappingLocus in overlappingLoci:
433 |                         overlapCoords+=overlappingLocus.coords()
434 |                         oldCollection.remove(overlappingLocus)
435 |                     if sense == 'both':
436 |                         locus = Locus(locus.chr(),min(overlapCoords),max(overlapCoords),'.',locus.ID())
437 |                     else:
438 |                         locus = Locus(locus.chr(),min(overlapCoords),max(overlapCoords),locus.sense(),locus.ID())
439 |                     overlappingLoci = oldCollection.getOverlap(Locus(locus.chr(),locus.start()-stitchWindow,locus.end()+stitchWindow,locus.sense()),sense)
440 |                 locus._ID = '%s_%s_lociStitched' % (stitchTicker,locus.ID())
441 |            
442 |                 stitchedCollection.append(locus)
443 |                 
444 |             else:
445 |                 continue
446 |         return stitchedCollection
447 | 
448 | 
449 | 
450 | #==================================================================
451 | #========================LOCUS FUNCTIONS===========================
452 | #==================================================================
453 | #06/11/09
454 | #turns a locusCollection into a gff
455 | #does not write to disk though
456 | def locusCollectionToGFF(locusCollection):
457 |     lociList = locusCollection.getLoci()
458 |     gff = []
459 |     for locus in lociList:
460 |         newLine = [locus.chr(),locus.ID(),'',locus.coords()[0],locus.coords()[1],'',locus.sense(),'',locus.ID()]
461 |         gff.append(newLine)
462 |     return gff
463 | 
464 | 
465 | def gffToLocusCollection(gff,window =500):
466 | 
467 |     '''
468 |     opens up a gff file and turns it into a LocusCollection instance
469 |     '''
470 | 
471 |     lociList = []
472 |     if type(gff) == str:
473 |         gff = parseTable(gff,'\t')
474 | 
475 |     for line in gff:
476 |         #USE line[2] as the locus ID.  If that is empty use line[8]
477 |         if len(line[2]) > 0:
478 |             name = line[2]
479 |         elif len(line[8]) >0:
480 |             name = line[8]
481 |         else:
482 |             name = '%s:%s:%s-%s' % (line[0],line[6],line[3],line[4])
483 | 
484 |         lociList.append(Locus(line[0],line[3],line[4],line[6],name))
485 |     return LocusCollection(lociList,window)
486 | 
487 | 
488 | #maketranscriptCollection
489 | #04/07/09
490 | #makes a LocusCollection w/ each transcript as a locus
491 | #bob = makeTranscriptCollection('/Users/chazlin/genomes/mm8/mm8refseq.txt')
492 | def makeTranscriptCollection(annotFile,upSearch,downSearch,window = 500,geneList = []):
493 |     '''
494 |     makes a LocusCollection w/ each transcript as a locus
495 |     takes in a refseqfile
496 |     '''
497 | 
498 |     if annotFile.upper().count('REFSEQ') >= 0:
499 |         refseqTable,refseqDict = importRefseq(annotFile)
500 |         locusList = []
501 |         ticker = 0
502 |         if len(geneList) == 0:
503 |             geneList =list(refseqDict.keys())
504 |         for line in refseqTable[1:]:
505 |             if geneList.count(line[1]) > 0:
506 |                 if line[3] == '-':
507 |                     locus = Locus(line[2],int(line[4])-downSearch,int(line[5])+upSearch,line[3],line[1])
508 |                 else:
509 |                     locus = Locus(line[2],int(line[4])-upSearch,int(line[5])+downSearch,line[3],line[1])
510 |                 locusList.append(locus)
511 |                 ticker = ticker + 1
512 |                 if ticker%1000 == 0:
513 |                     print(ticker)
514 |             
515 | 
516 |     transCollection = LocusCollection(locusList,window)
517 | 
518 |     return transCollection
519 | 
520 | 
521 | 
522 | def makeTSSLocus(gene,startDict,upstream,downstream):
523 |     '''
524 |     given a startDict, make a locus for any gene's TSS w/ upstream and downstream windows
525 |     '''
526 |     
527 |     start = startDict[gene]['start'][0]
528 |     if startDict[gene]['sense'] =='-':
529 |         return Locus(startDict[gene]['chr'],start-downstream,start+upstream,'-',gene)
530 |     else:
531 |         return Locus(startDict[gene]['chr'],start-upstream,start+downstream,'+',gene)
532 | 
533 | 
534 | #06/11/09
535 | #takes a locus and expands it by a fixed upstream/downstream amount. spits out the new larger locus
536 | def makeSearchLocus(locus,upSearch,downSearch):
537 |     if locus.sense() == '-':
538 |         searchLocus = Locus(locus.chr(),locus.start()-downSearch,locus.end()+upSearch,locus.sense(),locus.ID())
539 |     else:
540 |         searchLocus = Locus(locus.chr(),locus.start()-upSearch,locus.end()+downSearch,locus.sense(),locus.ID())
541 |     return searchLocus
542 | 
543 | 
544 | #==================================================================
545 | #==========================BAM CLASS===============================
546 | #==================================================================
547 | 
548 | #11/11/10
549 | #makes a new class Bam for dealing with bam files and integrating them into the SolexaRun class
550 | 
551 | 
552 | def checkChrStatus(bamFile):
553 |     command = 'samtools view %s | head -n 1' % (bamFile)
554 |     #print "TESTING"
555 |     #print command
556 |     stats = subprocess.Popen(command,stdin = subprocess.PIPE,stderr = subprocess.PIPE,stdout = subprocess.PIPE,shell = True)
557 |     statLines = stats.stdout.readlines()
558 |     stats.stdout.close()
559 |     chrPattern = re.compile('chr')
560 |     for line in statLines:
561 |         #print line
562 |         line = line.decode("utf-8")
563 |         sline = line.split("\t")
564 |         #print sline[2]
565 |         if re.search(chrPattern, sline[2]):
566 |             return 1
567 |         else:
568 |             return 0
569 | 
570 | def convertBitwiseFlag(flag):
571 |     if int(flag) & 16:
572 |         return "-"
573 |     else:
574 |         return "+"
575 |            
576 | class Bam:
577 |     '''A class for a sorted and indexed bam file that allows easy analysis of reads'''
578 |     def __init__(self,bamFile):
579 |         self._bam = bamFile
580 | 
581 |     def getTotalReads(self,readType = 'mapped'):
582 |         command = 'samtools flagstat %s' % (self._bam)
583 |         stats = subprocess.Popen(command,stdin = subprocess.PIPE,stderr = subprocess.PIPE,stdout = subprocess.PIPE,shell = True)
584 |         statLines = stats.stdout.readlines()
585 |         stats.stdout.close()
586 |         if readType == 'mapped':
587 |             for line in statLines:
588 |                 line = line.decode("utf-8")
589 |                 if line.count('mapped (') == 1:
590 |                     
591 |                     return int(line.split(' ')[0])
592 |         if readType == 'total':
593 |             return int(statLines[0].split(' ')[0])
594 |     
595 |     def convertBitwiseFlag(self,flag):
596 |         if flag & 16:
597 |             return "-"
598 |         else:
599 |             return "+"
600 | 
601 |     def getRawReads(self,locus,sense,unique = False,includeJxnReads = False,printCommand = False):
602 |         '''
603 |         gets raw reads from the bam using samtools view.
604 |         can enforce uniqueness and strandedness
605 |         '''
606 |         locusLine = locus.chr()+':'+str(locus.start())+'-'+str(locus.end())
607 |         
608 |         command = 'samtools view %s %s' % (self._bam,locusLine)
609 |         if printCommand:
610 |             print(command)
611 |         getReads = subprocess.Popen(command,stdin = subprocess.PIPE,stderr = subprocess.PIPE,stdout = subprocess.PIPE,shell = True)
612 |         reads = getReads.communicate()
613 |         reads = reads[0].decode("utf-8")
614 |         reads = reads.split('\n')[:-1]
615 |         reads = [read.split('\t') for read in reads]
616 |         if includeJxnReads == False:
617 |             reads = [x for x in reads if x[5].count('N') < 1]
618 | 
619 |         #convertDict = {'16':'-','0':'+','64':'+','65':'+','80':'-','81':'-','129':'+','145':'-'}
620 |         convertDict = {'16':'-','0':'+','64':'+','65':'+','80':'-','81':'-','129':'+','145':'-','256':'+','272':'-','99':'+','147':'-'}
621 |         
622 |         
623 |         #BJA added 256 and 272, which correspond to 0 and 16 for multi-mapped reads respectively:
624 |         #http://onetipperday.blogspot.com/2012/04/understand-flag-code-of-sam-format.html
625 |         #convert = string.maketrans('160','--+')
626 |         keptReads = []
627 |         seqDict = defaultdict(int)
628 |         if sense == '-':
629 |           strand = ['+','-']
630 |           strand.remove(locus.sense())
631 |           strand = strand[0]
632 |         else:
633 |             strand = locus.sense()
634 |         for read in reads:
635 |             #readStrand = read[1].translate(convert)[0]
636 |             #print read[1], read[0]
637 |             #readStrand = convertDict[read[1]]
638 |             readStrand = convertBitwiseFlag(read[1])
639 | 
640 |             if sense == 'both' or sense == '.' or readStrand == strand:
641 | 
642 |                 if unique and seqDict[read[9]] == 0:
643 |                     keptReads.append(read)
644 |                 elif not unique:
645 |                     keptReads.append(read)
646 |             seqDict[read[9]]+=1
647 |             
648 |         return keptReads
649 | 
650 |     def readsToLoci(self,reads,IDtag = 'sequence,seqID,none'):
651 |         '''
652 |         takes raw read lines from the bam and converts them into loci
653 |         '''
654 |         loci = []
655 |         ID = ''
656 |         if IDtag == 'sequence,seqID,none':
657 |             print('please specify one of the three options: sequence, seqID, none')
658 |             return
659 |         #convert = string.maketrans('160','--+')
660 |         #convertDict = {'16':'-','0':'+','64':'+','65':'+','80':'-','81':'-','129':'+','145':'-'}
661 |         #convertDict = {'16':'-','0':'+','64':'+','65':'+','80':'-','81':'-','129':'+','145':'-','256':'+','272':'-'}
662 |         
663 |         #BJA added 256 and 272, which correspond to 0 and 16 for multi-mapped reads respectively:
664 |         #http://onetipperday.blogspot.com/2012/04/understand-flag-code-of-sam-format.html
665 |         #convert = string.maketrans('160','--+')
666 |         numPattern = re.compile('\d*')
667 |         for read in reads:
668 |             chrom = read[2]
669 |             #strand = read[1].translate(convert)[0]
670 |             #strand = convertDict[read[1]]
671 |             strand = convertBitwiseFlag(read[1])
672 |             if IDtag == 'sequence':
673 |                 ID = read[9]
674 |             elif IDtag == 'seqID':
675 |                 ID = read[0]
676 |             else:
677 |                 ID = ''
678 |                 
679 |             length = len(read[9])
680 |             start = int(read[3])
681 |             if read[5].count('N') == 1:
682 |                 #this awful oneliner first finds all of the numbers in the read string
683 |                 #then it filters out the '' and converts them to integers
684 |                 #only works for reads that span one junction
685 |                 
686 |                 [first,gap,second] = [int(x) for x in [x for x in re.findall(numPattern,read[5]) if len(x) > 0]][0:3]
687 |                 if IDtag == 'sequence':
688 |                     loci.append(Locus(chrom,start,start+first,strand,ID[0:first]))
689 |                     loci.append(Locus(chrom,start+first+gap,start+first+gap+second,strand,ID[first:]))
690 |                 else:
691 |                     loci.append(Locus(chrom,start,start+first,strand,ID))
692 |                     loci.append(Locus(chrom,start+first+gap,start+first+gap+second,strand,ID))
693 |             elif read[5].count('N') > 1:
694 |                 continue
695 |             else:
696 |                 loci.append(Locus(chrom,start,start+length,strand,ID))
697 |         return loci
698 | 
699 |     def getReadsLocus(self,locus,sense = 'both',unique = True,IDtag = 'sequence,seqID,none',includeJxnReads = False):
700 |         '''
701 |         gets all of the reads for a given locus
702 |         '''
703 |         reads = self.getRawReads(locus,sense,unique,includeJxnReads)
704 |         
705 |         loci = self.readsToLoci(reads,IDtag)
706 |             
707 |         return loci
708 | 
709 |     def getReadSequences(self,locus,sense = 'both',unique = True,includeJxnReads = False):
710 | 
711 |         reads = self.getRawReads(locus,sense,unique,includeJxnReads)
712 | 
713 |         return [read[9] for read in reads]
714 |     
715 |     def getReadStarts(self,locus,sense = 'both',unique = False,includeJxnReads = False):
716 |         reads = self.getRawReads(locus,sense,unique,includeJxnReads)
717 | 
718 |         return [int(read[3]) for read in reads]
719 |         
720 | 
721 |     def getReadCount(self,locus,sense = 'both',unique = True,includeJxnReads = False):
722 |         reads = self.getRawReads(locus,sense,unique,includeJxnReads)
723 | 
724 |         return len(reads)
725 |                                     
726 | 
727 | 
728 | #==================================================================
729 | #========================MISC FUNCTIONS============================
730 | #==================================================================
731 | 
732 | 
733 | 
734 | 
735 | #uniquify function
736 | #by Peter Bengtsson
737 | #Used under a creative commons license
738 | #sourced from  here: http://www.peterbe.com/plog/uniqifiers-benchmark
739 | 
740 | def uniquify(seq, idfun=None): 
741 |     # order preserving
742 |     if idfun is None:
743 |         def idfun(x): return x
744 |     seen = {}
745 |     result = []
746 |     for item in seq:
747 |         marker = idfun(item)
748 |         # in old Python versions:
749 |         # if seen.has_key(marker)
750 |         # but in new ones:
751 |         if marker in seen: continue
752 |         seen[marker] = 1
753 |         result.append(item)
754 |     return result
755 | 
756 | 
757 | #082009
758 | #taken from http://code.activestate.com/recipes/491268/
759 | 
760 | def order(x, NoneIsLast = True, decreasing = False):
761 |     """
762 |     Returns the ordering of the elements of x. The list
763 |     [ x[j] for j in order(x) ] is a sorted version of x.
764 | 
765 |     Missing values in x are indicated by None. If NoneIsLast is true,
766 |     then missing values are ordered to be at the end.
767 |     Otherwise, they are ordered at the beginning.
768 |     """
769 |     omitNone = False
770 |     if NoneIsLast == None:
771 |         NoneIsLast = True
772 |         omitNone = True
773 |         
774 |     n  = len(x)
775 |     ix = list(range(n))
776 |     if None not in x:
777 |         ix.sort(reverse = decreasing, key = lambda j : x[j])
778 |     else:
779 |         # Handle None values properly.
780 |         def key(i, x = x):
781 |             elem = x[i]
782 |             # Valid values are True or False only.
783 |             if decreasing == NoneIsLast:
784 |                 return not(elem is None), elem
785 |             else:
786 |                 return elem is None, elem
787 |         ix = list(range(n))
788 |         ix.sort(key=key, reverse=decreasing)
789 |             
790 |     if omitNone:
791 |         n = len(x)
792 |         for i in range(n-1, -1, -1):
793 |             if x[ix[i]] == None:
794 |                 n -= 1
795 |         return ix[:n]
796 |     return ix
797 | 


--------------------------------------------------------------------------------