├── example └── chrom_hg19.sizes ├── current_release.log.txt ├── ReadMe.md └── hictrans.v3.R /example/chrom_hg19.sizes: -------------------------------------------------------------------------------- 1 | chr1 249250621 2 | chr2 243199373 3 | chr3 198022430 4 | chr4 191154276 5 | chr5 180915260 6 | chr6 171115067 7 | chr7 159138663 8 | chr8 146364022 9 | chr9 141213431 10 | chr10 135534747 11 | chr11 135006516 12 | chr12 133851895 13 | chr13 115169878 14 | chr14 107349540 15 | chr15 102531392 16 | chr16 90354753 17 | chr17 81195210 18 | chr18 78077248 19 | chr19 59128983 20 | chr20 63025520 21 | chr21 48129895 22 | chr22 51304566 23 | chrX 155270560 24 | chrY 59373566 25 | chrM 16571 26 | -------------------------------------------------------------------------------- /current_release.log.txt: -------------------------------------------------------------------------------- 1 | #08.25.2017 2 | First release 3 | 4 | #09.01.2017 5 | Modified the "run_HiCtrans.pl" script to generate all the chromosome combination translocation result. 6 | Script to generate the genomic feature file is added. 7 | 8 | #04.02.2018 9 | Added "noiseReduction.r" script. This script will calculate the extropy of count 10 | distribution inside the translocation boxes predicted by HiCtrans (*.Translocation.results) 11 | as compared to random boxes. If there is a true translocation than count distribution 12 | will be more random and should have high normalized entropy value as compared to similar 13 | sized random boxes. 14 | 15 | #06.19.2018 16 | Added "juiceboxToHiCtrans_MatrixFormat.r" script which converts juiceboxbox output files into 17 | HiCtrans input files (*.bed and *.matrix files) 18 | 19 | #12/12/2018 20 | Added HiCtransScript.pl which will call and run HiCtrans pipeline. No input.txt file required. 21 | Added translocationREBreakPoint.r script, which will call translocation breakpoints at RE site 22 | resolution. 23 | 24 | #08/23/19 25 | Version 3 released 26 | -------------------------------------------------------------------------------- /ReadMe.md: -------------------------------------------------------------------------------- 1 | # HiCtrans 2 | 3 | This is an updated version of HiCtrans program. HiCtrans can scan inter-chromosomal Hi-C matrix and report translocations, their breakpoints at restriction site or at any lower resolution. 4 | Check the paper https://doi.org/10.1093/bioinformatics/btx664 for details about the method. 5 | 6 | Changes made from previous version: 7 | 8 | 1. No requirement of genome feature file. 9 | 2. Given a starting Hi-C data, HiCtrans now can scan for translocations at different resolutions and report a translocation observed at multiple resolutions. 10 | 3. Users can also check potential translocations at different resolutions. 11 | 4. No requirement of perl and its associted libraries. 12 | 5. Error handling. 13 | 6. Faster. 14 | 15 | Result description: 16 | 17 | A successfull HiCtrans run will generate the following result files and folders 18 | 19 | ```bash 20 | 21 | _hictrans 22 | . 23 | \_hictrans___ 24 | . 25 | \Lower_Resolution_HiC_Data 26 | _._.matrix 27 | _.__abs.bed 28 | .... 29 | . 30 | \Translocations 31 | . 32 | \Details 33 | _._.Details.txt 34 | . .... 35 | \juicebox_files 36 | _._.Translocations_jcbx.txt 37 | .... 38 | _._.preCluster.txt 39 | . 40 | \MultiResolution_supported_Translocations 41 | _._.MultiResolution_Filtered.Translocation.txt 42 | 43 | .matrix 44 | _abs.bed 45 | ._.mat.txt 46 | ._.log.txt 47 | ``` 48 | 49 | NOTE: MultiResolution_supported_Translocations folder is only created when there are such cases. 50 | 51 | Once finished check the _._.MultiResolution_Filtered.Translocation.txt file for possible translocations. 52 | \\_\.\\_\.MultiResolution_Filtered.Translocation.txt provides strong support for the translocation with any anomaly in the inter Hi-C data. 53 | If there is no multi-resolution supported translocations, users can check the \\_\.\\_\.preCluster.txt file 'Translocations' folder. 54 | This file will have all the translocations (BreakPoints and Translocation boxes) found in the chromosomal pair data at different resolutions. 55 | Users can check the highest resolution in the 'resolution column' (lower the value higher the resolution) for further investigation. 56 | The zscore column repersents the enrichment of counts within the box associated with the translocation. The 'count' column is simply the hic count of the 57 | breakpoint detected within the enriched box. Users can ignore the 'id' column. 58 | 59 | For detailed help use the following 60 | 61 | ```bash 62 | 63 | Rscript hictrans.v3.R --help 64 | 65 | Usage: hictrans.v3.R [options] 66 | 67 | Options: 68 | --mat=MAT 69 | An upper triangular Hi-C sparse matrix 70 | It should have the following columns 71 | 72 | 73 | 74 | 1 1 300 75 | 1 2 30 76 | 1 3 10 77 | 2 2 200 78 | 2 3 20 79 | 3 3 200 80 | .... 81 | 82 | 83 | --bed=BED 84 | Bed file with index information 85 | It should have the following columns 86 | 87 | 88 | 89 | 90 | chr1 1 40000 1 91 | chr1 40000 80000 2 92 | chr1 80000 120000 3 93 | .... 94 | 95 | 96 | --chrA=CHRA 97 | Chromosome A name. It will represent the rows in the inter-chromosomal matrix. It should be the chromosome. 98 | 99 | 100 | --chrB=CHRB 101 | Chromosome B name. It will represent the columns in the inter-chromosomal matrix. It should be the chromosome. 102 | 103 | 104 | --prefix=PREFIX 105 | Prefix of the output file . All the output files and folders will be generated with this prefix. 106 | 107 | 108 | --covq=COVQ 109 | Quantile value to be subtracted from one dimensional trans-coverage profile [trans.coverage - quantile(trans.coverage, covq)] [default 0.10]. 110 | 111 | Bins with very low coverage values are removed with this filter. 112 | Increasing value will keep only the most stringent bins in the two chromosome. 113 | 114 | 115 | --minzscore=MINZSCORE 116 | Minimum Zscore of a possible translocation box to be retained [default is 1]. 117 | 118 | HiCtrans will find enriched boxes within the inter-chromosomal matrix as potential translocation box. 119 | The enrichment is calculated as Z-score against a background with all possible similar sized boxes in the inter-chromosomal matrix. 120 | Increasing value will keep the most enriched trans interacting boxes. 121 | 122 | 123 | --minboxsize=MINBOXSIZE 124 | Minimum size of a possible translocation box relative to its Hi-C resolution [default is 0 i.e. no filtering. If set to non-zero value, then (Breakpoint.start - Breakpoint.end)/HiC.resolution > minboxsize filtering will be applied]. 125 | 126 | HiCtrans will find enriched boxes within the inter-chromosomal matrix as potential translocation box. 127 | The minimum box size threshold will filter out small false positive multi-resolution supported potential translocation boxes. 128 | 129 | Increasing value will keep the most enriched and larger trans interacting boxes. 130 | 131 | 132 | --boxzscore=BOXZSCORE 133 | Minimum Zscore of a possible translocation box to be retained [default is 1]. 134 | 135 | HiCtrans will keep boxes enriched above boxzscore threshold to find translocations among them. 136 | Increasing value will keep the most enriched trans interacting boxes. 137 | 138 | 139 | --locq=LOCQ 140 | Top percentile to be reported as possible breakpoints within a translocation box [default top 0.1%] 141 | 142 | For each enriched translocated boxes, HiCtrans will report top % interacting pairs (Weighted by the frequency of total interaction). 143 | Decreasing will reduce the number of reported breakpoints within an enriched trans interacting box. A value of 0 will report only the top interacting pair 144 | 145 | 146 | --mincount=MINCOUNT 147 | Minimum count of a possible breakpoint to be retained when compared to all possible chrA-chrB interaction [default cutoff is 10, This is based on 10Kb HiC matrix]. 148 | 149 | This is an absolute minimum count cutoff to filter out breakpoints detected at any resolution. 150 | Increasing value will keep only the most stringent interacting pair. 151 | 152 | 153 | --glbq=GLBQ 154 | Percentile value for minimum count cutoff at each resolution [default cutoff is at top 0.1% of the count distribution]. 155 | 156 | This is a relative count cutoff based on the inter-chromosomal count distribution determined for each resolution independently. 157 | Increasing value keep only the most stringent interacting pair. 158 | 159 | 160 | --resolutions=RESOLUTIONS 161 | Comma separated list of integers to be multiplied with the starting Hi-C resolution to get the lower resolutions [default 2,4,5,10]. 162 | 163 | HiCtrans will search for enriched trans-interacting boxes and breakpoint finding at different resolutions. 164 | Provide only integer values in a comma separated list. 165 | 166 | 167 | --multires=MULTIRES 168 | Number of Hi-C resolutions at which the breakpoint should be supported with [default is at least 2 different resolutions]. 169 | 170 | HiCtrans wiil find enriched trans-interacting boxes and subsequent breakpoints at different resolutions. The ultimate goal is to find a true translocation supported by multiple resolutions. 171 | Increasing value will keep only the enriched boxes and breakpoints supported by at least number of different resolution. 172 | 173 | 174 | --maxres=MAXRES 175 | Maximum resolution upto which the breakpoint is kept after multi-resolution filtering [default is 3 X user provided Hi-C resolution]. 176 | 177 | HiCtrans wiil find enriched trans-interacting boxes and subsequent breakpoints at different resolutions. The ultimate goal is to find a true translocation supported by the highest resolutions. 178 | Increasing value will keep only the enriched boxes and breakpoints supported by upto X starting Hi-C resolutions. 179 | 180 | 181 | --relevel=RELEVEL 182 | Should the breakpoints be refined upto restriction-level resolution [default is 'No'; If 'Yes', the following parameters are MUST] 183 | 184 | 185 | --fragsFile=FRAGSFILE 186 | Restriction Fragment file [MUST]. 187 | 188 | chr1 0 16007 HIC_chr1_1 0 + 189 | chr1 16007 24571 HIC_chr1_2 0 + 190 | chr1 24571 27981 HIC_chr1_3 0 + 191 | ...... 192 | 193 | 194 | --chromsize=CHROMSIZE 195 | Chromosome size file [MUST]. 196 | 197 | chr1 249250621 198 | chr2 243199373 199 | chr3 198022430 200 | chr4 191154276 201 | ..... 202 | 203 | 204 | --validpair=VALIDPAIR 205 | Valid pair file of the HiC data [MUST]. 206 | 207 | SRR6213722.1 chr11 124331538 - chr11 124345246 - 208 | SRR6213722.2 chr1 198436365 - chr1 199923196 + 209 | ..... 210 | 211 | 212 | --clusdist=CLUSDIST 213 | Distance threshold in basepairs to cluster the nearby breakpoints obtained from multi-resolution filtered (MultiResolution_Filtered.Translocation.txt) or individual Translocations_jcbx.txt files [Default 1Mb] 214 | 215 | --ssA=SSA 216 | Extend -(ve) bp of the 5' HMM segment border of chromosome A for breakpoint identification. Default 100Kb. 217 | 218 | --seA=SEA 219 | Extend +(ve) bp of the 3' HMM segment border of chromosome A for breakpoint identification. Default 100Kb. 220 | 221 | --ssB=SSB 222 | Extend -(ve) bp of the 5' HMM segment border of chromosome B for breakpoint identification. Default 100Kb. 223 | 224 | --seB=SEB 225 | Extend +(ve) bp of the 3' HMM segment border of chromosome B for breakpoint identification. Default 100Kb. 226 | 227 | 228 | --precheck=PRECHECK 229 | Precheck option will help to restrict HiCtrans search only to chromosome combinations with significant max interaction compared to mean non-zero count value [Default 1. Lower value will increase stringency] 230 | 231 | -h, --help 232 | Show this help message and exit 233 | 234 | ``` 235 | 236 | Users need to run each chromosome pair independently. This is a helper function to generate all the combination of chromosomal pairs and run hictrans.R 237 | HiCtrans can be run with or without the restriction level validpair information file. 238 | 239 | If you don't have the validpair file, please use the following command 240 | 241 | ```bash 242 | perl -e '@F=`cat $ARGV[0]`; for($i=0; $i<$#F; $i++){chomp $F[$i]; for($j=$i+1; $j<=$#F; $j++){chomp $F[$j]; print "Rscript hictrans.v3.R --mat $ARGV[1] --bed $ARGV[2] --chrA $F[$i] --chrB $F[$j] --prefix $ARGV[3] --resolutions 2,3,4,5,6,8,10 --covq 0.1 --chromsize chrom_hg19.sizes\n";}}' 243 | ``` 244 | 245 | Example 246 | 247 | ```bash 248 | Rscript ../hictrans.v3.R --mat T47D_20Kb_chr10_chr20.matrix --bed T47D_20Kb_chr10_chr20_abs.bed --chrA chr10 --chrB chr20 --prefix T47D_20Kb_chr10_chr20 --resolutions 2,3,4,5,6,8,10 --covq 0.1 --chromsize chrom_hg19.sizes 249 | ``` 250 | 251 | If you have the validpair file, please use the following 252 | 253 | ```bash 254 | perl -e '@F=`cat $ARGV[0]`; for($i=0; $i<$#F; $i++){chomp $F[$i]; for($j=$i+1; $j<=$#F; $j++){chomp $F[$j]; print "Rscript ../HiCtrans/hictrans.v3.R --mat $ARGV[1] --bed $ARGV[2] --chrA $F[$i] --chrB $F[$j] --prefix $ARGV[3] --resolutions 2,3,4,5,6,8,10 --covq 0.1 --relevel YES --fragsFile Resfrag_hg19.bed --validpair $ARGV[4] --chromsize chrom_hg19.sizes --precheck 1e-5\n";}}' 255 | ``` 256 | 257 | Example (Validpair files are not provided in the example folder, but we assume the data by default is processed with HiC-pro and validpair file is available) 258 | 259 | ```bash 260 | Rscript ../hictrans.v3.R --mat T47D_20Kb_chr10_chr20.matrix --bed T47D_20Kb_chr10_chr20_abs.bed --chrA chr10 --chrB chr20 --prefix T47D_20Kb_chr10_chr20 --resolutions 2,3,4,5,6,8,10 --covq 0.1 --relevel YES --fragsFile HindIII_resfrag_hg19.bed --validpair T47D_validpair.txt --chromsize chrom_hg19.sizes --precheck 1e-5 261 | ``` 262 | 263 | 264 | Here, chrom_name.file is a signle column file with chromsome names; matrix and bed files are names of the Hi-C sparse matrix and the associated bed files. 265 | To generate the sparse matrix use the 'build_matrix.cpp' file (compile this program by running 'g++ build_matrix.cpp -o build_matrix' in your command prompt). 266 | For details of the program check the https://github.com/nservant/HiC-Pro repository. 267 | The input to the build_matrix program is a validpair file described in the help section. 268 | 269 | If you are staring with HiCUP, then use hicup_filter to create valid Hi-C read pairs (generally ends with a name filt.bam or filt.sam). 270 | Then use the following command to generate a validpair file from the filt.bam file 271 | 272 | ```bash 273 | samtools view filt.bam| awk -v OFS='\t' '{print $1,$3,$4,"+"}' |paste - - |awk -v OFS='\t' '{print $1,$2,$3,$4,$6,$7,$8}' > hictrans.validpair 274 | ``` 275 | 276 | ### Note 277 | 1. HiCtrans requires 10Kb resolution Hi-C matrix or multiple of 10Kb matrix to begin with 278 | 2. The bam file should be sorted based on read name. 279 | 280 | # Filtering the result 281 | HiCtrans recommends filtering the black listed regions (+/- 100Kb) from the output result. 282 | These regions tends to produce higher interactions and can artificially appear as translocations. 283 | Cheack the black list here for different genomes 284 | https://sites.google.com/site/anshulkundaje/projects/blacklists 285 | 286 | # R library requirements: 287 | data.table, hashmap, changepoint, hashmap, 288 | optparse, Rcpp, caTools, depmixS4 DEoptimR 289 | 290 | # For troubleshoot: 291 | Abhijit Chakraborty (abhijit@lji.org) 292 | -------------------------------------------------------------------------------- /hictrans.v3.R: -------------------------------------------------------------------------------- 1 | library(data.table) 2 | library(hashmap) 3 | suppressMessages(library(changepoint)) 4 | library(hashmap) 5 | library(optparse) 6 | library(Rcpp) 7 | library(caTools) 8 | library(depmixS4) 9 | library(DEoptimR) 10 | 11 | 12 | ## Change the path of build_matrix program when running from different folder## 13 | buildMatrixPath <- "/home/abhijit/overflow/proj_overflow/HiCProcessingPipeLine/Cancer_Cell_Lines/HiCtrans_Calls/HiCtrans/build_matrix" 14 | buildMatrixPath <- normalizePath(buildMatrixPath) 15 | 16 | ############### Get breakpoint calls at restriction level resolution ############# 17 | 18 | ## Cluster the breakpoints 19 | RE_HClust <- function(box.df,cl.A,cl.B) { 20 | 21 | box.df <- box.df[,c(1:6)] 22 | colnames(box.df) <- c("chrA","BoundaryAS","BoundaryAE","chrB","BoundaryBS","BoundaryBE") 23 | chrA.dist <- matrix(ncol=nrow(box.df),nrow=nrow(box.df)) 24 | chrB.dist <- matrix(ncol=nrow(box.df),nrow=nrow(box.df)) 25 | 26 | i <- 1 27 | while (i <= nrow(box.df)) { 28 | chrA.AS_i <- box.df$BoundaryAS[i] 29 | chrA.AE_i <- box.df$BoundaryAE[i] 30 | chrB.BS_i <- box.df$BoundaryBS[i] 31 | chrB.BE_i <- box.df$BoundaryBE[i] 32 | j <- 1 33 | while (j <= nrow(box.df)) { 34 | chrA.AS_j <- box.df$BoundaryAS[j] 35 | chrA.AE_j <- box.df$BoundaryAE[j] 36 | chrB.BS_j <- box.df$BoundaryBS[j] 37 | chrB.BE_j <- box.df$BoundaryBE[j] 38 | chrA.AS_AS.ij <- abs(chrA.AS_i - chrA.AS_j) 39 | chrA.AE_AS.ij <- abs(chrA.AE_i - chrA.AS_j) 40 | chrA.AS_AE.ij <- abs(chrA.AS_i - chrA.AE_j) 41 | chrA.AE_AE.ij <- abs(chrA.AE_i - chrA.AE_j) 42 | chrB.BS_BS.ij <- abs(chrB.BS_i - chrB.BS_j) 43 | chrB.BE_BS.ij <- abs(chrB.BE_i - chrB.BS_j) 44 | chrB.BS_BE.ij <- abs(chrB.BS_i - chrB.BE_j) 45 | chrB.BE_BE.ij <- abs(chrB.BE_i - chrB.BE_j) 46 | chrA.dist[i,j] <- min(chrA.AS_AS.ij,chrA.AE_AS.ij,chrA.AS_AE.ij,chrA.AE_AE.ij) 47 | chrB.dist[i,j] <- min(chrB.BS_BS.ij,chrB.BE_BS.ij,chrB.BS_BE.ij,chrB.BE_BE.ij) 48 | j <- j + 1 49 | } 50 | i <- i + 1 51 | } 52 | 53 | if (nrow(chrA.dist) > 1 & nrow(chrB.dist) > 1) { 54 | chrA.hclust <- hclust(as.dist(chrA.dist),method="single") 55 | chrB.hclust <- hclust(as.dist(chrB.dist),method="single") 56 | chrA.clust <- cutree(chrA.hclust,h=cl.A) 57 | chrB.clust <- cutree(chrB.hclust,h=cl.B) 58 | box.df[,"chrA.clus"] <- chrA.clust 59 | box.df[,"chrB.clus"] <- chrB.clust 60 | } else { 61 | chrA.clust <- 1 62 | chrB.clust <- 1 63 | box.df[,"chrA.clus"] <- chrA.clust 64 | box.df[,"chrB.clus"] <- chrB.clust 65 | } 66 | clus <- unique(data.frame(chrA.clust,chrB.clust)) 67 | box.cl <- list() 68 | i <- 1 69 | while (i <= nrow(clus)) { 70 | d <- box.df[box.df$chrA.clus==clus$chrA.clust[i] & box.df$chrB.clus==clus$chrB.clust[i],] 71 | box.cl[[i]] <- data.frame(chrA=d$chrA[1],BoundaryAS=min(d$BoundaryAS),BoundaryAE=max(d$BoundaryAE), 72 | chrB=d$chrB[1],BoundaryBS=min(d$BoundaryBS),BoundaryBE=max(d$BoundaryBE)) 73 | i <- i + 1 74 | } 75 | box.cl <- as.data.frame(do.call(rbind,box.cl)) 76 | colnames(box.cl) <- c("chrA","BoundaryAS","BoundaryAE","chrB","BoundaryBS","BoundaryBE") 77 | return(box.cl) 78 | } 79 | 80 | #Process the valid pair file 81 | mapVPs_on_REs <- function(re.bed,chrom.size,vp.file,prefix,chromA,chromB,id){ 82 | 83 | system(paste0("grep -w -e ",chromA," -e ",chromB," ",re.bed," > ",prefix,"_",chromA,"_",chromB,"_REfrags.",id,".bed"),wait=T) 84 | system(paste0("grep -w -e ",chromA," -e ",chromB," ",chrom.size," > ",prefix,"_",chromA,"_",chromB,"_chromSizes.",id,".txt"),wait=T) 85 | cmd <- paste0(buildMatrixPath," --binfile ",prefix,"_",chromA,"_",chromB,"_REfrags.",id,".bed --chrsizes ",prefix,"_",chromA,"_",chromB,"_chromSizes.",id,".txt --ifile ",vp.file," --oprefix ",prefix,"_",chromA,"_",chromB,"_",id," --matrix-format complete") 86 | system(cmd,wait=T) 87 | bed <- as.data.frame(fread(as.character(paste0(prefix,"_",chromA,"_",chromB,"_",id,"_abs.bed")),header=F)) 88 | mat <- as.data.frame(fread(as.character(paste0(prefix,"_",chromA,"_",chromB,"_",id,".matrix")),header=F)) 89 | colnames(bed) <- c("chrom","start","end","index") 90 | colnames(mat) <- c("indexA","indexB","count") 91 | bed_chrom.hash <- hashmap(bed$index,as.character(bed$chrom)) 92 | bed_start.hash <- hashmap(bed$index,bed$start) 93 | bed_end.hash <- hashmap(bed$index,bed$end) 94 | mat.df <- data.frame( 95 | chromA = bed_chrom.hash[[mat$indexA]], 96 | startA = bed_start.hash[[mat$indexA]], 97 | endA = bed_end.hash[[mat$indexA]], 98 | chromB = bed_chrom.hash[[mat$indexB]], 99 | startB = bed_start.hash[[mat$indexB]], 100 | endB = bed_end.hash[[mat$indexB]], 101 | count = mat$count 102 | ) 103 | write.table(mat.df,file=paste0(prefix,"_",chromA,"_",chromB,"_",id,".txt"),row.names=F,quote=F,sep="\t") 104 | return(list(mat.df=mat.df,bed=bed,chromA=chromA,chromB=chromB)) 105 | } 106 | 107 | #Calculate the Cis and trans coverage values 108 | cis_trans_coverage <- function(files,prefix,chrA,chrB,id){ 109 | 110 | df <- files$mat.df 111 | df_cis_chromA <- df[with(df,chromA==as.character(chrA) & chromB==as.character(chrA)),] 112 | df_cis_chromB <- df[with(df,chromA==as.character(chrB) & chromB==as.character(chrB)),] 113 | df_trans_chromA <- df[with(df,chromA==as.character(chrA) & chromB==as.character(chrB)),] 114 | df_trans_chromB <- df[with(df,chromA==as.character(chrB) & chromB==as.character(chrA)),] 115 | df_cis_chromA.up <- df_cis_chromA[with(df_cis_chromA,startA > endB),] 116 | df_cis_chromB.up <- df_cis_chromB[with(df_cis_chromB,startA > endB),] 117 | df_cis_chromA.dw <- df_cis_chromA[with(df_cis_chromA,endA < startB),] 118 | df_cis_chromB.dw <- df_cis_chromB[with(df_cis_chromB,endA < startB),] 119 | 120 | cis_chromA.up <- aggregate(with(df_cis_chromA.up,count ~ startA),FUN = sum) 121 | cis_chromA.dw <- aggregate(with(df_cis_chromA.dw,count ~ startA),FUN = sum) 122 | cis_chromB.up <- aggregate(with(df_cis_chromB.up,count ~ startA),FUN = sum) 123 | cis_chromB.dw <- aggregate(with(df_cis_chromB.dw,count ~ startA),FUN = sum) 124 | trans_chromA <- aggregate(with(df_trans_chromA,count ~ startA),FUN = sum) 125 | trans_chromB <- aggregate(with(df_trans_chromB,count ~ startA),FUN = sum) 126 | 127 | cis_chromA.up.hash <- hashmap(cis_chromA.up$startA,cis_chromA.up$count) 128 | cis_chromB.up.hash <- hashmap(cis_chromB.up$startA,cis_chromB.up$count) 129 | cis_chromA.dw.hash <- hashmap(cis_chromA.dw$startA,cis_chromA.dw$count) 130 | cis_chromB.dw.hash <- hashmap(cis_chromB.dw$startA,cis_chromB.dw$count) 131 | trans_chromA.hash <- hashmap(trans_chromA$startA,trans_chromA$count) 132 | trans_chromB.hash <- hashmap(trans_chromB$startA,trans_chromB$count) 133 | 134 | df <- files$bed 135 | df_chromA <- df[with(df,chrom==as.character(chrA)),] 136 | df_chromB <- df[with(df,chrom==as.character(chrB)),] 137 | df_chromA[,"cis.up"] <- cis_chromA.up.hash[[df_chromA$start]] 138 | df_chromA[,"cis.dw"] <- cis_chromA.dw.hash[[df_chromA$start]] 139 | df_chromB[,"cis.up"] <- cis_chromB.up.hash[[df_chromB$start]] 140 | df_chromB[,"cis.dw"] <- cis_chromB.dw.hash[[df_chromB$start]] 141 | df_chromA[,"trans"] <- trans_chromA.hash[[df_chromA$start]] 142 | df_chromB[,"trans"] <- trans_chromB.hash[[df_chromB$start]] 143 | 144 | df_chromA$cis.up[is.na(df_chromA$cis.up)] <- 0 145 | df_chromA$cis.dw[is.na(df_chromA$cis.dw)] <- 0 146 | df_chromB$cis.up[is.na(df_chromB$cis.up)] <- 0 147 | df_chromB$cis.dw[is.na(df_chromB$cis.dw)] <- 0 148 | df_chromA$trans[is.na(df_chromA$trans)] <- 0 149 | df_chromB$trans[is.na(df_chromB$trans)] <- 0 150 | 151 | df = rbind(df_chromA,df_chromB) 152 | write.table(df,file=paste0(prefix,"_",chrA,"_",chrB,"_",id,"_Coverage.bed"),col.names=F,row.names=F,quote=F,sep="\t") 153 | } 154 | 155 | 156 | #Calculate cis and trans coverage and then calculate directionality index. Followed by HMM segmentation (2 states) 157 | di_and_hmm <- function(coverage,chromA,chromA.start,chromA.end,chromB,chromB.start,chromB.end,prefix,id){ 158 | 159 | bed <- as.data.frame(fread(as.character(coverage),header=F)) 160 | colnames(bed) <- c("chrom","start","end","index","up","dw","trans") 161 | bed$up[bed$up == 0] <- rbinom(length(bed$up[bed$up == 0]), 3, 0.75) 162 | bed$dw[bed$dw == 0] <- rbinom(length(bed$dw[bed$dw == 0]), 3, 0.75) 163 | bed$trans[bed$trans == 0] <- rbinom(length(bed$trans[bed$trans == 0]), 3, 0.75) 164 | 165 | chromA_span <- chromA.end-chromA.start 166 | chromB_span <- chromB.end-chromB.start 167 | if (chromA_span < 1e6){ 168 | chromA_span <- 1e6-chromA_span 169 | chromA_span <- floor(chromA_span/2) 170 | chromA.start <- chromA.start-chromA_span 171 | chromA.end <- chromA.end+chromA_span 172 | } 173 | if (chromB_span < 1e6){ 174 | chromB_span <- 1e6-chromB_span 175 | chromB_span <- floor(chromB_span/2)+1 176 | chromB.start <- chromB.start-chromB_span 177 | chromB.end <- chromB.end+chromB_span 178 | } 179 | 180 | bed_chromA <- bed[with(bed,chrom==chromA),] 181 | bed_chromB <- bed[with(bed,chrom==chromB),] 182 | bed_chromA_cis_di <- list() 183 | bed_chromA_trans_di <- list() 184 | i <- 1 185 | while (i <= nrow(bed_chromA)){ 186 | up <- bed_chromA$up[i] 187 | dw <- bed_chromA$dw[i] 188 | trans <- bed_chromA$trans[i] 189 | bed_chromA_cis_di[[i]] <- (dw-up)/abs(dw-up)*(((up-mean(c(dw,up)))^2/mean(c(dw,up)))+((dw-mean(c(dw,up)))^2/mean(c(dw,up)))) 190 | bed_chromA_trans_di[[i]] <- ((trans-mean(bed_chromA$trans))/abs(trans-mean(bed_chromA$trans)))*((trans-mean(bed_chromA$trans))^2/mean(bed_chromA$trans)) 191 | i <- i+1 192 | } 193 | 194 | smooth <- 5 195 | bed_chromA_cis_di <- unlist(bed_chromA_cis_di) 196 | bed_chromA_trans_di <- unlist(bed_chromA_trans_di) 197 | bed_chromA_cis_di[is.na(bed_chromA_cis_di)] <- 0 198 | bed_chromA_trans_di[is.na(bed_chromA_trans_di)] <- 0 199 | bed_chromA_cis_di <- runmean(bed_chromA_cis_di,k=smooth,endrule="mean") 200 | bed_chromA_trans_di <- runmean(bed_chromA_trans_di,k=smooth,endrule="mean") 201 | bed_chromA <- as.data.frame(cbind(bed_chromA,bed_chromA_cis_di,bed_chromA_trans_di)) 202 | bed_chromA <- bed_chromA[with(bed_chromA,start >= chromA.start & end <= chromA.end),] 203 | 204 | bed_chromB_cis_di <- list() 205 | bed_chromB_trans_di <- list() 206 | i <- 1 207 | while (i <= nrow(bed_chromB)){ 208 | up <- bed_chromB$up[i] 209 | dw <- bed_chromB$dw[i] 210 | trans <- bed_chromB$trans[i] 211 | bed_chromB_cis_di[[i]] <- (dw-up)/abs(dw-up)*(((up-mean(c(dw,up)))^2/mean(c(dw,up)))+((dw-mean(c(dw,up)))^2/mean(c(dw,up)))) 212 | bed_chromB_trans_di[[i]] <- ((trans-mean(bed_chromB$trans))/abs(trans-mean(bed_chromB$trans)))*((trans-mean(bed_chromB$trans))^2/mean(bed_chromB$trans)) 213 | i <- i+1 214 | } 215 | 216 | bed_chromB_cis_di <- unlist(bed_chromB_cis_di) 217 | bed_chromB_trans_di <- unlist(bed_chromB_trans_di) 218 | bed_chromB_cis_di[is.na(bed_chromB_cis_di)] <- 0 219 | bed_chromB_trans_di[is.na(bed_chromB_trans_di)] <- 0 220 | bed_chromB_cis_di <- runmean(bed_chromB_cis_di,k=smooth,endrule="mean") 221 | bed_chromB_trans_di <- runmean(bed_chromB_trans_di,k=smooth,endrule="mean") 222 | bed_chromB <- as.data.frame(cbind(bed_chromB,bed_chromB_cis_di,bed_chromB_trans_di)) 223 | bed_chromB <- bed_chromB[with(bed_chromB,start >= chromB.start & end <= chromB.end),] 224 | 225 | nstates <- 2 226 | chromA.hmm <- depmix(response=list(bed_chromA_cis_di~1,bed_chromA_trans_di~1),data=bed_chromA, nstates=nstates,family=list(gaussian(),gaussian()),ntimes=nrow(bed_chromA)) 227 | chromA.hmm.fit <- fit(chromA.hmm) 228 | chromA.hmm.states <- posterior(chromA.hmm.fit)$state 229 | x <- summary(chromA.hmm.fit) 230 | chromA.trans.state <- which.max(as.vector(x[1:nstates,3])) 231 | cis.hash <- hashmap(c(1:nstates),as.vector(x[1:nstates,1])) 232 | trans.hash <- hashmap(c(1:nstates),as.vector(x[1:nstates,3])) 233 | chromA_cis_DI <- as.vector(x[chromA.trans.state,1]) 234 | bed_chromA <- cbind(bed_chromA,chromA.hmm.states,cis.mean=cis.hash[[chromA.hmm.states]],trans.mean=trans.hash[[chromA.hmm.states]]) 235 | chrA.hmmBoundary.start <- min(bed_chromA[bed_chromA$chromA.hmm.states==chromA.trans.state,]$start) 236 | chrA.hmmBoundary.end <- max(bed_chromA[bed_chromA$chromA.hmm.states==chromA.trans.state,]$end) 237 | write.table(bed_chromA,file=paste0(prefix,".",chromA,"_",id,".hmm.states.bed"),col.names=T,row.names=F,sep="\t",quote=F) 238 | 239 | chromB.hmm <- depmix(response=list(bed_chromB_cis_di~1,bed_chromB_trans_di~1),data=bed_chromB, nstates=nstates,family=list(gaussian(),gaussian()),ntimes=nrow(bed_chromB)) 240 | chromB.hmm.fit <- fit(chromB.hmm) 241 | chromB.hmm.states <- posterior(chromB.hmm.fit)$state 242 | x <- summary(chromB.hmm.fit) 243 | chromB.trans.state <- which.max(as.vector(x[1:nstates,3])) 244 | cis.hash <- hashmap(c(1:nstates),as.vector(x[1:nstates,1])) 245 | trans.hash <- hashmap(c(1:nstates),as.vector(x[1:nstates,3])) 246 | chromB_cis_DI <- as.vector(x[chromB.trans.state,1]) 247 | bed_chromB <- cbind(bed_chromB,chromB.hmm.states,cis.mean=cis.hash[[chromB.hmm.states]],trans.mean=trans.hash[[chromB.hmm.states]]) 248 | chrB.hmmBoundary.start <- min(bed_chromB[bed_chromB$chromB.hmm.states==chromB.trans.state,]$start) 249 | chrB.hmmBoundary.end <- max(bed_chromB[bed_chromB$chromB.hmm.states==chromB.trans.state,]$end) 250 | write.table(bed_chromB,file=paste0(prefix,".",chromB,"_",id,".hmm.states.bed"),col.names=T,row.names=F,sep="\t",quote=F) 251 | return( 252 | list( 253 | chrA.hmmBoundary.start=chrA.hmmBoundary.start, 254 | chrA.hmmBoundary.end=chrA.hmmBoundary.end, 255 | chrB.hmmBoundary.start=chrB.hmmBoundary.start, 256 | chrB.hmmBoundary.end=chrB.hmmBoundary.end, 257 | chromA_cis_DI=chromA_cis_DI, 258 | chromB_cis_DI=chromB_cis_DI 259 | ) 260 | ) 261 | } 262 | 263 | #Optimize the HMM segment border (+/-bp) to find the breakpoint using DE optimization 264 | optimization_func <- function(boundary,df,as,bs,ae,be,bk,chromA_cis_DI,chromB_cis_DI){ 265 | 266 | chromA_span <- ae-as 267 | chromB_span <- be-bs 268 | chromA_pos <- boundary[1] 269 | chromB_pos <- boundary[2] 270 | chromA_up <- floor(chromA_pos-(chromA_span/2)) 271 | chromA_dw <- floor(chromA_pos+(chromA_span/2)) 272 | chromB_up <- floor(chromB_pos-(chromB_span/2)) 273 | chromB_dw <- floor(chromB_pos+(chromB_span/2)) 274 | chromA_dw.chromB_up <- sum(df[df$startA >= chromA_pos & df$endA <= chromA_dw & df$startB >= chromB_up & df$endB <= chromB_pos,]$count) 275 | chromA_up.chromB_dw <- sum(df[df$startA >= chromA_up & df$endA <= chromA_pos & df$startB >= chromB_pos & df$endB <= chromB_dw,]$count) 276 | chromAB_up <- sum(df[df$startA >= chromA_up & df$endA <= chromA_pos & df$startB >= chromB_up & df$endB <= chromB_pos,]$count) 277 | chromAB_dw <- sum(df[df$startA >= chromA_pos & df$endA <= chromA_dw & df$startB >= chromB_pos & df$endB <= chromB_dw,]$count) 278 | 279 | if ((chromA_cis_DI > 0 & chromB_cis_DI < 0) | (chromA_cis_DI < 0 & chromB_cis_DI > 0)){ 280 | if (chromA_cis_DI > 0 & chromB_cis_DI < 0) { 281 | if (bk == 1){ 282 | value <- (chromA_dw.chromB_up+chromA_up.chromB_dw)-chromAB_dw 283 | } else if (bk == 2){ 284 | value <- (chromA_dw.chromB_up+chromA_up.chromB_dw)-chromAB_up 285 | } else if (bk == 3){ 286 | value <- (chromAB_up+chromAB_dw)-chromA_dw.chromB_up 287 | } else if (bk == 4){ 288 | value <- (chromAB_up+chromAB_dw)-chromA_up.chromB_dw 289 | } 290 | } else if (chromA_cis_DI < 0 & chromB_cis_DI > 0){ 291 | if (bk == 1){ 292 | value <- (chromA_dw.chromB_up+chromA_up.chromB_dw)-chromAB_dw 293 | } else if (bk == 2){ 294 | value <- (chromA_dw.chromB_up+chromA_up.chromB_dw)-chromAB_up 295 | } else if (bk == 3){ 296 | value <- (chromAB_up+chromAB_dw)-chromA_dw.chromB_up 297 | } else if (bk == 4){ 298 | value <- (chromAB_up+chromAB_dw)-chromA_up.chromB_dw 299 | } 300 | } 301 | } else if ((chromA_cis_DI > 0 & chromB_cis_DI > 0) | (chromA_cis_DI < 0 & chromB_cis_DI < 0)) { 302 | if (chromA_cis_DI > 0 & chromB_cis_DI > 0){ 303 | if (bk == 1){ 304 | value <- (chromA_dw.chromB_up+chromA_up.chromB_dw)-chromAB_dw 305 | } else if (bk == 2){ 306 | value <- (chromA_dw.chromB_up+chromA_up.chromB_dw)-chromAB_up 307 | } else if (bk == 3){ 308 | value <- (chromAB_up+chromAB_dw)-chromA_dw.chromB_up 309 | } else if (bk == 4){ 310 | value <- (chromAB_up+chromAB_dw)-chromA_up.chromB_dw 311 | } 312 | } else if (chromA_cis_DI < 0 & chromB_cis_DI < 0){ 313 | if (bk == 1){ 314 | value <- (chromA_dw.chromB_up+chromA_up.chromB_dw)-chromAB_dw 315 | } else if (bk == 2){ 316 | value <- (chromA_dw.chromB_up+chromA_up.chromB_dw)-chromAB_up 317 | } else if (bk == 3){ 318 | value <- (chromAB_up+chromAB_dw)-chromA_dw.chromB_up 319 | } else if (bk == 4){ 320 | value <- (chromAB_up+chromAB_dw)-chromA_up.chromB_dw 321 | } 322 | } 323 | } 324 | value 325 | } 326 | 327 | bp_optimization <- function(intr.file,boundary,chromA,chromB,ssA,seA,ssB,seB){ 328 | 329 | intr.file <- as.data.frame(fread(as.character(intr.file),header=T)) 330 | intr.file <- intr.file[intr.file$chromA == as.character(chromA) & intr.file$chromB == as.character(chromB),] 331 | 332 | as <- boundary$chrA.hmmBoundary.start-ssA 333 | bs <- boundary$chrB.hmmBoundary.start-ssB 334 | ae <- boundary$chrA.hmmBoundary.end+seA 335 | be <- boundary$chrB.hmmBoundary.end+seB 336 | chromA_span <- ae-as 337 | chromB_span <- be-bs 338 | 339 | upup <- sum(intr.file[with(intr.file,startA >= as & endA <= as+floor(chromA_span/2) & startB >= bs & endB <= bs+floor(chromB_span/2)),]$count) 340 | dwdw <- sum(intr.file[with(intr.file,startA >= as+floor(chromA_span/2) & endA <= ae & startB >= bs+floor(chromB_span/2) & endB <= be),]$count) 341 | updw <- sum(intr.file[with(intr.file,startA >= as & endA <= as+floor(chromA_span/2) & startB >= bs+floor(chromB_span/2) & endB <= be),]$count) 342 | dwup <- sum(intr.file[with(intr.file,startA >= as+floor(chromA_span/2) & endA <= ae & startB >= bs & endB <= bs+floor(chromB_span/2)),]$count) 343 | mx <- which.max(c(upup,dwdw,updw,dwup)) 344 | chromA_span <- ae-as 345 | chromB_span <- be-bs 346 | if (chromA_span < 1e6){ 347 | chromA_span <- 1e6-chromA_span 348 | chromA_span <- floor(chromA_span/2) 349 | as <- as-chromA_span 350 | ae <- ae+chromA_span 351 | } 352 | if (chromB_span < 1e6){ 353 | chromB_span <- 1e6-chromB_span 354 | chromB_span <- floor(chromB_span/2)+1 355 | bs <- bs-chromB_span 356 | be <- be+chromB_span 357 | } 358 | 359 | #Optimization parameters 360 | JDEoptim( 361 | lower=c(as,bs), 362 | upper=c(ae,be), 363 | fn=optimization_func, 364 | df=intr.file, 365 | compare_to="median", 366 | maxiter=1000, 367 | NP=100, 368 | fnscale=1e-5, 369 | tau_pF=0.1, 370 | as=as, 371 | bs=bs, 372 | ae=ae, 373 | be=be, 374 | bk=mx, 375 | chromA_cis_DI = as.numeric(boundary$chromA_cis_DI), 376 | chromB_cis_DI = as.numeric(boundary$chromB_cis_DI), 377 | trace=TRUE 378 | ) 379 | } 380 | 381 | 382 | ############### HiCtrans calling on the binned Hi-C data ######################### 383 | 384 | ## This function is called from HClust to calculate the overlap between the detected segments ## 385 | OverLap <- function(x,y,u,v) { 386 | if (y < u) { 387 | return(1) 388 | } else if (v < x) { 389 | return(1) 390 | } else if (y > u & y <= v) { 391 | return(0) 392 | } else if (x > u & x <= v) { 393 | return(0) 394 | } else if (u > x & v <= y) { 395 | return(0) 396 | } else if (v > x & u <= y) { 397 | return(0) 398 | } else if (x >= u & y < v) { 399 | return(0) 400 | } else if (u >= x & v < y) { 401 | return(0) 402 | } 403 | } 404 | 405 | ## Given distances, this function can cluster a 2D interacting segments ## 406 | HClust <- function(box.df,cl.A,cl.B,colNameA,colNameB,step=0) { 407 | 408 | if (step==0) { 409 | box.df <- box.df[,c(1:6,8,9,10,12)] 410 | colnames(box.df) <- c("chrA","BoundaryAS","BoundaryAE","chrB","BoundaryBS","BoundaryBE","zscore","count","id","resolution") 411 | } 412 | chrA.dist <- matrix(ncol=nrow(box.df),nrow=nrow(box.df)) 413 | chrB.dist <- matrix(ncol=nrow(box.df),nrow=nrow(box.df)) 414 | 415 | i <- 1 416 | while (i <= nrow(box.df)) { 417 | chrA.AS_i <- box.df$BoundaryAS[i] 418 | chrA.AE_i <- box.df$BoundaryAE[i] 419 | chrB.BS_i <- box.df$BoundaryBS[i] 420 | chrB.BE_i <- box.df$BoundaryBE[i] 421 | j <- 1 422 | while (j <= nrow(box.df)) { 423 | chrA.AS_j <- box.df$BoundaryAS[j] 424 | chrA.AE_j <- box.df$BoundaryAE[j] 425 | chrB.BS_j <- box.df$BoundaryBS[j] 426 | chrB.BE_j <- box.df$BoundaryBE[j] 427 | if (step==0 | step==2) { 428 | chrA.AS_AS.ij <- abs(chrA.AS_i - chrA.AS_j) 429 | chrA.AE_AS.ij <- abs(chrA.AE_i - chrA.AS_j) 430 | chrA.AS_AE.ij <- abs(chrA.AS_i - chrA.AE_j) 431 | chrA.AE_AE.ij <- abs(chrA.AE_i - chrA.AE_j) 432 | chrB.BS_BS.ij <- abs(chrB.BS_i - chrB.BS_j) 433 | chrB.BE_BS.ij <- abs(chrB.BE_i - chrB.BS_j) 434 | chrB.BS_BE.ij <- abs(chrB.BS_i - chrB.BE_j) 435 | chrB.BE_BE.ij <- abs(chrB.BE_i - chrB.BE_j) 436 | 437 | chrA.dist[i,j] <- min(chrA.AS_AS.ij,chrA.AE_AS.ij,chrA.AS_AE.ij,chrA.AE_AE.ij) 438 | chrB.dist[i,j] <- min(chrB.BS_BS.ij,chrB.BE_BS.ij,chrB.BS_BE.ij,chrB.BE_BE.ij) 439 | 440 | } else { 441 | 442 | chrA.dist[i,j] <- OverLap(chrA.AS_i,chrA.AE_i,chrA.AS_j,chrA.AE_j) 443 | chrB.dist[i,j] <- OverLap(chrB.BS_i,chrB.BE_i,chrB.BS_j,chrB.BE_j) 444 | 445 | } 446 | j <- j + 1 447 | } 448 | i <- i + 1 449 | } 450 | 451 | if (nrow(chrA.dist) > 1 & nrow(chrB.dist) > 1) { 452 | chrA.hclust <- hclust(as.dist(chrA.dist),method="single") 453 | chrB.hclust <- hclust(as.dist(chrB.dist),method="single") 454 | chrA.clust <- cutree(chrA.hclust,h=cl.A) 455 | chrB.clust <- cutree(chrB.hclust,h=cl.B) 456 | box.df[,"chrA.clus"] <- chrA.clust 457 | box.df[,"chrB.clus"] <- chrB.clust 458 | } else { 459 | chrA.clust <- 1 460 | chrB.clust <- 1 461 | box.df[,"chrA.clus"] <- chrA.clust 462 | box.df[,"chrB.clus"] <- chrB.clust 463 | } 464 | clus <- unique(data.frame(chrA.clust,chrB.clust)) 465 | if (step==1) { 466 | return(box.df) 467 | } 468 | if (step==0) { 469 | box.cl <- list() 470 | i <- 1 471 | while (i <= nrow(clus)) { 472 | d <- box.df[box.df$chrA.clus==clus$chrA.clust[i] & box.df$chrB.clus==clus$chrB.clust[i],] 473 | box.cl[[i]] <- data.frame(chrA=d$chrA[1],BoundaryAS=min(d$BoundaryAS),BoundaryAE=max(d$BoundaryAE), 474 | chrB=d$chrB[1],BoundaryBS=min(d$BoundaryBS),BoundaryBE=max(d$BoundaryBE), 475 | zscore=mean(d$zscore),count=mean(d$count),resolution=d$resolution[1], 476 | id=paste0(d$id,collapse=","),A=clus$chrA.clust[i],B=clus$chrB.clust[i]) 477 | i <- i + 1 478 | } 479 | box.cl <- as.data.frame(do.call(rbind,box.cl)) 480 | colnames(box.cl) <- c("chrA","BoundaryAS","BoundaryAE","chrB","BoundaryBS","BoundaryBE","zscore","count","resolution","id",as.character(colNameA),as.character(colNameB)) 481 | return(box.cl) 482 | } 483 | } 484 | 485 | ## Finding out the common multi-resolution translocation boxes and breakpoints ## 486 | ZoomIn <- function(d, f, r, l) { 487 | 488 | g <- 1 489 | clus <- unique(data.frame(d[,c("chrA.clus","chrB.clus")])) 490 | w <- list() 491 | i <- 1 492 | while (i <= nrow(clus)) { 493 | v <- d[d$chrA.clus==clus$chrA.clus[i] & d$chrB.clus==clus$chrB.clus[i],] 494 | b <- list() 495 | j <- 1 496 | while (j <= length(r)) { 497 | b[[j]] <- v[v$resolution==r[j],] 498 | if (nrow(b[[j]]) > 0) { 499 | b[[j]][,"Level"] <- paste0("L_",j) 500 | } 501 | j <- j + 1 502 | } 503 | b <- do.call(rbind,b) 504 | if (nrow(b) >= l) { 505 | w[[i]] <- b 506 | w[[i]][,"group"] <- g 507 | g <- g + 1 508 | } 509 | i <- i + 1 510 | } 511 | w <- as.data.frame(do.call(rbind, w)) 512 | g <- unique(w$g) 513 | b <- list() 514 | h <- list() 515 | n <- 1 516 | i <- 1 517 | while (i <= length(g)) { 518 | d <- w[w$g==g[i],] 519 | d <- d[d$Level==d[nrow(d),]$Level,] 520 | j <- 1 521 | while (j <= nrow(d)) { 522 | k <- as.integer(strsplit(as.character(d$id[j]),",")[[1]]) 523 | r <- d$resolution[j] 524 | b[[n]] <- f[f$resolution==r & f$id %in% k & f$type=="BreakPoint",] 525 | j <- j + 1 526 | n <- n + 1 527 | } 528 | h[[i]] <- w[w$g==g[i],] 529 | h[[i]] <- h[[i]][h[[i]]$Level==h[[i]][1,]$Level,] 530 | i <- i + 1 531 | } 532 | b <- as.data.frame(do.call(rbind,b)) 533 | if (nrow(b) > 0) { 534 | b <- HClust(box.df=b,cl.A=min(b$resolution),cl.B=min(b$resolution),colNameA="resA.cl",colNameB="resB.cl",step=0) 535 | h <- as.data.frame(do.call(rbind,h)) 536 | b <- b[,c(1:9)] 537 | h <- h[,c(1:9)] 538 | b[,"class"] <- "BreakPoint" 539 | h[,"class"] <- "TranslocationBox" 540 | df <- as.data.frame(rbind(h,b)) 541 | } else { 542 | df <- data.frame(chrA=c(),BoundaryAS=c(),BoundaryAE=c(),chrB=c(),BoundaryBS=c(),BoundaryBE=c(),zscore=c(),count=c(),resolution=c(),id=c(),A=c(),B=c(),class=c()) 543 | } 544 | return(df) 545 | } 546 | 547 | ## From ijk type data, this function will create a matrix object ## 548 | CreateMatrix <- function(ijk.file, bed.file, chromA, chromB, prefix) { 549 | 550 | ijk <- as.data.frame(fread(ijk.file,h=F)) 551 | bed <- as.data.frame(fread(bed.file,h=F)) 552 | colnames(ijk) <- c("A","B","C") 553 | colnames(bed) <- c("chr","start","end","index") 554 | bed <- bed[bed$chr %in% c(chromA,chromB),] 555 | indexA.conversion <- hashmap(bed[bed$chr==chromA,]$index,c(1:nrow(bed[bed$chr==chromA,]))) 556 | indexB.conversion <- hashmap(bed[bed$chr==chromB,]$index,c(1:nrow(bed[bed$chr==chromB,]))) 557 | ijk <- ijk[ijk$A %in% bed[bed$chr==chromA,]$index,] 558 | ijk <- ijk[ijk$B %in% bed[bed$chr==chromB,]$index,] 559 | ijk$A <- indexA.conversion[[ijk$A]] 560 | ijk$B <- indexB.conversion[[ijk$B]] 561 | bed[bed$chr==chromA,]$index <- 1:nrow(bed[bed$chr==chromA,]) 562 | bed[bed$chr==chromB,]$index <- 1:nrow(bed[bed$chr==chromB,]) 563 | mat <- matrix(0,nrow=nrow(bed[bed$chr==chromA,]),ncol=nrow(bed[bed$chr==chromB,])) 564 | print (dim(mat)) 565 | i <- 1 566 | while (i <= nrow(ijk)) { 567 | mat[ijk$A[i],ijk$B[i]] <- ijk$C[i] 568 | i <- i + 1 569 | } 570 | write.table(mat,file=paste0(prefix,".",chromA,"_",chromB,".mat.txt"),row.names=F,col.names=F,sep="\t",quote=F) 571 | return(list(mat=as.matrix(mat),bedA=as.data.frame(bed[bed$chr==chromA,]),bedB=as.data.frame(bed[bed$chr==chromB,]))) 572 | } 573 | 574 | ## This function scans the inter-chromosomal interaction matrix for mean values with a given box size ## 575 | cppFunction('NumericVector lmat(NumericMatrix mat, int x, int y) { 576 | int n = mat.nrow(); 577 | int m = mat.ncol(); 578 | int k = 0; 579 | int s1 = round(n/x); 580 | int s2 = round(m/y); 581 | int s3 = (s1 * s2); 582 | Rcpp::NumericVector M(s3); 583 | for(int i = 0; i < n; i = i + x) { 584 | int a = i; 585 | int b = i + (x-1); 586 | for (int j = 0; j < m; j = j + y){ 587 | int u = j; 588 | int v = j + (y-1); 589 | if (b < mat.nrow()) { 590 | if (v < mat.ncol()) { 591 | NumericMatrix F = mat(Range(a,b),Range(u,v)); 592 | M[k] = findMean(F); 593 | k = k + 1; 594 | } 595 | } 596 | } 597 | } 598 | return(M); 599 | }', includes='double findMean(NumericMatrix F) { 600 | int sum = 0; 601 | int n = F.nrow(); 602 | int m = F.ncol(); 603 | for (int i=0; i 0] <- 1 650 | x.val <- rowSums(m)/sum(rowSums(m)) 651 | y.val <- colSums(m)/sum(colSums(m)) 652 | val.mat <- outer(x.val,y.val) 653 | val.mat <- mat[s.1:e.1,s.2:e.2] * val.mat 654 | p <- which(val.mat >= quantile(val.mat, locq), arr.ind = TRUE) 655 | x.pos <- as.integer(rownames(m)[p[,1]]) 656 | y.pos <- as.integer(colnames(m)[p[,2]]) 657 | j <- 1 658 | while (j <= nrow(p)) { 659 | if (x.pos[j] >= 2 & x.pos[j] < nrow(mat) & y.pos[j] >= 2 & y.pos[j] < ncol(mat)) { 660 | max.count <- max(mat[c(x.pos[j]-1):c(x.pos[j]+1),c(y.pos[j]-1):c(y.pos[j]+1)]) 661 | b[[k]] <- cbind(box[i,],pos.1=x.pos[j],pos.2=y.pos[j],count=max.count) 662 | k <- k + 1 663 | } 664 | j <- j + 1 665 | } 666 | i <- i + 1 667 | } 668 | #b <- as.data.frame(do.call(rbind,b)) 669 | b <- as.data.frame(data.table::rbindlist(b)) 670 | return(b) 671 | } 672 | 673 | ## This function calls the lmat function described above ## 674 | BackgroundEnrichment <- function(i, mat, f) { 675 | 676 | print (f[i,]) 677 | x.s <- f$d1[i] 678 | y.s <- f$d2[i] 679 | val <- lmat(mat,x.s,y.s) 680 | return(data.frame(MEAN=mean(val),SD=sd(val))) 681 | 682 | } 683 | 684 | ## Check for all combination of 1D segments from both the chromosomes, the enrichment of their interaction ## 685 | BoxEnrichment <- function(m, x, y, core, boxzscore, cutoff=0.05) { 686 | 687 | n <- 0 688 | repeat { 689 | print (paste0("Estimating box enrichment: Trimmining ",cutoff*1e2,"% of values")) 690 | d <- list() 691 | k <- 1 692 | i <- 1 693 | while (i <= nrow(x)) { 694 | a <- x$start[i] 695 | b <- x$end[i] 696 | j <- 1 697 | while (j <= nrow(y)) { 698 | u <- y$start[j] 699 | v <- y$end[j] 700 | d[[k]] <- data.frame(start1=x$start[i],end1=x$end[i],d1=(x$end[i]-x$start[i]), 701 | start2=y$start[j],end2=y$end[j],d2=(y$end[j]-y$start[j]), 702 | val=mean(m[a:b,u:v],trim=cutoff)) 703 | k <- k + 1 704 | j <- j + 1 705 | } 706 | i <- i + 1 707 | } 708 | d <- as.data.frame(do.call(rbind,d)) 709 | d[,"val.center"] <- d$val - mean(d$val) 710 | d <- d[d$val.center > 0 & d$d1 > 2 & d$d2 > 2,] 711 | cutoff <- cutoff - 0.04 712 | if (nrow(d) > 0) { 713 | break 714 | } 715 | if (cutoff < 0) { 716 | n <- n + 1 717 | if (n == 1) { 718 | cutoff <- 0 719 | } else if (n > 1) { 720 | print ("No enrichemnt of changepoint boxes detected! Try to run with Lower resolution") 721 | break 722 | } 723 | } 724 | print ("Lowering trimming threshold") 725 | } 726 | if (nrow(d) > 0) { 727 | v <- lapply(c(1:nrow(d)), BackgroundEnrichment, m, d) 728 | v <- as.data.frame(do.call(rbind,v)) 729 | d[,"val.bg.mean"] <- v$MEAN 730 | d[,"val.bg.sd"] <- v$SD 731 | d[,"zscore"] <- ifelse(is.na(v$SD), NA, (d$val - d$val.bg.mean)/d$val.bg.sd) 732 | d <- na.omit(d) 733 | d <- d[d$zscore >= boxzscore,] 734 | print (d) 735 | if (nrow(d) > 0) { 736 | return(d) 737 | } else { 738 | print (paste0("No enriched segment found")) 739 | return(data.frame()) 740 | } 741 | } else { 742 | print ("No enrichemnt of changepoint boxes detected! Try to run with Lower resolution") 743 | return(data.frame()) 744 | } 745 | } 746 | 747 | ## Given a change-point vector, convert the vector to data.frame like object ## 748 | CptsInBedFormat <- function(cpt.obj) { 749 | 750 | start <- list() 751 | end <- list() 752 | val <- list() 753 | start[[1]] <- 1 754 | end[[1]] <- cpt.obj@cpts[1] 755 | val[[1]] <- cpt.obj@param.est$mean[1] 756 | i <- 2 757 | while (i <= length(cpt.obj@cpts)) { 758 | start[[i]] <- cpt.obj@cpts[i-1] 759 | end[[i]] <- cpt.obj@cpts[i] 760 | val[[i]] <- cpt.obj@param.est$mean[i] 761 | i <- i + 1 762 | } 763 | start <- unlist(start) 764 | end <- unlist(end) 765 | val <- unlist(val) 766 | return(data.frame(start,end,val)) 767 | } 768 | 769 | ## This is the main function which call the binary-segmentation algorithm and detects 1D segmentations ## 770 | GetSegments <- function(mat, bed, chromA, chromB, prefix, covq=0.75, locq, mincount, minzscore, resolution, glbq, boxzscore) { 771 | 772 | r.sum <- rowSums(mat) 773 | c.sum <- rowSums(t(mat)) 774 | r.center <- r.sum - quantile(r.sum, covq) 775 | c.center <- c.sum - quantile(c.sum, covq) 776 | 777 | ##Modified on 09/27/2021 778 | ##Commented 779 | #seg <- ifelse(resolution > 250000, 50, 100) 780 | #r.cpt <- suppressWarnings(cpt.meanvar(r.center,Q=seg,method="BinSeg")) 781 | #c.cpt <- suppressWarnings(cpt.meanvar(c.center,Q=seg,method="BinSeg")) 782 | ##Updated 783 | r.cpt <- suppressWarnings(cpt.meanvar(r.center,Q=ifelse((length(r.center)/2) < 100, round(((length(r.center)/2) - 1), 0), 100),method="BinSeg")) 784 | c.cpt <- suppressWarnings(cpt.meanvar(c.center,Q=ifelse((length(c.center)/2) < 100, round(((length(c.center)/2) - 1), 0), 100),method="BinSeg")) 785 | 786 | ## Flush changepoint summary ## 787 | print ("Row and Column changepoint summary") 788 | print (r.cpt) 789 | print (c.cpt) 790 | cat ("\n") 791 | ############################### 792 | 793 | r.df <- CptsInBedFormat(r.cpt) 794 | c.df <- CptsInBedFormat(c.cpt) 795 | r.df <- r.df[r.df$val > 0,] 796 | c.df <- c.df[c.df$val > 0,] 797 | 798 | if (nrow(r.df) > 0 & nrow(c.df) > 0) { 799 | ## Flush out the changepoints and their mean values ## 800 | print ("Row segments with positive values") 801 | print (r.df) 802 | print ("Column segments with positive values") 803 | print (c.df) 804 | ###################################################### 805 | 806 | box <- BoxEnrichment(mat,r.df,c.df,core,boxzscore) 807 | if (nrow(box) > 0) { 808 | box <- EstimateMaxRegion(box, mat, locq) 809 | if (nrow(box) > 0) { 810 | chrA.start.hash <- hashmap(bed[bed$chr==chromA,]$index,bed[bed$chr==chromA,]$start) 811 | chrB.start.hash <- hashmap(bed[bed$chr==chromB,]$index,bed[bed$chr==chromB,]$start) 812 | chrA.end.hash <- hashmap(bed[bed$chr==chromA,]$index,bed[bed$chr==chromA,]$end) 813 | chrB.end.hash <- hashmap(bed[bed$chr==chromB,]$index,bed[bed$chr==chromB,]$end) 814 | box[,"chr1"] <- chromA 815 | box[,"x1"] <- chrA.start.hash[[box$start1]] 816 | box[,"y1"] <- chrA.start.hash[[box$end1]] 817 | box[,"chr2"] <- chromB 818 | box[,"x2"] <- chrB.start.hash[[box$start2]] 819 | box[,"y2"] <- chrB.start.hash[[box$end2]] 820 | box[,"chrA"] <- chromA 821 | box[,"chrA.Brk.pos1"] <- chrA.start.hash[[box$pos.1]] 822 | box[,"chrA.Brk.pos2"] <- chrA.end.hash[[box$pos.1]] 823 | box[,"chrB"] <- chromB 824 | box[,"chrB.Brk.pos1"] <- chrB.start.hash[[box$pos.2]] 825 | box[,"chrB.Brk.pos2"] <- chrB.end.hash[[box$pos.2]] 826 | write.table(box,file=paste0(prefix,".",chromA,"_",chromB,".Details.txt"),row.names=F,sep="\t",quote=F) 827 | 828 | wt.threshold <- data.frame(count=as.integer(names(table(as.vector(mat)))),freq=as.vector(table(as.vector(mat)))) 829 | wt.threshold <- wt.threshold[-c(1),] 830 | wt.threshold[,"csum"] <- cumsum(wt.threshold$freq)/sum(wt.threshold$freq) 831 | wt.threshold <- min(wt.threshold[wt.threshold$csum >= glbq,]$count) 832 | wt.threshold <- ifelse(wt.threshold < mincount, mincount, wt.threshold) 833 | print (wt.threshold) 834 | box[,"color"] <- "0,0,255" 835 | box <- box[,c("chr1","x1","y1","chr2","x2","y2","color","zscore","chrA","chrA.Brk.pos1","chrA.Brk.pos2","chrB","chrB.Brk.pos1","chrB.Brk.pos2","count")] 836 | box <- box[box$count >= wt.threshold & box$zscore >= minzscore,] 837 | if (nrow(box) > 0) { 838 | box[,"id"] <- 1:nrow(box) 839 | boxA <- box[,c("chr1","x1","y1","chr2","x2","y2","color","zscore","count","id")] 840 | boxB <- data.frame(chr1=box$chrA,x1=box$chrA.Brk.pos1,y1=box$chrA.Brk.pos2, 841 | chr2=box$chrB,x2=box$chrB.Brk.pos1,y2=box$chrB.Brk.pos2, 842 | color=box$color,zscore=box$zscore,count=box$count,id=box$id) 843 | boxB[,"color"] <- "0,0,0" 844 | boxA[,"type"] <- "TranslocationBox" 845 | boxB[,"type"] <- "BreakPoint" 846 | box <- rbind(boxA,boxB) 847 | box[,"resolution"] <- resolution 848 | write.table(box,file=paste0(prefix,".",chromA,"_",chromB,".Translocations_jcbx.txt"),row.names=F,sep="\t",quote=F) 849 | return(box) 850 | } else { 851 | print ("Minimum count filter not passed. Try to decrease the minimum count to get a breakpoint at this resolution") 852 | return(data.frame()) 853 | } 854 | } else { 855 | print ("Minimum count percentile not passed. Try to decrease the minimum count percentile to get a breakpoint at this resolution") 856 | return(data.frame()) 857 | } 858 | } else { 859 | print ("Minimum count filter not passed. Try to decrease the minimum count to get a breakpoint at this resolution") 860 | return(data.frame()) 861 | } 862 | } else { 863 | if (covq > 0.5) { 864 | print (paste0("Descresing the quantile value to ",(covq-0.25)," to get positive segments")) 865 | GetSegments(mat, bed, chromA, chromB, prefix, covq=(covq-0.25), locq, mincount, minzscore, resolution, glbq, boxzscore) 866 | } else { 867 | print ("No positive segments from changepoint analysis! Try a lower resolution hic data for this chromosome") 868 | return(data.frame()) 869 | } 870 | } 871 | } 872 | 873 | 874 | CoordinateShrunkage <- function(v,len) { 875 | 876 | d <- list() 877 | j <- 1 878 | while(j < length(v)) { 879 | d[[j]] <- data.frame(p=j,s=v[j],e=(v[j+1]-1)) 880 | j <- j + 1 881 | } 882 | if (v[j]==len) { 883 | d[[j]] <- data.frame(p=j,s=v[j],e=v[j]) 884 | } else { 885 | d[[j]] <- data.frame(p=j,s=v[j],e=len) 886 | } 887 | d <- as.data.frame(do.call(rbind,d)) 888 | return(d) 889 | 890 | } 891 | 892 | ShrinkMatrix <- function(mat.high,x.coord,y.coord) { 893 | 894 | cat ("Shrinking Matrix to lower resolution\n") 895 | x.coord <- as.matrix(x.coord) 896 | y.coord <- as.matrix(y.coord) 897 | colnames(x.coord) <- NULL 898 | colnames(y.coord) <- NULL 899 | mat.low <- ShrinkMatrixCPP(mat.high,x.coord,y.coord) 900 | return(mat.low) 901 | } 902 | 903 | ## This function provide a Lower-resolution Hi-C matrix from a Higher-resolution Hi-C matrix ## 904 | HighToLowResolutionCoversion <- function(mat, high.resolution, low.resolution, bed.x, bed.y, bed, prefix) { 905 | 906 | mat_list <- list() 907 | bed_list <- list() 908 | x.coord_list <- list() 909 | y.coord_list <- list() 910 | mat_list[[1]] <- mat 911 | bed_list[[1]] <- bed 912 | i <- 1 913 | while (i <= length(low.resolution)) { 914 | x.start <- hashmap(c(bed.x$index),c(bed.x$start)) 915 | x.end <- hashmap(c(bed.x$index),c(bed.x$end)) 916 | y.start <- hashmap(c(bed.y$index),c(bed.y$start)) 917 | y.end <- hashmap(c(bed.y$index),c(bed.y$end)) 918 | scale.factor <- low.resolution[i]/high.resolution 919 | x.dim <- nrow(mat) 920 | y.dim <- ncol(mat) 921 | x.divison <- seq(1,x.dim,by=scale.factor) 922 | y.divison <- seq(1,y.dim,by=scale.factor) 923 | x.coord <- CoordinateShrunkage(x.divison,x.dim) 924 | y.coord <- CoordinateShrunkage(y.divison,y.dim) 925 | bedA <- data.frame(chr=rep(as.character(bed.x$chr[1]),nrow(x.coord)),start=x.start[[x.coord$s]],end=x.end[[x.coord$e]],index=c(1:nrow(x.coord))) 926 | bedB <- data.frame(chr=rep(as.character(bed.y$chr[1]),nrow(y.coord)),start=y.start[[y.coord$s]],end=y.end[[y.coord$e]],index=c(1:nrow(y.coord))) 927 | x.coord_list[[i]] <- cbind(bedA,x.coord) 928 | y.coord_list[[i]] <- cbind(bedB,y.coord) 929 | mat_list[[i+1]] <- ShrinkMatrix(mat,x.coord,y.coord) 930 | bed_list[[i+1]] <- data.frame(rbind(bedA,bedB)) 931 | 932 | write.table(mat_list[[i+1]],file=paste0(prefix,"_",as.integer(low.resolution[i]),".",bedA$chr[1],"_",bedB$chr[1],".matrix"),row.names=F,col.names=F,sep="\t",quote=F) 933 | write.table(bed_list[[i+1]],file=paste0(prefix,"_",as.integer(low.resolution[i]),".",bedA$chr[1],"_",bedB$chr[1],"_abs.bed"),row.names=F,col.names=F,sep="\t",quote=F) 934 | 935 | i <- i + 1 936 | } 937 | d <- list(mat=mat_list,bed=bed_list,x.coord=x.coord_list,y.coord=y.coord_list) 938 | return(d) 939 | } 940 | 941 | option_list = list( 942 | make_option(c("--mat"), type="character", help="An upper triangular Hi-C sparse matrix\n\t\tIt should have the following columns\n 943 | \t\t \n 944 | \t\t1 1 300 945 | \t\t1 2 30 946 | \t\t1 3 10 947 | \t\t2 2 200 948 | \t\t2 3 20 949 | \t\t3 3 200 950 | \t\t....\n"), 951 | 952 | make_option(c("--bed"), type="character", help="Bed file with index information\n\t\tIt should have the following columns\n\n 953 | \t\t \n 954 | \t\tchr1 1 40000 1 955 | \t\tchr1 40000 80000 2 956 | \t\tchr1 80000 120000 3 957 | \t\t....\n"), 958 | 959 | make_option(c("--chrA"), type="character", help="Chromosome A name. It will represent the rows in the inter-chromosomal matrix. It should be the chromosome.\n"), 960 | 961 | make_option(c("--chrB"), type="character", help="Chromosome B name. It will represent the columns in the inter-chromosomal matrix. It should be the chromosome.\n"), 962 | 963 | make_option(c("--prefix"), type="character", help="Prefix of the output file . All the output files and folders will be generated with this prefix.\n"), 964 | 965 | make_option(c("--covq"), type="numeric", default=0.1, help="Quantile value to be subtracted from one dimensional trans-coverage profile [trans.coverage - quantile(trans.coverage, covq)] [default 0.10].\n 966 | \t\tBins with very low coverage values are removed with this filter. 967 | \t\tIncreasing value will keep only the most stringent bins in the two chromosome.\n"), 968 | 969 | make_option(c("--minzscore"), type="numeric", default=1, help="Minimum Zscore of a possible translocation box to be retained [default is 1].\n 970 | \t\tHiCtrans will find enriched boxes within the inter-chromosomal matrix as potential translocation box. 971 | \t\tThe enrichment is calculated as Z-score against a background with all possible similar sized boxes in the inter-chromosomal matrix. 972 | \t\tIncreasing value will keep the most enriched trans interacting boxes.\n"), 973 | 974 | make_option(c("--minboxsize"), type="numeric", default=0, help="Minimum size of a possible translocation box relative to its Hi-C resolution [default is 0 i.e. no filtering. If set to non-zero value, then (Breakpoint.start - Breakpoint.end)/HiC.resolution > minboxsize filtering will be applied].\n 975 | \t\tHiCtrans will find enriched boxes within the inter-chromosomal matrix as potential translocation box. 976 | \t\tThe minimum box size threshold will filter out small false positive multi-resolution supported potential translocation boxes.\n 977 | \t\tIncreasing value will keep the most enriched and larger trans interacting boxes.\n"), 978 | 979 | make_option(c("--boxzscore"), type="numeric", default=1, help="Minimum Zscore of a possible translocation box to be retained [default is 1].\n 980 | \t\tHiCtrans will keep boxes enriched above boxzscore threshold to find translocations among them. 981 | \t\tIncreasing value will keep the most enriched trans interacting boxes.\n"), 982 | 983 | make_option(c("--locq"), type="numeric", default=0.1, help="Top percentile to be reported as possible breakpoints within a translocation box [default top 0.1%]\n 984 | \t\tFor each enriched translocated boxes, HiCtrans will report top % interacting pairs (Weighted by the frequency of total interaction). 985 | \t\tDecreasing will reduce the number of reported breakpoints within an enriched trans interacting box. A value of 0 will report only the top interacting pair\n"), 986 | 987 | make_option(c("--mincount"), type="numeric", default=10, help="Minimum count of a possible breakpoint to be retained when compared to all possible chrA-chrB interaction [default cutoff is 10].\n 988 | \t\tThis is an absolute minimum count cutoff to filter out breakpoints detected at any resolution. 989 | \t\tIncreasing value will keep only the most stringent interacting pair.\n"), 990 | 991 | make_option(c("--glbq"), type="numeric", default=0.1, help="Percentile value for minimum count cutoff at each resolution [default cutoff is at top 0.1% of the count distribution].\n 992 | \t\tThis is a relative count cutoff based on the inter-chromosomal count distribution determined for each resolution independently. 993 | \t\tIncreasing value keep only the most stringent interacting pair.\n"), 994 | 995 | make_option(c("--resolutions"), type="character", default="2,4,5,10", help="Comma separated list of integers to be multiplied with the starting Hi-C resolution to get the lower resolutions [default 2,4,5,10].\n 996 | \t\tHiCtrans will search for enriched trans-interacting boxes and breakpoint finding at different resolutions. 997 | \t\tProvide only integer values in a comma separated list.\n"), 998 | 999 | make_option(c("--multires"), type="numeric", default=2, help="Number of Hi-C resolutions at which the breakpoint should be supported with [default is at least 2 different resolutions].\n 1000 | \t\tHiCtrans wiil find enriched trans-interacting boxes and subsequent breakpoints at different resolutions. The ultimate goal is to find a true translocation supported by multiple resolutions. 1001 | \t\tIncreasing value will keep only the enriched boxes and breakpoints supported by at least number of different resolution.\n"), 1002 | 1003 | make_option(c("--maxres"), type="numeric", default=3, help="Maximum resolution upto which the breakpoint is kept after multi-resolution filtering [default is 3 X user provided Hi-C resolution].\n 1004 | \t\tHiCtrans wiil find enriched trans-interacting boxes and subsequent breakpoints at different resolutions. The ultimate goal is to find a true translocation supported by the highest resolutions. 1005 | \t\tIncreasing value will keep only the enriched boxes and breakpoints supported by upto X starting Hi-C resolutions.\n"), 1006 | 1007 | make_option(c("--relevel"), type="character", default="No", help="Should the breakpoints be refined upto restriction-level resolution [default is 'No'; If 'Yes', the following parameters are MUST]\n"), 1008 | 1009 | make_option(c("--fragsFile"), type="character", help="Restriction Fragment file [MUST].\n 1010 | \t\tchr1 0 16007 HIC_chr1_1 0 + 1011 | \t\tchr1 16007 24571 HIC_chr1_2 0 + 1012 | \t\tchr1 24571 27981 HIC_chr1_3 0 + 1013 | \t\t......\n"), 1014 | 1015 | make_option(c("--chromsize"), type="character", help="Chromosome size file [MUST].\n 1016 | \t\tchr1\t249250621 1017 | \t\tchr2\t243199373 1018 | \t\tchr3\t198022430 1019 | \t\tchr4\t191154276 1020 | \t\t.....\n"), 1021 | 1022 | make_option(c("--validpair"), type="character", help="Valid pair file of the HiC data [MUST].\n 1023 | \t\tSRR6213722.1\tchr11\t124331538\t-\tchr11\t124345246\t- 1024 | \t\tSRR6213722.2\tchr1\t198436365\t-\tchr1\t199923196\t+ 1025 | \t\t.....\n"), 1026 | 1027 | make_option(c("--clusdist"), default=1e6,type="integer", help="Distance threshold in basepairs to cluster the nearby breakpoints obtained from multi-resolution filtered (MultiResolution_Filtered.Translocation.txt) or individual Translocations_jcbx.txt files [Default 1Mb]"), 1028 | make_option(c("--ssA"), default=1e5, type="integer", help="Extend -(ve) bp of the 5' HMM segment border of chromosome A for breakpoint identification. Default 100Kb."), 1029 | make_option(c("--seA"), default=1e5, type="integer", help="Extend +(ve) bp of the 3' HMM segment border of chromosome A for breakpoint identification. Default 100Kb."), 1030 | make_option(c("--ssB"), default=1e5, type="integer", help="Extend -(ve) bp of the 5' HMM segment border of chromosome B for breakpoint identification. Default 100Kb."), 1031 | make_option(c("--seB"), default=1e5, type="integer", help="Extend +(ve) bp of the 3' HMM segment border of chromosome B for breakpoint identification. Default 100Kb.\n"), 1032 | make_option(c("--precheck"), default=1, type="numeric", help="Precheck option will help to restrict HiCtrans search only to chromosome combinations with significant max interaction compared to mean non-zero count value. [Default 1. Lower value will increase stringency]") 1033 | ) 1034 | opt = parse_args(OptionParser(option_list=option_list)) 1035 | 1036 | prefix <- as.character(opt$prefix) 1037 | 1038 | bed <- as.character(opt$bed) 1039 | bed <- read.table(bed,h=F) 1040 | colnames(bed) <- c("chr","start","end","index") 1041 | 1042 | high.resolution <- bed$end[1] - bed$start[1] 1043 | low.resolution <- as.integer(strsplit(as.character(opt$resolutions),",")[[1]]) * high.resolution 1044 | high.resolution <- as.integer(high.resolution) 1045 | low.resolution <- as.integer(low.resolution) 1046 | multires <- as.integer(opt$multires) 1047 | maxres <- as.integer(opt$maxres) * high.resolution 1048 | chromA <- as.character(opt$chrA) 1049 | chromB <- as.character(opt$chrB) 1050 | 1051 | if (!file.exists(paste0(prefix,"_hictrans"))) { 1052 | dir.create(paste0(prefix,"_hictrans")) 1053 | } 1054 | if (!file.exists(paste0(prefix,"_hictrans/",prefix,"_hictrans_",chromA,"_",chromB,"_",as.integer(high.resolution)))) { 1055 | dir.create(paste0(prefix,"_hictrans/",prefix,"_hictrans_",chromA,"_",chromB,"_",as.integer(high.resolution))) 1056 | mat.dir.name <- dirname(opt$mat) 1057 | bed.dir.name <- dirname(opt$bed) 1058 | 1059 | if (opt$relevel == "Yes" | opt$relevel == "YES") { 1060 | fragsFile_path <- normalizePath(opt$fragsFile) 1061 | validpair_path <- normalizePath(opt$validpair) 1062 | } 1063 | chromsize_path <- normalizePath(opt$chromsize) 1064 | mat_pat <- normalizePath(opt$mat) 1065 | bed_pat <- normalizePath(opt$bed) 1066 | 1067 | setwd(paste0(prefix,"_hictrans/",prefix,"_hictrans_",chromA,"_",chromB,"_",as.integer(high.resolution))) 1068 | cmd <- paste0("ln -s ",mat_pat) 1069 | system(cmd, wait=T) 1070 | cmd <- paste0("ln -s ",bed_pat) 1071 | system(cmd, wait=T) 1072 | 1073 | if (opt$relevel == "Yes" | opt$relevel == "YES") { 1074 | cmd <- paste0("ln -s ",fragsFile_path) 1075 | system(cmd, wait=T) 1076 | cmd <- paste0("ln -s ",chromsize_path) 1077 | system(cmd, wait=T) 1078 | cmd <- paste0("ln -s ",validpair_path) 1079 | system(cmd, wait=T) 1080 | } 1081 | 1082 | } else { 1083 | 1084 | mat.dir.name <- dirname(opt$mat) 1085 | bed.dir.name <- dirname(opt$bed) 1086 | 1087 | mat_pat <- normalizePath(opt$mat) 1088 | bed_pat <- normalizePath(opt$bed) 1089 | chromsize_path <- normalizePath(opt$chromsize) 1090 | 1091 | if (opt$relevel == "Yes" | opt$relevel == "YES") { 1092 | fragsFile_path <- normalizePath(opt$fragsFile) 1093 | validpair_path <- normalizePath(opt$validpair) 1094 | } 1095 | 1096 | setwd(paste0(prefix,"_hictrans/",prefix,"_hictrans_",chromA,"_",chromB,"_",as.integer(high.resolution))) 1097 | cmd <- paste0("ln -s ",mat_pat) 1098 | system(cmd, wait=T) 1099 | cmd <- paste0("ln -s ",bed_pat) 1100 | system(cmd, wait=T) 1101 | 1102 | if (opt$relevel == "Yes" | opt$relevel == "YES") { 1103 | cmd <- paste0("ln -s ",fragsFile_path) 1104 | system(cmd, wait=T) 1105 | cmd <- paste0("ln -s ",chromsize_path) 1106 | system(cmd, wait=T) 1107 | cmd <- paste0("ln -s ",validpair_path) 1108 | system(cmd, wait=T) 1109 | } 1110 | } 1111 | 1112 | bed <- bed[bed$chr %in% c(chromA,chromB),] 1113 | bed[bed$chr==chromA,]$index <- 1:nrow(bed[bed$chr==chromA,]) 1114 | bed[bed$chr==chromB,]$index <- 1:nrow(bed[bed$chr==chromB,]) 1115 | 1116 | mat <- CreateMatrix(as.character(mat_pat), 1117 | as.character(bed_pat), 1118 | as.character(opt$chrA), 1119 | as.character(opt$chrB), 1120 | as.character(opt$prefix)) 1121 | 1122 | pval <- ppois(max(mat$mat), mean(mat$mat[mat$mat > 0]), lower.tail=F) 1123 | if (pval >= opt$precheck) { 1124 | cat("Max count of highest resolution Hi-C matrix is non-significant. HiCtrans will not search this chromosome combination.\n") 1125 | quit(save="no") 1126 | } else { 1127 | cat ("Max count of highest resolution Hi-C matrix is significant. HiCtrans will go-ahead to search translocation.\n") 1128 | } 1129 | mat <- HighToLowResolutionCoversion(mat$mat,high.resolution,low.resolution, mat$bedA, mat$bedB, bed, prefix) 1130 | 1131 | all.resolution <- sort(c(high.resolution,low.resolution)) 1132 | translocation.boxes <- list() 1133 | 1134 | sink(paste0(prefix,".",chromA,"_",chromB,".log.txt")) 1135 | print(paste0("Started ",Sys.time())) 1136 | for (i in 1:length(all.resolution)) { 1137 | 1138 | covq <- opt$covq 1139 | locq <- opt$locq 1140 | glbq <- opt$glbq 1141 | locq <- 1 - (locq/100) 1142 | glbq <- 1 - (glbq/100) 1143 | mincount <- as.integer(opt$mincount) 1144 | minzscore<- as.integer(opt$minzscore) 1145 | boxzscore<- as.integer(opt$boxzscore) 1146 | prefix_resolution <- paste0(prefix,"_",as.integer(all.resolution[i])) 1147 | print (paste0("Running for ",all.resolution[i]," resolution")) 1148 | translocation.boxes[[i]] <- GetSegments(mat$mat[[i]], mat$bed[[i]], chromA, chromB, prefix_resolution, covq, locq, mincount, minzscore, as.integer(all.resolution[i]), glbq, boxzscore) 1149 | 1150 | } 1151 | translocation.boxes <- as.data.frame(do.call(rbind, translocation.boxes)) 1152 | write.table(translocation.boxes, file=paste0(prefix,"_hictrans.",chromA,"_",chromB,".preCluster.txt"),row.names=F,sep="\t",quote=F) 1153 | print ("Clustering the Translocation boxes at resolution level") 1154 | 1155 | print ("Filtering breakpoint from multi-resolution data") 1156 | if (nrow(translocation.boxes) > 0) { 1157 | resolution <- sort(unique(as.integer(translocation.boxes$resolution)),decreasing=T) 1158 | if (length(resolution) >= opt$multires) { 1159 | data <- list() 1160 | for(i in 1:length(resolution)) { 1161 | box.tmp <- translocation.boxes[translocation.boxes$resolution==resolution[i] & translocation.boxes$type=="TranslocationBox",] 1162 | if (nrow(box.tmp) > 0) { 1163 | data[[i]] <- HClust(box.df=box.tmp,cl.A=0,cl.B=0,colNameA="resA.cl",colNameB="resB.cl",step=0) 1164 | } 1165 | } 1166 | data <- as.data.frame(do.call(rbind,data)) 1167 | data <- HClust(box.df=data,cl.A=0,cl.B=0,colNameA="boxA.cl",colNameB="boxB.cl",step=1) 1168 | data <- ZoomIn(data, translocation.boxes, resolution, multires) 1169 | data <- data[data$resolution <= maxres,] 1170 | data <- data[((data$BoundaryAE - data$BoundaryAS)/data$resolution) > opt$minboxsize | ((data$BoundaryBE - data$BoundaryBS)/data$resolution) > opt$minboxsize,] 1171 | 1172 | if (nrow(data) > 0) { 1173 | write.table(data, file=paste0(prefix,"_hictrans.",chromA,"_",chromB,".MultiResolution_Filtered.Translocation.txt"),row.names=F,sep="\t",quote=F) 1174 | } 1175 | ## Call restriction resolution breakpoint filtering here ## 1176 | if ((opt$relevel == "Yes" | opt$relevel == "YES") & nrow(data) > 0){ 1177 | param <- data 1178 | param <- param[param$class=="BreakPoint",] 1179 | if (nrow(param) == 1) { 1180 | param[2,] <- param[1,] 1181 | } 1182 | 1183 | param <- RE_HClust(param,cl.A=opt$clusdist,cl.B=opt$clusdist) 1184 | print (param) 1185 | chromA_BP_start <- list() 1186 | chromA_BP_end <- list() 1187 | chromB_BP_start <- list() 1188 | chromB_BP_end <- list() 1189 | k <- 1 1190 | while (k <= nrow(param)){ 1191 | opt$chrA <- param$chrA[k] 1192 | opt$startA <- param$BoundaryAS[k] 1193 | opt$endA <- param$BoundaryAE[k] 1194 | opt$chrB <- param$chrB[k] 1195 | opt$startB <- param$BoundaryBS[k] 1196 | opt$endB <- param$BoundaryBE[k] 1197 | 1198 | df <- mapVPs_on_REs( 1199 | re.bed=as.character(fragsFile_path), 1200 | chrom.size=as.character(chromsize_path), 1201 | vp.file=as.character(validpair_path), 1202 | prefix=as.character(opt$prefix), 1203 | chromA=as.character(opt$chrA), 1204 | chromB=as.character(opt$chrB), 1205 | id=k 1206 | ) 1207 | 1208 | cis_trans_coverage(df,as.character(opt$prefix),as.character(opt$chrA),as.character(opt$chrB),id=k) 1209 | boundary <- di_and_hmm( 1210 | coverage=paste0(opt$prefix,"_",opt$chrA,"_",opt$chrB,"_",k,"_Coverage.bed"), 1211 | chromA=as.character(opt$chrA), 1212 | chromA.start=as.integer(opt$startA), 1213 | chromA.end=as.integer(opt$endA), 1214 | chromB=as.character(opt$chrB), 1215 | chromB.start=as.integer(opt$startB), 1216 | chromB.end=as.integer(opt$endB), 1217 | prefix=opt$prefix, 1218 | id=k 1219 | ) 1220 | 1221 | bp_optimized <- bp_optimization( 1222 | paste0(opt$prefix,"_",opt$chrA,"_",opt$chrB,"_",k,".txt"), 1223 | boundary, 1224 | as.character(opt$chrA), 1225 | as.character(opt$chrB), 1226 | ssA=as.integer(opt$ssA), 1227 | ssB=as.integer(opt$ssB), 1228 | seA=as.integer(opt$seA), 1229 | seB=as.integer(opt$seB) 1230 | ) 1231 | chromA_BP_start[k] <- head(df$bed[df$bed$chrom==as.character(opt$chrA) & df$bed$start >= floor(bp_optimized$par[1]),]$start,1) 1232 | chromA_BP_end[k] <- head(df$bed[df$bed$chrom==as.character(opt$chrA) & df$bed$start >= floor(bp_optimized$par[1]),]$end,1) 1233 | chromB_BP_start[k] <- head(df$bed[df$bed$chrom==as.character(opt$chrB) & df$bed$start >= floor(bp_optimized$par[2]),]$start,1) 1234 | chromB_BP_end[k] <- head(df$bed[df$bed$chrom==as.character(opt$chrB) & df$bed$start >= floor(bp_optimized$par[2]),]$end,1) 1235 | 1236 | if (is.null(chromA_BP_start[k][[1]])) { 1237 | chromA_BP_start[k] <- tail(df$bed[df$bed$chrom==as.character(opt$chrA) & df$bed$end <= floor(bp_optimized$par[1]),]$start,1) 1238 | chromA_BP_end[k] <- tail(df$bed[df$bed$chrom==as.character(opt$chrA) & df$bed$end <= floor(bp_optimized$par[1]),]$end,1) 1239 | } 1240 | if (is.null(chromB_BP_start[k][[1]])) { 1241 | chromB_BP_start[k] <- tail(df$bed[df$bed$chrom==as.character(opt$chrB) & df$bed$end <= floor(bp_optimized$par[2]),]$start,1) 1242 | chromB_BP_end[k] <- tail(df$bed[df$bed$chrom==as.character(opt$chrB) & df$bed$end <= floor(bp_optimized$par[2]),]$end,1) 1243 | } 1244 | k = k+1 1245 | } 1246 | chromA_BP_start <- unlist(chromA_BP_start) 1247 | chromA_BP_end <- unlist(chromA_BP_end) 1248 | chromB_BP_start <- unlist(chromB_BP_start) 1249 | chromB_BP_end <- unlist(chromB_BP_end) 1250 | param <- cbind(param,chromA_BP_start,chromA_BP_end,chromB_BP_start,chromB_BP_end) 1251 | write.table(param,file=paste0(opt$prefix,"_",opt$chrA,"_",opt$chrB,".RE_BreakPoints.txt"),col.names=T,row.names=F,sep="\t",quote=F) 1252 | } 1253 | } else { 1254 | print ("Breakpoints have not enough resolution support!") 1255 | } 1256 | } else { 1257 | print ("No enriched translocation box found") 1258 | } 1259 | 1260 | ## Orginizing results ## 1261 | system("mkdir Lower_Resolution_HiC_Data", wait=T) 1262 | files_matrix <- list.files(".", pattern=paste0(chromA,"_",chromB,".matrix")) 1263 | files_bed <- list.files(".", pattern=paste0(chromA,"_",chromB,"_abs.bed")) 1264 | for(i in 1:length(files_matrix)) { 1265 | system(paste0("mv ",files_matrix[i]," Lower_Resolution_HiC_Data/"),wait=T) 1266 | system(paste0("mv ",files_bed[i]," Lower_Resolution_HiC_Data/"),wait=T) 1267 | } 1268 | 1269 | system("mkdir Translocations", wait=T) 1270 | system("mkdir Translocations/juicebox_files", wait=T) 1271 | system("mkdir Translocations/Details", wait=T) 1272 | f <- list.files(".",pattern="preCluster.txt") 1273 | if (file.exists(f[1])) { 1274 | system("mv *.preCluster.txt Translocations/") 1275 | system("mv *.Details.txt Translocations/Details/") 1276 | } 1277 | 1278 | f <- list.files(".",pattern="Translocations_jcbx.txt") 1279 | if (length(f) > 0) { 1280 | system("mv *.Translocations_jcbx.txt Translocations/juicebox_files/") 1281 | } 1282 | 1283 | f <- list.files(".",pattern="MultiResolution_Filtered.Translocation.txt") 1284 | if (file.exists(f[1])) { 1285 | system("mkdir MultiResolution_supported_Translocations") 1286 | system("mv *.MultiResolution_Filtered.Translocation.txt MultiResolution_supported_Translocations/") 1287 | } 1288 | 1289 | f <- list.files(".",pattern="_REfrags") 1290 | if (length(f) > 0) { 1291 | system("mkdir RE_Level_Translocation") 1292 | system(paste0("mv ",prefix,"_",chromA,"_",chromB,"* RE_Level_Translocation/")) 1293 | system(paste0("mv *.hmm.states.bed RE_Level_Translocation/")) 1294 | 1295 | } 1296 | 1297 | print(paste0("Ended ",Sys.time())) 1298 | 1299 | sink() 1300 | --------------------------------------------------------------------------------