├── LICENSE.md ├── MuSiC2-0.2.tar.gz ├── README.md ├── bin ├── bedtools ├── calcRoiCovg ├── dendrix ├── joinx ├── music2 ├── permutationTestDendrix └── samtools ├── dist.ini ├── example ├── dendrix │ ├── GBM_mutationsAndCN │ ├── Lung_mutations │ ├── analyzed_genes │ └── mutation_matrix └── smg │ ├── example.bam_list │ ├── example.input.maf │ ├── example.roi_file │ ├── example.run-coverage-command │ └── roi_file ├── lib └── TGI │ └── MuSiC2 │ ├── Bmr.pm │ ├── CalcBmr.pm │ ├── CalcBmrModifier.pm │ ├── CalcBmr_music2.pm │ ├── CalcBmr_prag.pm │ ├── CalcCovg.pm │ ├── CalcCovgHelper.pm │ ├── CalcWigCovg.pm │ ├── CalcWindowMaf.pm │ ├── CalcWindowRoi.pm │ ├── ClinicalCorrelation.pm │ ├── ClinicalCorrelation.pm.R │ ├── Complicated.pm │ ├── Cosmic.pm │ ├── Dendrix.pm │ ├── DendrixPermutation.pm │ ├── LongGeneFilter.pm │ ├── Pfam.pm │ ├── Proximity.pm │ ├── ProximityWindow.pm │ ├── Smg.pm │ ├── Smg.pm.R │ ├── Smg.pm.qqplot.R │ ├── Smg.pm.qqplot.correct.R │ ├── Survival.pm │ ├── Survival.pm.R │ └── correlation.pl └── t ├── TESTING.md └── foo.t /LICENSE.md: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2013-2017 MuSiC2 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /MuSiC2-0.2.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ding-lab/MuSiC2/ccc623c200cafbcc0cca9459c145e7c1802285b2/MuSiC2-0.2.tar.gz -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | MuSiC2 2 | =========== 3 | Mutational Significance in Cancer (Cancer Mutation Analysis) version 2. 4 | 5 | Usage 6 | ----- 7 | 8 | Program: music2 - Mutational Significance in Cancer (Cancer Mutation Analysis) version 2. 9 | Version: V0.2 10 | Author: Beifang Niu && Matthew Wyczalkowski 11 | 12 | Usage: music2 [options] 13 | 14 | Key commands: 15 | 16 | bmr ... Calculate gene coverages and background mutation rates. 17 | smg Identify significantly mutated genes. 18 | long-gene-filter Find conditions for which significance status is no longer related to gene size. 19 | survival Create survival plots and P-values for clinical and mutational phenotypes. 20 | clinical-correlation Correlate phenotypic traits against mutated genes, or against individual variants. 21 | cosmic Match a list of variants to those in COSMIC, and highlight druggable targets. 22 | cosmic-omim Compare the amino acid changes of supplied mutations to COSMIC and OMIM databases. 23 | dendrix Discovery of mutated driver pathways in cancer using only mutation data. 24 | dendri-permutation ... Run the permutation test for Dendrix. 25 | mutation-relation Identify relationships of mutation concurrency or mutual exclusivity in genes across cases. 26 | path-scan Find signifcantly mutated pathways in a cohort given a list of somatic mutations. 27 | pfam Add Pfam annotation to a MAF file. 28 | proximity Perform a proximity analysis on a list of mutations. 29 | proximity-window Perform a sliding window proximity analysis on a list of mutations. 30 | 31 | help this message 32 | 33 | 34 | Install (Ubuntu & CentOS) 35 | ------- 36 | Note: We provided binaries for joinx, samtools, calcRoiCovg and bedtools in /bin dir, and which were compiled on CentOS, and tested on CentOS/Ubuntu. 37 | 38 | Prerequisites for Ubuntu: 39 | 40 | sudo apt-get install build-essential \ 41 | git \ 42 | cmake \ 43 | curl \ 44 | cpanminus 45 | libbz2-dev \ 46 | libgtest-dev \ 47 | libbam-dev \ 48 | zlib1g-dev 49 | 50 | Prerequisites for CentOS: 51 | 52 | sudo yum install yum-utils 53 | sudo yum install curl 54 | sudo yum install git 55 | sudo yum install cmake 56 | sudo yum groupinstall "Development Tools" 57 | sudo yum update -y nss curl libcurl 58 | sudo yum install perl-devel 59 | sudo yum install perl-CPAN 60 | sudo yum install bzip2-libs 61 | sudo yum install zlib-devel 62 | sudo curl -L http://cpanmin.us | perl - --sudo App::cpanminus 63 | 64 | 65 | Change C++11 compiler for CentOS (required for joinx installation) 66 | 67 | Reference 68 | > https://www.softwarecollections.org/en/scls/rhscl/devtoolset-3/ 69 | 70 | 1. Install a package with repository for your system: 71 | On CentOS, install package centos-release-scl available in CentOS repository: 72 | $ sudo yum install centos-release-scl 73 | On RHEL, enable RHSCL repository for you system: 74 | $ sudo yum-config-manager --enable rhel-server-rhscl-7-rpms 75 | 2. Install the collection: 76 | $ sudo yum install devtoolset-3 77 | 3. Start using software collections: 78 | $ scl enable devtoolset-3 bash 79 | Set env variables --optional 80 | CC=gcc CXX=g++ 81 | 82 | Install samtools ( Download the samtools-0.1.19 from SOURCEFORGE (http://sourceforge.net/projects/samtools/files/samtools/0.1.19) ) 83 | 84 | tar jxf samtools-0.1.19.tar.bz2 85 | cd samtools-0.1.19 86 | make 87 | export SAMTOOLS_DIR=$PWD 88 | sudo mv samtools /usr/local/bin/ 89 | 90 | Install calcRoiCovg 91 | 92 | git clone https://github.com/Beifang/calcRoiCovg.git 93 | cd calc-roi-covg 94 | make 95 | sudo mv calcRoiCovg /usr/local/bin/ 96 | 97 | Install bedtools 98 | 99 | wget https://github.com/arq5x/bedtools2/archive/v2.27.1.tar.gz 100 | tar -zxvf v2.27.1.tar.gz 101 | cd bedtools2-2.27.1/ 102 | make 103 | sudo mv ./bin /usr/local/bin/ 104 | 105 | Install joinx 106 | 107 | git clone --recursive https://github.com/genome/joinx.git 108 | cd joinx 109 | mkdir build 110 | cd build 111 | cmake .. 112 | make deps 113 | make 114 | sudo make install 115 | 116 | Fix joinx bugs 117 | 118 | StreamLineSource.cpp 119 | bool StreamLineSource::getline(std::string& line) { 120 | std::getline(_in, line); 121 | return true; 122 | } 123 | 124 | Intall Perl modules 125 | 126 | sudo cpanm Test::Most 127 | sudo cpanm Statistics::Descriptive 128 | sudo cpanm Statistics::Distributions 129 | sudo cpanm Bit::Vector 130 | 131 | Install MuSiC2 package 132 | 133 | git clone https://github.com/ding-lab/MuSiC2 134 | cd MuSiC2 135 | sudo cpanm MuSiC2-#.#.tar.gz 136 | 137 | Notes: Python is needed to be installed if you run music2 dendrix & dendrix-permutation 138 | 139 | 140 | example 141 | ------- 142 | 143 | 1. smg test example: 144 | 145 | Make a dir for MuSiC2 smg running 146 | 147 | mkdir music2_smg_running 148 | cd music2_smg_running 149 | 150 | Make subdirs where all the runtime logs can be written 151 | 152 | mkdir logs 153 | mkdir logs/calc_covg 154 | 155 | Get calculate coverage command list 156 | 157 | music2 bmr calc-covg --roi-file ./example/smg/example.roi_file --reference-sequence /reference_dir/ucsc.hg19.fa --bam-list ./example/smg/example.bam_list --output-dir . --cmd-list-file example.run-coverage-command 158 | 159 | Run roi coverage for each sample 160 | 161 | bash example.run-coverage-command 162 | 163 | Run bmr calc-covg again to get gene coverage 164 | 165 | music2 bmr calc-covg --roi-file ./example/smg/example.roi_file --reference-sequence /reference_dir/ucsc.hg19.fa --bam-list ./example/smg/example.bam_list --output-dir . 166 | 167 | Run calc-bmr to measure overall and per-gene mutation rates. Give it extra memory, because it may need it 168 | 169 | music2 bmr calc-bmr --roi-file ./example/smg/example.roi_file --reference-sequence /reference_dir/ucsc.hg19.fa --bam-list ./example/smg/example.bam_list --maf-file ./example/smg/example.input.maf --output-dir . --show-skipped 170 | 171 | Run SMG test using an FDR threshold appropriate for these mutation rates 172 | 173 | music2 smg --gene-mr-file gene_mrs --output-file smgs --max-fdr 0.05 --processors 1 174 | 175 | 2. dendrix example: 176 | 177 | Runs the MCMC for 1000000 iterations, sampling sets of size 3 every 1000 178 | iterations. Produces two files (since 1 experiment is run): 179 | 180 | music2 dendrix --mutations-file example/dendrix/mutation_matrix --set-size 3 --minimum-freq 1 \ 181 | --number-interations 1000000 --analyzed-genes-file example/dendrix/analyzed_genes \ 182 | --number-experiments 1 --step-length 1000 183 | 184 | If you want to compute the p-value for the second set having weight 47, you can run: 185 | 186 | music2 dendrix-permutation --mutations-file example/dendrix/mutation_matrix --set-size 3 --minimum-freq 1 \ 187 | --number-interations 1000000 --analyzed-genes-file example/dendrix/analyzed_genes \ 188 | --number-permutations 100 --value-tested 47 --rank 2 189 | 190 | SUPPORT 191 | ------- 192 | 193 | If you have any questions, please contact one or more of the following folks: 194 | 195 | Beifang Niu 196 | Li Ding 197 | -------------------------------------------------------------------------------- /bin/bedtools: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ding-lab/MuSiC2/ccc623c200cafbcc0cca9459c145e7c1802285b2/bin/bedtools -------------------------------------------------------------------------------- /bin/calcRoiCovg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ding-lab/MuSiC2/ccc623c200cafbcc0cca9459c145e7c1802285b2/bin/calcRoiCovg -------------------------------------------------------------------------------- /bin/dendrix: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | #Copyright 2010,2011,2012,2013 Brown University, Providence, RI. 4 | 5 | #All Rights Reserved 6 | 7 | #Permission to use, copy, modify, and distribute this software and its 8 | #documentation for any purpose other than its incorporation into a 9 | #commercial product is hereby granted without fee, provided that the 10 | #above copyright notice appear in all copies and that both that 11 | #copyright notice and this permission notice appear in supporting 12 | #documentation, and that the name of Brown University not be used in 13 | #advertising or publicity pertaining to distribution of the software 14 | #without specific, written prior permission. 15 | 16 | #BROWN UNIVERSITY DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, 17 | #INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY 18 | #PARTICULAR PURPOSE. IN NO EVENT SHALL BROWN UNIVERSITY BE LIABLE FOR 19 | #ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES 20 | #WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN 21 | #ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF 22 | #OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. 23 | #http://cs.brown.edu/people/braphael/software.html 24 | 25 | import sys 26 | import os 27 | 28 | import random 29 | import math 30 | 31 | def measure(genes_collection1, genes_collection2): 32 | #coverage of genes in genes_collection1 33 | out1 = 0 34 | #total number of mutations in genes_collection1 35 | inside1 = 0 36 | #coverage of genes_collection2 37 | out2 = 0 38 | #total number of mutations in genes_collection2 39 | inside2 = 0 40 | for sampleID in sample_mutatedGenes: 41 | genes_in_sample = sample_mutatedGenes[sampleID] 42 | inside_genes1 = genes_collection1.intersection(genes_in_sample) 43 | if len(inside_genes1)>0: 44 | out1 += 1 45 | num_ig1 = len(inside_genes1) 46 | inside1 += num_ig1 47 | inside_genes2 = genes_collection2.intersection(genes_in_sample) 48 | if len(inside_genes2)>0: 49 | out2 += 1 50 | num_ig2 = len(inside_genes2) 51 | inside2 += num_ig2 52 | c = 0.5 53 | return c*float(2*out1 - inside1 - (2*out2 - inside2)) 54 | 55 | if len(sys.argv)<8: 56 | print "Usage: python Dendrix.py mutations_file K minFreqGene number_iterations analyzed_genes_file num_exper step_length" 57 | print "mutations_file: input file with mutation matrix (see README.txt for description)" 58 | print "K: size of the sets to be sampled" 59 | print "minFreqGene: minimum frequency of mutation for a gene to be considered in the analysis" 60 | print "number_iterations: number of iterations of the MCMC" 61 | print "analyzed_genes_file: file with list of analyzed genes (see README.txt for description)" 62 | print "num_exper: number of times the experiment is going to be run (see README.txt for description)" 63 | print "step_length: number of iterations of the MCMC between two samples" 64 | exit(0) 65 | 66 | genes = list() 67 | 68 | gene_mutatedSamples = dict() 69 | 70 | sample_mut_f = open(sys.argv[1] ,'r') 71 | 72 | K = int(sys.argv[2]) 73 | 74 | mAS_perGene=int(sys.argv[3]) 75 | 76 | num_iterations = int(sys.argv[4]) 77 | 78 | analyzed_genes_file = open(sys.argv[5],'r') 79 | 80 | all_samples = set() 81 | 82 | sample_mutatedGenes = dict() 83 | 84 | num_exper = int(sys.argv[6]) 85 | 86 | step_length = int(sys.argv[7]) 87 | 88 | print "Load genes..." 89 | for line in analyzed_genes_file: 90 | v =line.split() 91 | genes.append(v[0]) 92 | gene_mutatedSamples[v[0]]=set() 93 | analyzed_genes_file.close() 94 | 95 | print "Loading mutations..." 96 | 97 | for line in sample_mut_f: 98 | v = line.split("\t") 99 | sampleID = v[0] 100 | all_samples.add( sampleID ) 101 | sample_mutatedGenes[sampleID]=set() 102 | for i in range(len(v) - 1): 103 | gene = v[i+1].strip("\n") 104 | if gene in genes: 105 | if gene not in gene_mutatedSamples: 106 | gene_mutatedSamples[gene]=set() 107 | gene_mutatedSamples[gene].add(sampleID) 108 | sample_mutatedGenes[sampleID].add(gene) 109 | 110 | sample_mut_f.close() 111 | 112 | sample_numMut = dict() 113 | for sampleID in sample_mutatedGenes: 114 | tmp_numMut = len(sample_mutatedGenes[sampleID]) 115 | sample_numMut[sampleID] = float(tmp_numMut) 116 | 117 | genes_toRemove = set() 118 | for gene in gene_mutatedSamples: 119 | if len(gene_mutatedSamples[gene]) 0: 160 | expon = 0 161 | 162 | prob = min(1.0, math.exp(expon)) 163 | coin = random.random() 164 | if coin <= prob: 165 | solution = next_solution 166 | 167 | if (itera+1) % step_length ==0: 168 | frozen_tmp = frozenset(solution) 169 | if frozen_tmp not in num_visits: 170 | num_visits[frozen_tmp]=0 171 | num_visits[frozen_tmp]+=1 172 | del frozen_tmp 173 | 174 | to_sort=list() 175 | most_visited_file = open("sets_frequencyOrder_experiment"+str(exp_n)+".txt",'w') 176 | for frozen_tmp in num_visits: 177 | to_sort.append([num_visits[frozen_tmp], frozen_tmp]) 178 | to_sort.sort() 179 | most_visited_file.write("Total visited: "+str(len(to_sort))+"\n") 180 | to_sort_weight = list() 181 | #only the 1000 most sampled sets are reported 182 | for i in range(len(to_sort)): 183 | if i < 1000: 184 | most_visited_file.write(str(to_sort[-(i+1)][0])+"\t") 185 | genes_list=list(to_sort[-(i+1)][1]) 186 | genes_list.sort() 187 | tmp_tot = set() 188 | sum = 0 189 | tmp_str = "" 190 | for j in range(len(genes_list)): 191 | tmp_tot.update(gene_mutatedSamples[genes_list[j]]) 192 | sum += len(gene_mutatedSamples[genes_list[j]]) 193 | tmp_str = tmp_str+genes_list[j]+"\t" 194 | if i < 1000: 195 | most_visited_file.write(genes_list[j]+"\t") 196 | tmp_weight = 2 * len(tmp_tot) - sum 197 | to_sort_weight.append([tmp_weight, tmp_str, str(to_sort[-(i+1)][0])]) 198 | if i < 1000: 199 | most_visited_file.write(str(tmp_weight)+"\n") 200 | most_visited_file.close() 201 | 202 | #only the 1000 sets with highest weight are reported 203 | to_sort_weight.sort() 204 | highest_weight_file = open("sets_weightOrder_experiment"+str(exp_n)+".txt",'w') 205 | for i in range(min(len(to_sort_weight),1000)): 206 | highest_weight_file.write(str(to_sort_weight[-(i+1)][0])+"\t"+to_sort_weight[-(i+1)][1]+"\t"+to_sort_weight[-(i+1)][2]+"\n") 207 | highest_weight_file.close() 208 | -------------------------------------------------------------------------------- /bin/joinx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ding-lab/MuSiC2/ccc623c200cafbcc0cca9459c145e7c1802285b2/bin/joinx -------------------------------------------------------------------------------- /bin/music2: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl 2 | #---------------------------------- 3 | # $Authors: Beifang Niu $ 4 | # $Date: 2013-08-08 13:22:08 -0500 (Thu Aug 8 13:22:08 CDT 2013) $ 5 | # $Revision: $ 6 | # $URL: $ 7 | #---------------------------------- 8 | use strict; 9 | use warnings; 10 | our $VERSION = 'v0.2'; 11 | 12 | use Getopt::Long; 13 | 14 | use TGI::MuSiC2::Bmr; 15 | use TGI::MuSiC2::Smg; 16 | use TGI::MuSiC2::Dendrix; 17 | use TGI::MuSiC2::DendrixPermutation; 18 | use TGI::MuSiC2::Survival; 19 | use TGI::MuSiC2::ClinicalCorrelation; 20 | use TGI::MuSiC2::LongGeneFilter; 21 | 22 | my $sub_cmd = shift; 23 | my %cmds = map{ 24 | ($_, 1) 25 | } qw( bmr 26 | clinical-correlation 27 | cosmic 28 | cosmic-omim 29 | create-visualizations 30 | data 31 | dendrix 32 | dendrix-permutation 33 | long-gene-filter 34 | mutation-relation 35 | path-scan 36 | pfam 37 | play 38 | plot 39 | proximity 40 | proximity-window 41 | smg 42 | survival 43 | help ); 44 | 45 | unless (defined $sub_cmd) { die help_text(); }; 46 | unless (exists $cmds{$sub_cmd}) { 47 | warn ' Please give valid sub command ! ', "\n"; 48 | die help_text(); 49 | } 50 | 51 | SWITCH:{ 52 | $sub_cmd eq 'bmr' && do { my $sub_cmd2 = shift; TGI::MuSiC2::Bmr->new( $sub_cmd2 ); last SWITCH; }; 53 | $sub_cmd eq 'smg' && do { TGI::MuSiC2::Smg->new(); last SWITCH; }; 54 | $sub_cmd eq 'dendrix' && do { TGI::MuSiC2::Dendrix->new(); last SWITCH; }; 55 | $sub_cmd eq 'dendrix-permutation' && do { TGI::MuSiC2::DendrixPermutation->new(); last SWITCH; }; 56 | $sub_cmd eq 'survival' && do { TGI::MuSiC2::Survival->new(); last SWITCH; }; 57 | $sub_cmd eq 'clinical-correlation' && do { TGI::MuSiC2::ClinicalCorrelation->new(); last SWITCH; }; 58 | $sub_cmd eq 'long-gene-filter' && do { TGI::MuSiC2::LongGeneFilter->new(); last SWITCH; }; 59 | 60 | $sub_cmd eq 'help' && do { die help_text(); last SWITCH; }; 61 | } 62 | 63 | sub help_text { 64 | return < [options] 69 | 70 | Key commands: 71 | 72 | bmr ... Calculate gene coverages and background mutation rates. 73 | smg Identify significantly mutated genes. 74 | long-gene-filter Find conditions for which significance status is no longer related to gene size. 75 | survival Create survival plots and P-values for clinical and mutational phenotypes. 76 | clinical-correlation Correlate phenotypic traits against mutated genes, or against individual variants. 77 | cosmic Match a list of variants to those in COSMIC, and highlight druggable targets. 78 | cosmic-omim Compare the amino acid changes of supplied mutations to COSMIC and OMIM databases. 79 | dendrix Discovery of mutated driver pathways in cancer using only mutation data. 80 | dendri-permutation ... Run the permutation test for Dendrix. 81 | mutation-relation Identify relationships of mutation concurrency or mutual exclusivity in genes across cases. 82 | path-scan Find signifcantly mutated pathways in a cohort given a list of somatic mutations. 83 | pfam Add Pfam annotation to a MAF file. 84 | proximity Perform a proximity analysis on a list of mutations. 85 | proximity-window Perform a sliding window proximity analysis on a list of mutations. 86 | 87 | help this message 88 | 89 | SUPPORT 90 | For user support please mail beifang.cn\@gmail.com & ckandoth\@gmail.com & lding\@wustl.edu. 91 | 92 | HELP 93 | } 94 | 95 | 1; 96 | 97 | # Perl POD stuff: https://juerd.nl/site.plp/perlpodtut 98 | 99 | __END__ 100 | 101 | =head1 NAME 102 | 103 | MuSiC2 - Mutational Significance in Cancer (Cancer Mutation Analysis) version 2. 104 | 105 | =head1 SYNOPSIS 106 | 107 | music2 --help; 108 | 109 | =head1 DESCRIPTION 110 | 111 | MuSiC2 - Mutational Significance in Cancer (Cancer Mutation Analysis) version 2. 112 | 113 | =head1 AUTHOR 114 | 115 | Beifang Niu Ebeifang.cn@gmail.comE 116 | 117 | =head1 SEE ALSO 118 | 119 | https://github.com/ding-lab/MuSiC2 120 | 121 | =head1 LICENSE 122 | 123 | This library is free software with MIT licence; you can redistribute it and/or modify 124 | it under the same terms as Perl itself. 125 | 126 | =cut 127 | 128 | -------------------------------------------------------------------------------- /bin/permutationTestDendrix: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | #Copyright 2010,2011,2012,2013 Brown University, Providence, RI. 4 | 5 | #All Rights Reserved 6 | 7 | #Permission to use, copy, modify, and distribute this software and its 8 | #documentation for any purpose other than its incorporation into a 9 | #commercial product is hereby granted without fee, provided that the 10 | #above copyright notice appear in all copies and that both that 11 | #copyright notice and this permission notice appear in supporting 12 | #documentation, and that the name of Brown University not be used in 13 | #advertising or publicity pertaining to distribution of the software 14 | #without specific, written prior permission. 15 | 16 | #BROWN UNIVERSITY DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, 17 | #INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY 18 | #PARTICULAR PURPOSE. IN NO EVENT SHALL BROWN UNIVERSITY BE LIABLE FOR 19 | #ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES 20 | #WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN 21 | #ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF 22 | #OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. 23 | #http://cs.brown.edu/people/braphael/software.html 24 | 25 | import sys 26 | import os 27 | 28 | import random 29 | import math 30 | 31 | def measure(genes_collection1, genes_collection2): 32 | out_g1 = 0 33 | inside_g1 = 0 34 | out_g2 = 0 35 | inside_g2 = 0 36 | #out = number of covered samples 37 | for sampleID in sample_mutatedGenes: 38 | genes_in_sample = sample_mutatedGenes[sampleID] 39 | inside_genes_1 = genes_collection_1.intersection(genes_in_sample) 40 | if len(inside_genes_1)>0: 41 | out_g1 += 1 42 | num_ig_1 = len(inside_genes_1) 43 | inside_g1 += num_ig_1 44 | inside_genes_2 = genes_collection_2.intersection(genes_in_sample) 45 | if len(inside_genes_2)>0: 46 | out_g2 += 1 47 | num_ig_2 = len(inside_genes_2) 48 | inside_g2 += num_ig_2 49 | c = 0.5 50 | return c*float(2*out_g1 - inside_g1 - (2*out_g2 - inside_g2)) 51 | 52 | def measure(genes_solution): 53 | samples_out = 0 54 | samples_inside = 0 55 | for sampleID in sample_mutatedGenes: 56 | genes_in_sample = sample_mutatedGenes[sampleID] 57 | inside_genes_tmp = genes_solution.intersection(genes_in_sample) 58 | if len(inside_genes_tmp)>0: 59 | samples_out += 1 60 | num_ig_tmp = len(inside_genes_tmp) 61 | samples_inside += num_ig_tmp 62 | return float(2*samples_out - samples_inside) 63 | 64 | 65 | if len(sys.argv)<9: 66 | print "Usage: python permutationTestDendrix.py mutations_file K minFreqGene number_iterations analyzed_genes_file num_permutations value_tested rank" 67 | print "mutations_file: input file with mutation matrix (see README.txt for description)" 68 | print "K: size of the sets to be sampled" 69 | print "minFreqGene: minimum frequency of mutation for a gene to be considered in the analysis" 70 | print "number_iterations: number of iterations of the MCMC" 71 | print "analyzed_genes_file: file with list of analyzed genes (see README.txt for description)" 72 | print "num_permutations: number of times the permuted datasets to consider in the permutation test" 73 | print "value_tested: value of the weight for which the permutation test will be run" 74 | print "rank: rank of the weight tested (value_tested) in the sets_weightOrder_experiment.txt file" 75 | exit(0) 76 | 77 | genes = list() 78 | 79 | gene_mutatedSamples = dict() 80 | 81 | sample_mut_f = open(sys.argv[1] ,'r') 82 | 83 | K = int(sys.argv[2]) 84 | 85 | mAS_perGene=int(sys.argv[3]) 86 | 87 | num_iterations = int(sys.argv[4]) 88 | 89 | analyzed_genes_file = open(sys.argv[5],'r') 90 | 91 | all_samples = set() 92 | 93 | sample_mutatedGenes = dict() 94 | 95 | num_exper = int(sys.argv[6]) 96 | 97 | statistic = float(sys.argv[7]) 98 | 99 | rank = int(sys.argv[8]) 100 | 101 | print "Load genes..." 102 | for line in analyzed_genes_file: 103 | v =line.split() 104 | genes.append(v[0]) 105 | gene_mutatedSamples[v[0]]=set() 106 | analyzed_genes_file.close() 107 | 108 | print "Loading mutations..." 109 | 110 | for line in sample_mut_f: 111 | v = line.split("\t") 112 | sampleID = v[0] 113 | all_samples.add( sampleID ) 114 | sample_mutatedGenes[sampleID]=set() 115 | for i in range(len(v) - 1): 116 | gene = v[i+1].strip("\n") 117 | if gene in genes: 118 | if gene not in gene_mutatedSamples: 119 | gene_mutatedSamples[gene]=set() 120 | gene_mutatedSamples[gene].add(sampleID) 121 | sample_mutatedGenes[sampleID].add(gene) 122 | 123 | sample_mut_f.close() 124 | 125 | genes_toRemove = set() 126 | for gene in gene_mutatedSamples: 127 | if len(gene_mutatedSamples[gene]) 0: 174 | expon = 0 175 | prob = min(1.0, math.exp(expon)) 176 | coin = random.random() 177 | if coin <= prob: 178 | solution = next_solution 179 | curr_measure = measure(solution) 180 | if curr_measure>= statistic: 181 | solution_to_add = frozenset(solution) 182 | if solution_to_add not in seen_sols: 183 | seen_sols.append(solution_to_add) 184 | if len(seen_sols)>= rank: 185 | num_found+=1 186 | pval = float(num_found)/float(num_exper) 187 | pval_f.write(str(pval)+"\n") 188 | pval_f.close() 189 | -------------------------------------------------------------------------------- /bin/samtools: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ding-lab/MuSiC2/ccc623c200cafbcc0cca9459c145e7c1802285b2/bin/samtools -------------------------------------------------------------------------------- /dist.ini: -------------------------------------------------------------------------------- 1 | name = MuSiC2 2 | author = Beifang Niu 3 | author = Cyriac Kandoth 4 | author = Adam Scott 5 | author = Matthew Wyczalkowski 6 | version = 0.2 7 | license = Perl_5 8 | copyright_holder = McDonnell Genome Institute at Washington University 9 | copyright_year = 2017 10 | abstract = Identifying mutational significance in cancer genomes. 11 | 12 | [@Basic] 13 | 14 | -------------------------------------------------------------------------------- /example/dendrix/GBM_mutationsAndCN: -------------------------------------------------------------------------------- 1 | TCGA-06-0132 EGFR 2 | TCGA-06-0184 PTEN TP53 RB1 NF1 3 | TCGA-02-0089 ERBB2 CDKN2A TP53 ELAVL2 EGFR IFNA21 PIK3R1 MTAP ROS1 BCL2 4 | TCGA-02-0086 CDKN2B NF1 5 | TCGA-02-0085 CDKN2B CDKN2A ELAVL2 NRAS MDM2 MTAP 6 | TCGA-06-0174 PDGFRA DTX3 EGFR FAM119B MARCH9 CHIC2 TSFM AVIL GLI1 CYP27B1 FIP1L1 SLC26A10 CTDSP2 OS9 LOC402176 SEC61G MDM2 CENTG1 BMP6 METTL1 GSX2 GEFT TSPAN31 CDK4 PIP4K2C 7 | TCGA-02-0006 OS9 CENTG1 DTX3 TSFM METTL1 AVIL PTEN GEFT GLI1 FAM119B CYP27B1 CTDSP2 CDK4 MDM2 TSPAN31 MARCH9 SLC26A10 PIP4K2C 8 | TCGA-06-0210 CHIC2 CDKN2B PDGFRA CDKN2A ZC3H11A BTG2 GSX2 IFNA21 SOX13 MTAP LRRN2 KIT LAX1 PIK3C2B LOC402176 PPP1R15B SOX1 MDM4 SNRPE 9 | TCGA-02-0080 ERBB2 TP53 CYP27B1 PRKD2 PI15 10 | TCGA-06-0214 CHIC2 CDKN2B PDGFRA CDKN2A ELAVL2 GSX2 EGFR PAX5 PTEN LOC402176 NF1 SEC61G MDM2 MTAP 11 | TCGA-06-0211 CHIC2 CDKN2B PDGFRA CDKN2A BIK ELAVL2 GSX2 IFNA21 EGFR KIT LOC402176 MAPK11 PPARA SEC61G MTAP 12 | TCGA-06-0187 CDKN2B CDKN2A EGFR PTEN SEC61G MDM2 MTAP 13 | TCGA-06-0148 CDKN2B CDKN2A ELAVL2 DST EGFR IFNA21 MTAP 14 | TCGA-06-0185 CDKN2B CDKN2A STAT1 EGFR IFNA21 SEC61G MTAP PRKCB1 15 | TCGA-06-0147 MARK4 INSR 16 | TCGA-06-0188 CDKN2B CDKN2C TP53 ELAVL2 PTEN TNC 17 | TCGA-06-0145 CDKN2B DTX3 EGFR FAM119B MARCH9 TSFM AVIL MAPK1 CYP27B1 SLC26A10 CTDSP2 CHEK1 OS9 DST PRKCD SEC61G CENTG1 METTL1 GEFT TSPAN31 CDK4 PIP4K2C GLI1 18 | TCGA-06-0219 CDKN2B PTEN EGFR SEC61G 19 | TCGA-06-0143 CHIC2 CDKN2B PDGFRA CDKN2A ZC3H11A GSX2 MTAP SOX13 EGFR LRRN2 KIT LAX1 PIK3C2B LOC402176 PPP1R15B MDM4 SNRPE 20 | TCGA-06-0141 ITGB3 CDC123 KPNA2 21 | TCGA-02-0058 FABP7 RPS6KA2 TRIM2 TP53 22 | TCGA-02-0074 ELAVL4 PDGFRA DTX3 DOCK1 FAM119B MARCH9 CHIC2 EPS15 TSFM AVIL CYP27B1 PIK3R1 FIP1L1 SLC26A10 CTDSP2 OS9 CDKN2C LOC402176 CENTG1 TP53 METTL1 KIT GSX2 GEFT TSPAN31 CDK4 PIP4K2C 23 | TCGA-02-0075 TP53 MSH2 PTEN NF1 IQGAP1 TAOK3 24 | TCGA-06-0128 OS9 TP53 CENTG1 TSFM METTL1 AVIL TSPAN31 FAM119B CYP27B1 TCF12 CDK4 MARCH9 CTDSP2 25 | TCGA-06-0129 LIN7A OS9 CENTG1 TSFM METTL1 AVIL PIM1 MAPK14 GLI1 FAM119B CYP27B1 PIK3R1 TSPAN31 MDM4 CDK4 MAPK13 MARCH9 CTDSP2 26 | TCGA-06-0166 CDKN2B PIK3CA NF1 PTEN 27 | TCGA-02-0071 CDKN2B CDKN2A ZC3H11A ELAVL2 PTEN HUS1 SOX13 EGFR LRRN2 MTAP LAX1 PIK3C2B PPP1R15B CHI3L1 SEC61G MDM2 MDM4 SNRPE 28 | TCGA-06-0168 PTEN 29 | TCGA-06-0169 CDKN2B CDKN2A ELAVL2 EGFR BRCA2 SEC61G MTAP 30 | TCGA-02-0052 PTEN MDM2 RB1 31 | TCGA-02-0064 CDKN2B CDKN2A ELAVL2 EGFR IFNA21 MTAP 32 | TCGA-02-0054 CDKN2B CDKN2A TP53 EGFR SLC1A1 IFNA21 JAK2 MTAP DGKG 33 | TCGA-02-0055 OMG CENPF BAX PTEN RB1 NF1 TP53 PLAG1 34 | TCGA-06-0122 CDKN2B CDKN2A EPHA3 EGFR SEC61G TRIM3 MTAP 35 | TCGA-02-0057 CDKN2B CDKN2A ELAVL2 MLL4 36 | TCGA-02-0033 TP53 FGFR2 IRS1 NF1 RB1 37 | TCGA-02-0011 EPHA3 TP53 PIM1 38 | TCGA-02-0034 CDKN2B PTEN TP53 39 | TCGA-02-0038 CDKN2B CDKN2A EGFR PTEN IFNA21 SEC61G MTAP 40 | TCGA-02-0021 CDKN2B CDKN2A ELAVL2 EGFR RTN1 PTEN ST7 PIK3R1 SEC61G MTAP 41 | TCGA-02-0115 ERBB2 CDKN2B CDKN2A IFNA21 NF1 MTAP 42 | TCGA-02-0037 CDKN2B WT1 DPYSL4 TP53 ELAVL2 PIK3CA DOCK1 PAX6 MGMT ABCC4 SLC1A2 43 | TCGA-06-0137 CDKN2B CDKN2A ELAVL2 EGFR IFNA21 SEC61G MTAP 44 | TCGA-06-0195 ZNF384 TP53 BIK RB1 EGFR HSPA8 H2AFX CBL PTEN MAPK11 PPARA ING4 SEC61G 45 | TCGA-06-0197 PTEN TP53 46 | TCGA-06-0190 PTEN WEE1 TP53 ELAVL2 NF1 47 | TCGA-06-0201 CDKN2B CDKN2A ELAVL2 NF1 FLT4 MTAP 48 | TCGA-06-0158 CDKN2B CDKN2A EGFR ITGB2 IFNA21 SEC61G MTAP 49 | TCGA-06-0206 OMG RB1 PTEN PRKCZ NF1 TP53 50 | TCGA-06-0209 CENTG1 SERPINA3 TSFM METTL1 EGFR TSPAN31 FAM119B CYP27B1 SEC61G CDK4 MDM2 MARCH9 GNAI1 51 | TCGA-06-0133 CDKN2B CDKN2A EGFR BCL11A SLC12A6 PIK3R1 SEC61G EP300 MTAP GLI1 52 | TCGA-06-0221 CDKN2B PDGFRA ELAVL2 DTX3 ATM FAM119B MARCH9 CHIC2 DDX10 TSFM MTAP AVIL TSPAN31 JAK2 CYP27B1 FIP1L1 SLC26A10 KDR OS9 CDKN2A LOC402176 PIP4K2C CENTG1 TP53 METTL1 KIT GSX2 GEFT IFNA21 CDK4 CTDSP2 CDK6 GATAD1 SLC1A1 53 | TCGA-02-0113 ERBB2 PTEN 54 | TCGA-06-0241 CHIC2 CDKN2B PDGFRA TP53 GSX2 KIT LOC402176 KDR 55 | TCGA-06-0154 CDKN2B CDKN2A APBB1IP EGFR IFNA21 SEC61G MTAP 56 | TCGA-06-0157 DTX3 EGFR FAM119B MARCH9 TSFM MDM4 AVIL SOX13 TSPAN31 CYP27B1 SLC26A10 CTDSP2 OS9 PIK3C2B SEC61G CENTG1 RPS6KA3 METTL1 PPP1R15B LRRN2 PTEN GEFT SYP CDK4 PIP4K2C 57 | TCGA-06-0156 CHIC2 CDKN2B PDGFRA CDKN2A TP53 GSX2 EGFR KIT PTEN LOC402176 MTAP KDR 58 | TCGA-06-0173 ELAVL2 DOCK1 EGFR FAM119B MARCH9 FLT1 TSFM SOX11 AVIL PDPN NTRK3 JAK2 CYP27B1 CTDSP2 CDKN2B DPYSL4 CDKN2A MGMT SEC61G MDM2 CENTG1 METTL1 TSPAN31 CDK4 SLC1A1 59 | TCGA-02-0060 OS9 ERBB3 CENTG1 DTX3 TSFM METTL1 AVIL GEFT CDK2 GLI1 FAM119B CYP27B1 CTDSP2 CDK4 MDM2 TSPAN31 MARCH9 SLC26A10 PIP4K2C FGFR1 60 | TCGA-06-0171 CDKN2B CDKN2A TRIM24 PIK3CA PTPN11 MTAP 61 | TCGA-06-0176 SNF1LK2 PHLPP TAF1 BCL11A RB1 PTEN BCAR1 62 | TCGA-06-0139 CES3 63 | TCGA-06-0138 OS9 CENTG1 DTX3 TSFM METTL1 AVIL EGFR PTEN GEFT CTDSP2 FAM119B CYP27B1 TSPAN31 CDK4 MDM2 MARCH9 SLC26A10 PIP4K2C 64 | TCGA-02-0069 ERBB2 CHIC2 PDGFRA APBB1IP GSX2 ITGB1 KIT DHTKD1 PRKCQ BAMBI LOC402176 USP6NL UPF2 PPP2R5D FIP1L1 CDC123 RSU1 ZEB1 GATA3 KDR KRAS 65 | TCGA-02-0024 CHIC2 CAPZA2 PDGFRA CDKN2A TP53 GSX2 WNT2 ASZ1 ST7 KIT MET LOC402176 A2M CDKN2B MTAP CFTR KDR 66 | TCGA-02-0027 CDKN2B PTEN RPS6KA2 CDKN2A MTAP 67 | TCGA-06-0178 CENTG1 CDKN2A TSFM METTL1 AVIL MTAP NMBR FAM119B CYP27B1 PIK3R1 TSPAN31 CDK4 MARCH9 CTDSP2 68 | TCGA-02-0047 CDKN2B PIK3CA CDKN2A TSC2 MTAP 69 | TCGA-02-0046 OS9 SEC61G TP53 CENTG1 TSFM PIK3CA AVIL EGFR TSPAN31 FAM119B CYP27B1 METTL1 CDK4 MARCH9 CTDSP2 70 | TCGA-06-0130 TP53 71 | TCGA-02-0007 ZC3H11A IL1RL1 PTEN SOX13 EGFR LRRN2 RB1 LAX1 PIK3C2B PPP1R15B CHI3L1 SNRPE MDM4 BTG2 72 | TCGA-06-0124 SHH PTEN NF1 73 | TCGA-02-0102 CENTG1 DPYSL4 TSFM METTL1 AVIL DOCK1 EGFR SLC12A6 FAM119B CYP27B1 MGMT CDK4 MDM2 TSPAN31 MARCH9 CTDSP2 BCR 74 | TCGA-02-0003 OS9 SEC61G TP53 CENTG1 DTX3 TSFM METTL1 AVIL EGFR EPHA7 GEFT GLI1 FAM119B CYP27B1 PIK3R1 CTDSP2 CDK4 SLC26A10 TSPAN31 MARCH9 PIP4K2C 75 | TCGA-02-0001 CDKN2B CDKN2A TP53 ELAVL2 RB1 PRKDC PTEN NF1 P2RY5 76 | TCGA-06-0125 CDKN2B CDKN2A ELAVL2 DST EGFR PTEN IFNA21 CHAT SEC61G PTCH1 MTAP 77 | TCGA-06-0237 EGFR TP53 SEC61G 78 | TCGA-02-0107 ERBB2 NF1 DTX3 79 | TCGA-06-0208 CDKN2B CDKN2A ELAVL2 EGFR IFNA21 SEC61G MTAP 80 | TCGA-02-0116 ERBB2 CDKN2B CDKN2A PIK3CA EGFR IFNA21 LAMP1 SEC61G MTAP 81 | TCGA-06-0189 TNK2 CHD5 82 | TCGA-06-0126 CDKN2B CDKN2A TNFRSF1B EGFR COL1A2 SEC61G MTAP 83 | TCGA-02-0009 CDKN2B CDKN2A ELAVL2 FLI1 EGFR IFNA21 SEC61G MTAP GPR78 84 | TCGA-06-0213 TP53 TAF1 GRIA2 EPHA3 RB1 PTEN 85 | -------------------------------------------------------------------------------- /example/dendrix/Lung_mutations: -------------------------------------------------------------------------------- 1 | 17760 STK11 KRAS ATM PKN2 FLT1 2 | 16606 MAP3K3 3 | 17763 TP53 PIK3CG STK11 MAP3K6 RB1 NTRK3 AURKB PAK7 CDH4 KDR 4 | 16600 TGFBR2 TP53 ERBB4 MASTL RB1 PTEN BMPR1B ERG MAPK4 PRKD3 KRAS 5 | 17766 RBL1 MAP3K2 TRAIP MERTK PAK7 KRAS 6 | 17769 ACVR1B 7 | 16608 PDGFRA TP53 DDR2 PKN3 TP73L LRP1B PTPRD JAK2 INSRR PRKACB SKI CDH4 KRAS 8 | 16760 EGFR MSH6 9 | 16845 PTEN 10 | 16841 EGFR 11 | 16684 ERBB4 12 | 16686 PDGFRA EPHA7 SMO BMPR1A EPHB1 CNTN4 FLT4 FLT1 FOS NTRK2 SMAD4 NF1 MAPK8 BRCA2 BUB1 BARD1 PRKCG CACNA2D2 FGFR4 PHOX2B ROR2 SRPK3 ALK PTPRG PTPRD BCL6 13 | 16965 EGFR 14 | 16630 NF1 15 | 17272 KRAS 16 | 16839 LRP1B RPSA 17 | 17727 MINK1 STK11 KRAS 18 | 16670 MYCN RBL1 MKNK2 PTCH1 CYSLTR2 HD 19 | 16770 ABL1 20 | 16678 PRKCI ACVR1C EGFR PRKDC RIPK1 LATS1 MAP2K6 FLT4 MYBL2 RAF1 FLT1 SMG1 CDC42BPG MAPK6 DMPK JAK1 JAG1 CDKN2B ERBB3 PIK3C2G TP53 LMTK2 GNAS DDR1 GSK3B LCK TFDP1 21 | 16772 PAK4 KRAS 22 | 16774 TGFBR1 23 | 17152 IRAK2 FER LMTK2 KRAS 24 | 17150 KRAS 25 | 17156 TEK ROBO1 ROBO2 FYN PRKDC HCK BMX TERT TSC2 STK38L 26 | 17154 KRAS 27 | 17158 NTRK1 KRAS 28 | 16979 CTNNB1 EGFR 29 | 16875 PDGFRA ATM CDC42BPA RB1 KRAS 30 | 16973 ZMYND10 NF1 31 | 16879 CDH11 TUSC4 JUNB ROBO1 EPHA5 ATM NTRK2 STK11 JAK3 MYBL2 KRAS 32 | 16975 TP53 EGFR RB1 33 | 17242 MAP3K9 TP53 CDKN2A NF1 GPC3 TSHR PAK3 ROR2 KDR KRAS 34 | 16660 DOCK3 INSR GRB14 LATS1 AKT3 INSRR ITK LRRK1 GRK7 MST1 JAK1 TERT ALS2CR2 KDR ERBB2 CDKN2A MRAS CAMKV LTK RIPK3 KRAS PTCH1 TP53 ALK HYAL1 ROBO2 IRS1 MAP3K12 PRKACA CYB561D2 PAK7 NTRK1 RALA 35 | 17228 EGFR 36 | 16664 PAK6 ZAP70 37 | 16668 PDGFRB CRKRS MAP4K4 MKNK1 EGFR PDGFRA PRKDC AKT1 ARAF RB1 TSC2 LRRK2 MAP3K2 NTRK1 NTRK2 JAK2 PIK3R2 RELA MYB ERBB2 ERBB3 CDKN2A FOS MLL MAST2 MERTK FGFR4 TYRO3 TP53 VAV1 MST1R KIT PTPRG STAT5A CDK5 38 | 17226 BRCA1 RPS6KA1 SMG1 GNAS MST4 NTRK1 PIK3CG INSRR TP53 39 | 17781 EPHA5 STK11 KRAS 40 | 16947 AKT1 TP53 NF1 41 | 17146 KRAS 42 | 16706 ZAK ERBB4 GNAS RET FES PAK4 FGFR4 43 | 16949 CBLB EPHA7 STK11 EPHA3 GSK3B CTNNB1 KLF6 ZMYND10 LTK AIFM3 44 | 16784 EGFR 45 | 16786 EGFR 46 | 16861 APC KRAS PIK3C2G TP53 ERBB3 47 | 16863 RBM5 EPHB6 STK11 EPHA3 MEN1 TP53 48 | 17258 EPHB6 PIK3CG 49 | 17316 APC EIF4G1 ATM KRAS 50 | 17250 PTPN11 KRAS 51 | 17238 KRAS 52 | 17234 EGFR 53 | 17733 WT1 ROBO1 ATM TP73 NTRK3 DMPK KDR KRAS 54 | 16951 RAP2B MYO3B TP53 MAP3K3 VAV2 PRKCG INHBA LRP1B YES1 TSHR TSC2 PRKD1 AXL TSC1 55 | 17731 TP53 MAP3K6 KRAS 56 | 16953 PDGFRB CDKN2A RPS6KA4 RBL1 TNK2 PTPN11 NRAS MYO3B CNTN6 MAPK4 TP53 FLT4 PTK2 VAV3 YES1 PAK3 57 | 17735 TP53 CYSLTR2 EPHA3 MSH6 JAK2 FGFR2 58 | 17739 MYO3B CDKN2A EGFR CDC42BPA 59 | 17738 FH KSR2 STK11 TP53 60 | 16712 SEMA3F 61 | 16710 MDM2 EGFR 62 | 16792 BAX 63 | 16814 NRK VAV2 EPHB6 RPS6KA6 ERBB4 ROBO2 TP53 MAP3K15 MERTK TP73L CDK9 PTCH1 PAK3 64 | 17302 ATM EGFR 65 | 17730 PIK3CD PIK3R2 PTCH2 PAK6 HEMK1 RAPGEF1 66 | 17306 TP53 EGFR 67 | 17304 BMPR1B TP53 SHC1 KRAS 68 | 17308 IGF1R STK11 NOTCH1 RET IRS1 MAPK7 APC NTRK3 69 | 16963 EGFR 70 | 16648 SGK2 GNAS MKNK2 ATM PRKCE RB1 PLCG2 FES BCL6 71 | 16883 STK11 NRAS NTRK2 FBXW7 72 | 16881 RASSF2 TP53 STYK1 STK11 CCNG1 ROR2 KRAS 73 | 16640 MET HRAS MAST1 PLCG1 RAB43 IFRD2 TERT 74 | 17202 LOC400891 MAP4K3 STK11 VEGFC KRAS 75 | 17206 KRAS 76 | 16646 ABL1 MYO3B 77 | 17282 MAP4K1 PRKCH TP53 RBL1 NOTCH4 ERCC2 STK11 PFTK1 MAST2 NTRK3 78 | 17280 KRAS ATM NTRK1 ERBB4 FLT1 79 | 17728 TP53 NTRK2 KRAS 80 | 17286 PKN3 TP53 RCBTB2 KRAS 81 | 17042 CDH11 SEMA3B EPHB6 LMTK2 PTPN11 MSH6 TP53 PIK4CA SLA KRAS 82 | 17726 STAT6 CRKRS TP53 CSF1R STK11 PRKCE MSH6 PCTK2 TSC1 CHEK1 83 | 17288 STK11 NTRK3 PTPRD BRAF TP73L PIK3C3 84 | 16724 JAK3 PTCH1 GNAS NOTCH4 EGFR GSK3A INSRR PAK4 IRAK1 FGFR1 85 | 16925 TP53 EGFR 86 | 16927 TP53 EGFR 87 | 16921 TP53 EPHA5 EPHA4 NF2 TSHR LZTR1 KDR KRAS 88 | 16802 EPHX1 TP53 MERTK KLHL22 ZAK EPHA5 SHC3 EPHA3 RB1 LRP1B INSRR GRB7 KRAS 89 | 16800 KIAA1804 TP53 ERBB4 PRKDC 90 | 16696 KSR1 91 | 16929 ALK KIAA1303 STK11 LATS2 LRP1B APC KRAS 92 | 17784 RASSF2 CCNE1 PIM2 NF1 93 | 17330 STK11 FRK ATM KIAA1804 PTPRD NF1 NLK PIK3C3 KRAS 94 | 16734 STK32C TEC MAML2 STAT5B CSF1R STK11 EGFR NOTCH3 CDC2L5 SRMS TP53 PAK7 ROR2 95 | 16730 KDR 96 | 17759 MUSK FYN ERG TYK2 PRKX MAP3K6 TPTE JUP PCTK2 AURKA TERT BRCA2 SLC38A3 MAST1 MAST4 TP53 PIK3C3 LMTK3 GNAS CDC2L5 BRAF LZTR1 KIAA1804 97 | 16835 CDH11 DOCK3 EPHA3 EPHA2 PRKDC FOXO3 MAP2K5 CCNT2 JUP JAK3 NF1 KSR2 FBXW7 ACVR2B PRKCG STK3 LTK HCK TP53 ITK RPS6KA6 VAV1 MET PTPRD CDC2L2 98 | 17055 MET NTRK1 INHBA ETS1 STK36 TERT KRAS 99 | 17298 KSR1 PTPRD EGFR TP53 100 | 16638 BAP1 EGFR MLL 101 | 17294 TP53 102 | 17750 TEC PTCH1 TP53 NF1 MERTK 103 | 17222 VHL LRP1B TP53 KRAS 104 | 17290 MYO3A SRC EPHB1 TP53 EPHB4 ERBB4 NOTCH4 EPHA5 EPHA3 MAP3K12 LRP1B MYO3B TFDP1 STK36 P2RXL1 LTK APC MYC PAK3 KDR KIAA1303 105 | 17754 TP53 ATR 106 | 17292 KRAS 107 | 16632 RBL2 MYCL1 LMTK2 RALB TP53 FEZF2 ERBB4 EPHA7 CDC25A ROCK1 PIK3C2A PIK3CG STK11 PRKCD LRP1B PIK3C2G STAT2 CDH4 ADRBK2 KRAS 108 | 16937 ERBB2 TP53 109 | 16636 STK11 110 | 17328 TP53 TFDP1 111 | 16823 JAG2 FGFR4 EVI1 TP53 112 | 17320 MAP3K3 PTPRD TP53 113 | 17324 NRK STK11 IFRD2 KRAS 114 | 17326 PIK3C2B MATK 115 | 16740 MYO3B EGFR 116 | 16742 PRKD3 117 | 16748 APC STK11 KRAS 118 | 16821 CDKN2C STK11 119 | 17060 CBLB TP53 STK11 RET PRKDC LRP1B CDC42BPA HCK FGFR2 BCL3 KRAS 120 | 16825 CDKN2A 121 | 16628 ACVRL1 ROR2 TP53 GNAS EPHA7 EPHA6 EIF4G1 MKNK1 MAP3K4 NOTCH3 MAP3K10 LRP1B PTPRD ZMYND10 DOCK3 PTCH2 APC PIK3C3 PRKCB1 KDR KRAS 122 | 16827 TP53 CROT PRKCE JAK2 CDC42BPA IRAK2 KRAS 123 | 16626 TEC ATM NOTCH1 PDK1 ANKK1 KRAS 124 | 17743 DOCK3 GCK STK11 MAST2 EPHA1 125 | 17741 STK11 ABL2 TP53 126 | 17746 CDH11 EPHB1 PINX1 ITK MAML1 NRAS LRP1B PFTK1 MAP3K10 KIAA1804 INSR PTK2 LTF IKBKE TP53 FGFR4 TFDP1 MAP3K15 127 | 17747 DOCK3 EGFR 128 | 17745 SMAD2 129 | 16901 TP53 LRP1B EPHB1 EPHA3 130 | 16907 TP53 KRAS 131 | 16905 TP53 EPHA7 PRKCG PFTK1 PRKAR1A NF1 JAG2 132 | 16909 MAP4K1 ROBO2 CDC25A ATM KIAA1804 LRP1B LTK APC FGFR1 KRAS 133 | 17216 XRCC1 AIFM3 KRAS 134 | 17210 MAP4K1 MYCL1 PDGFRA MYO3B RYK EPHA3 PRKDC INSRR TYK2 SMG1 ROBO2 MAML1 PIK3CA FYN NTRK2 PIK3CG INHBA LRP1B GAK NF1 IRAK2 FNDC3A MYB ACVRL1 PIK3C2A MAST4 PRKCQ TP53 FGFR4 PIK3C3 FGFR2 KRAS CCNB3 STK32A SEMA3B LMTK3 ZAK ROBO1 VAV3 DDR1 IRS1 NOTCH2 PTPRD PRKACB PIK3C2G 135 | 17194 PTEN RANBP9 136 | 17218 TP53 RBL2 EPHA7 SLC7A4 ATM EGFR PASK INHBA LRP1B CNTN6 RIPK4 NF1 RHOB APC PRKCB1 PAK3 137 | 17190 CDKN2A TP53 LMTK2 STK11 RIN1 CASK TNNI3K 138 | 17777 KRAS IRAK1 TP53 LRP1B GATA1 139 | 17776 STK11 TP73L ERAS NTRK3 140 | 16750 STK11 JAK2 141 | 16754 INSR 142 | 16993 BAP1 CDC42BPA 143 | 17778 APC STK11 144 | 17174 PDGFRA TP53 EPHB4 LRRK2 PIK3CD EPHA3 TAOK3 PLCG2 MAPK6 NF1 FGFR1 SLC38A3 KRAS 145 | 17176 MYC STK11 TP53 146 | 16594 RASEF 147 | 17170 SYK FNDC3A CDKN2A TP53 148 | 16596 EPHA5 STK11 KRAS 149 | 17172 PTEN EGFR TP53 150 | 16616 STK11 151 | 16857 MYO3A STAT5B PTPRD CDK7 FLT4 MDM2 KRAS 152 | 16919 KRAS BARD1 153 | 16915 IKBKB GATA1 EGFR ACVR1B 154 | 16698 FH STK11 LRRK1 155 | 16859 IGF1R TEK STK10 ERAS NTRK1 MAP3K4 BRAF VEGFC KRAS 156 | 16913 TP53 MAML1 EGFR LATS1 CDC42BPG PRKACB 157 | 17260 FGFR4 KRAS 158 | 17262 MAP4K3 CDKN2A TP53 MAP2K5 EPHA5 PFTK2 FOXO3 INHBA STK24 MAPK6 GRB7 JAG2 KRAS 159 | 17264 KRAS 160 | 17268 EPHB6 SLC38A3 FRK GRK7 EPHA3 MAP3K4 NTRK3 SMAD4 PRKG1 TP53 RAB6C KRAS 161 | 17182 EGFR ZMYND10 162 | 17184 APC EGFR STAT3 163 | 17186 LTK MYO3A ATM PDGFRB KRAS 164 | -------------------------------------------------------------------------------- /example/dendrix/analyzed_genes: -------------------------------------------------------------------------------- 1 | IL1RL1 2 | ROS1 3 | PHLPP 4 | PIK3CA 5 | FGFR2 6 | SLC12A6 7 | WEE1 8 | KRAS 9 | ERBB2 10 | BRCA2 11 | PIK3R1 12 | TRIM24 13 | CENPF 14 | FLI1 15 | MDM2 16 | MDM4 17 | FGFR1 18 | PTPN11 19 | PTEN 20 | SYP 21 | KPNA2 22 | TCF12 23 | APBB1IP 24 | GRIA2 25 | MLL4 26 | FLT4 27 | CHD5 28 | SOX11 29 | GPR78 30 | NMBR 31 | CYP27B1 32 | NF1 33 | GNAI1 34 | CHEK1 35 | DST 36 | RTN1 37 | PAX5 38 | SHH 39 | BCAR1 40 | PRKCB1 41 | TNK2 42 | RPS6KA3 43 | TNFRSF1B 44 | TAF1 45 | PPP2R5D 46 | TRIM3 47 | TRIM2 48 | GLI1 49 | PDGFRA 50 | EPHA7 51 | CHAT 52 | CDC123 53 | TSC2 54 | A2M 55 | BCR 56 | FLT1 57 | STAT1 58 | MARK4 59 | BCL11A 60 | LAMP1 61 | TP53 62 | EP300 63 | TAOK3 64 | IRS1 65 | COL1A2 66 | PRKD2 67 | DTX3 68 | MSH2 69 | EGFR 70 | BAX 71 | INSR 72 | IQGAP1 73 | SERPINA3 74 | NRAS 75 | RB1 76 | ST7 77 | PLAG1 78 | DGKG 79 | PI15 80 | ITGB3 81 | ITGB2 82 | PIM1 83 | PRKCD 84 | CES3 85 | PRKCZ 86 | TNC 87 | MTAP 88 | SNF1LK2 89 | ABCC4 90 | PTCH1 91 | BCL2 92 | -------------------------------------------------------------------------------- /example/dendrix/mutation_matrix: -------------------------------------------------------------------------------- 1 | TCGA-06-0132 TP53 TRIM3 EP300 DTX3 EGFR PTCH1 2 | TCGA-06-0184 ERBB2 FGFR2 GRIA2 GNAI1 TP53 IRS1 DTX3 SERPINA3 ST7 CES3 BCL2 3 | TCGA-02-0089 FGFR1 PTEN TCF12 CYP27B1 PAX5 TNFRSF1B DTX3 4 | TCGA-02-0086 PTEN NF1 PRKCD NRAS RTN1 5 | TCGA-02-0085 PTEN TNK2 CDC123 STAT1 TAOK3 BCL2 6 | TCGA-06-0174 PRKCD PRKCB1 TRIM2 GLI1 CDC123 7 | TCGA-02-0006 IL1RL1 TP53 DTX3 SERPINA3 8 | TCGA-06-0210 PRKCB1 TP53 STAT1 ST7 PRKCZ BCL2 9 | TCGA-02-0080 TRIM24 NMBR TCF12 TP53 RTN1 MTAP 10 | TCGA-06-0214 ROS1 DST PAX5 TNK2 PPP2R5D GLI1 MSH2 EGFR PIM1 11 | TCGA-06-0211 ERBB2 NF1 PRKCB1 TP53 TSC2 A2M DTX3 EGFR PRKCZ 12 | TCGA-06-0187 PTEN FLT4 MARK4 COL1A2 PRKCZ 13 | TCGA-06-0148 PIK3CA COL1A2 MTAP 14 | TCGA-06-0185 GRIA2 TP53 IQGAP1 RB1 PLAG1 15 | TCGA-06-0147 BRCA2 FGFR2 GLI1 PLAG1 CES3 16 | TCGA-06-0188 CHD5 NRAS RB1 SNF1LK2 17 | TCGA-06-0145 PTEN CYP27B1 BCAR1 DTX3 MSH2 EGFR 18 | TCGA-06-0219 BRCA2 PTEN GRIA2 CHD5 GNAI1 CHEK1 DGKG TNC 19 | TCGA-06-0143 MDM4 APBB1IP CYP27B1 PDGFRA EGFR 20 | TCGA-06-0141 PTEN TP53 BCR COL1A2 EGFR 21 | TCGA-02-0058 GPR78 SYP GNAI1 TP53 A2M MSH2 SNF1LK2 22 | TCGA-02-0074 ERBB2 PTEN TCF12 SOX11 TP53 BCL2 23 | TCGA-02-0075 FGFR1 FLI1 24 | TCGA-06-0128 SLC12A6 PTEN PRKCB1 TP53 DTX3 MSH2 PTCH1 25 | TCGA-06-0129 PIK3R1 GRIA2 SHH TP53 PIK3CA COL1A2 DGKG ITGB2 ABCC4 PTCH1 26 | TCGA-06-0166 FLT1 PIM1 27 | TCGA-02-0071 WEE1 MDM2 FGFR2 TP53 BAX MTAP 28 | TCGA-06-0168 SHH RPS6KA3 TP53 TNFRSF1B RB1 29 | TCGA-06-0169 PIK3R1 PTPN11 PTEN CHEK1 TP53 BCR BCL11A NRAS ITGB3 ITGB2 30 | TCGA-02-0052 TRIM24 MDM2 FGFR2 PRKCD RPS6KA3 PDGFRA CES3 31 | TCGA-02-0064 CDC123 PIM1 32 | TCGA-02-0054 PTEN EPHA7 PRKCZ 33 | TCGA-02-0055 NF1 PAX5 PIK3CA TSC2 BCR BAX PI15 SNF1LK2 PTCH1 34 | TCGA-06-0122 TRIM24 RPS6KA3 TP53 EPHA7 EP300 BAX MTAP 35 | TCGA-02-0057 GPR78 TP53 TRIM2 PTCH1 36 | TCGA-02-0033 FGFR2 NF1 TP53 TRIM2 CHAT MSH2 NRAS ST7 37 | TCGA-02-0011 PIK3R1 TRIM24 NMBR TP53 38 | TCGA-02-0034 GNAI1 PIK3CA DTX3 INSR 39 | TCGA-02-0038 PTEN TP53 TAF1 EPHA7 DTX3 EGFR BCL2 40 | TCGA-02-0021 SLC12A6 PTEN APBB1IP PRKCB1 PIK3CA TSC2 EGFR 41 | TCGA-02-0115 ERBB2 TRIM2 PI15 42 | TCGA-02-0037 SLC12A6 MDM2 PDGFRA EPHA7 PRKCZ 43 | TCGA-06-0137 SYP COL1A2 EGFR RB1 BCL2 44 | TCGA-06-0195 PTEN NF1 DTX3 MSH2 EGFR ABCC4 45 | TCGA-06-0197 PTEN CHD5 IQGAP1 NRAS SNF1LK2 46 | TCGA-06-0190 IL1RL1 WEE1 KRAS PTEN APBB1IP NF1 PRKCD TP53 PIK3CA COL1A2 EGFR 47 | TCGA-06-0201 PIK3R1 KRAS NMBR EGFR IQGAP1 MTAP ABCC4 PTCH1 BCL2 48 | TCGA-06-0158 PHLPP KPNA2 FLT4 PIK3CA TRIM3 EGFR PRKCZ MTAP 49 | TCGA-06-0206 SLC12A6 APBB1IP PRKCD BCAR1 TP53 50 | TCGA-06-0209 PIK3R1 FLT1 LAMP1 IRS1 BAX MTAP 51 | TCGA-06-0133 PTEN CHEK1 EGFR 52 | TCGA-06-0221 IL1RL1 WEE1 PTEN APBB1IP CYP27B1 DST GLI1 CHAT 53 | TCGA-02-0113 IL1RL1 GPR78 54 | TCGA-06-0241 TCF12 PRKCD TAF1 TRIM2 PDGFRA PTCH1 55 | TCGA-06-0154 SOX11 CYP27B1 GNAI1 PPP2R5D CDC123 MTAP 56 | TCGA-06-0157 MDM4 PTEN TP53 TNFRSF1B IRS1 EGFR IQGAP1 PIM1 57 | TCGA-06-0156 PAX5 PDGFRA EPHA7 RB1 58 | TCGA-06-0173 CHD5 GPR78 GLI1 PDGFRA 59 | TCGA-02-0060 MDM2 PTEN CYP27B1 TP53 GLI1 TNFRSF1B EP300 IRS1 DTX3 ST7 CES3 60 | TCGA-06-0171 APBB1IP PAX5 EPHA7 EGFR RB1 ITGB2 PIM1 61 | TCGA-06-0176 BRCA2 GRIA2 GNAI1 A2M FLT1 ST7 DGKG ITGB3 TNC 62 | TCGA-06-0139 PTEN SHH PIK3CA TRIM3 TAOK3 EGFR 63 | TCGA-06-0138 TRIM24 MDM2 NMBR KPNA2 NF1 PDGFRA INSR 64 | TCGA-02-0069 FGFR2 PRKCD CDC123 BCR MSH2 CES3 65 | TCGA-02-0024 PTEN SYP BCR MARK4 MSH2 66 | TCGA-02-0027 KRAS PTEN TCF12 SOX11 BCAR1 TP53 EPHA7 LAMP1 67 | TCGA-06-0178 PTEN GRIA2 CYP27B1 DST MSH2 EGFR PIM1 PRKCZ 68 | TCGA-02-0047 SLC12A6 ERBB2 APBB1IP TP53 CHAT MARK4 RB1 69 | TCGA-02-0046 SLC12A6 FLI1 PTEN NF1 PRKCB1 TP53 DTX3 RB1 70 | TCGA-06-0130 BRCA2 APBB1IP SHH 71 | TCGA-02-0007 PTPN11 PTEN NF1 RB1 72 | TCGA-06-0124 TRIM24 MDM2 PTEN NMBR TCF12 APBB1IP TP53 STAT1 MARK4 LAMP1 ITGB3 ITGB2 MTAP BCL2 73 | TCGA-02-0102 FLI1 PIK3CA RB1 MTAP BCL2 74 | TCGA-02-0003 PTEN GLI1 PDGFRA CDC123 75 | TCGA-02-0001 PTEN CHD5 GNAI1 LAMP1 TAOK3 COL1A2 EGFR 76 | TCGA-06-0125 PIK3R1 TRIM24 FLI1 APBB1IP TP53 BCR EP300 EGFR 77 | TCGA-06-0237 ROS1 TCF12 GRIA2 PRKCD TP53 MARK4 EGFR BAX IQGAP1 ABCC4 78 | TCGA-02-0107 PTEN KPNA2 PRKD2 ABCC4 79 | TCGA-06-0208 FGFR2 NMBR FLT1 EP300 80 | TCGA-02-0116 BRCA2 PTEN MLL4 PRKD2 ST7 PRKCZ 81 | TCGA-06-0189 TRIM24 MDM2 PTEN BCAR1 TP53 82 | TCGA-06-0126 MDM2 PTEN CHD5 A2M EGFR MTAP 83 | TCGA-02-0009 ERBB2 BRCA2 PTEN NF1 CHEK1 PDGFRA TSC2 STAT1 DTX3 EGFR NRAS DGKG 84 | TCGA-06-0213 ERBB2 CENPF PTEN GPR78 NF1 TP53 EGFR 85 | -------------------------------------------------------------------------------- /example/smg/example.bam_list: -------------------------------------------------------------------------------- 1 | CNIC-02-04-0001-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0001-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0001-01-WGS-PP-4201.bam 2 | CNIC-02-04-0002-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0002-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0002-01-WGS-PP-4201.bam 3 | CNIC-02-04-0003-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0003-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0003-01-WGS-PP-4201.bam 4 | CNIC-02-04-0004-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0004-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0004-01-WGS-PP-4201.bam 5 | CNIC-02-04-0005-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0005-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0005-01-WGS-PP-4201.bam 6 | CNIC-02-04-0006-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0006-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0006-01-WGS-PP-4201.bam 7 | CNIC-02-04-0007-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0007-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0007-01-WGS-PP-4201.bam 8 | CNIC-02-04-0008-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0008-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0008-01-WGS-PP-4201.bam 9 | CNIC-02-04-0009-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0009-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0009-01-WGS-PP-4201.bam 10 | CNIC-02-04-0010-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0010-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0010-01-WGS-PP-4201.bam 11 | CNIC-02-04-0011-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0011-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0011-01-WGS-PP-4201.bam 12 | CNIC-02-04-0012-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0012-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0012-01-WGS-PP-4201.bam 13 | CNIC-02-04-0013-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0013-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0013-01-WGS-PP-4201.bam 14 | CNIC-02-04-0014-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0014-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0014-01-WGS-PP-4201.bam 15 | CNIC-02-04-0015-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0015-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0015-01-WGS-PP-4201.bam 16 | CNIC-02-04-0016-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0016-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0016-01-WGS-PP-4201.bam 17 | CNIC-02-04-0017-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0017-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0017-01-WGS-PP-4201.bam 18 | CNIC-02-04-0018-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0018-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0018-01-WGS-PP-4201.bam 19 | CNIC-02-04-0019-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0019-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0019-01-WGS-PP-4201.bam 20 | CNIC-02-04-0020-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0020-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0020-01-WGS-PP-4201.bam 21 | CNIC-02-04-0021-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0021-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0021-01-WGS-PP-4201.bam 22 | -------------------------------------------------------------------------------- /example/smg/example.run-coverage-command: -------------------------------------------------------------------------------- 1 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0001-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0001-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0001-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0001-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 2 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0002-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0002-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0002-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0002-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 3 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0003-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0003-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0003-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0003-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 4 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0004-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0004-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0004-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0004-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 5 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0005-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0005-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0005-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0005-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 6 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0006-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0006-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0006-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0006-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 7 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0007-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0007-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0007-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0007-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 8 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0008-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0008-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0008-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0008-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 9 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0009-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0009-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0009-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0009-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 10 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0010-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0010-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0010-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0010-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 11 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0011-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0011-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0011-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0011-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 12 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0012-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0012-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0012-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0012-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 13 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0013-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0013-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0013-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0013-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 14 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0014-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0014-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0014-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0014-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 15 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0015-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0015-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0015-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0015-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 16 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0016-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0016-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0016-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0016-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 17 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0017-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0017-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0017-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0017-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 18 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0018-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0018-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0018-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0018-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 19 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0019-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0019-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0019-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0019-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 20 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0020-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0020-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0020-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0020-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 21 | 'music2 bmr calc-covg-helper --normal-tumor-bam-pair="CNIC-02-04-0021-01-WGS-PP-4201 /bam_dir/normal/CNIC-02-04-0021-10-WGS-PP-4201.bam /bam_dir/tumor/CNIC-02-04-0021-01-WGS-PP-4201.bam" --roi-file=../input/roi_file --reference-sequence=../input/refseq.fa --output-file=./roi_covgs/CNIC-02-04-0021-01-WGS-PP-4201.covg --normal-min-depth=6 --tumor-min-depth=8 --min-mapq=20 --bp-class-types=AT,CG,CpG' 22 | -------------------------------------------------------------------------------- /lib/TGI/MuSiC2/Bmr.pm: -------------------------------------------------------------------------------- 1 | package TGI::MuSiC2::Bmr; 2 | 3 | use warnings; 4 | use strict; 5 | 6 | use IO::File; 7 | 8 | use TGI::MuSiC2::CalcBmr; 9 | use TGI::MuSiC2::CalcCovg; 10 | use TGI::MuSiC2::CalcWigCovg; 11 | use TGI::MuSiC2::CalcCovgHelper; 12 | use TGI::MuSiC2::CalcWindowRoi; 13 | use TGI::MuSiC2::CalcWindowMaf; 14 | use TGI::MuSiC2::CalcBmrModifier; 15 | 16 | ## process subcommands 17 | sub new { 18 | my $class = shift; 19 | my $this = {}; 20 | $this->{SUBCOMMAND} = shift; 21 | bless $this, $class; 22 | $this->process(); 23 | 24 | return $this; 25 | } 26 | 27 | sub process { 28 | my $this = shift; 29 | my %cmds = map{ 30 | ($_, 1) 31 | } qw( calc-bmr 32 | calc-covg 33 | calc-covg-helper 34 | calc-wig-covg 35 | calc-window-roi 36 | calc-window-maf 37 | calc-bmr-modifier ); 38 | unless (defined $this->{SUBCOMMAND}) { die help_text(); }; 39 | unless (exists $cmds{ $this->{SUBCOMMAND}}) { 40 | warn ' Please give valid sub command ! ', "\n"; 41 | die help_text(); 42 | } 43 | SWITCH:{ 44 | $this->{SUBCOMMAND} eq 'calc-bmr' && do { TGI::MuSiC2::CalcBmr->new(); last SWITCH; }; 45 | $this->{SUBCOMMAND} eq 'calc-bmr-modifier' && do { TGI::MuSiC2::CalcBmrModifier->new(); last SWITCH; }; 46 | $this->{SUBCOMMAND} eq 'calc-covg' && do { TGI::MuSiC2::CalcCovg->new(); last SWITCH; }; 47 | $this->{SUBCOMMAND} eq 'calc-covg-helper' && do { TGI::MuSiC2::CalcCovgHelper->new(); last SWITCH; }; 48 | $this->{SUBCOMMAND} eq 'calc-wig-covg' && do { TGI::MuSiC2::CalcWigCovg->new(); last SWITCH; }; 49 | $this->{SUBCOMMAND} eq 'calc-window-roi' && do { TGI::MuSiC2::CalcWindowRoi->new(); last SWITCH; }; 50 | $this->{SUBCOMMAND} eq 'calc-window-maf' && do { TGI::MuSiC2::CalcWindowMaf->new(); last SWITCH; }; 51 | 52 | $this->{SUBCOMMAND} eq 'help' && do { die help_text(); last SWITCH; }; 53 | } 54 | 55 | } 56 | 57 | ## Bmr subcmds 58 | sub help_text { 59 | my $this = shift; 60 | return <beifang.cn@gmail.comE 97 | 98 | =head1 SEE ALSO 99 | 100 | https://github.com/ding-lab/MuSiC2 101 | 102 | =head1 LICENSE 103 | 104 | This library is free software with MIT licence; you can redistribute it and/or modify 105 | it under the same terms as Perl itself. 106 | 107 | =cut 108 | 109 | -------------------------------------------------------------------------------- /lib/TGI/MuSiC2/CalcBmrModifier.pm: -------------------------------------------------------------------------------- 1 | package TGI::MuSiC2::CalcBmrModifier; 2 | ## 3 | # given genes mr file, calculate window based bmr modifier 4 | ## 5 | use strict; 6 | use warnings; 7 | # 8 | # 9 | use IO::File; 10 | use Getopt::Long; 11 | use File::Temp qw/ tempfile /; 12 | 13 | sub new { 14 | my $class = shift; 15 | my $this = {}; 16 | 17 | $this->{_ROI_FILE} = undef; 18 | $this->{_GENE_MRS_FILE} = undef; 19 | $this->{_OUTPUT_MODIFIER_FILE} = "window_based_gene_bmr_modifier"; 20 | $this->{_WINDOW_SIZE} = 1000000; 21 | 22 | bless $this, $class; 23 | $this->process(); 24 | 25 | return $this; 26 | } 27 | 28 | sub process { 29 | my $this = shift; 30 | my ( $help, $options ); 31 | unless( @ARGV ) { die $this->help_text(); } 32 | $options = GetOptions ( 33 | 34 | 'roi-file=s' => \$this->{_ROI_FILE}, 35 | 'gene-mrs-file=s' => \$this->{_GENE_MRS_FILE}, 36 | 'output-modifier-file=s' => \$this->{_OUTPUT_MODIFIER_FILE}, 37 | 38 | 'window-size=i' => \$this->{_WINDOW_SIZE}, 39 | 40 | 'help' => \$help, 41 | ); 42 | if ( $help ) { print STDERR help_text(); exit 0; } 43 | unless( $options ) { die $this->help_text(); } 44 | #### processing #### 45 | # 46 | # Check on all the input data 47 | print STDERR "ROI file not found or is empty: $this->{_ROI_FILE}\n" unless( -s $this->{_ROI_FILE} ); 48 | return undef unless( -s $this->{_ROI_FILE} ); 49 | print STDERR "window based mutation rate file not found or is empty: $this->{_GENE_MRS_FILE}\n" unless( -s $this->{_GENE_MRS_FILE} ); 50 | return undef unless( -s $this->{_GENE_MRS_FILE} ); 51 | unless ( $this->{_WINDOW_SIZE} =~ /\d+/ ) { 52 | print STDERR "Window size format is not valid !\n"; 53 | return undef; 54 | }; 55 | # 56 | ## generate 0-based window based bmrs file 57 | # 58 | my @cont = (); my %gene_mr_hash; 59 | my $gene_mrs_fh = IO::File->new( $this->{_GENE_MRS_FILE} ) or die "Can not open gene_mrs file. $!"; 60 | while ( my $line = $gene_mrs_fh->getline ) { 61 | next if ($line =~ /^#Gene/); 62 | chomp($line); my @t = split /\t/, $line; 63 | $gene_mr_hash{'cov'}{$t[0]} += $t[2]; 64 | $gene_mr_hash{'mut'}{$t[0]} += $t[3]; 65 | } 66 | $gene_mrs_fh->close(); 67 | 68 | foreach my $w (sort keys %{$gene_mr_hash{'mut'}}) { 69 | my $cont_line = ""; 70 | my ($chr, $start) = split /:/, $w; 71 | $chr=~s/^CHR//; 72 | $start--; 73 | if ($gene_mr_hash{'mut'}{$w} >= 1) { 74 | $cont_line .= "$chr\t$start\t".($start+$this->{_WINDOW_SIZE})."\t$gene_mr_hash{'mut'}{$w}\t".int($gene_mr_hash{'cov'}{$w}/3)."\t".$gene_mr_hash{'mut'}{$w}/($gene_mr_hash{'cov'}{$w}/3)."\n"; ; 75 | push( @cont, $cont_line); 76 | } 77 | } 78 | my ( undef, $temp_0_based_win_bmrs ) = tempfile(); 79 | my $win_bmrs_fh = IO::File->new( $temp_0_based_win_bmrs, ">" ) or die "Temporary file could not be created. $!"; 80 | $win_bmrs_fh->print( @cont ); 81 | $win_bmrs_fh->close(); 82 | @cont = (); 83 | my $roi_fh = IO::File->new( $this->{_ROI_FILE} ) or die "Can not open ROI file. $!"; 84 | while ( my $line = $roi_fh->getline ) { 85 | chomp( $line ); 86 | my @t = split /\t/, $line; 87 | $t[1]--; 88 | push ( @cont, join("\t", @t)."\n" ); 89 | } 90 | $roi_fh->close(); 91 | my ( undef, $temp_0_based_roi_file ) = tempfile(); 92 | my $new_rois_fh = IO::File->new( $temp_0_based_roi_file, ">" ) or die "Temporary file could not be created. $!"; 93 | $new_rois_fh->print( @cont ); 94 | $new_rois_fh->close(); 95 | my ( undef, $temp_0_based_win_bmrs_sorted ) = tempfile(); 96 | my ( undef, $temp_0_based_roi_file_sorted ) = tempfile(); 97 | system( "joinx sort -i $temp_0_based_roi_file -o $temp_0_based_roi_file_sorted" ); 98 | system( "joinx sort -i $temp_0_based_win_bmrs -o $temp_0_based_win_bmrs_sorted" ); 99 | my ( undef, $temp_intersected_file ) = tempfile(); 100 | ## do intersecting 101 | system( "joinx intersect --output-both -a $temp_0_based_roi_file_sorted -b $temp_0_based_win_bmrs_sorted -o $temp_intersected_file" ); 102 | @cont = (); 103 | my $intersect_fh = IO::File->new( $temp_intersected_file ) or die "Temporary file could not be opened. $!"; 104 | while ( my $line = $intersect_fh->getline ) { chomp( $line ); push( @cont, $line ); }; 105 | $intersect_fh->close(); 106 | my %contu = map{ @_ = split /\t/; ( join( "\t", @_[4..8] ), 1) } @cont; 107 | my ( $total_mr, $m, $c, $f2 ) = (0.0, 0, 0, 0); 108 | map{ @_ = split /\t/; $m += $_[3]; $c +=$_[4]; } keys %contu; 109 | $total_mr = $m/$c; 110 | my %mh = (); my %ch = (); 111 | map{ @_ = split /\t/; $_[9] = $_[9]/$total_mr; $f2 = abs($_[2] - $_[1]); $ch{$_[3]}+=$f2; $mh{$_[3]}+=($f2*$_[9]); } @cont; 112 | ## pour out modifier file 113 | # 114 | my $output_fh = IO::File->new( $this->{_OUTPUT_MODIFIER_FILE}, ">" ) or die "Output file could not be created. $!"; 115 | map{ $output_fh->print( "$_\t", $mh{$_}/$ch{$_}, "\n"); } sort keys %mh; 116 | $output_fh->close(); 117 | # 118 | return 1; 119 | 120 | } 121 | 122 | ## usage 123 | sub help_text { 124 | my $this = shift; 125 | return <{_ROI_FILE} = undef; 16 | $this->{_REF_SEQ} = undef; 17 | $this->{_BAM_LIST} = undef; 18 | $this->{_OUTPUT_DIR} = undef; 19 | $this->{_CMD_LIST_FILE} = undef; 20 | $this->{_CMD_PREFIX} = undef; 21 | $this->{_BP_CLASS_TYPES} = 'AT,CG,CpG'; 22 | $this->{_NOR_MIN_DEPTH} = 6; 23 | $this->{_TUM_MIN_DEPTH} = 8; 24 | $this->{_MIN_MAPQ} = 20; 25 | bless $this, $class; 26 | $this->process(); 27 | 28 | return $this; 29 | } 30 | 31 | sub process { 32 | my $this = shift; 33 | my ($help, $options); 34 | unless(@ARGV) { die $this->help_text(); } 35 | $options = GetOptions ( 36 | 'roi-file=s' => \$this->{_ROI_FILE}, 37 | 'reference-sequence=s' => \$this->{_REF_SEQ}, 38 | 'bam-list=s' => \$this->{_BAM_LIST}, 39 | 'output-dir=s' => \$this->{_OUTPUT_DIR}, 40 | 'cmd-list-file=s' => \$this->{_CMD_LIST_FILE}, 41 | 'cmd-prefix=s' => \$this->{_CMD_PREFIX}, 42 | 'bp-class-types=s' => \$this->{_BP_CLASS_TYPES}, 43 | 'normal-min-depth=i' => \$this->{_NOR_MIN_DEPTH}, 44 | 'tumor-min-depth=i' => \$this->{_TUM_MIN_DEPTH}, 45 | 'min-mapq=i' => \$this->{_MIN_MAPQ}, 46 | 'help' => \$help, 47 | ); 48 | if ($help) { print STDERR help_text(); exit 0; } 49 | unless($options) { die $this->help_text(); } 50 | #### processing #### 51 | # 52 | # Check on all the input data 53 | print STDERR "ROI file not found or is empty: $this->{_ROI_FILE}\n" unless( -s $this->{_ROI_FILE} ); 54 | print STDERR "Reference sequence file not found: $this->{_REF_SEQ}\n" unless( -e $this->{_REF_SEQ} ); 55 | print STDERR "List of BAMs not found or is empty: $this->{_BAM_LIST}\n" unless( -s $this->{_BAM_LIST} ); 56 | print STDERR "Output directory not found: $this->{_OUTPUT_DIR}\n" unless( -e $this->{_OUTPUT_DIR} ); 57 | return undef unless( -s $this->{_ROI_FILE} && -e $this->{_REF_SEQ} && -s $this->{_BAM_LIST} && -e $this->{_OUTPUT_DIR} ); 58 | # Outputs of this script will be written to these 59 | # locations in the output directory 60 | # 61 | # Remove trailing forward slashes if any 62 | $this->{_OUTPUT_DIR} =~ s/(\/)+$//; 63 | # Stores output from calcRoiCovg per sample 64 | my $roi_covg_dir = "$this->{_OUTPUT_DIR}/roi_covgs"; 65 | # Stores per-gene coverages per sample 66 | my $gene_covg_dir = "$this->{_OUTPUT_DIR}/gene_covgs"; 67 | # Stores total coverages per sample 68 | my $tot_covg_file = "$this->{_OUTPUT_DIR}/total_covgs"; 69 | ## optional paras 70 | my $optional_params = "--normal-min-depth=$this->{_NOR_MIN_DEPTH} --tumor-min-depth=$this->{_TUM_MIN_DEPTH} --min-mapq=$this->{_MIN_MAPQ} --bp-class-types=$this->{_BP_CLASS_TYPES}"; 71 | # Check whether the annotated regions of interest are clumped together by chromosome 72 | my $roiFh = IO::File->new( $this->{_ROI_FILE} ) or die "ROI file could not be opened. $!\n"; 73 | my @chroms = ( "" ); 74 | # Emulate Unix's uniq command on the chromosome column 75 | while ( my $line = $roiFh->getline ) { 76 | my ( $chrom ) = ( $line =~ m/^(\S+)/ ); 77 | push( @chroms, $chrom ) if( $chrom ne $chroms[-1] ); 78 | } 79 | $roiFh->close; 80 | # Get the actual number of unique chromosomes 81 | my %chroms = map { $_ => 1 } @chroms; 82 | if ( scalar( @chroms ) != scalar( keys %chroms ) ) { 83 | print STDERR "ROIs from the same chromosome must be listed adjacent to each other in file. "; 84 | print STDERR "If in UNIX, try:\nsort -k 1,1 $this->{_ROI_FILE}\n"; 85 | return undef; 86 | } 87 | 88 | # If the reference sequence FASTA file hasn't been indexed, do it 89 | my $ref_seq_idx = "$this->{_REF_SEQ}.fai"; 90 | system( "samtools faidx $this->{_REF_SEQ}" ) unless( -e $ref_seq_idx ); 91 | 92 | # Create the output directories unless they already exist 93 | mkdir $roi_covg_dir unless( -e $roi_covg_dir ); 94 | mkdir $gene_covg_dir unless( -e $gene_covg_dir ); 95 | 96 | my ( $cmdFh, $totCovgFh ); 97 | if ( defined $this->{_CMD_LIST_FILE} ) { 98 | $cmdFh = IO::File->new( $this->{_CMD_LIST_FILE}, ">" ); 99 | print "Creating a list of parallelizable jobs at $this->{_CMD_LIST_FILE}.\n"; 100 | print "After successfully running all the jobs in $this->{_CMD_LIST_FILE},\n", 101 | "be sure to run this script a second time (without defining the cmd-list-file argument) to merge results in roi_covgs.\n"; 102 | } else { 103 | $totCovgFh = IO::File->new( $tot_covg_file, ">" ); 104 | ## process bp class types here 105 | # instead of using fixed AT, CG, CpG types 106 | # 107 | # 108 | $totCovgFh->print( "#Sample\tCovered_Bases\t" ); 109 | $totCovgFh->print( join("_Bases_Covered\t", split /,/, $this->{_BP_CLASS_TYPES}) ); 110 | $totCovgFh->print( "_Bases_Covered\n" ); 111 | } 112 | # Parse through each pair of BAM files provided 113 | # and run calcRoiCovg as necessary 114 | my $bamFh = IO::File->new( $this->{_BAM_LIST} ); 115 | while ( my $line = $bamFh->getline ) { 116 | next if( $line =~ m/^#/ ); 117 | chomp( $line ); 118 | my ( $sample, $normal_bam, $tumor_bam ) = split( /\t/, $line ); 119 | $normal_bam = '' unless( defined $normal_bam ); 120 | $tumor_bam = '' unless( defined $tumor_bam ); 121 | print STDERR "Normal BAM for $sample not found: \"$normal_bam\"\n" unless( -e $normal_bam ); 122 | print STDERR "Tumor BAM for $sample not found: \"$tumor_bam\"\n" unless( -e $tumor_bam ); 123 | next unless( -e $normal_bam && -e $tumor_bam ); 124 | # Construct the command that calculates coverage per ROI 125 | # 126 | my $calcRoiCovg_cmd = "\'music2 bmr calc-covg-helper --normal-tumor-bam-pair=\"$line\" --roi-file=$this->{_ROI_FILE} --reference-sequence=$this->{_REF_SEQ} --output-file=$roi_covg_dir\/$sample.covg $optional_params\'"; 127 | 128 | # If user only wants the calcRoiCovg commands, write 129 | # them to file and skip running calcRoiCovg 130 | if ( defined $this->{_CMD_LIST_FILE} ) { 131 | $calcRoiCovg_cmd = $this->{_CMD_PREFIX} . " $calcRoiCovg_cmd" if ( defined $this->{_CMD_PREFIX} ); 132 | $cmdFh->print( "$calcRoiCovg_cmd\n" ); 133 | next; 134 | } 135 | 136 | # If the calcRoiCovg output was already 137 | # generated, then don't rerun it 138 | if ( -s "$roi_covg_dir/$sample.covg" ) { 139 | print "$sample.covg found in $roi_covg_dir. Skipping re-calculation.\n"; 140 | } else { 141 | print STDERR "$sample.covg not found in $roi_covg_dir. please make a command list file to run calcRoiCovg !\n"; 142 | return undef; 143 | } 144 | # Read the calcRoiCovg output and count 145 | # covered bases per gene 146 | my %geneCovg = (); 147 | my @bp_types_array = split /,/, $this->{_BP_CLASS_TYPES}; 148 | 149 | #my ( $tot_covd, $tot_at_covd, $tot_cg_covg, $tot_cpg_covd ); 150 | my ( $tot_covd, ); 151 | my %tot_covd_bp_types = map{ ($_, 0) } @bp_types_array; 152 | my %gene_covd_bp_types = map{ ($_, 0) } @bp_types_array; 153 | 154 | my $roiCovgFh = IO::File->new( "$roi_covg_dir/$sample.covg" ); 155 | while ( my $line = $roiCovgFh->getline ) { 156 | chomp( $line ); 157 | if ( $line =~ m/^#NonOverlappingTotals/ ) { 158 | my @ta = split( /\t/, $line ); 159 | $tot_covd = $ta[3]; my $i = 1; 160 | map { $tot_covd_bp_types{$_} = $ta[3+$i++]; } @bp_types_array; 161 | 162 | #( undef, undef, undef, $tot_covd, $tot_at_covd, $tot_cg_covg, $tot_cpg_covd ) = split( /\t/, $line ); 163 | # 164 | } elsif ( $line !~ m/^#/ ) { 165 | # my ( $gene, undef, $length, $covd, $at_covd, $cg_covd, $cpg_covd ) = split( /\t/, $line ); 166 | my @tag = split( /\t/, $line ); 167 | my ( $gene, undef, $length, $covd, ) = @tag; 168 | $geneCovg{$gene}{len} += $length; 169 | $geneCovg{$gene}{covd_len} += $covd; 170 | my $i = 1; 171 | map { $geneCovg{$gene}{$_} += $tag[3+$i++]; } @bp_types_array; 172 | 173 | #$geneCovg{$gene}{at} += $at_covd; 174 | #$geneCovg{$gene}{cg} += $cg_covd; 175 | #$geneCovg{$gene}{cpg} += $cpg_covd; 176 | } 177 | } 178 | 179 | $roiCovgFh->close; 180 | # Write the per-gene coverages to a file 181 | # named after this sample_name 182 | # 183 | my $geneCovgFh = IO::File->new( "$gene_covg_dir/$sample.covg", ">" ); 184 | #$geneCovgFh->print( "#Gene\tLength\tCovered\tAT_covd\tCG_covd\tCpG_covd\n" ); 185 | $geneCovgFh->print( "#Gene\tLength\tCovered\t", join("_covd\t", @bp_types_array ), "_covd\n" ); 186 | foreach my $gene ( sort keys %geneCovg ) { 187 | #$geneCovgFh->print( join( "\t", $gene, $geneCovg{$gene}{len}, $geneCovg{$gene}{covd_len},$geneCovg{$gene}{at}, $geneCovg{$gene}{cg}, $geneCovg{$gene}{cpg} ), "\n" ); 188 | $geneCovgFh->print( join( "\t", $gene, $geneCovg{$gene}{len}, $geneCovg{$gene}{covd_len}, map { $geneCovg{$gene}{$_} } @bp_types_array ), "\n" ); 189 | } 190 | $geneCovgFh->close; 191 | # Write total coverages for this sample to a file 192 | #$totCovgFh->print( "$sample\t$tot_covd\t$tot_at_covd\t$tot_cg_covg\t$tot_cpg_covd\n" ); 193 | 194 | $totCovgFh->print( join( "\t", $sample, $tot_covd, map{ $tot_covd_bp_types{$_} } @bp_types_array ), "\n" ); 195 | 196 | } 197 | $bamFh->close; 198 | $cmdFh->close if ( defined $this->{_CMD_LIST_FILE} ); 199 | $totCovgFh->close unless ( defined $this->{_CMD_LIST_FILE} ); 200 | 201 | return 1; 202 | } 203 | 204 | ## usage 205 | sub help_text { 206 | my $this = shift; 207 | return <{_ROI_FILE} = undef; 17 | $this->{_REF_SEQ} = undef; 18 | $this->{_NOR_TUM_PAIR} = undef; 19 | $this->{_OUTPUT_FILE} = undef; 20 | $this->{_BP_CLASS_TYPES} = 'AT,CG,CpG'; 21 | $this->{_NOR_MIN_DEPTH} = 6; 22 | $this->{_TUM_MIN_DEPTH} = 8; 23 | $this->{_MIN_MAPQ} = 20; 24 | 25 | bless $this, $class; 26 | $this->process(); 27 | 28 | return $this; 29 | } 30 | 31 | sub process { 32 | my $this = shift; 33 | my ( $help, $options ); 34 | unless( @ARGV ) { die $this->help_text(); } 35 | $options = GetOptions ( 36 | 'roi-file=s' => \$this->{_ROI_FILE}, 37 | 'reference-sequence=s' => \$this->{_REF_SEQ}, 38 | 'normal-tumor-bam-pair=s' => \$this->{_NOR_TUM_PAIR}, 39 | 'output-file=s' => \$this->{_OUTPUT_FILE}, 40 | 'bp-class-types=s' => \$this->{_BP_CLASS_TYPES}, 41 | 'normal-min-depth=i' => \$this->{_NOR_MIN_DEPTH}, 42 | 'tumor-min-depth=i' => \$this->{_TUM_MIN_DEPTH}, 43 | 'min-mapq=i' => \$this->{_MIN_MAPQ}, 44 | 'help' => \$help, 45 | ); 46 | if ( $help ) { print STDERR help_text(); exit 0; } 47 | unless( $options ) { die $this->help_text(); } 48 | 49 | my ( $sample_name, $normal_bam, $tumor_bam, ) = split /\t/, $this->{_NOR_TUM_PAIR}; 50 | 51 | # Check on all the required input data 52 | print STDERR "ROI file not found or is empty: $this->{_ROI_FILE}\n" unless( -s $this->{_ROI_FILE} ); 53 | print STDERR "Reference sequence file not found: $this->{_REF_SEQ}\n" unless( -e $this->{_REF_SEQ} ); 54 | print STDERR "Normal BAM file not found or is empty: $normal_bam\n" unless( -s $normal_bam ); 55 | print STDERR "Tumor BAM file not found or is empty: $tumor_bam\n" unless( -s $tumor_bam ); 56 | return undef unless( -s $this->{_ROI_FILE} && -e $this->{_REF_SEQ} && -s $normal_bam && -s $tumor_bam ); 57 | 58 | #### processing #### 59 | # 60 | # Check whether the annotated regions of interest 61 | # are clumped together by chromosome 62 | my $roiFh = IO::File->new( $this->{_ROI_FILE} ) or die "ROI file could not be opened. $!\n"; 63 | my @chroms = ( "" ); 64 | # Emulate Unix's uniq command on the chromosome column 65 | while ( my $line = $roiFh->getline ) { 66 | my ( $chrom ) = ( $line =~ m/^(\S+)/ ); 67 | push( @chroms, $chrom ) if ( $chrom ne $chroms[-1] ); 68 | } 69 | $roiFh->close; 70 | # Get the actual number of unique chromosomes 71 | my %chroms = map { $_ => 1 } @chroms; 72 | if ( scalar( @chroms ) != scalar( keys %chroms ) ) { 73 | print STDERR "ROIs from the same chromosome must be listed adjacent to each other in file. "; 74 | print STDERR "If in UNIX, try:\nsort -k 1,1 $this->{_ROI_FILE}\n"; 75 | return undef; 76 | } 77 | # If the reference sequence FASTA file hasn't been indexed, do it 78 | my $ref_seq_idx = "$this->{_REF_SEQ}.fai"; 79 | system( "samtools faidx $this->{_REF_SEQ}" ) unless( -e $ref_seq_idx ); 80 | $normal_bam = '' unless( defined $normal_bam ); 81 | $tumor_bam = '' unless( defined $tumor_bam ); 82 | print STDERR "Normal BAM not found: \"$normal_bam\"\n" unless( -e $normal_bam ); 83 | print STDERR "Tumor BAM not found: \"$tumor_bam\"\n" unless( -e $tumor_bam ); 84 | next unless( -e $normal_bam && -e $tumor_bam ); 85 | # Construct the command that calculates coverage per ROI 86 | #my $calcRoiCovg_cmd = "calcRoiCovg $normal_bam $tumor_bam $roi_file $ref_seq $output_file $normal_min_depth $tumor_min_depth $min_mapq"; 87 | my $cprogram_paras = "calcRoiCovg -q $this->{_MIN_MAPQ} -n $this->{_NOR_MIN_DEPTH} -t $this->{_TUM_MIN_DEPTH} -c $this->{_BP_CLASS_TYPES} "; 88 | my $required_paras = "$normal_bam $tumor_bam $this->{_ROI_FILE} $this->{_REF_SEQ} $this->{_OUTPUT_FILE}"; 89 | my $calcRoiCovg_cmd = $cprogram_paras . $required_paras; 90 | # If the calcRoiCovg output was already 91 | # generated, then don't rerun it 92 | if ( -s $this->{_OUTPUT_FILE} ) { 93 | print "Output file $this->{_OUTPUT_FILE} found. Skipping re-calculation.\n"; 94 | } 95 | # Run the calcRoiCovg command on this tumor-normal pair. This could take a while 96 | elsif( system( "$calcRoiCovg_cmd" ) != 0 ) { 97 | print STDERR "Failed to execute: $calcRoiCovg_cmd\n"; 98 | return; 99 | } else { 100 | print "$this->{_OUTPUT_FILE} generated and stored.\n"; 101 | return 1; 102 | } 103 | 104 | } 105 | 106 | ## usage 107 | sub help_text { 108 | my $this = shift; 109 | return <{_ROI_FILE} = undef; 18 | $this->{_REF_SEQ} = undef; 19 | $this->{_WIG_LIST} = undef; 20 | $this->{_OUTPUT_DIR} = undef; 21 | $this->{_BP_CLASS_TYPES} = 'AT,CG,CpG'; 22 | $this->{_NOR_MIN_DEPTH} = 6; 23 | $this->{_TUM_MIN_DEPTH} = 8; 24 | $this->{_MIN_MAPQ} = 20; 25 | 26 | bless $this, $class; 27 | $this->process(); 28 | 29 | return $this; 30 | } 31 | 32 | sub process { 33 | my $this = shift; 34 | my ( $help, $options ); 35 | unless( @ARGV ) { die $this->help_text(); } 36 | $options = GetOptions ( 37 | 'roi-file=s' => \$this->{_ROI_FILE}, 38 | 'reference-sequence=s' => \$this->{_REF_SEQ}, 39 | 'wig-list=s' => \$this->{_WIG_LIST}, 40 | 'output-dir=s' => \$this->{_OUTPUT_DIR}, 41 | 'bp-class-types=s' => \$this->{_BP_CLASS_TYPES}, 42 | 'normal-min-depth=i' => \$this->{_NOR_MIN_DEPTH}, 43 | 'tumor-min-depth=i' => \$this->{_TUM_MIN_DEPTH}, 44 | 'min-mapq=i' => \$this->{_MIN_MAPQ}, 45 | 'help' => \$help, 46 | ); 47 | if ( $help ) { print STDERR help_text(); exit 0; } 48 | unless( $options ) { die $this->help_text(); } 49 | 50 | #### processing #### 51 | # 52 | # Check on all the input data 53 | print STDERR "ROI file not found or is empty: $this->{_ROI_FILE}\n" unless( -s $this->{_ROI_FILE} ); 54 | print STDERR "Reference sequence file not found: $this->{_REF_SEQ}\n" unless( -e $this->{_REF_SEQ} ); 55 | print STDERR "List of WIGs not found or is empty: $this->{_WIG_LIST}\n" unless( -s $this->{_WIG_LIST} ); 56 | print STDERR "Output directory not found: $this->{_OUTPUT_DIR}\n" unless( -e $this->{_OUTPUT_DIR} ); 57 | return undef unless( -s $this->{_ROI_FILE} && -e $this->{_REF_SEQ} && -s $this->{_WIG_LIST} && -e $this->{_OUTPUT_DIR} ); 58 | 59 | # Remove trailing forward slashes if any 60 | $this->{_OUTPUT_DIR} =~ s/(\/)+$//; 61 | # Stores output from calcRoiCovg per sample 62 | my $roi_covg_dir = "$this->{_OUTPUT_DIR}/roi_covgs"; 63 | # Stores per-gene coverages per sample 64 | my $gene_covg_dir = "$this->{_OUTPUT_DIR}/gene_covgs"; 65 | # Stores total coverages per sample 66 | my $tot_covg_file = "$this->{_OUTPUT_DIR}/total_covgs"; 67 | 68 | # If the reference sequence FASTA file hasn't been indexed, do it 69 | my $ref_seq_idx = "$this->{_REF_SEQ}.fai"; 70 | system( "samtools faidx $this->{_REF_SEQ}" ) unless( -e $ref_seq_idx ); 71 | # 72 | # Create a temporary 0-based ROI BED-file that we can use with 73 | # joinx, and also measure gene lengths 74 | # 75 | my %geneLen = (); 76 | # 77 | ## using file::temp module for temp file 78 | 79 | # didn't touch Cyriac's code in this part 80 | # 81 | my ( undef, $roi_bed ) = tempfile(); 82 | my $roiBedFh = IO::File->new( $roi_bed, ">" ) or die "Temporary ROI BED file could not be created. $!\n"; 83 | my $roiFh = IO::File->new( $this->{_ROI_FILE} ) or die "ROI file could not be opened. $!\n"; 84 | while ( my $line = $roiFh->getline ) { 85 | chomp( $line ); 86 | my ( $chr, $start, $stop, $gene ) = split( /\t/, $line ); 87 | --$start; 88 | unless( $start >= 0 && $start < $stop ) { 89 | print STDERR "Invalid ROI: $line\nPlease use 1-based loci and ensure that start <= stop\n"; 90 | return undef; 91 | } 92 | $geneLen{$gene} += ( $stop - $start ); 93 | $roiBedFh->print( "$chr\t$start\t$stop\t$gene\n" ); 94 | } 95 | $roiFh->close; 96 | $roiBedFh->close; 97 | # 98 | # 99 | # Also create a merged BED file where overlapping ROIs are joined 100 | # together into contiguous regions 101 | # ::TODO:: Use joinx instead of mergeBed, because 102 | # we'd rather add an in-house dependency 103 | # 104 | #my $merged_roi_bed = "$this->{_OUTPUT_DIR}/.merged_bed_file"; 105 | my ( undef, $merged_roi_bed ) = tempfile(); 106 | # 107 | system( "mergeBed -i $roi_bed | joinx sort -s - -o $merged_roi_bed" ); 108 | # or die "Failed to run mergeBed or joinx!\n$roi_bed\n$merged_roi_bed\n $!\n"; 109 | # 110 | # Create the output directories unless they already exist 111 | mkdir $roi_covg_dir unless( -e $roi_covg_dir ); 112 | mkdir $gene_covg_dir unless( -e $gene_covg_dir ); 113 | # bp class types 114 | my @bp_types_array = split /,/, $this->{_BP_CLASS_TYPES}; 115 | # 116 | # This is a file that will report the overall 117 | # non-overlapping coverages per WIG 118 | # 119 | my $totCovgFh = IO::File->new( $tot_covg_file, ">" ); 120 | $totCovgFh->print( "#Sample\tCovered_Bases\tAT_Bases_Covered\tCG_Bases_Covered\tCpG_Bases_Covered\n" ); 121 | # Parse through each pair of WIG files provided and run calcRoiCovg as necessary 122 | my $wigFh = IO::File->new( $this->{_WIG_LIST} ); 123 | while( my $line = $wigFh->getline ) { 124 | next if ( $line =~ m/^#/ ); 125 | chomp( $line ); 126 | my ( $sample, $wig_file ) = split( /\t/, $line ); 127 | $wig_file = '' unless( defined $wig_file ); 128 | print STDERR "Wiggle track format file for $sample not found: \"$wig_file\"\n" unless( -e $wig_file ); 129 | next unless( -e $wig_file ); 130 | # 131 | # Use joinx to parse the WIG file and return per-ROI 132 | # coverages of AT, CG (non-CpG), and CpG 133 | # 134 | system( "joinx wig2bed -Zc $wig_file | joinx sort -s | joinx intersect -F \"I A3\" $roi_bed - | joinx ref-stats - $this->{_REF_SEQ} | cut -f 1-7 > $roi_covg_dir/$sample.covg" ); 135 | # or die "Failed to run joinx to calculate per-gene coverages in $sample! $!\n"; 136 | # 137 | # Read the joinx formatted coverage file and count covered bases per gene 138 | # 139 | my %geneCovg = (); 140 | my $roiCovgFh = IO::File->new( "$roi_covg_dir/$sample.covg" ); 141 | # 142 | # TODO:: need to develop MuSiCmate to do any XpX coverage information 143 | # didn't change "at,cg,cpg" coverage here 144 | # 145 | while ( my $line = $roiCovgFh->getline ) { 146 | chomp( $line ); 147 | if ( $line !~ m/^#/ ) { 148 | my ( undef, undef, undef, $gene, $at_covd, $cg_covd, $cpg_covd ) = split( /\t/, $line ); 149 | $geneCovg{$gene}{covd} += ( $at_covd + $cg_covd + $cpg_covd ); 150 | $geneCovg{$gene}{at} += $at_covd; 151 | $geneCovg{$gene}{cg} += $cg_covd; 152 | $geneCovg{$gene}{cpg} += $cpg_covd; 153 | } 154 | 155 | } 156 | $roiCovgFh->close; 157 | # 158 | # Write the per-gene coverages to a file named after this sample_name 159 | my $geneCovgFh = IO::File->new( "$gene_covg_dir/$sample.covg", ">" ); 160 | $geneCovgFh->print( "#Gene\tLength\tCovered\tAT_covd\tCG_covd\tCpG_covd\n" ); 161 | foreach my $gene ( sort keys %geneLen ) { 162 | if ( defined $geneCovg{$gene} ) { 163 | $geneCovgFh->print( join( "\t", $gene, $geneLen{$gene}, $geneCovg{$gene}{covd}, $geneCovg{$gene}{at}, $geneCovg{$gene}{cg}, $geneCovg{$gene}{cpg} ), "\n" ); 164 | } else { $geneCovgFh->print( "$gene\t" . $geneLen{$gene} . "\t0\t0\t0\t0\n" ); } 165 | } 166 | $geneCovgFh->close; 167 | # Measure coverage stats on the merged ROI file, so that 168 | # bps across the genome are not counted twice 169 | my ( undef, $merged_roi_bed_covg ) = tempfile(); 170 | system( "joinx wig2bed -Zc $wig_file | joinx sort -s | joinx intersect $merged_roi_bed - | joinx ref-stats - $this->{_REF_SEQ} | cut -f 1-6 > $merged_roi_bed_covg" ); 171 | # or die "Failed to run joinx to calculate overall coverages in $sample! $!\n"; 172 | # 173 | # Read the joinx formatted coverage file and sum up the coverage stats per region 174 | my ( $tot_covd, $tot_at_covd, $tot_cg_covg, $tot_cpg_covd ); 175 | my $totRoiCovgFh = IO::File->new( $merged_roi_bed_covg ); 176 | while( my $line = $totRoiCovgFh->getline ) { 177 | chomp( $line ); 178 | if ( $line !~ m/^#/ ) { 179 | my ( $chr, $start, $stop, $at_covd, $cg_covd, $cpg_covd ) = split( /\t/, $line ); 180 | $tot_covd += ( $at_covd + $cg_covd + $cpg_covd ); 181 | $tot_at_covd += $at_covd; 182 | $tot_cg_covg += $cg_covd; 183 | $tot_cpg_covd += $cpg_covd; 184 | } 185 | } 186 | $totRoiCovgFh->close; 187 | $totCovgFh->print( "$sample\t$tot_covd\t$tot_at_covd\t$tot_cg_covg\t$tot_cpg_covd\n" ); 188 | } 189 | $wigFh->close; 190 | $totCovgFh->close; 191 | 192 | return 1; 193 | } 194 | 195 | ## usage 196 | sub help_text { 197 | my $this = shift; 198 | return <{_MAF_FILE} = undef; 18 | $this->{_OUTPUT_MAF_FILE} = "window_based_maf"; 19 | 20 | $this->{_WINDOW_SIZE} = 1000000; 21 | 22 | bless $this, $class; 23 | $this->process(); 24 | 25 | return $this; 26 | } 27 | 28 | sub process { 29 | my $this = shift; 30 | my ( $help, $options ); 31 | unless( @ARGV ) { die $this->help_text(); } 32 | $options = GetOptions ( 33 | 34 | 'maf-file=s' => \$this->{_MAF_FILE}, 35 | 'output-maf-file=s' => \$this->{_OUTPUT_MAF_FILE}, 36 | 'window-size=i' => \$this->{_WINDOW_SIZE}, 37 | 38 | 'help' => \$help, 39 | ); 40 | if ( $help ) { print STDERR help_text(); exit 0; } 41 | unless( $options ) { die $this->help_text(); } 42 | #### processing #### 43 | # 44 | # Check on all the input data 45 | print STDERR "MAF file not found or is empty: $this->{_MAF_FILE}\n" unless( -s $this->{_MAF_FILE} ); 46 | return undef unless( -s $this->{_MAF_FILE} ); 47 | unless ( $this->{_WINDOW_SIZE} =~ /\d+/ ) { 48 | print STDERR "Window size format is not valid !\n"; 49 | return undef; 50 | }; 51 | # 52 | # 53 | my $outputfh = IO::File->new( $this->{_OUTPUT_MAF_FILE}, ">" ) or die "Couldn't open $this->{_OUTPUT_MAF_FILE}. $!"; 54 | my $maffh = IO::File->new( $this->{_MAF_FILE} ) or die "Couldn't open $this->{_MAF_FILE}. $!"; 55 | my @outputs; 56 | # perl -an -F'\t' -e '$F[0]="chr$F[4]:".(int(($F[5]-1)/1000000)*1000000+1) unless($F[0]=~m/Hugo_Symbol/); 57 | # print join("\t",@F)' merged.cds.filtered.maf > merged_modded_gene_names_based_1M_regions.maf 58 | # 59 | while ( my $line = $maffh->getline ) { 60 | my @t = split( /\t/, $line ); 61 | $t[0] = "CHR$t[4]:".(int(($t[5]-1)/$this->{_WINDOW_SIZE})*$this->{_WINDOW_SIZE}+1) unless($t[0]=~m/Hugo_Symbol/); 62 | push( @outputs, join("\t", @t) ); 63 | } 64 | $maffh->close(); 65 | $outputfh->print( @outputs ); 66 | $outputfh->close(); 67 | # 68 | return 1; 69 | } 70 | 71 | ## usage 72 | sub help_text { 73 | my $this = shift; 74 | return <{_ROI_FILE} = undef; 18 | $this->{_OUTPUT_ROI_FILE} = "window_based_roi"; 19 | 20 | $this->{_WINDOW_SIZE} = 1000000; 21 | 22 | bless $this, $class; 23 | $this->process(); 24 | 25 | return $this; 26 | } 27 | 28 | sub process { 29 | my $this = shift; 30 | my ( $help, $options ); 31 | unless( @ARGV ) { die $this->help_text(); } 32 | $options = GetOptions ( 33 | 34 | 'roi-file=s' => \$this->{_ROI_FILE}, 35 | 'output-roi-file=s' => \$this->{_OUTPUT_ROI_FILE}, 36 | 'window-size=i' => \$this->{_WINDOW_SIZE}, 37 | 38 | 'help' => \$help, 39 | ); 40 | if ( $help ) { print STDERR help_text(); exit 0; } 41 | unless( $options ) { die $this->help_text(); } 42 | #### processing #### 43 | # 44 | # Check on all the input data 45 | print STDERR "ROI file not found or is empty: $this->{_ROI_FILE}\n" unless( -s $this->{_ROI_FILE} ); 46 | return undef unless( -s $this->{_ROI_FILE} ); 47 | unless ( $this->{_WINDOW_SIZE} =~ /\d+/ ) { 48 | print STDERR "Window size format is not valid !\n"; 49 | return undef; 50 | }; 51 | # 52 | # 53 | my ( undef, $temp_win_bed ) = tempfile(); 54 | my $outputfh = IO::File->new( $temp_win_bed, ">" ) or die "Temporary file could not be created. $!"; 55 | my $roifh = IO::File->new( $this->{_ROI_FILE} ) or die "Couldn't open $this->{_ROI_FILE}. $!"; 56 | while ( my $line = $roifh->getline ) { 57 | chomp( $line ); 58 | my ( $chr, $start, $stop ) = split( /\t/, $line ); 59 | ( $stop >= $start ) or die "Stop locus is less than start in:\n$line\n\n"; 60 | my ( $start_window, $stop_window ) = (( int(($start-1)/$this->{_WINDOW_SIZE}) * $this->{_WINDOW_SIZE} + 1 ), ( int(($stop-1)/$this->{_WINDOW_SIZE}) * $this->{_WINDOW_SIZE} + 1 )); 61 | if ( $start_window == $stop_window ) { 62 | $outputfh->print( "$chr\t$start\t$stop\tCHR$chr:$start_window\n" ); 63 | } elsif ( $start_window < $stop_window ) { 64 | $outputfh->print( "$chr\t$start\t", $start_window + $this->{_WINDOW_SIZE} - 1, "\tCHR$chr:$start_window\n" ); 65 | $start_window += $this->{_WINDOW_SIZE}; 66 | while ( $start_window != $stop_window ) { 67 | $outputfh->print( "$chr\t$start_window\t", $start_window + $this->{_WINDOW_SIZE} - 1, "\tCHR$chr:$start_window\n" ); 68 | $start_window += $this->{_WINDOW_SIZE}; 69 | } 70 | $outputfh->print( "$chr\t$stop_window\t$stop\tCHR$chr:$stop_window\n" ); 71 | } else { die "Unhandled exception! \n"; } 72 | } 73 | $roifh->close(); 74 | $outputfh->close(); 75 | # 76 | my ( undef, $temp_win_sorted_bed ) = tempfile(); 77 | unless ( -e $temp_win_sorted_bed ) { die "Temporary file could not be created. $!" }; 78 | system( "joinx sort -i $temp_win_bed -o $temp_win_sorted_bed" ); 79 | # 80 | # write results 81 | # 82 | system( "joinx bed-merge -n -u -i $temp_win_sorted_bed -o $this->{_OUTPUT_ROI_FILE}" ); 83 | # 84 | # 85 | return 1; 86 | 87 | } 88 | 89 | ## usage 90 | sub help_text { 91 | my $this = shift; 92 | return <md 73 | read.table(y.file,na.strings = c("","NA"),sep="\t",header=T)->y 74 | #if (x.names!="*") x.names=strsplit(x.names,split="[|]")[[1]] 75 | 76 | # TODO: Here, it would be smart to check whether all model names (md[,2]) correspond to columns in y. 77 | # obscure errors result if people mistype or use remapped characters (:-<=>) 78 | 79 | 80 | 81 | if (x.file!="*") 82 | { 83 | read.table(x.file,na.strings = c("","NA"),sep="\t",header=T)->x 84 | xid=colnames(x)[1] 85 | xs=colnames(x)[-1] 86 | #if (x.names!="*") {x=x[,c(xid,x.names)];xs=colnames(x)[-1]} 87 | x.names=xs 88 | yid=colnames(y)[1] 89 | ysid = ! (colnames(y) %in% xs) 90 | y=y[,ysid] 91 | if (sum(ysid)==1) {y=data.frame(id=y);colnames(y)[1]=yid} 92 | #y=merge(y,x,by.x = xid, by.y = yid) 93 | y=merge(x,y,by.x = xid, by.y = yid) 94 | } 95 | if (is.null(x.names)) x.names=colnames(y)[-1]; 96 | 97 | ######### analysis ########## 98 | tt=NULL 99 | for (i in c(1:nrow(md))) # loop through all rows in model 100 | { 101 | # analysis_type clinical_data_trait_name variant/gene_name covariates memo 102 | ytype=md[i,1];yi=md[i,2];xs=md[i,3];covi=md[i,4];memo=md[i,5] 103 | 104 | if (!is.na(xs) & nchar(xs)>0) xs=strsplit(xs,split="[|]")[[1]] 105 | if (is.na(xs)[1]|nchar(xs)[1]==0|xs=="*") xs=x.names 106 | if (length(covi)==0) covi=NA 107 | for (xi in xs) 108 | { 109 | # DEBUG 110 | #cat(paste(" Processing: yi =", yi, " xi =", xi, " covi =", covi, "\n") ) 111 | 112 | if (ytype=="Q") { 113 | test = "F" 114 | } else if (ytype=="B") { 115 | test = "Chisq" 116 | } else { 117 | stop("Unknown model ytype ", ytype) 118 | } 119 | glm = try(myglm(y,yi,xi,covi,ytype)) # MAW new 120 | if(class(glm)[1] == "try-error") { 121 | # Type of error to catch here: 122 | # Error in family$linkfun(mustart) : 123 | # Argument mu must be a nonempty numeric vector 124 | # This occurs if too many NA's 125 | # This also occurs: Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels 126 | # above happens when using covariate and e.g. disease BLCA has all NA even if other diseases do not 127 | cat(paste(" Error caught, continuing. yi =", yi, " xi =", xi, " covi =", covi, "\n") ) 128 | next 129 | } 130 | try(anova(glm,test=test))->fit 131 | coeff = coefficients(glm)[[xi]] # MAW new 132 | if (class(fit)[1]!="try-error") 133 | { 134 | fit=as.matrix(fit) 135 | # line below dies if mix Q and B ytypes in a single model file 136 | if (xi %in% rownames(fit)) tt=rbind(tt, cbind(yi,ytype,xi,as.data.frame(t(fit[xi,])),coeff,covi,memo)) 137 | } 138 | } 139 | } 140 | if (!is.null(tt)) { 141 | #"yi","ytype","xi","Df","Deviance","Resid. Df","Resid. Dev","F","Pr(>F)","covi","memo" 142 | if (ytype=="Q") colnames(tt) = c("y","y_type","x","degrees_freedom","deviance","residual_degrees_freedom","residual_deviance", 143 | "F_statistic","p-value","coefficient","covariants","memo"); 144 | if (ytype=="B") colnames(tt) = c("y","y_type","x","degrees_freedom","deviance","residual_degrees_freedom","residual_deviance", 145 | "p-value","coefficient","covariants","memo"); 146 | tt$FDR = p.adjust(tt[,"p-value"], method="fdr") # MAW new, calculates FDR based on the method from, 147 | # Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 57, 289–300. 148 | write.table(tt,file=out.file,quote=F,sep="\t",row.names=F); 149 | } 150 | 151 | } else { 152 | 153 | # else, we process numerical or categorical clinical data correlation 154 | 155 | clinical_data = as.character(commandArgs()[5]); 156 | mutation_matrix = as.character(commandArgs()[6]); 157 | output_file = as.character(commandArgs()[7]); 158 | 159 | # FUNCTION finds the correlation between two variables 160 | cor2=function(ty,tx,method) 161 | { 162 | 163 | id=intersect(!is.na(ty),!is.na(tx)); 164 | ty=ty[id]; 165 | tx=tx[id]; 166 | 167 | if(method=="cor") 168 | { 169 | tst=cor.test(tx,ty); 170 | s=tst$est; 171 | p=tst$p.value; 172 | } 173 | 174 | if(method=="wilcox") #x must be (0,1) mutation data 175 | { 176 | tst=wilcox.test(x=ty[tx==0],y=ty[tx>=1]) 177 | s=tst$stat 178 | p=tst$p.value 179 | } 180 | 181 | if(method=="chisq") 182 | { 183 | tst=chisq.test(tx,ty); 184 | s=tst$stat; 185 | p=tst$p.value; 186 | } 187 | 188 | if(method=="fisher") 189 | { 190 | tst=fisher.test(tx,ty) 191 | s=tst$p.value 192 | p=tst$p.value 193 | } 194 | 195 | if(method=="anova") 196 | { 197 | tst=summary(aov(ty~tx,as.data.frame(cbind(tx,ty)))) 198 | s=tst[[1]]$F[1] 199 | p=tst[[1]]$Pr[1] 200 | } 201 | 202 | tt=c(p,s); 203 | tt; 204 | } 205 | # END cor2 206 | 207 | # FUNCTION runs correlation test on matrixes of data 208 | cor2test =function(y,x=NULL,method="cor",cutoff=1,sep="\t",outf=NULL) 209 | { 210 | 211 | if (!is.null(x)) 212 | { 213 | 214 | if (length(x)==1) {read.table(x,header=T,sep=sep)->x;} 215 | if (length(y)==1) {read.table(y,header=T,sep=sep)->y;} 216 | colnames(y)[1]="id"; 217 | colnames(x)[1]="id"; 218 | tt=character(0); 219 | for (vi in colnames(x)[-1]) 220 | { 221 | for (vj in colnames(y)[-1]) 222 | { 223 | tx=x[,c("id",vi)]; 224 | tx=tx[!is.na(tx[,vi]),]; 225 | tx=tx[!duplicated(tx[,"id"]),]; 226 | ty=y[,c("id",vj)]; 227 | ty=ty[!is.na(ty[,vj]),]; 228 | ty=ty[!duplicated(ty[,"id"]),]; 229 | xy=merge(tx,ty,by.x="id",by.y="id"); 230 | tx=xy[,2]; 231 | ty=xy[,3]; 232 | n=length(xy[,"id"]); 233 | rst=try(cor2(ty,tx,method)); 234 | if (class(rst)=="try-error") {p=NA;s=NA;} else {p=rst[1];s=rst[2];} 235 | t=c(vi,vj,method,n,s,p) 236 | 237 | tt=rbind(tt,t); 238 | } #end vj 239 | } #end vi 240 | 241 | rownames(tt)=NULL; 242 | colnames(tt)=c("x","y","method","n","s","p"); 243 | tt=as.data.frame(tt); 244 | tt[,"s"]=as.character(tt[,"s"]); 245 | tt[,"s"]=as.numeric(tt[,"s"]); 246 | tt[,"p"]=as.character(tt[,"p"]); 247 | tt[,"p"]=as.numeric(tt[,"p"]); 248 | fdr=p.adjust(tt[,"p"],method="fdr"); 249 | #bon=p.adjust(tt[,"p"],method="bon"); 250 | tt=cbind(tt,fdr); 251 | tt=tt[order(tt[,"p"]),]; 252 | } 253 | 254 | if (is.null(x)) 255 | { 256 | 257 | if (length(y)==1) {read.table(y,header=T,sep=sep)->y;} 258 | x=y; 259 | nxy=ncol(y)-1; 260 | colnames(y)[1]="id"; 261 | colnames(x)[1]="id" 262 | tt=character(0); 263 | for (i in c(1:(nxy-1))) 264 | { 265 | for (j in c((i+1):nxy)) 266 | { 267 | 268 | vi=colnames(x)[-1][i]; 269 | vj=colnames(y)[-1][j]; 270 | 271 | tx=x[,c("id",vi)]; 272 | tx=tx[!is.na(tx[,vi]),]; 273 | tx=tx[!duplicated(tx[,"id"]),] 274 | ty=y[,c("id",vj)]; 275 | ty=ty[!is.na(ty[,vj]),]; 276 | ty=ty[!duplicated(ty[,"id"]),]; 277 | xy=merge(tx,ty,by.x="id",by.y="id"); 278 | tx=xy[,2]; 279 | ty=xy[,3]; 280 | n=length(xy[,"id"]); 281 | rst=try(cor2(ty,tx,method)); 282 | if (class(rst)=="try-error") {p=NA;s=NA;} else {p=rst[1];s=rst[2];} 283 | t=c(vi,vj,method,n,s,p); 284 | tt=rbind(tt,t); 285 | } #end vj 286 | } #end vi 287 | 288 | rownames(tt)=NULL; 289 | colnames(tt)=c("x","y","method","n","s","p"); 290 | tt=as.data.frame(tt); 291 | tt[,"s"]=as.character(tt[,"s"]); 292 | tt[,"s"]=as.numeric(tt[,"s"]); 293 | tt[,"p"]=as.character(tt[,"p"]); 294 | tt[,"p"]=as.numeric(tt[,"p"]); 295 | fdr=p.adjust(tt[,"p"],method="fdr"); 296 | #bon=p.adjust(tt[,"p"],method="bon"); 297 | tt=cbind(tt,fdr); 298 | tt=tt[order(tt[,"p"]),]; 299 | } 300 | 301 | 302 | if (!is.null(outf)) 303 | { 304 | #colnames(tt)=c("x","y","method","n","s","p","fdr"); 305 | #The amount of precision that R prints with is somehow machine dependent (or the R version?) 306 | tt[,"s"] = sapply(tt[,"s"], sprintf, fmt="%.4E"); 307 | tt[,"p"] = sapply(tt[,"p"], sprintf, fmt="%.4E"); 308 | tt[,"fdr"] = sapply(tt[,"fdr"], sprintf, fmt="%.2E"); 309 | #tt[,"bon"] = sapply(tt[,"bon"], sprintf, fmt="%.2E"); 310 | 311 | #The ordering should be done after reformatting the precision (duh) 312 | tt=tt[order(tt[,"x"]),]; 313 | tt=tt[order(tt[,"p"]),]; 314 | #rename the column headers to something more pleasant 315 | colnames(tt)=c("Gene","ClinParam","Method","NumCases","Statistic","Pval","FDR"); 316 | write.table(tt,file=outf,quote=FALSE,row.names=FALSE,sep="\t"); 317 | } 318 | invisible(tt); 319 | } 320 | #END cor2test 321 | 322 | #run correlation test using function 323 | cor2test(y = clinical_data, x = mutation_matrix, method = method, outf = output_file); 324 | } 325 | -------------------------------------------------------------------------------- /lib/TGI/MuSiC2/Complicated.pm: -------------------------------------------------------------------------------- 1 | package TGI::MuSiC2::Complicated; 2 | 3 | use strict; 4 | use warnings FATAL => 'all'; 5 | 6 | sub print_stuff { 7 | print "Hi\n"; 8 | } 9 | 10 | 1; 11 | 12 | -------------------------------------------------------------------------------- /lib/TGI/MuSiC2/Dendrix.pm: -------------------------------------------------------------------------------- 1 | package TGI::MuSiC2::Dendrix; 2 | ## 3 | 4 | use warnings; 5 | use strict; 6 | 7 | use IO::File; 8 | use IPC::Open3; 9 | 10 | # Using Dendrix.py from Dendrix v 0.3 11 | 12 | sub new { 13 | my $class = shift; 14 | my $this = {}; 15 | 16 | $this->{_MUTATIONS_FILE} = undef; 17 | $this->{_K} = 2; 18 | $this->{_MIN_FREQ_GENE} = 1; 19 | $this->{_NUMBER_ITERATIONS} = 1000000; 20 | $this->{_ANALYZED_GENES_FILE} = undef; 21 | $this->{_NUM_EXPER} = 1; 22 | $this->{_STEP_LENGTH} = 1000; 23 | 24 | bless $this, $class; 25 | $this->process(); 26 | 27 | return $this; 28 | } 29 | 30 | 31 | sub process { 32 | my $this = shift; 33 | my ( $help, $options ); 34 | unless( @ARGV ) { die $this->help_text(); } 35 | $options = GetOptions ( 36 | 'mutations-file=s' => \$this->{_MUTATIONS_FILE}, 37 | 'set-size=i' => \$this->{_K}, 38 | 'minimum-freq=i' => \$this->{_MIN_FREQ_GENE}, 39 | 'number-interations=i' => \$this->{_NUMBER_ITERATIONS}, 40 | 'analyzed-genes-file=s' => \$this->{_ANALYZED_GENES_FILE}, 41 | 'number-experiments=i' => \$this->{_NUM_EXPER}, 42 | 'step-length=i' => \$this->{_STEP_LENGTH}, 43 | 'help' => \$help, 44 | ); 45 | if ( $help ) { print STDERR help_text(); exit 0; } 46 | unless( $options ) { die $this->help_text(); } 47 | #### processing #### 48 | # Check on all the input data before starting work 49 | print STDERR "Mutation matrix file not found or is empty: $this->{_MUTATIONS_FILE}\n" unless( -s $this->{_MUTATIONS_FILE} ); 50 | print STDERR "Analyzed gene list file not found or is empty: $this->{_ANALYZED_GENES_FILE}\n" unless( -s $this->{_ANALYZED_GENES_FILE} ); 51 | # Collect args 52 | my $args = $this->{_MUTATIONS_FILE} . ' ' . $this->{_K} . ' ' . $this->{_MIN_FREQ_GENE}; 53 | $args = $args . $this->{_NUMBER_ITERATIONS} . ' ' . $this->{_ANALYZED_GENES_FILE} . ' '; 54 | $args = $args . $this->{_NUM_EXPER} . ' ' . $this->{_STEP_LENGTH}; 55 | my ( $input, $output, $err ); 56 | use Symbol 'gensym'; 57 | $err = gensym; 58 | my $pid = open3( $input, $output, $err, 'dendrix $args' ); 59 | # puts output in the specified text files 60 | #mutations_file K minFreqGene number_iterations analyzed_genes_file num_exper step_length' ) 61 | waitpid( $pid, 0 ); 62 | my $child_exit_status = $? >> 8; 63 | if ( $output ) { print "Output: \n"; while (<$output>) { print; }; print "\n"; }; 64 | if ( $err ) { print "Errors: \n"; while (<$err>) { print; }; print "\n"; }; 65 | 66 | return 1; 67 | 68 | } 69 | 70 | 71 | ## usage 72 | sub help_text { 73 | my $this = shift; 74 | return <{_MUTATIONS_FILE} = undef; 17 | $this->{_K} = 2; 18 | $this->{_MIN_FREQ_GENE} = 1; 19 | $this->{_NUMBER_ITERATIONS} = 1000000; 20 | $this->{_ANALYZED_GENES_FILE} = undef; 21 | $this->{_NUMBER_PERMUTATIONS} = 100; 22 | $this->{_VALUE_TESTED} = 48; 23 | $this->{_RANK} = 1; 24 | 25 | bless $this, $class; 26 | $this->process(); 27 | 28 | return $this; 29 | } 30 | 31 | 32 | sub process { 33 | my $this = shift; 34 | my ( $help, $options ); 35 | unless( @ARGV ) { die $this->help_text(); } 36 | $options = GetOptions ( 37 | 'mutations-file=s' => \$this->{_MUTATIONS_FILE}, 38 | 'set-size=i' => \$this->{_K}, 39 | 'minimum-freq=i' => \$this->{_MIN_FREQ_GENE}, 40 | 'number-interations=i' => \$this->{_NUMBER_ITERATIONS}, 41 | 'analyzed-genes-file=s' => \$this->{_ANALYZED_GENES_FILE}, 42 | 'number-permutationss=i' => \$this->{_NUMBER_PERMUTATIONS}, 43 | 'value-tested=i' => \$this->{_VALUE_TESTED}, 44 | 'rank=i' => \$this->{_RANK}, 45 | 'help' => \$help, 46 | ); 47 | if ( $help ) { print STDERR help_text(); exit 0; } 48 | unless( $options ) { die $this->help_text(); } 49 | #### processing #### 50 | # Check on all the input data before starting work 51 | print STDERR "Mutation matrix file not found or is empty: $this->{_MUTATIONS_FILE}\n" unless( -s $this->{_MUTATIONS_FILE} ); 52 | print STDERR "Analyzed gene list file not found or is empty: $this->{_ANALYZED_GENES_FILE}\n" unless( -s $this->{_ANALYZED_GENES_FILE} ); 53 | # Collect args 54 | my $args = $this->{_MUTATIONS_FILE} . ' ' . $this->{_K} . ' ' . $this->{_MIN_FREQ_GENE}; 55 | $args = $args . $this->{_NUMBER_ITERATIONS} . ' ' . $this->{_ANALYZED_GENES_FILE} . ' '; 56 | $args = $args . $this->{_NUMBER_PERMUTATIONS} . ' ' . $this->{_VALUE_TESTED} . ' ' . $this->{_RANK}; 57 | my ( $input, $output, $err ); 58 | use Symbol 'gensym'; 59 | $err = gensym; 60 | my $pid = open3( $input, $output, $err, 'permutationTestDendrix $args' ); 61 | # puts output in the specified text files 62 | # python PermutationTestDendrix.py example/mutation_matrix 3 1 1000000 example/analyzed_genes 100 48 1 63 | waitpid( $pid, 0 ); 64 | my $child_exit_status = $? >> 8; 65 | if ( $output ) { print "Output: \n"; while (<$output>) { print; }; print "\n"; }; 66 | if ( $err ) { print "Errors: \n"; while (<$err>) { print; }; print "\n"; }; 67 | 68 | return 1; 69 | 70 | } 71 | 72 | 73 | ## usage 74 | sub help_text { 75 | my $this = shift; 76 | return < 'Genome::Model::Tools::Music::Base', 13 | has_input => [ 14 | maf_file => { 15 | is => 'Text', is_input => 1, 16 | doc => "List of mutations using TCGA MAF specification v2.3", 17 | }, 18 | output_file => { 19 | is => 'Text', is_output => 1, 20 | doc => "Output MAF file with an extra column that reports Pfam annotation domains", 21 | }, 22 | reference_build => { 23 | is => 'Text', 24 | example_values => ['Build37'], 25 | doc => "Options are 'Build36' or 'Build37'. This parameter ensures appropriate annotation of domains", 26 | valid_values => ['Build36', 'Build37'], 27 | }, 28 | ], 29 | doc => "Add Pfam annotation to a MAF file", 30 | }; 31 | 32 | sub help_synopsis { 33 | return <maf_file; 70 | my $reference_build = $self->reference_build; 71 | my $output_file = $self->output_file; 72 | 73 | #parse the MAF file and output a new MAF with Pfam info appended 74 | my $maf_fh = IO::File->new( $maf_file ) or die "Couldn't open MAF file!\n"; 75 | my $out_fh = IO::File->new( $output_file, ">" )or die "Couldn't open $output_file for writing!\n"; 76 | while( my $line = $maf_fh->getline ) { 77 | 78 | chomp $line; 79 | my @cols = split( /\t/, $line ); 80 | my ( $chr, $start, $stop ) = @cols[4..6]; 81 | 82 | #print out any headers directly to the output file 83 | if( $line =~ m/^(#|Hugo_Symbol)/ ) { 84 | $out_fh->print( $line ); 85 | $out_fh->print(( $line =~ m/^Hugo_Symbol/ ) ? "\tPfam_Annotation_Domains\n" : "\n" ); 86 | next; 87 | } 88 | 89 | #construct a tabix command 90 | my $db_path = Genome::Sys->dbpath( 'pfam', 'latest' ) or die "Cannot find the pfam db path."; 91 | my $tabix = can_run( 'tabix' ) or die "Cannot find the tabix command. It can be obtained from http://sourceforge.net/projects/samtools/files/tabix"; 92 | my $tabix_cmd = "$tabix"; 93 | if( $reference_build eq 'Build36' ) { 94 | $tabix_cmd .= " $db_path/pfam.annotation.build36.gz $chr:$start-$stop - |"; 95 | } 96 | elsif( $reference_build eq 'Build37' ) { 97 | $tabix_cmd .= " $db_path/pfam.annotation.build37.gz $chr:$start-$stop - |"; 98 | } 99 | else { 100 | die "Please specify either 'Build36' or 'Build37' for the --reference-build parameter."; 101 | } 102 | 103 | #run tabix command 104 | my %domains; 105 | open( TABIX, $tabix_cmd ) or die "Cannot run 'tabix'. Please check it is in your PATH. It can be installed from the samtools project. $!"; 106 | while( my $tabline = ) { 107 | chomp $tabline; 108 | my ( undef, undef, undef, $csv_domains ) = split( /\t/, $tabline ); 109 | my @domains = split( /,/, $csv_domains ); 110 | for my $domain ( @domains ) { 111 | $domains{$domain}++; 112 | } 113 | } 114 | close(TABIX); 115 | 116 | #print output to new file 117 | my $all_domains = join( ",", sort keys %domains ); 118 | my $output_line = "$line\t"; 119 | unless( $all_domains eq "" ) { 120 | $output_line .= "$all_domains\n"; 121 | } 122 | else { 123 | $output_line .= "NA\n"; 124 | } 125 | $out_fh->print( $output_line ); 126 | } 127 | 128 | return( 1 ); 129 | } 130 | 131 | 1; 132 | -------------------------------------------------------------------------------- /lib/TGI/MuSiC2/Proximity.pm: -------------------------------------------------------------------------------- 1 | package Genome::Model::Tools::Music::Proximity; 2 | 3 | use warnings; 4 | use strict; 5 | use IO::File; 6 | 7 | our $VERSION = $Genome::Model::Tools::Music::VERSION; 8 | 9 | class Genome::Model::Tools::Music::Proximity { 10 | is => 'Command::V2', 11 | has_input => [ 12 | maf_file => { is => 'Text', doc => "List of mutations using TCGA MAF specifications v2.3" }, 13 | output_dir => { is => 'Text', doc => "Directory where output files will be written" }, 14 | max_proximity => { is => 'Text', doc => "Maximum allowed AA distance between 2 mutations", is_optional => 1, default => 7 }, 15 | skip_non_coding => { is => 'Boolean', doc => "Skip non-coding mutations from the provided MAF file", is_optional => 1, default => 1 }, 16 | skip_silent => { is => 'Boolean', doc => "Skip silent mutations from the provided MAF file", is_optional => 1, default => 1 }, 17 | ], 18 | has_optional_output => [ 19 | output_file => {is => 'Text', doc => "TODO" }, 20 | ], 21 | doc => "Perform a proximity analysis on a list of mutations." 22 | }; 23 | 24 | sub help_detail { 25 | return <maf_file; 64 | my $output_dir = $self->output_dir; 65 | my $max_proximity = $self->max_proximity; 66 | my $skip_non_coding = $self->skip_non_coding; 67 | my $skip_silent = $self->skip_silent; 68 | $output_dir =~ s/(\/)+$//; # Remove trailing forward slashes if any 69 | 70 | # Check on all the input data before starting work 71 | print STDERR "MAF file not found or is empty: $maf_file\n" unless( -s $maf_file ); 72 | print STDERR "Output directory not found: $output_dir\n" unless( -e $output_dir ); 73 | return undef unless( -s $maf_file && -e $output_dir ); 74 | 75 | # Output of this script will be written to this location in the output directory 76 | my $out_file = "$output_dir/proximity_report"; 77 | $self->output_file($out_file); 78 | 79 | # Parse the header row in the MAF file 80 | my $maf_fh = IO::File->new( $maf_file ) or die "Couldn't open $maf_file. $!"; 81 | my $maf_header = $maf_fh->getline; 82 | $maf_header = $maf_fh->getline while( $maf_header =~ /^#/ ); # Skip commented lines 83 | chomp( $maf_header ); 84 | unless( $maf_header =~ /^Hugo_Symbol/ ) 85 | { 86 | print STDERR "Could not find column headers in $maf_file\n"; 87 | return undef; 88 | } 89 | 90 | # Check whether the required additional MAF columns were included in the MAF 91 | unless( $maf_header =~ m/c_position/ and $maf_header =~ m/amino_acid_change/ and $maf_header =~ m/transcript_name/ ) 92 | { 93 | print STDERR "Could not find required additional columns in $maf_file\n"; 94 | return undef; 95 | } 96 | 97 | # Find the indexes of all the MAF columns 98 | my $idx = 0; 99 | my %col_idx = map {($_, $idx++)} split( /\t/, $maf_header ); 100 | 101 | # A hash to store statuses, and a hash to store variants and AA positions 102 | my %status; 103 | my $status = \%status; 104 | my %aa_mutations; 105 | 106 | # Load relevant data from MAF into hash 107 | while( my $line = $maf_fh->getline ) 108 | { 109 | chomp $line; 110 | my @cols = split( /\t/, $line ); 111 | 112 | # Fetch data from the generic MAF columns 113 | my ( $gene, $chr, $start, $stop, $mutation_class, $mutation_type, $ref_allele, $var_allele, $var2, $sample ) = 114 | ( $cols[0], $cols[4], $cols[5], $cols[6], $cols[8], $cols[9], $cols[10], $cols[11], $cols[12], $cols[15] ); 115 | $var_allele = $var2 if ( $var_allele eq $ref_allele ); # Different centers interpret the 2 variant columns differently 116 | 117 | # Fetch data from the required additional MAF columns 118 | my ( $c_position, $aa_change, $transcript ) = ( $cols[$col_idx{c_position}], $cols[$col_idx{amino_acid_change}], $cols[$col_idx{transcript_name}] ); 119 | 120 | # Create a key to uniquely identify each variant 121 | my $variant_key = join( "\t", $gene, $chr, $start, $stop, $ref_allele, $var_allele, $sample ); 122 | 123 | 124 | #check that the mutation class is acceptable 125 | if( $mutation_class !~ m/^(Missense_Mutation|Nonsense_Mutation|Nonstop_Mutation|Splice_Site|Translation_Start_Site|Frame_Shift_Del|Frame_Shift_Ins|In_Frame_Del|In_Frame_Ins|Silent|Intron|RNA|3'Flank|3'UTR|5'Flank|5'UTR|IGR|Targeted_Region|De_novo_Start_InFrame|De_novo_Start_OutOfFrame)$/ ) 126 | { 127 | print STDERR "Unrecognized Variant_Classification \"$mutation_class\" in MAF file for gene $gene\n"; 128 | print STDERR "Please use TCGA MAF Specification v2.3.\n"; 129 | return undef; 130 | } 131 | 132 | # If user wants, skip Silent mutations, or those in Introns, RNA, UTRs, Flanks, IGRs, or the ubiquitous Targeted_Region 133 | if(( $skip_non_coding && $mutation_class =~ m/^(Intron|RNA|3'Flank|3'UTR|5'Flank|5'UTR|IGR|Targeted_Region)$/ ) || 134 | ( $skip_silent && $mutation_class =~ m/^Silent$/ )) 135 | { 136 | print "Skipping $mutation_class mutation in gene $gene.\n"; 137 | $status{synonymous_mutations_skipped}++; 138 | next; 139 | } 140 | 141 | # Determine amino acid position and load into hash 142 | my @mutated_aa_positions = (); 143 | @mutated_aa_positions = $self->get_amino_acid_pos( $variant_key, $mutation_class, $c_position, $aa_change, $status ); 144 | 145 | # Record data in hash if mutated aa position found 146 | if( scalar( @mutated_aa_positions ) > 0 ) 147 | { 148 | push( @{$aa_mutations{$transcript}{$variant_key}{mut_AAs}}, @mutated_aa_positions ); 149 | } 150 | } 151 | $maf_fh->close; 152 | 153 | # Evaluate proximity of mutated amino acids for each transcript 154 | for my $transcript ( keys %aa_mutations ) 155 | { 156 | # For each variant hitting that transcript 157 | for my $variant ( keys %{$aa_mutations{$transcript}} ) 158 | { 159 | # Initialize the search 160 | my @affected_amino_acids = @{$aa_mutations{$transcript}{$variant}{mut_AAs}}; 161 | my $mutations_within_proximity = 0; # Variable for summing # of mutations within proximity 162 | my $min_proximity = $max_proximity + 1; # Current minimum proximity 163 | 164 | # For each OTHER variant hitting the transcript 165 | for my $other_variant ( keys %{$aa_mutations{$transcript}} ) 166 | { 167 | # Ignore the current mutation 168 | next if $variant eq $other_variant; 169 | # Get affected amino acids from OTHER variant 170 | my @other_affected_amino_acids = @{$aa_mutations{$transcript}{$other_variant}{mut_AAs}}; 171 | 172 | # Compare distances between amino acids 173 | my $found_close_one = 0; 174 | for my $other_variant_aa ( @other_affected_amino_acids ) 175 | { 176 | for my $variant_aa ( @affected_amino_acids ) 177 | { 178 | my $distance = abs($other_variant_aa - $variant_aa); 179 | # If distance is within range 180 | if( $distance <= $max_proximity ) 181 | { 182 | $found_close_one++; 183 | $min_proximity = $distance if $distance < $min_proximity; 184 | } 185 | } 186 | } 187 | # Note that this variant is within proximity if applicable 188 | $mutations_within_proximity++ if $found_close_one; 189 | } 190 | # Now, save results in hash if there are any 191 | if ($mutations_within_proximity) 192 | { 193 | $aa_mutations{$transcript}{$variant}{muts_within_range} = $mutations_within_proximity; 194 | $aa_mutations{$transcript}{$variant}{min_proximity} = $min_proximity; 195 | } 196 | } 197 | } 198 | 199 | # Print results 200 | my $out_fh = IO::File->new( $out_file, ">" ) or die "Couldn't open $out_file. $!"; 201 | $out_fh->print( "Mutations_Within_Proximity\tNearest_Mutation\tGene\tTranscript\tAffected_Amino_Acid(s)\tChr\tStart\tStop\tRef_Allele\tVar_Allele\tSample\n" ); 202 | for my $transcript ( keys %aa_mutations ) 203 | { 204 | for my $variant ( keys %{$aa_mutations{$transcript}} ) 205 | { 206 | if( exists $aa_mutations{$transcript}{$variant}{muts_within_range} ) 207 | { 208 | my ( $gene, $chr, $start, $stop, $ref_allele, $var_allele, $sample ) = split( /\t/, $variant ); 209 | my $affected_amino_acids = join( ",", sort @{$aa_mutations{$transcript}{$variant}{mut_AAs}} ); 210 | my $line = join( "\t", $aa_mutations{$transcript}{$variant}{muts_within_range}, 211 | $aa_mutations{$transcript}{$variant}{min_proximity}, $gene, $transcript, 212 | $affected_amino_acids, $chr, $start, $stop, $ref_allele, $var_allele, $sample ); 213 | $out_fh->print( "$line\n" ); 214 | } 215 | } 216 | } 217 | $out_fh->close(); 218 | return 1; 219 | } 220 | 221 | ################################################################################ 222 | 223 | =head2 get_amino_acid_pos 224 | 225 | This subroutine deducts the amino acid position within the transcript using the c_position and amino_acid_position columns in the MAF. 226 | 227 | =cut 228 | 229 | ################################################################################ 230 | 231 | sub get_amino_acid_pos { 232 | # Parse arguments 233 | my $self = shift; 234 | my ( $variant_key, $mut_class, $c_position, $aa_change, $status ) = @_; 235 | 236 | # Initialize variables 237 | my $tx_start = my $tx_stop = 0; 238 | my $aa_position_start = my $aa_position_stop = 0; 239 | my $inferred_aa_start = my $inferred_aa_stop = 0; 240 | my $aa_pos = my $inferred_aa_pos = 0; 241 | 242 | # Amino acid position determination 243 | if( $aa_change && $aa_change ne "NULL" && substr( $aa_change, 0, 1 ) ne "e" ) 244 | { 245 | $aa_pos = $aa_change; 246 | $aa_pos =~ s/[^0-9]//g; 247 | } 248 | 249 | # Parse out c_position if applicable ## 250 | if( $c_position && $c_position ne "NULL" ) 251 | { 252 | # If multiple results, parse both ## 253 | if( $c_position =~ '_' && !( $mut_class =~ 'splice' )) 254 | { 255 | ($tx_start, $tx_stop) = split( /\_/, $c_position ); 256 | $tx_start =~ s/[^0-9]//g; 257 | $tx_stop =~ s/[^0-9]//g; 258 | 259 | if( $tx_stop < $tx_start ) 260 | { 261 | $inferred_aa_start = $tx_stop / 3; 262 | $inferred_aa_start = sprintf( "%d", $inferred_aa_start ) + 1 if( $tx_stop % 3 ) ; 263 | $inferred_aa_stop = $tx_start / 3; 264 | $inferred_aa_stop = sprintf( "%d", $inferred_aa_stop ) + 1 if( $tx_start % 3 ); 265 | } 266 | else 267 | { 268 | $inferred_aa_start = $tx_start / 3; 269 | $inferred_aa_start = sprintf( "%d", $inferred_aa_start ) + 1 if( $tx_start % 3 ); 270 | $inferred_aa_stop = $tx_stop / 3; 271 | $inferred_aa_stop = sprintf( "%d", $inferred_aa_stop ) + 1 if( $tx_stop % 3 ); 272 | } 273 | } 274 | else 275 | { 276 | my ( $tx_pos ) = split( /[\+\-\_]/, $c_position ); 277 | $tx_pos =~ s/[^0-9]//g; 278 | $tx_start = $tx_stop = $tx_pos; 279 | if($tx_pos) 280 | { 281 | $inferred_aa_pos = $tx_pos / 3; 282 | $inferred_aa_pos = sprintf( "%d", $inferred_aa_pos ) + 1 if( $tx_pos % 3 ); 283 | $inferred_aa_start = $inferred_aa_stop = $inferred_aa_pos; 284 | } 285 | } 286 | } 287 | 288 | # If we inferred aa start stop, proceed with it ## 289 | if( $inferred_aa_start && $inferred_aa_stop ) 290 | { 291 | $aa_position_start = $inferred_aa_start; 292 | $aa_position_stop = $inferred_aa_stop; 293 | $status->{aa_position_inferred}++; 294 | } 295 | # Otherwise if we inferred aa position ## 296 | elsif( $aa_pos ) 297 | { 298 | $aa_position_start = $aa_pos; 299 | $aa_position_stop = $aa_pos; 300 | $status->{c_position_not_available}++; 301 | } 302 | # Otherwise we were unable to infer the info ## 303 | else 304 | { 305 | $status->{aa_position_not_found}++; 306 | $self->status_message( "Amino acid position not found for variant: $variant_key" ); 307 | return; 308 | } 309 | 310 | # Proceed if we have aa_position_start and stop ## 311 | my %mutated_aa_positions; 312 | if( $aa_position_start && $aa_position_stop ) 313 | { 314 | for( my $this_aa_pos = $aa_position_start; $this_aa_pos <= $aa_position_stop; $this_aa_pos++ ) 315 | { 316 | $mutated_aa_positions{$this_aa_pos}++; 317 | } 318 | } 319 | 320 | my @mutated_aa_positions = keys %mutated_aa_positions; 321 | return @mutated_aa_positions; 322 | } 323 | 324 | 1; 325 | -------------------------------------------------------------------------------- /lib/TGI/MuSiC2/ProximityWindow.pm: -------------------------------------------------------------------------------- 1 | package Genome::Model::Tools::Music::ProximityWindow; 2 | 3 | use warnings; 4 | use strict; 5 | use IO::File; 6 | # Binomial test 7 | use PDL::Stats::Basic; 8 | our $VERSION = $Genome::Model::Tools::Music::VERSION; 9 | 10 | class Genome::Model::Tools::Music::ProximityWindow { 11 | is => 'Command::V2', 12 | has_input => [ 13 | maf_file => { is => 'Text', doc => "List of mutations using TCGA MAF specifications v2.3" }, 14 | bed_file => { is => 'Text', doc => "Coding regions list with BED format "}, 15 | bam_list => { is => 'Text', doc => "Tab delimited list of BAM files [sample_name, normal_bam, tumor_bam]" }, 16 | output_file => { is => 'Text', doc => "Output files will be written" }, 17 | bmr => { is => 'Number', doc => "Background mutation rate", is_optional => 1, default => '1e-6' }, 18 | window_size => { is => 'Integer', doc => "Fixed window size for sliding", is_optional => 1, default => '10' }, 19 | skip_silent => { is => 'Boolean', doc => "Skip silent mutations from the provided MAF file", is_optional => 1, default => 1 }, 20 | skip_non_coding => { is => 'Boolean', doc => "Skip non-coding mutations from the provided MAF file", is_optional => 1, default => 1 }, 21 | ], 22 | doc => "Perform a sliding window proximity analysis on a list of mutations." 23 | }; 24 | 25 | sub help_detail { 26 | return <maf_file; 58 | my $bed_file = $self->bed_file; 59 | my $bam_list = $self->bam_list; 60 | my $output_file = $self->output_file; 61 | my $bmr = $self->bmr; 62 | my $window_size = $self->window_size; 63 | my $skip_silent = $self->skip_silent; 64 | my $skip_non_coding = $self->skip_non_coding; 65 | # Check on all the input data before starting work 66 | print STDERR "MAF file not found or is empty: $maf_file\n" unless(-s $maf_file); 67 | print STDERR "BED file not found or is empty: $bed_file\n" unless(-s $bed_file); 68 | print STDERR "BAM list file not found or is empty: $bam_list\n" unless(-s $bam_list); 69 | return undef unless( -s $maf_file && -s $bed_file && -s $bam_list); 70 | # Output of this script will be written to this location in the output directory 71 | my $outfh = new FileHandle; 72 | die "Could not create output file\n" unless($outfh->open(">$output_file")); 73 | my %genes = (); 74 | my @parse_bed; 75 | # Parse location information from .bed file 76 | my $bedFh = IO::File->new($bed_file) or die "Couldn't open $bed_file. $!"; 77 | while (my $line = <$bedFh>) { 78 | my @t = $line =~ /^(\w+)\t(\d+)\t(\d+)\t(\S+)\n/; 79 | my ($chr, $start, $stop, $gene) = @t; 80 | if (defined($genes{$gene})) { push(@{$genes{$gene}}, "$chr\t$start\t$stop"); } 81 | else { 82 | my @t0; 83 | $genes{$gene} = \@t0; 84 | push(@{$genes{$gene}}, "$chr\t$start\t$stop"); 85 | } 86 | } 87 | # parse information from .maf file list 88 | my @t = keys %genes; 89 | foreach my $item (@t) { 90 | # each gene 91 | my $fl = shift(@{$genes{$item}}); 92 | my @t1 = $fl =~ /^(\w+)\t(\d+)\t(\d+)/; 93 | my ($chr, $start, $stop) = @t1; 94 | foreach my $item0 (@{$genes{$item}}) { 95 | my @t2 = $item0 =~ /^(\w+)\t(\d+)\t(\d+)/; 96 | my ($i_chr, $i_start, $i_stop) = @t2; 97 | next if (($i_chr eq $chr) and ($stop >= $i_stop)); 98 | if ($i_chr eq $chr) { 99 | if ($stop >= $i_start) { $stop = $i_stop; } 100 | else { 101 | push(@parse_bed, "$chr\t$start\t$stop\t$item"); 102 | $chr = $i_chr; 103 | $stop = $i_stop; 104 | $start = $i_start; 105 | } 106 | } 107 | else { 108 | push(@parse_bed, "$chr\t$start\t$stop\t$item"); 109 | $chr = $i_chr; 110 | $stop = $i_stop; 111 | $start = $i_start; 112 | } 113 | } 114 | push(@parse_bed, "$chr\t$start\t$stop\t$item"); 115 | } 116 | #reload hash table of genes 117 | %genes = (); 118 | foreach my $item (@parse_bed) { 119 | my @t = $item =~ /^(\w+)\t(\d+)\t(\d+)\t(\S+)/; 120 | my ($chr, $start, $stop, $gene) = @t; 121 | my $n_start = $start + 2; 122 | my $n_stop = $stop - 2; 123 | $genes{uc($gene)}{$chr}{$n_start}{$n_stop} = 1; 124 | } 125 | 126 | # Parse bam_list 127 | my %bam_samples = (); 128 | my $bamFh = IO::File->new($bam_list) or die "Couldn't open $bam_list. $!"; 129 | while (my $line = <$bamFh>) { 130 | $line =~ /^(\S+)\t/; 131 | $bam_samples{$1} = 1; 132 | } 133 | # sample number 134 | my $samples = keys %bam_samples; 135 | print $samples."\n"; 136 | 137 | # Parse location information from .bed file 138 | my %gene_locs = (); 139 | my %dele_locs = (); 140 | my $mafFh = IO::File->new($maf_file) or die "Couldn't open $maf_file. $!"; 141 | while (my $line = <$mafFh>) { 142 | next if ($line =~ /^#/ or $line =~ /^Hugo_Sym/); 143 | my @t = split(/\t/, $line); 144 | my ($gene, $chr, $start, $stop, $mutclass, $type) = @t[0,4,5,6,8,9]; 145 | # If user wants, skip Silent mutations, or those in Introns, RNA, UTRs, Flanks, IGRs, or the ubiquitous Targeted_Region 146 | if(($skip_non_coding && $mutclass =~ m/^(Intron|RNA|3'Flank|3'UTR|5'Flank|5'UTR|IGR|Targeted_Region)$/) || ($skip_silent && $mutclass =~ m/^Silent$/)){ 147 | print "Skipping $mutclass mutation in gene $gene.\n"; 148 | next; 149 | } 150 | foreach my $a ($start..$stop) { 151 | # handel deletion 152 | if ($type eq "DEL") { 153 | my $lan = $stop - $start + 1; 154 | my $w = 1/$lan; 155 | my $weight = sprintf "%0.1f", $w; 156 | $gene_locs{$gene}{$chr}{$a} += $weight; 157 | $dele_locs{$gene}{$chr}{$a} = 1; 158 | } 159 | else { $gene_locs{$gene}{$chr}{$a}++; } 160 | } 161 | } 162 | my $span = $window_size; 163 | # Windows container 164 | my @windows; 165 | foreach my $e (keys %gene_locs) { 166 | foreach my $f (keys %{$gene_locs{$e}}) { 167 | @t = keys %{$gene_locs{$e}{$f}}; 168 | @t = sort {$a <=> $b} @t; 169 | my %loc_count = (); 170 | # processing sliding part 171 | foreach my $c (@t) { $loc_count{$c} = $gene_locs{$e}{$f}{$c}; } 172 | my @paras = ($span, $e, $f); 173 | WindowSliding(\@t, \%loc_count, \@windows, \@paras); 174 | } 175 | } 176 | # Test windows 177 | my $window_num = scalar(@windows); 178 | print "Total windows: $window_num\n"; 179 | my %wins_h; 180 | # P-value caculating 181 | foreach my $a (@windows) { 182 | chop($a); 183 | @t = split(/\t/, $a); 184 | my ($gene, $chr, $start, $vars, $str, $sh_win, $sh_mr) = @t; 185 | my $sample_base = $span*$samples; 186 | my $pvalue = binomial_test($vars, $sample_base, $bmr); 187 | $wins_h{"$gene\t$chr\t$start\t$vars\t$str\t$sh_win\t$sh_mr"} = "$pvalue"; 188 | } 189 | my @s_keys = sort {$wins_h{$a} <=> $wins_h{$b}} keys %wins_h; 190 | foreach (@s_keys) { print $outfh "$_\t$wins_h{$_}\n"; } 191 | $outfh->close; 192 | print STDERR "Processing Done. \n"; 193 | } 194 | 195 | # Shrink window part 196 | ############################################################## 197 | ## subroutines 198 | 199 | # window sliding algorithm 200 | sub WindowSliding { 201 | my ($a_ref, $loc_count, $wins, $paras) = @_; 202 | my ($span, $gene, $chr) = @$paras; 203 | my $rspan = $span - 1; 204 | my @paras0 = ($rspan, $gene, $chr); 205 | my @container; 206 | my $item = shift(@$a_ref); 207 | push(@container, $item); 208 | foreach $a (@$a_ref) { 209 | if ($a <= ($item + $rspan)) { 210 | $item = $a; 211 | push(@container, $item); 212 | } 213 | else { 214 | #sliding 215 | &UnitSliding(\@container, $loc_count, $wins, \@paras0); 216 | @container = (); 217 | $item = $a; 218 | push(@container, $item); 219 | } 220 | } 221 | # last unit 222 | if (@container) { UnitSliding(\@container, $loc_count, $wins, \@paras0); } 223 | undef @container; 224 | } 225 | 226 | sub UnitSliding { 227 | my ($a_ref, $loc_count, $wins, $paras) = @_; 228 | my ($span,$gene, $chr) = @$paras; 229 | my $first = $a_ref->[0]; 230 | my $start = $first - $span; 231 | if ($start <= 0) { $start = 1; } 232 | my $stop = $a_ref->[-1] + $span; 233 | my $wstart = $start; 234 | my $wstop = $a_ref->[-1]; 235 | my $number = $stop - $start + 1; 236 | my @array = (0) x $number; 237 | foreach my $a (@$a_ref) { $array[$a - $start] = $loc_count->{$a}; } 238 | foreach my $b ($wstart..$wstop) { 239 | my $varcount = 0; 240 | my $windowstr = ""; 241 | my $astart = $b; 242 | my $astop = $astart + $span; 243 | foreach my $c ($astart..$astop) { 244 | $varcount += $array[$c - $start]; 245 | if ($array[$c - $start] == 0) { $windowstr .= 'o'; } 246 | else{ $windowstr .= 'x'; } 247 | } 248 | my $est_vars = int($varcount + 0.5); 249 | next unless($est_vars > 0); 250 | $varcount = $est_vars; 251 | $windowstr =~ /o*(\w*x)o*$/; 252 | my $shrink_win_len = length($1); 253 | my $tmp_mr = $varcount/$shrink_win_len; 254 | my $shrink_win_mr = sprintf "%0.4f", $tmp_mr; 255 | push(@$wins, "$gene\t$chr\t$b\t$varcount\t$windowstr\t$shrink_win_len\t$shrink_win_mr\n"); 256 | } 257 | undef @array; 258 | } 259 | 260 | 1; 261 | 262 | -------------------------------------------------------------------------------- /lib/TGI/MuSiC2/Smg.pm.R: -------------------------------------------------------------------------------- 1 | ############################################# 2 | ### Functions for testing significance of ### 3 | ### per-gene categorized mutation rates ### 4 | ############################################# 5 | 6 | # Fetch command line arguments 7 | args = commandArgs(); 8 | input_file = as.character(args[4]); 9 | output_file = as.character(args[5]); 10 | run_type = as.character(args[6]); 11 | processors = as.numeric(args[7]); 12 | skip_low_mr_genes = as.numeric(args[8]); 13 | 14 | # See if we have the necessary packages installed to run in parallel 15 | is.installed <- function( mypkg ) is.element( mypkg, installed.packages()[,1] ); 16 | parallel = FALSE; 17 | if( processors > 1 & is.installed( 'doMC' ) & is.installed( 'foreach' )) { 18 | parallel = TRUE; 19 | } 20 | 21 | # Generates a binomial distribution 22 | gethist <- function( xmax, n, p, ptype = "positive_log" ) { 23 | dbinom( 0:xmax, n, p ) -> ps; 24 | ps = ps[ps > 0]; 25 | lastp = 1 - sum( ps ); 26 | if( lastp > 0 ) ps = c( ps, lastp ); 27 | if( ptype == "positive_log" ) ps = -log( ps ); 28 | return( ps ); 29 | } 30 | 31 | # Splits data into bins 32 | binit <- function( x, hmax, bin, dropbin = T ) { 33 | bs = as.integer( x / bin ); 34 | bs[bs > hmax/bin] = hmax / bin; 35 | bs[is.na( bs )] = hmax / bin; 36 | tapply( exp(-x), as.factor( bs ), sum ) -> bs; 37 | bs = bs[bs>0]; 38 | bs = -log( bs ); 39 | if( dropbin ) bs = as.numeric( bs ); 40 | return( bs ); 41 | } 42 | 43 | # Convolutes two vectors into one 44 | convolute_b <- function( a, b ) { 45 | tt = NULL; 46 | for( j in b ) { tt = c( tt, ( a + j )); } 47 | return( tt ); 48 | } 49 | 50 | # Runs SMG test on a given gene with categorized mutation counts, coverage, and BMR 51 | mut_class_test <- function( x, xmax = 100, hmax = 25, bin = 0.001 ) { 52 | x = as.data.frame( x ); 53 | colnames( x ) = c( "Class", "Bps", "Muts", "BMR" ); 54 | x$p = NA; x$lh0 = NA; x$lh1 = NA; 55 | tot_bps = x[( x$Class == "Overall" ),]$Bps; 56 | tot_muts = x[( x$Class == "Overall" ),]$Muts; 57 | overall_bmr = x[( x$Class == "Overall" ),]$BMR; 58 | 59 | # Remove the row containing overall MR and BMR because we don't want it to be a tested category 60 | x = x[( x$Class != "Overall" ),]; 61 | 62 | # If user wants to skip testing genes with low MRs, measure the relevant MRs of this gene 63 | gene_mr = 0; indel_mr = 0; indel_bmr = 0; trunc_mr = 0; trunc_bmr = 0; 64 | if( skip_low_mr_genes == 1 ) { 65 | if( tot_bps > 0 ) { gene_mr = tot_muts / tot_bps; } 66 | if( x[( grep( "Indels", x$Class )),]$Bps > 0 ) { indel_mr = x[( grep( "Indels", x$Class )),]$Muts / x[( grep( "Indels", x$Class )),]$Bps; } 67 | indel_bmr = x[( grep( "Indels", x$Class )),]$BMR; 68 | if( nrow( x[( grep( "Truncations", x$Class )),] ) > 0 ) { 69 | if( x[( grep( "Truncations", x$Class )),]$Bps > 0 ) { trunc_mr = x[( grep( "Truncations", x$Class )),]$Muts / x[( grep( "Truncations", x$Class )),]$Bps; } 70 | trunc_bmr = x[( grep( "Truncations", x$Class )),]$BMR; 71 | } 72 | } 73 | 74 | # Set pvals of 1 for genes with zero mutations, zero covered bps, or zero overall BMR 75 | if( tot_muts <= 0 | tot_bps <= 0 | overall_bmr <= 0 ) { 76 | p.fisher = 1; p.lr = 1; p.convol = 1; qc = 1; 77 | } 78 | # If user wants to skip testing genes with low MRs, give them pvals of 1 79 | else if( skip_low_mr_genes == 1 & gene_mr < overall_bmr & indel_mr <= indel_bmr & trunc_mr <= trunc_bmr ) { 80 | p.fisher = 1; p.lr = 1; p.convol = 1; qc = 1; 81 | } 82 | else { 83 | # Skip testing mutation categories with 0 BMR, 0 #bps, or has #muts >= #bps 84 | x = x[( x$BMR > 0 & x$Bps > 0 & x$Bps > x$Muts ),]; 85 | rounded_mut_cnts = round(x$Muts); 86 | for( i in 1:nrow(x) ) { 87 | x$p[i] = binom.test( rounded_mut_cnts[i], x$Bps[i], x$BMR[i], alternative = "greater" )$p.value; 88 | x$lh0[i] = dbinom( rounded_mut_cnts[i], x$Bps[i], x$BMR[i], log = T ); 89 | x$lh1[i] = dbinom( rounded_mut_cnts[i], x$Bps[i], x$Muts[i] / x$Bps[i], log = T ); 90 | iBps = x$Bps[i]; iBMR = x$BMR[i]; 91 | gethist( xmax, iBps, iBMR, ptype = "positive_log" ) -> bi; 92 | binit( bi, hmax, bin ) -> bi; 93 | if( i == 1 ) { hist0 = bi; } 94 | if( i > 1 & i < nrow(x) ) { hist0 = convolute_b( hist0, bi ); binit( hist0, hmax, bin ) -> hist0; } 95 | if( i == nrow(x)) { hist0 = convolute_b( hist0, bi ); } 96 | } 97 | 98 | # Fisher combined p-value 99 | q = ( -2 ) * sum( log( x$p )); 100 | df = 2 * length( x$p ); 101 | p.fisher = 1 - pchisq( q, df ); 102 | 103 | # Likelihood ratio test 104 | q = 2 * ( sum( x$lh1 ) - sum( x$lh0 )); 105 | df = sum( x$lh1 != 0 ); 106 | if( df > 0 ) p.lr = 1 - pchisq( q, df ); 107 | if( df == 0 ) p.lr = 1; 108 | 109 | # Convolution test 110 | bx = -sum( x[,"lh0"] ); 111 | p.convol = sum( exp( -hist0[hist0>=bx] )); 112 | qc = sum( exp( -hist0 )); 113 | } 114 | 115 | # Return results 116 | rst = list( x = cbind( x, tot_muts, p.fisher, p.lr, p.convol, qc )); 117 | return( rst ); 118 | } 119 | 120 | dotest <- function( idx, mut, zgenes ) { 121 | 122 | # step = round( length( zgenes ) / processors ); 123 | # start = step * ( idx - 1 ) + 1; 124 | # stop = step * idx; 125 | # if( idx == processors ) { stop = length( zgenes ); } 126 | 127 | ## in order to get load balance 128 | # scatter genes alternativelly instead of 129 | # scatter them by segment 130 | # 131 | start = 1 132 | stop = length( zgenes ) 133 | index = start; idxx = idx - 1 134 | tt = NULL; 135 | for( Gene in zgenes[start:stop] ) { 136 | if ( index %% processors != idxx ) { index = index + 1; next } 137 | mutgi = mut[mut$Gene==Gene,]; 138 | mut_class_test( mutgi[,2:5], hmax = 25, bin = 0.001 ) -> z; 139 | tt = rbind( tt, cbind( Gene, unique( z$x[,(9:11)] ))); 140 | index = index + 1; 141 | } 142 | return( tt ); 143 | } 144 | 145 | combineresults <- function( a, b ) { 146 | return( rbind( a, b )); 147 | } 148 | 149 | smg_test <- function( gene_mr_file, pval_file ) { 150 | read.delim( gene_mr_file ) -> mut; 151 | colnames( mut ) = c( "Gene", "Class", "Bases", "Mutations", "BMR" ); 152 | mut$BMR = as.numeric( as.character( mut$BMR )); 153 | tt = NULL; 154 | 155 | # Run in parallel if we have the needed packages, or fall back to the old way 156 | if( parallel ) { 157 | library( 'doMC' ); 158 | library( 'foreach' ); 159 | registerDoMC(); 160 | cat( "Parallel backend installed - splitting across", processors, "cores\n" ); 161 | 162 | options( cores = processors ); 163 | mcoptions <- list( preschedule = TRUE ); 164 | zgenes = unique( as.character( mut$Gene )); 165 | tt = foreach( idx = 1:processors, .combine="combineresults", .options.multicore = mcoptions ) %dopar% { 166 | dotest( idx, mut, zgenes ); 167 | } 168 | write.table( tt, file = pval_file, quote = FALSE, row.names = F, sep = "\t" ); 169 | } 170 | else { 171 | for( Gene in unique( as.character( mut$Gene ))) { 172 | mutgi = mut[mut$Gene==Gene,]; 173 | mut_class_test( mutgi[,2:5], hmax = 25, bin = 0.001 ) -> z; 174 | tt = rbind( tt, cbind( Gene, unique( z$x[,(9:11)] ))); 175 | } 176 | write.table( tt, file = pval_file, quote = FALSE, row.names = F, sep = "\t" ); 177 | } 178 | } 179 | 180 | smg_fdr <- function( pval_file, fdr_file ) { 181 | read.table( pval_file, header = T, sep = "\t" ) -> x; 182 | 183 | #Calculate FDR measure and write FDR output 184 | p.adjust( x[,2], method="BH" ) -> fdr.fisher; 185 | p.adjust( x[,3], method="BH" ) -> fdr.lr; 186 | p.adjust( x[,4], method="BH" ) -> fdr.convol; 187 | x = cbind( x, fdr.fisher, fdr.lr, fdr.convol ); 188 | #Rank SMGs starting with lowest convolution test FDR, and then by Likelihood Ratio FDR 189 | x = x[order( fdr.convol, fdr.lr ),]; 190 | write.table( x, file = fdr_file, quote = FALSE, row.names = F, sep = "\t" ); 191 | } 192 | 193 | # Figure out which function needs to be invoked and call it 194 | if( run_type == "smg_test" ) { smg_test( input_file, output_file ); } 195 | if( run_type == "calc_fdr" ) { smg_fdr( input_file, output_file ); } 196 | -------------------------------------------------------------------------------- /lib/TGI/MuSiC2/Smg.pm.qqplot.R: -------------------------------------------------------------------------------- 1 | #### 2 | # Generate qq plot for SMG results 3 | # 4 | ### 5 | # for example 6 | #R --slave --args < qqplot.R music2_smg_test_detailed out.pdf 7 | # 8 | # Fetch command line arguments 9 | args = commandArgs(); 10 | input = as.character(args[4]) 11 | output = as.character(args[5]) 12 | 13 | pdf( output ) 14 | # plot 1: 15 | #read.table( input, header = TRUE )[,11]->p 16 | read.table( input, header = F )[,9]->p 17 | p = p[p>0] 18 | p = p[p<1] 19 | p = p[!is.na(p)] 20 | OBS = sort(-log10(p)) 21 | EXP = sort(-log10(1:length(p)/length(p))) 22 | plot(EXP, OBS, col="red", pch=20); abline(a=0,b=1, col="lightgray", lty=1, lwd=2) 23 | title("SMG test qq-plot") 24 | 25 | dev.off() 26 | 27 | -------------------------------------------------------------------------------- /lib/TGI/MuSiC2/Smg.pm.qqplot.correct.R: -------------------------------------------------------------------------------- 1 | #### 2 | # Generate qq plot for SMG results 3 | # 4 | ### 5 | # for example 6 | #R --slave --args < qqplot.R music2_smg_test_detailed out.pdf 7 | # 8 | # Fetch command line arguments 9 | args = commandArgs(); 10 | input = as.character(args[4]) 11 | output = as.character(args[5]) 12 | 13 | #input="smgs_detailed" 14 | #output="qqplot.pdf" 15 | 16 | read.table(input,header=T )->z 17 | 18 | gc=function(p) 19 | { 20 | p=1-p 21 | #pm=median(p) 22 | pm=median(p[p>0 & p<1]) 23 | lambda=qchisq(pm,1)/qchisq(0.5,1) 24 | x2=qchisq(p,1)/lambda 25 | p=pchisq(x2,1) 26 | p=1-p 27 | p 28 | } 29 | 30 | z$P_CT_corrected=gc(z[,9]) 31 | z$FDR_CT_corrected=p.adjust(z$P_CT_corrected,method="fdr") 32 | 33 | pdf(output,10,7 ) 34 | 35 | par(mfrow=c(1,2)) 36 | 37 | # plot 1: uncorrected 38 | p=z[,9] 39 | p = p[p>0] 40 | p = p[p<1] 41 | p = p[!is.na(p)] 42 | OBS = sort(-log10(p)) 43 | EXP = sort(-log10(1:length(p)/length(p))) 44 | plot(EXP, OBS, col="red", pch=20); abline(a=0,b=1, col="lightgray", lty=1, lwd=2) 45 | title("SMG test qq-plot") 46 | 47 | 48 | # plot 2: GC-corrected 49 | p=z$P_CT_corrected 50 | p = p[p>0] 51 | p = p[p<1] 52 | p = p[!is.na(p)] 53 | OBS = sort(-log10(p)) 54 | EXP = sort(-log10(1:length(p)/length(p))) 55 | plot(EXP, OBS, col="red", pch=20); abline(a=0,b=1, col="lightgray", lty=1, lwd=2) 56 | title("SMG test qq-plot(GC corrected)") 57 | 58 | dev.off() 59 | 60 | #write.csv(z,paste(input,"corrected",sep="."),row.names=F,quote=F) 61 | write.csv(z, file="smgs_detailed.corrected", row.names=F, quote=F) 62 | 63 | -------------------------------------------------------------------------------- /lib/TGI/MuSiC2/Survival.pm: -------------------------------------------------------------------------------- 1 | package TGI::MuSiC2::Survival; 2 | 3 | use warnings; 4 | use strict; 5 | 6 | use Carp; 7 | use IO::File; 8 | use POSIX qw( WIFEXITED ); 9 | use File::Temp qw/ tempfile /; 10 | use Getopt::Long; 11 | 12 | our $VERSION = 'v0.1'; 13 | 14 | sub new { 15 | my $class = shift; 16 | my $this = {}; 17 | 18 | $this->{'bam_list'} = undef; 19 | $this->{'output_dir'} = undef; 20 | $this->{'maf_file'} = undef; 21 | $this->{'numeric_clinical_data_file'} = undef; 22 | $this->{'categorical_clinical_data_file'} = undef; 23 | $this->{'glm_clinical_data_file'} = undef; 24 | $this->{'genetic_data_type'} = 'gene'; 25 | $this->{'phenotypes_to_include'} = undef; 26 | $this->{'legend_placement'} = 'bottomleft'; 27 | $this->{'skip_non_coding'} = 1; 28 | $this->{'noskip_non_coding'} = 0; 29 | $this->{'skip_silent'} = 1; 30 | $this->{'noskip_silent'} = 0; 31 | 32 | bless $this, $class; 33 | $this->process(); 34 | 35 | return $this; 36 | } 37 | 38 | sub process { 39 | 40 | my $this = shift; 41 | my ( $help, $options ); 42 | unless( @ARGV ) { die $this->help_text(); } 43 | $options = GetOptions ( 44 | 'bam-list=s' => \$this->{'bam_list'}, 45 | 'output-dir=s' => \$this->{'output_dir'}, 46 | 'maf-file=s' => \$this->{'maf_file'}, 47 | 'numeric-clinical-data-file' => \$this->{'numeric_clinical_data_file'}, 48 | 'categorical-clinical-data-file' => \$this->{'categorical_clinical_data_file'}, 49 | 'glm-clinical-data-file' => \$this->{'glm_clinical_data_file'}, 50 | 'genetic-data-type' => \$this->{'genetic_data_type'}, 51 | 'phenotypes-to-include' => \$this->{'phenotypes_to_include'}, 52 | 'legend-placement' => \$this->{'legend_placement'}, 53 | 'skip-non-coding' => \$this->{'skip_non_coding'}, 54 | 'noskip-non-coding' => \$this->{'noskip_non_coding'}, 55 | 'skip-silent' => \$this->{'skip_silent'}, 56 | 'noskip-silent' => \$this->{'noskip_silent'}, 57 | 58 | 'help' => \$help, 59 | ); 60 | if ( $help ) { print STDERR help_text(); exit 0; } 61 | unless( $options ) { die $this->help_text(); } 62 | #### processing #### 63 | # handle phenotype inclusions 64 | my ( @phenotypes_to_include, @clinical_phenotypes_to_include, @mutated_genes_to_include, ); 65 | if ($this->{'phenotypes_to_include'}) { @phenotypes_to_include = split /,/, $this->{'phenotypes_to_include'} } 66 | # check genetic data type 67 | unless ($this->{'genetic_data_type'} =~ /^gene|variant$/i) { 68 | warn ("Please enter either \"gene\" or \"variant\" for the --genetic-data-type parameter."); 69 | return; 70 | } 71 | # load clinical data and analysis types 72 | my %clinical_data; 73 | if ($this->{'numeric_clinical_data_file'}) { $clinical_data{'numeric'} = $this->{'numeric_clinical_data_file'} } 74 | if ($this->{'categorical_clinical_data_file'}) { $clinical_data{'categ'} = $this->{'categorical_clinical_data_file'} } 75 | if ($this->{'glm_clinical_data_file'}) { $clinical_data{'glm'} = $this->{'glm_clinical_data_file'} } 76 | # create array of all sample names possibly included from clinical data and MAF 77 | my $sampleFh = IO::File->new( $this->{'bam_list'} ) or die "Couldn't open $this->{'bam_list'}. $!\n"; 78 | my @all_sample_names = map{ unless(/^#/){ chomp; (split /\t/) } } $sampleFh->getlines; 79 | $sampleFh->close; 80 | # loop through clinical data files and assemble survival data hash (vital_status and days_to_last_followup required); 81 | my (%survival_data, $vital_status_flag, $days_to_last_follow_flag, ); 82 | $vital_status_flag = $days_to_last_follow_flag = 0; 83 | map { 84 | my $clin_fh = new IO::File $clinical_data{$_}, "r"; 85 | unless ($clin_fh) { warn( "Failed to open $clinical_data{$_} for reading: $!" ); return; } 86 | #initiate variables to hold column info 87 | my ( %phenotypes_to_print, $vital_status_col, $days_to_last_follow_col, $i, ); 88 | $vital_status_col = $days_to_last_follow_col = $i = 0; 89 | #parse header and record column locations for needed data 90 | my @t_array = split /\t/, $clin_fh->getline; 91 | shift( @t_array ); 92 | map { 93 | $i++; 94 | if (/vital_status|vitalstatus/i) { $vital_status_col = $i; $vital_status_flag++; } 95 | if (/days_to_last_(follow_up|followup)|daystolastfollowup/i) { $days_to_last_follow_col = $i; $days_to_last_follow_flag++; } 96 | if (scalar grep { /^$_$/i } @phenotypes_to_include ) { $phenotypes_to_print{$_} = $i; } 97 | } @t_array; 98 | while ( my $line = $clin_fh->getline ) { 99 | chomp $line; 100 | my @fields = split /\t/,$line; 101 | my $sample = $fields[0]; 102 | unless (scalar grep { m/^$sample$/ } @all_sample_names) { warn( "Skipping sample $sample. (Sample is not in --bam-list)." ); next; } 103 | if ( $vital_status_col ) { 104 | my $vital_status; 105 | if ($fields[$vital_status_col] =~ /^(0|living)$/i) { $vital_status = 0; } 106 | elsif ($fields[$vital_status_col] =~ /^(1|deceased)$/i) { $vital_status = 1; } 107 | else { $vital_status = "NA"; } 108 | $survival_data{$sample}{'vital_status'} = $vital_status; 109 | } 110 | if ($days_to_last_follow_col) { $survival_data{$sample}{'days'} = $fields[$days_to_last_follow_col]; } 111 | for my $pheno (keys %phenotypes_to_print) { $survival_data{$sample}{$pheno} = $fields[$phenotypes_to_print{$pheno}]; } 112 | } 113 | $clin_fh->close; 114 | # record phenotypes included from clinical data 115 | push @clinical_phenotypes_to_include, keys %phenotypes_to_print; 116 | } keys %clinical_data; 117 | # check for necessary header fields 118 | unless ($vital_status_flag) { warn( 'Clinical data does not seem to contain a column labeled "vital_status".' ); return; } 119 | unless ($days_to_last_follow_flag) { warn( 'Clnical data does not seem to contain a column labeled "days_to_last_followup".' ); return; } 120 | # create temporary files for R command 121 | # 122 | my ( undef, $survival_data_file ) = tempfile(); 123 | my ( undef, $mutation_matrix ) = tempfile(); 124 | my $surv_fh = new IO::File $survival_data_file, "w" or die "Couldn't open survival data filehandle."; 125 | print $surv_fh join( "\t","Sample","Days_To_Last_Followup","Vital_Status" ); 126 | if (@clinical_phenotypes_to_include) { print $surv_fh "\t" . join("\t", @clinical_phenotypes_to_include); } 127 | print $surv_fh "\n"; 128 | map { 129 | my $sample = $_; 130 | unless (exists $survival_data{$sample}{'days'}) { $survival_data{$sample}{'days'} = "NA"; } 131 | unless (exists $survival_data{$sample}{'vital_status'}) { $survival_data{$sample}{'vital_status'} = "NA"; } 132 | print $surv_fh join("\t",$sample,$survival_data{$sample}{'days'},$survival_data{$sample}{'vital_status'}); 133 | map { 134 | my $pheno = $_; 135 | unless (exists $survival_data{$sample}{$pheno}) { $survival_data{$sample}{$pheno} = "NA"; } 136 | print $surv_fh "\t" . $survival_data{$sample}{$pheno}; 137 | } @clinical_phenotypes_to_include; 138 | print $surv_fh "\n"; 139 | } keys %survival_data; 140 | $surv_fh->close; 141 | 142 | # find if any of the "phenotypes_to_include" are genes, and if so, limit the MAF mutation matrix to those genes 143 | my %clinical_pheno_to_include; 144 | @clinical_pheno_to_include{ @clinical_phenotypes_to_include } = (); 145 | map { push @mutated_genes_to_include, $_ unless exists $clinical_pheno_to_include{$_}; } @phenotypes_to_include; 146 | my $mutated_genes_to_include = \@mutated_genes_to_include; 147 | # create mutation matrix file 148 | if ( $this->{'genetic_data_type'} =~ /^gene$/i ) { 149 | $this->create_sample_gene_matrix_gene( $mutation_matrix, $mutated_genes_to_include, @all_sample_names ); 150 | } elsif ( $this->{'genetic_data_type'} =~ /^variant$/i ) { 151 | $this->create_sample_gene_matrix_variant( $mutation_matrix, $mutated_genes_to_include, @all_sample_names ); 152 | } else { 153 | warn( "Please enter either \"gene\" or \"variant\" for the --genetic-data-type parameter." ); 154 | return; 155 | } 156 | # check and prepare output directory 157 | my $output_dir = $this->{'output_dir'} . "/"; 158 | unless (-e $output_dir) { 159 | warn( "Creating output directory: $output_dir..." ); 160 | unless(mkdir $output_dir) { warn( "Failed to create output directory: $!" ); return; } 161 | } 162 | # set up R command 163 | my $R_cmd = "R --slave --args < " . __FILE__ . ".R " . join( " ", $survival_data_file, $mutation_matrix, $this->{'legend_placement'}, $output_dir ); 164 | print "R_cmd:\n$R_cmd\n"; 165 | #run R command 166 | WIFEXITED( system $R_cmd ) or croak "Couldn't run: $R_cmd ($?)"; 167 | 168 | return(1); 169 | } 170 | 171 | 172 | sub create_sample_gene_matrix_gene { 173 | my ( $this, $mutation_matrix, $mutated_genes_to_include, @all_sample_names ) = @_; 174 | #create hash of mutations from the MAF file 175 | my ( %mutations, %all_genes ); 176 | #parse the MAF file and fill up the mutation status hashes 177 | my $maf_fh = IO::File->new( $this->{'maf_file'} ) or die "Couldn't open MAF file!\n"; 178 | while ( my $line = $maf_fh->getline ) { 179 | next if ( $line =~ m/^(#|Hugo_Symbol)/ ); 180 | chomp $line; my @cols = split( /\t/, $line ); 181 | my ( $gene, $mutation_class, $sample ) = @cols[0,8,15]; 182 | #check that the mutation class is valid 183 | if ($mutation_class !~ m/^(Missense_Mutation|Nonsense_Mutation|Nonstop_Mutation|Splice_Site|Translation_Start_Site|Frame_Shift_Del|Frame_Shift_Ins|In_Frame_Del|In_Frame_Ins|Silent|Intron|RNA|3'Flank|3'UTR|5'Flank|5'UTR|IGR|Targeted_Region|De_novo_Start_InFrame|De_novo_Start_OutOfFrame)$/ ) { 184 | warn( "Unrecognized Variant_Classification \"$mutation_class\" in MAF file for gene $gene\nPlease use TCGA MAF v2.3.\n" ); 185 | return; 186 | } 187 | #check to see if this gene is on the list (if there is a list at all) 188 | if ( defined($mutated_genes_to_include) and @{$mutated_genes_to_include} ) { next unless( scalar grep { m/^$gene$/ } @{$mutated_genes_to_include} ) } 189 | # If user wants, skip Silent mutations, or those in Introns, RNA, UTRs, Flanks, IGRs, or the ubiquitous Targeted_Region 190 | if ( ($this->{'skip_non_coding'} && $mutation_class =~ m/^(Intron|RNA|3'Flank|3'UTR|5'Flank|5'UTR|IGR|Targeted_Region)$/) || ($this->{'skip_silent'} && $mutation_class =~ m/^Silent$/) ) { 191 | print "Skipping $mutation_class mutation in gene $gene.\n"; 192 | next; 193 | } 194 | $all_genes{$gene}++; 195 | $mutations{$sample}{$gene}++; 196 | } 197 | $maf_fh->close; 198 | #sort @all_genes for consistency in header and loops 199 | my @all_genes = sort keys %all_genes; 200 | #write the input matrix for R code to a temp file 201 | my $matrix_fh = new IO::File $mutation_matrix,"w" or die "Failed to create matrix file $mutation_matrix!: $!"; 202 | #print input matrix file header 203 | my $header = join( "\t", "Sample", @all_genes ); 204 | $matrix_fh->print( "$header\n" ); 205 | #print mutation relation input matrix 206 | 207 | #print mutation relation input matrix 208 | for my $sample ( sort @all_sample_names ) { 209 | $matrix_fh->print( $sample ); 210 | for my $gene ( @all_genes ) { 211 | if( exists $mutations{$sample}{$gene} ) { 212 | $matrix_fh->print( "\tMutated" ); 213 | } else { 214 | $matrix_fh->print( "\tWildtype" ); 215 | } 216 | } 217 | $matrix_fh->print( "\n" ); 218 | } 219 | 220 | } 221 | 222 | sub create_sample_gene_matrix_variant { 223 | 224 | my ( $this, $mutation_matrix, $mutated_genes_to_include, @all_sample_names ) = @_; 225 | #create hash of mutations from the MAF file 226 | my ( %variants_hash, %all_variants ); 227 | #parse the MAF file and fill up the mutation status hashes 228 | my $maf_fh = IO::File->new( $this->{'maf_file'} ) or die "Couldn't open MAF file!\n"; 229 | while( my $line = $maf_fh->getline ) { 230 | next if ( $line =~ m/^(#|Hugo_Symbol)/ ); 231 | chomp $line; 232 | my @cols = split( /\t/, $line ); 233 | my ( $gene, $chr, $start, $stop, $mutation_class, $mutation_type, $ref, $var1, $var2, $sample ) = @cols[0,4..6,8..12,15]; 234 | #check that the mutation class is valid 235 | if ( $mutation_class !~ m/^(Missense_Mutation|Nonsense_Mutation|Nonstop_Mutation|Splice_Site|Translation_Start_Site|Frame_Shift_Del|Frame_Shift_Ins|In_Frame_Del|In_Frame_Ins|Silent|Intron|RNA|3'Flank|3'UTR|5'Flank|5'UTR|IGR|Targeted_Region|De_novo_Start_InFrame|De_novo_Start_OutOfFrame)$/ ) { 236 | warn( "Unrecognized Variant_Classification \"$mutation_class\" in MAF file for gene $gene\nPlease use TCGA MAF v2.3.\n" ); 237 | return; 238 | } 239 | #check to see if this gene is on the list (if there is a list at all) 240 | if ( @{$mutated_genes_to_include} ) { next unless (scalar grep { m/^$gene$/ } @{$mutated_genes_to_include}); } 241 | # If user wants, skip Silent mutations, or those in Introns, RNA, UTRs, Flanks, IGRs, or the ubiquitous Targeted_Region 242 | if (( $this->skip_non_coding && $mutation_class =~ m/^(Intron|RNA|3'Flank|3'UTR|5'Flank|5'UTR|IGR|Targeted_Region)$/ ) || 243 | ( $this->skip_silent && $mutation_class =~ m/^Silent$/ )) { 244 | print "Skipping $mutation_class mutation in gene $gene.\n"; 245 | next; 246 | } 247 | my ( $var, $variant_name, ); 248 | if ( $ref eq $var1 ) { 249 | $var = $var2; 250 | $variant_name = $gene."_".$chr."_".$start."_".$stop."_".$ref."_".$var; 251 | $variants_hash{$sample}{$variant_name}++; 252 | $all_variants{$variant_name}++; 253 | } elsif ( $ref eq $var2 ) { 254 | $var = $var1; 255 | $variant_name = $gene."_".$chr."_".$start."_".$stop."_".$ref."_".$var; 256 | $variants_hash{$sample}{$variant_name}++; 257 | $all_variants{$variant_name}++; 258 | } elsif ( $ref ne $var1 && $ref ne $var2 ) { 259 | $var = $var1; 260 | $variant_name = $gene."_".$chr."_".$start."_".$stop."_".$ref."_".$var; 261 | $variants_hash{$sample}{$variant_name}++; 262 | $all_variants{$variant_name}++; 263 | $var = $var2; 264 | $variant_name = $gene."_".$chr."_".$start."_".$stop."_".$ref."_".$var; 265 | $variants_hash{$sample}{$variant_name}++; 266 | $all_variants{$variant_name}++; 267 | } 268 | } 269 | $maf_fh->close; 270 | #sort variants for consistency 271 | my @variant_names = sort keys %all_variants; 272 | #write the input matrix for R code to a file 273 | my $matrix_fh = new IO::File $mutation_matrix,"w" or die "Failed to create matrix file $mutation_matrix!: $!"; 274 | #print input matrix file header 275 | my $header = join( "\t", "Sample", @variant_names ); 276 | $matrix_fh->print("$header\n"); 277 | #print mutation relation input matrix 278 | for my $sample (sort @all_sample_names) { 279 | $matrix_fh->print( $sample ); 280 | for my $variant (@variant_names) { 281 | if (exists $variants_hash{$sample}{$variant}) { 282 | $matrix_fh->print("\t$variants_hash{$sample}{$variant}"); 283 | } else { 284 | $matrix_fh->print("\t0"); 285 | } 286 | } 287 | $matrix_fh->print("\n"); 288 | } 289 | } 290 | 291 | 292 | ## usage 293 | sub help_text { 294 | my $this = shift; 295 | return <phenos 25 | if (class(x[,phenos])=="integer" & length(unique(x[,phenos]))<6) x[,phenos] [x[,phenos]>1]=1 26 | # Make a list of distinctive colors using afriggeri.github.com/RYB 27 | distinctColors = c(rgb(255,63,0,maxColorValue=255),rgb(51,23,0,maxColorValue=255),rgb(0,168,51,maxColorValue=255),rgb(41,95,153,maxColorValue=255),rgb(255,127,0,maxColorValue=255),rgb(255,127,127,maxColorValue=255),rgb(127,0,127,maxColorValue=255),rgb(127,211,25,maxColorValue=255),rgb(148,175,204,maxColorValue=255),rgb(153,75,0,maxColorValue=255),rgb(255,255,127,maxColorValue=255),rgb(89,11,63,maxColorValue=255),rgb(20,131,102,maxColorValue=255),rgb(191,0,63,maxColorValue=255),rgb(127,211,25,maxColorValue=255),rgb(255,255,0,maxColorValue=255),rgb(25,96,25,maxColorValue=255)); 28 | 29 | ######################### survival analysis 30 | 31 | library(survival) 32 | logr=NULL 33 | 34 | for (phenotype in phenos) 35 | { 36 | #clean data 37 | loopdata <- x; 38 | loopdata <- loopdata[!is.na(loopdata[,phenotype]),]; 39 | loopdata <- loopdata[!is.na(loopdata[,3]),]; 40 | loopdata <- loopdata[!is.na(loopdata[,2]),]; 41 | status=loopdata[,3]; 42 | time=loopdata[,2]; 43 | x1=loopdata[,phenotype]; 44 | base.class = as.vector(sort(unique(x1)))[1]; 45 | 46 | coxph(Surv(time, status) ~ x1, loopdata) -> co; 47 | summary(co)->co; co$conf->cox; co$logtest[3]->p; co$coef[5]->indv.p; 48 | rownames(cox) = sub("x1","",rownames(cox)); 49 | if (length(rownames(cox))==1 && rownames(cox)[1]=="") { rownames(cox)[1] = "1"; } 50 | logr=rbind(logr,cbind(base.class,rownames(cox),phenotype,cox,indv.p,p)) 51 | 52 | mfit.by <- survfit(Surv(time, status == 1) ~ x1, data = loopdata) 53 | ## file name for plot 54 | bitmap(file=paste(out.dir,"/",phenotype,"_survival_plot.png",sep="")) 55 | ## create survival plot 56 | plot(mfit.by,lty=1:10,ylab="Survival Probability",xlab="Time",col=distinctColors) 57 | if (dim(table(x1))>1) { 58 | title(paste(phenotype,", P=",signif(p,3),sep="")); 59 | } 60 | else { 61 | title(paste(phenotype)); 62 | } 63 | legend(x=legend.placement, legend=names(table(x1)), lty = 1:10, col=distinctColors) 64 | dev.off() 65 | } 66 | 67 | ########################## calculate fdr 68 | 69 | logr=logr[,-5]; 70 | if (length(phenos) < 2) { logr=(t(logr)); } 71 | fdr=p.adjust(as.numeric(logr[,"p"]),"fdr") 72 | logr=cbind(logr,fdr) 73 | 74 | ######################### print output 75 | 76 | colnames(logr)[1:9]=c("base.class","comparison.class","phenotype","hazard.ratio","lower.95","upper.95","2-class-p-value","p-value","fdr") 77 | logr=logr[order(logr[,"p-value"]),] 78 | write.table(logr,file=paste(out.dir,"survival_analysis_test_results.tsv",sep="/"),quote=F,append=F,row.names=F,sep="\t") 79 | -------------------------------------------------------------------------------- /lib/TGI/MuSiC2/correlation.pl: -------------------------------------------------------------------------------- 1 | #!/opt/local/bin/perl 2 | 3 | # FOR EACH GENE, DETERMINE THE CORRELATION BETWEEN ITS CANCER-SPECIFIC 4 | # -LOG10(P) AND THAT CANCER'S RESCALED BACKGROUND MUTATION RATE AND ALSO THE 5 | # MAXIMUM DISTANCE OF POINTS TO THE REGRESSION LINE 6 | # 7 | # REFERENCE completed/2007_correction_factor_fpc/scripts/plot_data.pl FOR SOME 8 | # PROGRAMMING GOODIES RELATIVE TO THIS PROBLEM 9 | # 10 | # IN READING THE DATA, THE LINE "if (defined $pval_neg_log && $pval_neg_log) {" 11 | # **IS** READING OCCURENCES OF 0.0 IN THE INPUT FILES BECAUSE, ALTHOUGH PERL 12 | # PROCESSES NUMBERS AND TEXT-THAT-RESEMBLES-A-NUMBER DIFFERENTLY (SEE BELOW 13 | # EXAMPLE), IT *READS* A FILE OF NUMBERS INITIALLY AS TEXT 14 | # 15 | # bash-3.2$ perl bash-3.2$ perl 16 | # $s = 0.0; $s = "0.0"; <---quotation marks on this one 17 | # if ($s) { if ($s) { 18 | # print "yes\n"; print "yes\n"; 19 | # } else { } else { 20 | # print "no\n"; print "no\n"; 21 | # } } 22 | # no yes 23 | # 24 | # USAGE> ./correlation.pl > plot_file.dat 25 | 26 | # INFER CANCER TYPE FROM P-VALUE FILENAME AND VERIFY CONSISTENCY WITH RATE 27 | # NOMENCLATURE --- THIS DOES NOT INFER THE CANCER TYPE OF THE GENES-OF-INTEREST 28 | # FILE --- THAT IS SEPARATE 29 | # http://stackoverflow.com/questions/20647010/pass-regex-into-perl-subroutine 30 | # my $regexp = qr/^Perl$/; 31 | 32 | ############ 33 | # SET UP # 34 | ############ 35 | 36 | #__STANDARD PERL PACKAGES 37 | use strict; 38 | use warnings; 39 | 40 | #__STATISTICAL LIBRARIES 41 | use Statistics::OLS; 42 | 43 | ################### 44 | # CONFIGURATION # 45 | ################### 46 | 47 | #__FILE OF GENE NAMES FOR WHICH TO RUN ANALYSIS 48 | # 49 | # PROGRAMMING NOTE: SOME FILES FROM BAILEY (FROM MICROSOFT APPS) HAVE 50 | # HAVE BOTH \N AND \CR SO **ALWAYS** CHECK THE READING LOOP BELOW TO DETERMINE 51 | # WHETHER "chop" IS NEEED 52 | # 53 | # CANCER TYPE IS INFERRED FROM THE FILENAME: SHOULD HAVE FORM OF "CANCER_*" 54 | 55 | # my $file_gene_list = "laml.tophits"; 56 | # my $file_gene_list = "321_Round/STAD_SMGs_from_321_genes.txt"; 57 | # my $file_gene_list = "LUAD_SMGs_from_366_genes_with_max_can.txt"; 58 | my $file_gene_list = "plots/cc_vs_cancer_dis_all_genes_annotated_from_366/STAD_SMGs_from_366_genes.txt"; 59 | 60 | #__ANALYZE "KANDOTH" SUBSET OF THESE GENES (=1) OR THE "NOT KANDOTH" SUBSET (=0) 61 | my $in_kandoth = 0; 62 | 63 | #__PANCAN DATA FILES TO PROCESS 64 | my @files = qw( 65 | ../revised_manuscript_adjust_cutoff/acc_pval.genesize.txt 66 | ../revised_manuscript_adjust_cutoff/blca_pval.genesize.txt 67 | ../revised_manuscript_adjust_cutoff/brca_pval.genesize.txt 68 | ../revised_manuscript_adjust_cutoff/cesc_pval.genesize.txt 69 | ../revised_manuscript_adjust_cutoff/coadread_pval.genesize.txt 70 | ../revised_manuscript_adjust_cutoff/gbm_pval.genesize.txt 71 | ../revised_manuscript_adjust_cutoff/hnsc_pval.genesize.txt 72 | ../revised_manuscript_adjust_cutoff/kich_pval.genesize.txt 73 | ../revised_manuscript_adjust_cutoff/kirc_pval.genesize.txt 74 | ../revised_manuscript_adjust_cutoff/kirp_pval.genesize.txt 75 | ../revised_manuscript_adjust_cutoff/laml_pval.genesize.txt 76 | ../revised_manuscript_adjust_cutoff/lgg_pval.genesize.txt 77 | ../revised_manuscript_adjust_cutoff/lihc_pval.genesize.txt 78 | ../revised_manuscript_adjust_cutoff/luad_pval.genesize.txt 79 | ../revised_manuscript_adjust_cutoff/lusc_pval.genesize.txt 80 | ../revised_manuscript_adjust_cutoff/OV_pval.genesize.txt 81 | ../revised_manuscript_adjust_cutoff/prad_pval.genesize.txt 82 | ../revised_manuscript_adjust_cutoff/skcm_pval.genesize.txt 83 | ../revised_manuscript_adjust_cutoff/stad_pval.genesize.txt 84 | ../revised_manuscript_adjust_cutoff/thca_pval.genesize.txt 85 | ../revised_manuscript_adjust_cutoff/ucec_pval.genesize.txt 86 | ../revised_manuscript_adjust_cutoff/ucs_pval.genesize.txt 87 | ); 88 | 89 | #__CANCER BACKGROUND MUTATION RATES (TRANSFORM TO 25-SCALE BY MULT 8.5e6) 90 | my $cancer_background_rate = { 91 | 'acc' => 5.04443724040595e-07 * 8500000, 92 | 'blca' => 2.29927147645268e-06 * 8500000, 93 | 'brca' => 4.43871323686807e-07 * 8500000, 94 | 'cesc' => 9.58381886685746e-07 * 8500000, 95 | 'coadread' => 8.1086390892704e-07 * 8500000, 96 | 'gbm' => 5.59470325737927e-07 * 8500000, 97 | 'hnsc' => 1.29729704730899e-06 * 8500000, 98 | 'kich' => 6.43549698064298e-07 * 8500000, 99 | 'kirc' => 5.47376310309256e-07 * 8500000, 100 | 'kirp' => 9.19530214627311e-07 * 8500000, 101 | 'laml' => 1.16867075954987e-07 * 8500000, 102 | 'lgg' => 3.86862498457639e-07 * 8500000, 103 | 'lihc' => 1.95329184153507e-06 * 8500000, 104 | 'luad' => 2.3821583999372e-06 * 8500000, 105 | 'lusc' => 2.68752209654077e-06 * 8500000, 106 | 'ov' => 5.94633670211403e-07 * 8500000, 107 | 'prad' => 3.62495550231542e-07 * 8500000, 108 | 'skcm' => 2.90983510858331e-06 * 8500000, 109 | 'stad' => 1.52211643106822e-06 * 8500000, 110 | 'thca' => 1.46669747383227e-07 * 8500000, 111 | 'ucec' => 1.54617002760412e-06 * 8500000, 112 | 'ucs' => 6.72611552630032e-07 * 8500000, 113 | }; 114 | 115 | #################### 116 | # PRE-PROCESSING # 117 | #################### 118 | 119 | #__DETERMINE OUTPUT FILE NAME 120 | my $file_out; 121 | if ($file_gene_list =~ /^(\S+)\.txt$/) { 122 | $file_out = $1; 123 | } else { 124 | die "unexpected naming for input file '$file_gene_list'"; 125 | } 126 | if ($in_kandoth) { 127 | $file_out .= "_kandoth_YES.dat"; 128 | } else { 129 | $file_out .= "_kandoth_NO.dat"; 130 | } 131 | print "OUTPUT FILE IS $file_out\n"; 132 | 133 | #__OPEN OUTPUT FILE 134 | unlink $file_out if -e $file_out; 135 | open (OUT, ">$file_out") || die "cant open $file_out"; 136 | 137 | #__READ P-VALUE DATA FOR EACH GENE ACROSS ALL AVAILABLE CANCER TYPES 138 | my ($gene_pval_data, $gene_size) = ({}, {}); 139 | foreach my $file (@files) { 140 | 141 | #__INFER CANCER TYPE FROM FILE NAME USING REGEXP (SEE PROGRAMMING NOTES) 142 | my $regexp = qr/(\w+)\_pval\.genesize\.txt$/; 143 | my $cancer = _cancer_type_ ($file, $regexp); 144 | 145 | #__READ DATA 146 | open (F, $file) || die "cant open $file"; 147 | while () { 148 | 149 | #__PARSE PROCESS AND STORE A LINE OF DATA 150 | next if /^#/; 151 | chomp; 152 | my ($gene, $size, $pval_neg_log) = split; 153 | if (defined $pval_neg_log && $pval_neg_log) { 154 | $gene_pval_data->{$gene}->{$cancer} = $pval_neg_log; 155 | $gene_size->{$gene} = $size; # OVER-WRITTEN WITH SAME DATA 156 | } 157 | } 158 | close (F); 159 | } 160 | 161 | #__READ KANDOTH LIST OF BONA-FIDE CANCER GENES 162 | my $kandoth = {}; 163 | while () { 164 | chomp; 165 | $kandoth->{$_} = 1; 166 | } 167 | 168 | #__READ THE LIST OF GENES OF INTEREST 169 | my $gene_list = {}; 170 | open (F, $file_gene_list) || die "cant open $file_gene_list"; 171 | while () { 172 | chomp; 173 | # chop; # 943genesPancan2_from_Bailey.txt SEEMS TO HAVE BOTH \N AND \CR 174 | $gene_list->{$_} = 1; 175 | # warn "reading --->$_<---\n"; 176 | } 177 | close (F); 178 | 179 | #__INFER CANCER TYPE FROM GENES-OF-INTEREST FILE USING REGEXP (SEE PROG NOTES) 180 | my $regexp = qr/^plots\/cc_vs_cancer_dis_all_genes_annotated_from_366\/([A-Za-z]+)\_/; 181 | my $cancer_type_main = _cancer_type_ ($file_gene_list, $regexp); 182 | 183 | ##### 184 | # close (OUT); 185 | # die "cancer is '$cancer_type_main' --- END"; 186 | ##### 187 | 188 | #__HEADER 189 | print OUT "# CORRELATION COEFFICIENT VS CANCER-SPECIFIC DISTANCE TO REGRESSION LINE\n#\n"; 190 | print OUT "# script: $0\n#\n"; 191 | print OUT "# gene list input file: $file_gene_list\n#\n"; 192 | print OUT "# cancer type: $cancer_type_main\n#\n"; 193 | if ($in_kandoth) { 194 | print OUT "# status of genes being screened: in the Kandoth subset\n#\n"; 195 | } else { 196 | print OUT "# status of genes being screened: NOT in the Kandoth subset\n#\n"; 197 | } 198 | 199 | ##################### 200 | # MAIN PROCESSING # 201 | ##################### 202 | # 203 | # X-AXIS: CANCER BACKGROUND MUTATION RATE * 8.5e+6 204 | # Y-AXIS: -LOG10(P-VALUE) 205 | 206 | #__GENE-BY-GENE REGRESSION ANALYSIS 207 | my ($x_axis_min, $x_axis_max) = (9999, 0); 208 | my ($y_axis_min, $y_axis_max) = (9999, 0); 209 | my $num_filtered = 0; 210 | foreach my $gene (sort keys %{$gene_pval_data}) { 211 | 212 | #__WE ARE ONLY LOOKING AT GENES IN THE LIST OF INTEREST 213 | next unless defined $gene_list->{$gene}; 214 | 215 | #__GENERATE THE INPUT XVALS AND YVALS VECTORS 216 | my ($num_non_0_yvals, $num_max_significant_p) = (0, 0); 217 | my ($xvals, $yvals) = ([], []); 218 | foreach my $cancer (sort keys %{$gene_pval_data->{$gene}}) { 219 | $num_non_0_yvals++ if $gene_pval_data->{$gene}->{$cancer} > 0; 220 | $num_max_significant_p++ if $gene_pval_data->{$gene}->{$cancer} > 15; 221 | push @{$xvals}, $cancer_background_rate->{$cancer}; 222 | push @{$yvals}, $gene_pval_data->{$gene}->{$cancer}; 223 | } 224 | my $num_pts = scalar @{$xvals}; 225 | 226 | #__MAKE SURE GENE HAS AT LEAST 3 DATA POINTS 227 | if ($num_pts < 3) { 228 | warn "$gene HAS LESS THAN 3 POINTS --- SKIPPING\n"; 229 | next; 230 | } 231 | 232 | #__EXECUTE REGRESSION 233 | my $stats_obj = Statistics::OLS->new(); 234 | $stats_obj->setData ($xvals, $yvals) || die ($stats_obj->error()); 235 | $stats_obj->regress () || die ($stats_obj->error()); 236 | 237 | #__PEARSON'S CORRELATION COEFFICIENT INDICATES GOODNESS OF CORRELATION 238 | my $r_squared = $stats_obj->rsq(); 239 | $r_squared = 0 if $r_squared < 0; 240 | my $pearson = sqrt $r_squared; 241 | 242 | #__CALCULATE REGRESSION EQUATION PARAMETERS 243 | my ($intercept, $slope) = $stats_obj->coefficients(); 244 | if ($slope == 0) { 245 | warn "$gene HAS 0 SLOPE --- CHECK YOUR DATA --- SKIPPING\n"; 246 | next; 247 | } 248 | 249 | #__FIND MAX DISTANCE OF A POINT TO THE REGRESSION LINE (SEE OFFLINE NOTES) 250 | my ($y_value, $y_cancer) = (0, ""); 251 | foreach my $cancer (sort keys %{$gene_pval_data->{$gene}}) { 252 | my ($x_1, $y_1) = ($cancer_background_rate->{$cancer}, $gene_pval_data->{$gene}->{$cancer}); 253 | my $x_2 = ($y_1 - $intercept + $x_1/$slope) / ($slope + 1/$slope); 254 | my $y_2 = $slope*$x_2 + $intercept; 255 | my $distance = sqrt(($x_2-$x_1)*($x_2-$x_1) + ($y_2-$y_1)*($y_2-$y_1)); 256 | 257 | ##_Y-AXIS IS MAXIMUM DISTANCE TO REGRESSION LINE (DEPRECATED) 258 | ## if ($distance > $y_value) { 259 | ## $y_value = $distance; 260 | ## $y_cancer = $cancer; 261 | ## } 262 | 263 | #__Y-AXIS IS DISTANCE TO REGRESSION LINE FOR CANCER-SPECIFIC DISTANCE 264 | if ($cancer eq $cancer_type_main) { 265 | $y_value = $distance; 266 | $y_cancer = $cancer; # TECHNICALLY SUPERFLUOUS 267 | } 268 | } 269 | 270 | #__CONDITIONAL OUTPUT 271 | if ($in_kandoth) { 272 | 273 | #__FOR KANDOTH GENES WE MAY OR MAY NOT NEED GENE NAMES ANNOTATED 274 | if (defined $kandoth->{$gene}) { 275 | ## print OUT "$pearson $y_value\n"; 276 | print OUT "$pearson $y_value \"$gene\"\n"; 277 | #### print OUT "$pearson $y_value \"$gene $y_cancer\"\n"; 278 | } 279 | } else { 280 | 281 | #__FOR NON-KANDOTH GENES WE NEED GENE NAMES ANNOTATED 282 | unless (defined $kandoth->{$gene}) { 283 | print OUT "$pearson $y_value \"$gene\"\n"; 284 | #### print OUT "$pearson $y_value \"$gene $y_cancer\"\n"; 285 | } 286 | } 287 | 288 | #__TRACK AXIS BOUNDARIES TO REPORT HOW BIG X-AXIS AND Y-AXIS SHOULD BE SET TO 289 | $x_axis_max = $pearson if $pearson > $x_axis_max; 290 | $x_axis_min = $pearson if $pearson < $x_axis_min; 291 | $y_axis_max = $y_value if $y_value > $y_axis_max; 292 | $y_axis_min = $y_value if $y_value < $y_axis_min; 293 | } 294 | 295 | #__OUTPUT AXIS BOUNDARIES FOR PLOTTING: COMMENTS IGNORED BY XMGRACE 296 | print OUT "# $x_axis_min <= X <= $x_axis_max\n"; 297 | print OUT "# $y_axis_min <= Y <= $y_axis_max\n"; 298 | 299 | ################## 300 | # POST PROCESS # 301 | ################## 302 | 303 | #__CLOSE OUTPUT FILE 304 | close (OUT); 305 | 306 | ################################################################################ 307 | # # 308 | # S U B R O U T I N E S # 309 | # # 310 | ################################################################################ 311 | 312 | # INFER CANCER TYPE EMBEDDED IN A FILE NAME: USES COMPILED REGEXPS HAVING 313 | # A SINGLE CALLBACK --- SEE PROGRAMMING NOTES ABOVE 314 | 315 | sub _cancer_type_ { 316 | my ($file, $regexp) = @_; 317 | my $cancer; 318 | if ($file =~ /$regexp/) { 319 | $cancer = $1; 320 | 321 | #__CONVERT MIXED TEXT TO ALL LOWER CASE FOR CONSISTENCY 322 | $cancer = lc $cancer; 323 | 324 | #__CHECK CONSISTENCY WITH BACKGROUND MUTATION RATE NOMENCLATURE 325 | unless (defined $cancer_background_rate->{$cancer}) { 326 | die "cancer '$cancer' from filename '$file' not a valid cancer type"; 327 | } 328 | } else { 329 | die "cannot infer cancer type from file name '$file'"; 330 | } 331 | return $cancer; 332 | } 333 | 334 | ################################################################################ 335 | # # 336 | # D A T A # 337 | # # 338 | ################################################################################ 339 | 340 | # THE GENES FROM KANDOTH ET AL THAT ARE BONA FIDE CANCER GENES --- THESE 341 | # ARE USEFUL TO GET A FEEL FOR WHAT PEARSON COEFFICIENT AND SLOPE ARE 342 | # CHARACTERISTIC OF REAL CANCER GENES --- THIS LIST IS **IDENTICAL** TO 343 | # THE CONTENTS OF FILE 344 | # 345 | # Kandoth_127_genes_from_Mike_M.txt 346 | # 347 | # BECAUSE WE READ-IN THE CONTENTS FROM THAT FILE 348 | 349 | __DATA__ 350 | ACVR1B 351 | ACVR2A 352 | AJUBA 353 | AKT1 354 | APC 355 | AR 356 | ARHGAP35 357 | ARID1A 358 | ARID5B 359 | ASXL1 360 | ATM 361 | ATR 362 | ATRX 363 | AXIN2 364 | B4GALT3 365 | BAP1 366 | BRAF 367 | BRCA1 368 | BRCA2 369 | CBFB 370 | CCND1 371 | CDH1 372 | CDK12 373 | CDKN1A 374 | CDKN1B 375 | CDKN2A 376 | CDKN2C 377 | CEBPA 378 | CHEK2 379 | CRIPAK 380 | CTCF 381 | CTNNB1 382 | DNMT3A 383 | EGFR 384 | EGR3 385 | EIF4A2 386 | ELF3 387 | EP300 388 | EPHA3 389 | EPHB6 390 | EPPK1 391 | ERBB4 392 | ERCC2 393 | EZH2 394 | FBXW7 395 | FGFR2 396 | FGFR3 397 | FLT3 398 | FOXA1 399 | FOXA2 400 | GATA3 401 | H3F3C 402 | HGF 403 | HIST1H1C 404 | HIST1H2BD 405 | IDH1 406 | IDH2 407 | KDM5C 408 | KDM6A 409 | KEAP1 410 | KIT 411 | KMT2B 412 | KMT2C 413 | KMT2D 414 | KRAS 415 | LIFR 416 | LRRK2 417 | MALAT1 418 | MAP2K4 419 | MAP3K1 420 | MAPK8IP1 421 | MECOM 422 | MIR142 423 | MTOR 424 | NAV3 425 | NCOR1 426 | NF1 427 | NFE2L2 428 | NFE2L3 429 | NOTCH1 430 | NPM1 431 | NRAS 432 | NSD1 433 | PBRM1 434 | PCBP1 435 | PDGFRA 436 | PHF6 437 | PIK3CA 438 | PIK3CG 439 | PIK3R1 440 | POLQ 441 | PPP2R1A 442 | PRX 443 | PTEN 444 | PTPN11 445 | RAD21 446 | RB1 447 | RPL22 448 | RPL5 449 | RUNX1 450 | SETBP1 451 | SETD2 452 | SF3B1 453 | SIN3A 454 | SMAD2 455 | SMAD4 456 | SMC1A 457 | SMC3 458 | SOX17 459 | SOX9 460 | SPOP 461 | STAG2 462 | STK11 463 | TAF1 464 | TBL1XR1 465 | TBX3 466 | TET2 467 | TGFBR2 468 | TLR4 469 | TP53 470 | TSHZ2 471 | TSHZ3 472 | U2AF1 473 | USP9X 474 | VEZF1 475 | VHL 476 | WT1 477 | -------------------------------------------------------------------------------- /t/TESTING.md: -------------------------------------------------------------------------------- 1 | # Notes on testing 2 | 3 | ## Background reading 4 | 5 | * [Useful set of slides](http://perl.plover.com/yak/testing/) - motivation and best practices 6 | * [Testing tutorial](http://search.cpan.org/~rgarcia/perl-5.10.0/lib/Test/Tutorial.pod) - introductory stuff 7 | * [Test reference card](https://github.com/statico/perl-test-refcard) - useful cheat sheet 8 | * [Dr. Dobbs article about testing](http://www.drdobbs.com/web-development/automated-testing-with-the-perl-test-mod/184416061) 9 | * [Test driven development](theory) 10 | 11 | ## Testing 12 | 13 | `perl foo.t` 14 | 15 | or 16 | 17 | `prove -lr t` 18 | 19 | 20 | -------------------------------------------------------------------------------- /t/foo.t: -------------------------------------------------------------------------------- 1 | use strict; 2 | use warnings; 3 | 4 | use Test::Most; 5 | use TGI::MuSiC2::Complicated; 6 | 7 | TGI::MuSiC2::Complicated::print_stuff(); 8 | ok(1, '1 is definitely okay'); 9 | 10 | done_testing; 11 | 12 | --------------------------------------------------------------------------------