├── PipelineStandard.md ├── README.md └── scripts ├── LICENSE ├── classify_mie.py ├── compare_based_on_strand_output_bedpe.py ├── compare_round3_by_region.sh └── compare_single_sample_based_on_strand.py /PipelineStandard.md: -------------------------------------------------------------------------------- 1 | This pipeline standard was developed to aid in coordination of the Centers for Common Disease Genomics project. It was tested with HiSeq X data on pipeline implementations from five centers. 2 | 3 | * [Alignment pipeline standards](#alignment-pipeline-standards) 4 | 1. [Reference genome version](#reference-genome-version) 5 | 2. [Alignment](#alignment) 6 | 3. [Duplicate marking](#duplicate-marking) 7 | 4. [Indel realignment](#indel-realignment) 8 | 5. [Base quality score recalibration](#base-quality-score-recalibration) 9 | 6. [Base quality score binning scheme](#base-quality-score-binning-scheme) 10 | 7. [File format](#file-format) 11 | * [Functional equivalence evaluation](#functional-equivalence-evaluation) 12 | * [Pathway for updates to this standard](#pathway-for-updates-to-this-standard) 13 | 14 | # Alignment pipeline standards 15 | 16 | ## Reference genome version 17 | Each center should use exactly the same reference genome. 18 | 19 | Standard: 20 | * GRCh38DH, [1000 Genomes Project version](http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/) 21 | * Includes the standard set of chromosomes and alternate sequences from GRCh38 22 | * Includes the decoy sequences 23 | * Includes additional alternate versions of the HLA locus 24 | 25 | ## Alignment 26 | Each center should use exactly the same alignment strategy 27 | 28 | Standard: 29 | * Aligner: BWA-MEM 30 | * Version: We will use 0.7.15 (https://github.com/lh3/bwa/releases/tag/v0.7.15) 31 | * Standardized parameters: 32 | * Do not use `-M` since it causes split-read alignments to be marked as "secondary" rather than "supplementary" alignments, violating the BAM specification 33 | * Use `-K 100000000` to achieve deterministic alignment results (Note: this is a hidden option) 34 | * Use `-Y` to force soft-clipping rather than default hard-clipping of supplementary alignments 35 | * Include a `.alt` file for consumption by BWA-MEM; do not perform post-processing of alternate alignments 36 | * Optional parameters (may be useful for convenience and not expected to alter results): 37 | * `-p` (for interleaved fastq) 38 | * `-C` (append FASTA/FASTQ comment to SAM output) 39 | * `-v` (logging verbosity) 40 | * `-t` (threading) 41 | * `-R` (read group header line) 42 | * Post-alignment modification: 43 | * In order to reduce false positive calls due to bacterial contamination randomly aligning to the human genome, reads and their mates may be marked by setting 0x4 bit in the SAM flag if the following conditions apply: 44 | 1. The primary alignment has less than 32 aligned bases 45 | 2. The primary alignment is soft clipped on both sides 46 | * This filtering is optional 47 | * The original mapping information will be encoded in a Previous Alignment (PA) tag on the marked reads using the same format as the SA tag in the BAM specification. 48 | * Modification of other flags after alignment will not be performed. 49 | 50 | ## Duplicate marking 51 | Different centers can use different tools, as long as the same number of reads are marked duplicate and results are functionally equivalent. During the pipeline synchronization exercise we evaluated four tools: Picard MarkDuplicates, bamUtil, samblaster, and sambamba. After the exercise, centers are using Picard and bamUtil. 52 | 53 | Standard: 54 | * Match Picard’s current definition of duplicates for primary alignments where both reads of a pair align to the reference genome. Both samblaster and bamUtil already attempt to match Picard for this class of alignments. 55 | * If a primary alignment is marked as duplicate, then all supplementary alignments for that same read should also be marked as duplicates. Both Picard and bamUtil have modified to exhibit this behavior. For Picard, you must use >= version 2.4.1 and run on a queryname sorted input file. BamUtil must be version >=TODO. Samblaster supports this behavior, but Sambamba does not. 56 | * Orphan alignments (where the mate paired read is unmapped) will be marked as duplicates if there’s another read with the same alignment (mated, or orphaned) 57 | * The unmapped mate of duplicate orphan reads is required to also be marked as a duplicate. 58 | * It is not a requirement for duplicate marking software to choose the best pair based on base quality sum, but results must be functionally equivalent. In practice we have moved away from using samblaster for this reason. 59 | * If a primary alignment is marked as duplicate, then all secondary alignments for that read should also be marked as duplicates. However, given that no secondary alignments will exist using our proposed alignment strategy, it is optional for software to implement. 60 | * There was a discussion about whether duplicate marking should be deterministic. We did not reach a decision on this. 61 | * We have discussed the preferred behavior for marking duplicates in datasets with multiple sequencing libraries and have decided that this is a minor concern given that very few samples should have multiple libraries. Currently MarkDuplicates supports multiple libraries with the caveat that the term “Library” isn’t exactly defined (consider a technical replicate that starts somewhere in the middle of the LC process, how early must it be to be called a different library?) 62 | 63 | ## Indel realignment 64 | This computationally expensive data processing step is dispensable given the state of current indel detection algorithms and will not be performed. 65 | 66 | ## Base quality score recalibration 67 | There was discussion about dropping BQSR given evidence that the impact on variant calling performance is minimal. However, given that this project will involve combined analysis of data from multiple centers and numerous sequencers, generated over multiple years, and that we cannot ensure the consistency of Illumina base-calling software over time, we decided that it is preferable to perform BQSR. We evaluated two tools, GATK BaseRecalibrator (both GATK3 and GATK4) and bamUtil. 68 | 69 | Standard: 70 | * We will use the following files from the [GATK hg38 bundle](https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/) for the site list: 71 | * Homo_sapiens_assembly38.dbsnp138.vcf 72 | * Mills_and_1000G_gold_standard.indels.hg38.vcf.gz 73 | * Homo_sapiens_assembly38.known_indels.vcf.gz 74 | * The recalibration table may optionally be generated using only the autosomes (chr1-chr22) 75 | * Downsampling of the reads is optional 76 | * per-base alignment qualities (BAQ) algorithm is optional 77 | 78 | Command line: 79 | 80 | For users of GATK, the following command line options should be utilized for the BaseRecalibrator tool: 81 | ``` 82 | -R ${ref_fasta} \ 83 | -I ${input_bam} \ 84 | -O ${recalibration_report_filename} \ 85 | -knownSites "Homo_sapiens_assembly38.dbsnp138.vcf" \ 86 | -knownSites “Mills_and_1000G_gold_standard.indels.hg38.vcf.gz" \ 87 | -knownSites “Homo_sapiens_assembly38.known_indels.vcf.gz” 88 | ``` 89 | 90 | For users of GATK, the following command line options are optional for efficiency and can be utilized for the BaseRecalibrator tool: 91 | Note: we've tested .1 downsampling fractions. Lower fractions should be tested for functional equivalence. 92 | ``` 93 | --downsample_to_fraction .1 \ 94 | -L chr1 \ 95 | -L chr2 \ 96 | -L chr3 \ 97 | -L chr4 \ 98 | -L chr5 \ 99 | -L chr6 \ 100 | -L chr7 \ 101 | -L chr8 \ 102 | -L chr9 \ 103 | -L chr10 \ 104 | -L chr11 \ 105 | -L chr12 \ 106 | -L chr13 \ 107 | -L chr14 \ 108 | -L chr15 \ 109 | -L chr16 \ 110 | -L chr17 \ 111 | -L chr18 \ 112 | -L chr19 \ 113 | -L chr20 \ 114 | -L chr21 \ 115 | -L chr22 116 | ``` 117 | 118 | For users of GATK, the following command line options are optional: 119 | * `-rf BadCigar` 120 | * `--preserve_qscores_less_than 6` 121 | * `--disable_auto_index_creation_and_locking_when_reading_rods` 122 | * `--disable_bam_indexing` 123 | * `-nct` 124 | * `--useOriginalQualities` 125 | 126 | ## Base quality score binning scheme 127 | Additional base quality score compression is required to reduce file size. It is possible to achieve this with minimal adverse impacts on variant calling. 128 | 129 | Standard: 130 | * 4-bin quality score compression. The 4-bin scheme is 2-6, 10, 20, 30. The 2-6 scores correspond to Illumina error codes and will be left as-is by recalibration. 131 | * Bin base quality scores by rounding off to the nearest bin value, in probability space. This feature is already implemented in the current version of GATK. 132 | 133 | Command line: 134 | 135 | For users of GATK, the following command line options should be utilized for the PrintReads (GATK3) or ApplyBQSR (GATK4) tool: 136 | ``` 137 | -R ${ref_fasta} \ 138 | -I ${input_bam} \ 139 | -O ${output_bam_basename}.bam \ 140 | -bqsr ${recalibration_report} \ 141 | -SQQ 10 -SQQ 20 -SQQ 30 \ 142 | --disable_indel_quals 143 | ``` 144 | 145 | For users of GATK, the following command line options are optional: 146 | * `--globalQScorePrior -1.0` 147 | * `--preserve_qscores_less_than 6` 148 | * `--useOriginalQualities` 149 | * `-nct` 150 | * `-rf BadCigar` 151 | * `--createOutputBamMD5` 152 | * `--addOutputSAMProgramRecord` 153 | 154 | ## File format 155 | Each center should use the same file format, while retaining flexibility to include additional information for specific centers or projects. 156 | 157 | Standard: 158 | * Lossless CRAM. Upon conversion to BAM, the BAM file should be valid according to Picard’s ValidateSamFile. 159 | * Read group (@RG) tags should be present for all reads. 160 | * The header for the RG should contain minimally the ID tag, PL tag, PU tag, SM tag, and LB tag. 161 | * The CN tag is recommended. 162 | * Other tags are optional. 163 | * The ID tag must be unique within the CRAM. ID tags may be freely renamed to maintain uniqueness when merging CRAMs. No assumptions should be made about the permanence of RG IDs. 164 | * The PL tag should indicate the instrument vendor name according to the SAM spec (CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, ONT, and PACBIO). PL values are case insensitive. 165 | * The PU tag is used for grouping reads for BQSR and should uniquely identify reads as belonging to a sample-library-flowcell-lane (or other appropriate recalibration unit) within the CRAM file. PU is not required to contain values for fields that are uniform across the CRAM (e.g., single sample CRAM or single library CRAM). The PU tag is not guaranteed to be sufficiently informative after merging with other CRAMs, and anyone performing a merge should consider modifying PU values appropriately. 166 | * SM should contain the individual identifier for the sample (e.g., NA12878) without any other process or aliquot-specific information. 167 | * The LB tag should uniquely identify the library for the sample; it must be present even if there is only a single library per sample or CRAM file. 168 | * If the PM tag is used, values should conform to one of the following (for Illumina instruments): “HiSeq-X”, “HiSeq-4000”, “HiSeq-2500”, “HiSeq-2000”, “NextSeq-500”, or “MiSeq”. 169 | * Retain original query names. 170 | * Retain @PG records for bwa, duplicate marking, quality recalibration, and any other tools that was run on the data. 171 | * Retain the minimal set of tags (RG, MQ, MC and SA). NOTE: an additional tool may be needed to add the MQ and MC tags if none of the tools add these tags otherwise. One option is to pipe the alignment through [samblaster](https://github.com/GregoryFaust/samblaster) with the options `-a --addMateTags` as it comes out of BWA 172 | * Groups can add custom tags as needed. 173 | * Do not retain the original base quality scores (OQ tag). 174 | * it is recommended that users use samtools version >=1.3.1 to convert from BAM/SAM to CRAM (The use of htsjdk/Picard/GATK for converting BAM to CRAM is not currently condoned). Users that would like to convert back from CRAM to BAM (and want to avoid ending up with an invalid BAM) need to either convert to SAM and then to BAM (piping works) or compile samtools with HTSLib version >=1.3.2. To enable this you need to: configure the build of samtools with the parameter `--with-htslib=/path/to/htslib-1.3.2`. 175 | 176 | # Functional equivalence evaluation 177 | All pipelines used for this effort need to be validated as functionally equivalent. The validation methodology will be published alongside a test data set. 178 | 179 | # Pathway for updates to this standard 180 | Pipelines will need to be updated during the project, but this should be a tightly controlled process given the need to reprocess vast amounts of data each time substantial pipeline modifications occur. 181 | 182 | Draft plan: 183 | * Initial pipeline versions should serve for project years 1-2. 184 | * Efficiency updates passing functional equivalence tests are always allowed. 185 | * Propose to start a review process in late 2017: invite proposals for pipeline updates that incorporate new aligners, reference genomes and data processing steps. 186 | * If substantial improvements are achievable, implement new pipelines for project years 3-4. 187 | * There will need to be a decision about how large the potential variant calling improvements should be to warrant pipeline modification and data reprocessing. 188 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Functional equivalent whole genome sequencing analysis pipelines 2 | 3 | ### This pipeline standard was developed to aid in coordination of the Centers for Common Disease Genomics project. It was tested with HiSeq X data on pipeline implementations from five centers. 4 | 5 | For detailed description, please see the following article: 6 | __[Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects.](https://www.ncbi.nlm.nih.gov/pubmed/30279509)__ _Nat Commun. 2018 Oct 2;9(1):4038_ 7 | -------------------------------------------------------------------------------- /scripts/LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Allison Regier 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /scripts/classify_mie.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import re 4 | import sys 5 | 6 | class Uninformative(RuntimeError): 7 | pass 8 | 9 | class MIE(object): 10 | def __init__(self): 11 | pass 12 | 13 | def __call__(self, fields): 14 | #assume last 3 columns are the genotypes in KID DAD MOM order 15 | kid, dad, mom = [ re.split('[/|]', x) for x in fields[-3:] ] 16 | dad_alleles, mom_alleles, kid_alleles = map(set, (dad, mom, kid)) 17 | if len(dad_alleles) == 2 and len(mom_alleles) == 2 and len(dad_alleles.union(mom_alleles).union(kid_alleles)) == 2: 18 | # both are het and not a multi-allelic site, this site is UNINFORMATIVE assuming input includes no missing genotypes 19 | raise Uninformative 20 | a1, a2 = kid 21 | classification = 'MIE' 22 | if (a1 in dad_alleles and a2 in mom_alleles) or (a2 in dad_alleles and a1 in mom_alleles): 23 | # NOT AN MIE 24 | classification = 'NONE' 25 | return fields[:-3] + [ classification ] 26 | 27 | if __name__ == '__main__': 28 | mie_classify = MIE() 29 | for line in sys.stdin: 30 | fields = line.rstrip().split('\t') 31 | try: 32 | print '\t'.join(mie_classify(fields)) 33 | except Uninformative: 34 | pass 35 | 36 | -------------------------------------------------------------------------------- /scripts/compare_based_on_strand_output_bedpe.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from __future__ import division 3 | 4 | import sys 5 | import argparse 6 | 7 | #TODO what do we do with missing data? Right now is counted as discordance or match depending on if the other sample has data or not 8 | 9 | class Bedpe(object): 10 | def __init__(self, bed_list): 11 | self.c1 = bed_list[0] 12 | self.s1 = int(bed_list[1]) 13 | self.e1 = int(bed_list[2]) 14 | self.c2 = bed_list[3] 15 | self.s2 = int(bed_list[4]) 16 | self.e2 = int(bed_list[5]) 17 | self.name = bed_list[6] 18 | self.score = bed_list[7] 19 | self.o1 = bed_list[8] 20 | self.o2 = bed_list[9] 21 | self.svtype = bed_list[10] 22 | self.filter = bed_list[11] 23 | self.orig_name1 = bed_list[12] 24 | self.orig_ref1 = bed_list[13] 25 | self.orig_alt1 = bed_list[14] 26 | self.orig_name2 = bed_list[15] 27 | self.orig_ref2 = bed_list[16] 28 | self.orig_alt2 = bed_list[17] 29 | self.info1 = bed_list[18] 30 | self.info2 = bed_list[19] 31 | # remainder will be entries then second bedpe entry will follow 32 | self.format = bed_list[20] 33 | self.misc = bed_list[21:] 34 | 35 | def toString(self, status, sample, region): 36 | return '\t'.join(str(v) for v in (self.c1, self.s1, self.e1, self.c2, self.s2, self.e2, self.name, self.score, self.o1, self.o2, self.svtype, self.filter, self.orig_name1, self.orig_ref1, self.orig_alt1, self.orig_name2, self.orig_ref2, self.orig_alt2, self.info1+";Regions="+region, self.info2, self.format+":RE", self.misc[sample]+":"+status))+"\n" 37 | 38 | class PairToPair(object): 39 | def __init__(self, sample_header, out_dir, region): 40 | self._load_samples(sample_header) 41 | self.outFiles = [ open(out_dir+"/"+x+".bedpe", 'a') for x in self.samples ] 42 | self.region = region 43 | self.deferred = dict() 44 | 45 | def _load_samples(self, sample_header): 46 | with open(sample_header) as fh: 47 | for line in fh: 48 | fields = line.rstrip().split('\t') 49 | self.samples = fields[21:] 50 | self.num_samples = len(self.samples) 51 | self.index = 21 + self.num_samples 52 | 53 | def process(self, fields_list): 54 | bedpe1 = Bedpe(fields_list[0:self.index]) 55 | if len(fields_list) == self.index: 56 | # this is unique to the first file 57 | # process the line and record variants as unique 58 | for index, x in enumerate(bedpe1.misc): 59 | fields = x.rstrip().split(':', 1) 60 | #self.results[index].unique += 1 61 | if fields[0] in ('0/1', '1/1'): 62 | #should be a variant 63 | self.outFiles[index].write(bedpe1.toString("0-only", index, self.region)) 64 | else: 65 | # this MAY be a match 66 | # check strands and then determine class 67 | # for each sample record results 68 | bedpe2 = Bedpe(fields_list[self.index:]) 69 | if strands_match(bedpe1.o1, bedpe1.o2, bedpe2.o1, bedpe2.o2): 70 | # matched by strand 71 | if bedpe1.name in self.deferred: 72 | del self.deferred[bedpe1.name] 73 | for index in range(self.num_samples): 74 | discordant_type = not types_match(bedpe1.svtype, bedpe2.svtype) 75 | gt1 = gt_for_index(bedpe1, index) 76 | gt2 = gt_for_index(bedpe2, index) 77 | if gt1 in ('0/1', '1/1') or gt2 in ('0/1', '1/1'): 78 | if gt1 == gt2: 79 | self.outFiles[index].write(bedpe1.toString("match", index, self.region)) 80 | else: 81 | self.outFiles[index].write(bedpe1.toString("discordant", index, self.region)) 82 | else: 83 | self.deferred[bedpe1.name] = bedpe1 84 | 85 | def process_shadow_unique(self): 86 | # count up things that overlapped but ended up being unique 87 | for bedpe in self.deferred.values(): 88 | for index, x in enumerate(bedpe.misc): 89 | gt = gt_for_index(bedpe, index) 90 | if gt in ('0/1', '1/1'): 91 | self.outFiles[index].write(bedpe.toString("0-only", index, self.region)) 92 | 93 | def finalize(self): 94 | for index, f in enumerate(self.outFiles): 95 | f.close() 96 | 97 | def gt_for_index(bedpe, index): 98 | format_fields = bedpe.misc[index].rstrip().split(':', 1) 99 | return format_fields[0] 100 | 101 | 102 | def strands_match(eval_strand_a, eval_strand_b, compare_strand_a, compare_strand_b): 103 | if '.' in [eval_strand_a, eval_strand_b, compare_strand_a, compare_strand_b]: 104 | return False 105 | if eval_strand_a != eval_strand_b: 106 | return (eval_strand_a == compare_strand_a 107 | and eval_strand_b == compare_strand_b) 108 | else: 109 | return compare_strand_a == compare_strand_b 110 | 111 | def types_match(type1, type2): 112 | return type1 == type2 113 | 114 | if __name__ == '__main__': 115 | parser = argparse.ArgumentParser(description='Filter output of pairtopair based on strand.') 116 | parser.add_argument('-s', '--sample-header', metavar='', type=str, help='Filename of file containing sample header of the bedpe') 117 | parser.add_argument('-o', '--outDir', metavar='', type=str, help='Directory where output files should be written') 118 | parser.add_argument('-r', '--region', metavar='', type=str, help='Region to record in INFO field') 119 | args = parser.parse_args() 120 | 121 | p = PairToPair(args.sample_header, args.outDir, args.region) 122 | try: 123 | for line in sys.stdin: 124 | fields = line.rstrip().split('\t') 125 | p.process(fields) 126 | p.process_shadow_unique() 127 | p.finalize() 128 | except ValueError as e: 129 | sys.exit(e) 130 | 131 | -------------------------------------------------------------------------------- /scripts/compare_round3_by_region.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -eo pipefail 4 | BEDTOOLS=$1 #path to bedtools executable 5 | pad1bp () { 6 | cat "$1" | perl -ape '$F[1] -= 1; $F[2]+=1; $F[4] -= 1; $F[5] += 1; $_ = join("\t", @F)."\n"' > $1.padded 7 | } 8 | 9 | OUTPUT_DIR=$2 #output directory 10 | OUTPUT=$OUTPUT/results.txt 11 | DIFFICULT=$3 #BED file of difficult regions 12 | REST=$4 #BED file of remaining regions 13 | SAMPLE_HEADER=$5 #BEDPE header line listing all samples in correct order 14 | echo "Pipeline1 Pipeline2 Sample Class Count" > $OUTPUT 15 | 16 | #file containing list of pipelines and paths to final BEDPE files from each pipeline, 17 | #separated by tabs 18 | BEDPE_LIST=$6 19 | while read pipeline1 pipeline1_file 20 | do 21 | while read pipeline2 pipeline2_file 22 | do 23 | if [ $pipeline1 == $pipeline2 ]; then 24 | continue 25 | fi 26 | out_dir=$OUTDIR/${pipeline1}_${pipeline2} 27 | mkdir -p $out_dir/logs 28 | 29 | for sample in $( cut -f 22- $SAMPLE_HEADER ); do 30 | grep "^##" $pipeline1_file > $out_dir/$sample.bedpe 31 | echo "##FORMAT=" >> $out_dir/$sample.bedpe 32 | echo "##INFO=" >> $out_dir/$sample.bedpe 33 | grep "^#CHROM_A" $pipeline1_file | cut -f 1-21 | sed "s/$/\t$sample/" >> $out_dir/$sample.bedpe 34 | done 35 | $BEDTOOLS pairtopair -is -a $pipeline1_file.padded -b $pipeline2_file.padded -type both -slop 50 | $BEDTOOLS pairtobed -a - -b $DIFFICULT -type either | rev | cut -f 4- | rev | sort -u | python compare_based_on_strand_output_bedpe.py -s $SAMPLE_HEADER -o $out_dir -r hard 36 | $BEDTOOLS pairtopair -is -a $pipeline1_file.padded -b $pipeline2_file.padded -type both -slop 50 | $BEDTOOLS pairtobed -a - -b $DIFFICULT -type neither | $BEDTOOLS pairtobed -a - -b $REST -type either | rev | cut -f 4- | rev | sort -u | python compare_based_on_strand_output_bedpe.py -s $SAMPLE_HEADER -o $out_dir -r medium 37 | $BEDTOOLS pairtopair -is -a $pipeline1_file.padded -b $pipeline2_file.padded -type both -slop 50 | $BEDTOOLS pairtobed -a - -b $DIFFICULT -type neither | $BEDTOOLS pairtobed -a - -b $REST -type neither | sort -u | python compare_based_on_strand_output_bedpe.py -s $SAMPLE_HEADER -o $out_dir -r easy 38 | $BEDTOOLS pairtopair -is -a $pipeline1_file.padded -b $pipeline2_file.padded -type notboth -slop 50 | $BEDTOOLS pairtobed -a - -b $DIFFICULT -type either | rev | cut -f 4- | rev | sort -u | python compare_based_on_strand_output_bedpe.py -s $SAMPLE_HEADER -o $out_dir -r hard 39 | $BEDTOOLS pairtopair -is -a $pipeline1_file.padded -b $pipeline2_file.padded -type notboth -slop 50 | $BEDTOOLS pairtobed -a - -b $DIFFICULT -type neither | $BEDTOOLS pairtobed -a - -b $REST -type either | rev | cut -f 4- | rev | sort -u | python compare_based_on_strand_output_bedpe.py -s $SAMPLE_HEADER -o $out_dir -r medium 40 | $BEDTOOLS pairtopair -is -a $pipeline1_file.padded -b $pipeline2_file.padded -type notboth -slop 50 | $BEDTOOLS pairtobed -a - -b $DIFFICULT -type neither | $BEDTOOLS pairtobed -a - -b $REST -type neither | sort -u | python compare_based_on_strand_output_bedpe.py -s $SAMPLE_HEADER -o $out_dir -r easy 41 | 42 | done < $BEDPE_LIST 43 | done < $BEDPE_LIST 44 | -------------------------------------------------------------------------------- /scripts/compare_single_sample_based_on_strand.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from __future__ import division 3 | 4 | import sys 5 | import argparse 6 | 7 | #TODO what do we do with missing data? Right now is counted as discordance or match depending on if the other sample has data or not 8 | 9 | class Result(object): 10 | def __init__(self): 11 | self.unique = 0 12 | self.match = 0 13 | self.discordant = 0 14 | self.match_discordant_type = 0 15 | self.discordant_discordant_type = 0 16 | self.unmatched = dict() 17 | 18 | class Bedpe(object): 19 | def __init__(self, bed_list): 20 | self.c1 = bed_list[0] 21 | self.s1 = int(bed_list[1]) 22 | self.e1 = int(bed_list[2]) 23 | self.c2 = bed_list[3] 24 | self.s2 = int(bed_list[4]) 25 | self.e2 = int(bed_list[5]) 26 | self.name = bed_list[6] 27 | self.score = bed_list[7] 28 | self.o1 = bed_list[8] 29 | self.o2 = bed_list[9] 30 | self.svtype = bed_list[10] 31 | self.filter = bed_list[11] 32 | self.orig_name1 = bed_list[12] 33 | self.orig_ref1 = bed_list[13] 34 | self.orig_alt1 = bed_list[14] 35 | self.orig_name2 = bed_list[15] 36 | self.orig_ref2 = bed_list[16] 37 | self.orig_alt2 = bed_list[17] 38 | self.info1 = bed_list[18] 39 | self.info2 = bed_list[19] 40 | # remainder will be entries then second bedpe entry will follow 41 | self.format = bed_list[20] 42 | self.misc = bed_list[21:] 43 | 44 | class PairToPair(object): 45 | def __init__(self, label): 46 | self.label = label 47 | self.results = Result() 48 | self.deferred = dict() 49 | 50 | def process(self, fields_list): 51 | bedpe1 = Bedpe(fields_list[0:22]) 52 | if len(fields_list) == 22: 53 | # this is unique to the first file 54 | # process the line and record variants as unique 55 | for index, x in enumerate(bedpe1.misc): 56 | fields = x.rstrip().split(':', 1) 57 | if fields[0] in ('0/1', '1/1'): 58 | #should be a variant 59 | self.results.unique += 1 60 | else: 61 | # this MAY be a match 62 | # check strands and then determine class 63 | # for each sample record results 64 | bedpe2 = Bedpe(fields_list[22:]) 65 | if strands_match(bedpe1.o1, bedpe1.o2, bedpe2.o1, bedpe2.o2): 66 | # matched by strand 67 | if bedpe1.name in self.deferred: 68 | del self.deferred[bedpe1.name] 69 | for index in range(1): 70 | discordant_type = not types_match(bedpe1.svtype, bedpe2.svtype) 71 | gt1 = gt_for_index(bedpe1, index) 72 | gt2 = gt_for_index(bedpe2, index) 73 | if gt1 in ('0/1', '1/1') or gt2 in ('0/1', '1/1'): 74 | if gt1 == gt2: 75 | self.results.match += 1 76 | if discordant_type: 77 | self.results.match_discordant_type += 1 78 | else: 79 | self.results.discordant += 1 80 | if discordant_type: 81 | self.results.discordant_discordant_type += 1 82 | else: 83 | self.deferred[bedpe1.name] = bedpe1 84 | 85 | def process_shadow_unique(self): 86 | # count up things that overlapped but ended up being unique 87 | for bedpe in self.deferred.values(): 88 | for index, x in enumerate(bedpe.misc): 89 | gt = gt_for_index(bedpe, index) 90 | if gt in ('0/1', '1/1'): 91 | self.results.unique += 1 92 | 93 | def summarize(self): 94 | unique_label = '-'.join((self.label, 'only')) 95 | print '\t'.join((unique_label, str(self.results.unique))) 96 | print '\t'.join(('match', str(self.results.match))) 97 | print '\t'.join(('discordant', str(self.results.discordant))) 98 | print '\t'.join(('match_discordant_type', str(self.results.match_discordant_type))) 99 | print '\t'.join(('discordant_discordant_type', str(self.results.discordant_discordant_type))) 100 | 101 | def gt_for_index(bedpe, index): 102 | format_fields = bedpe.misc[index].rstrip().split(':', 1) 103 | return format_fields[0] 104 | 105 | 106 | def strands_match(eval_strand_a, eval_strand_b, compare_strand_a, compare_strand_b): 107 | if '.' in [eval_strand_a, eval_strand_b, compare_strand_a, compare_strand_b]: 108 | return False 109 | if eval_strand_a != eval_strand_b: 110 | return (eval_strand_a == compare_strand_a 111 | and eval_strand_b == compare_strand_b) 112 | else: 113 | return compare_strand_a == compare_strand_b 114 | 115 | def types_match(type1, type2): 116 | return type1 == type2 117 | 118 | if __name__ == '__main__': 119 | parser = argparse.ArgumentParser(description='Filter output of pairtopair based on strand.') 120 | parser.add_argument('-l', '--label', metavar='', type=str, help='Label for a file') 121 | args = parser.parse_args() 122 | 123 | p = PairToPair(args.label) 124 | try: 125 | for line in sys.stdin: 126 | fields = line.rstrip().split('\t') 127 | p.process(fields) 128 | p.process_shadow_unique() 129 | p.summarize() 130 | except ValueError as e: 131 | sys.exit(e) 132 | 133 | --------------------------------------------------------------------------------