├── PipelineStandard.md
├── README.md
└── scripts
    ├── LICENSE
    ├── classify_mie.py
    ├── compare_based_on_strand_output_bedpe.py
    ├── compare_round3_by_region.sh
    └── compare_single_sample_based_on_strand.py


/PipelineStandard.md:
--------------------------------------------------------------------------------
  1 | This pipeline standard was developed to aid in coordination of the Centers for Common Disease Genomics project.  It was tested with HiSeq X data on pipeline implementations from five centers.
  2 | 
  3 | * [Alignment pipeline standards](#alignment-pipeline-standards)
  4 |    1. [Reference genome version](#reference-genome-version)
  5 |    2. [Alignment](#alignment)
  6 |    3. [Duplicate marking](#duplicate-marking)
  7 |    4. [Indel realignment](#indel-realignment)
  8 |    5. [Base quality score recalibration](#base-quality-score-recalibration)
  9 |    6. [Base quality score binning scheme](#base-quality-score-binning-scheme)
 10 |    7. [File format](#file-format)
 11 | * [Functional equivalence evaluation](#functional-equivalence-evaluation)
 12 | * [Pathway for updates to this standard](#pathway-for-updates-to-this-standard)
 13 | 
 14 | # Alignment pipeline standards
 15 | 
 16 | ## Reference genome version
 17 | Each center should use exactly the same reference genome.
 18 | 
 19 | Standard:
 20 | * GRCh38DH, [1000 Genomes Project version](http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/)
 21 | * Includes the standard set of chromosomes and alternate sequences from GRCh38
 22 | * Includes the decoy sequences
 23 | * Includes additional alternate versions of the HLA locus
 24 | 
 25 | ## Alignment
 26 | Each center should use exactly the same alignment strategy
 27 | 
 28 | Standard:
 29 | * Aligner: BWA-MEM
 30 | * Version: We will use 0.7.15 (https://github.com/lh3/bwa/releases/tag/v0.7.15)
 31 | * Standardized parameters:
 32 |     * Do not use `-M` since it causes split-read alignments to be marked as "secondary" rather than "supplementary" alignments, violating the BAM specification
 33 |     * Use `-K 100000000` to achieve deterministic alignment results (Note: this is a hidden option)
 34 |     * Use `-Y` to force soft-clipping rather than default hard-clipping of supplementary alignments
 35 |     * Include a `.alt` file for consumption by BWA-MEM; do not perform post-processing of alternate alignments
 36 | * Optional parameters (may be useful for convenience and not expected to alter results):
 37 |     * `-p` (for interleaved fastq)
 38 |     * `-C` (append FASTA/FASTQ comment to SAM output)
 39 |     * `-v` (logging verbosity)
 40 |     * `-t` (threading)
 41 |     * `-R` (read group header line)
 42 | * Post-alignment modification:
 43 |     * In order to reduce false positive calls due to bacterial contamination randomly aligning to the human genome, reads and their mates may be marked by setting 0x4 bit in the SAM flag if the following conditions apply:
 44 |         1. The primary alignment has less than 32 aligned bases
 45 |         2. The primary alignment is soft clipped on both sides
 46 |     * This filtering is optional
 47 |     * The original mapping information will be encoded in a Previous Alignment (PA) tag on the marked reads using the same format as the SA tag in the BAM specification.
 48 |     * Modification of other flags after alignment will not be performed.
 49 | 
 50 | ## Duplicate marking
 51 | Different centers can use different tools, as long as the same number of reads are marked duplicate and results are functionally equivalent.  During the pipeline synchronization exercise we evaluated four tools: Picard MarkDuplicates, bamUtil, samblaster, and sambamba.  After the exercise, centers are using Picard and bamUtil.
 52 | 
 53 | Standard:
 54 | * Match Picard’s current definition of duplicates for primary alignments where both reads of a pair align to the reference genome. Both samblaster and bamUtil already attempt to match Picard for this class of alignments.
 55 | * If a primary alignment is marked as duplicate, then all supplementary alignments for that same read should also be marked as duplicates. Both Picard and bamUtil have modified to exhibit this behavior.  For Picard, you must use >= version 2.4.1 and run on a queryname sorted input file.  BamUtil must be version >=TODO.  Samblaster supports this behavior, but Sambamba does not.
 56 | * Orphan alignments (where the mate paired read is unmapped) will be marked as duplicates if there’s another read with the same alignment (mated, or orphaned)
 57 | * The unmapped mate of duplicate orphan reads is required to also be marked as a duplicate.
 58 | * It is not a requirement for duplicate marking software to choose the best pair based on base quality sum, but results must be functionally equivalent.  In practice we have moved away from using samblaster for this reason.
 59 | * If a primary alignment is marked as duplicate, then all secondary alignments for that read should also be marked as duplicates. However, given that no secondary alignments will exist using our proposed alignment strategy, it is optional for software to implement.
 60 | * There was a discussion about whether duplicate marking should be deterministic. We did not reach a decision on this.
 61 | * We have discussed the preferred behavior for marking duplicates in datasets with multiple sequencing libraries and have decided that this is a minor concern given that very few samples should have multiple libraries. Currently MarkDuplicates supports multiple libraries with the caveat that the term “Library” isn’t exactly defined (consider a technical replicate that starts somewhere in the middle of the LC process, how early must it be to be called a different library?)
 62 | 
 63 | ## Indel realignment
 64 | This computationally expensive data processing step is dispensable given the state of current indel detection algorithms and will not be performed.
 65 | 
 66 | ## Base quality score recalibration
 67 | There was discussion about dropping BQSR given evidence that the impact on variant calling performance is minimal. However, given that this project will involve combined analysis of data from multiple centers and numerous sequencers, generated over multiple years, and that we cannot ensure the consistency of Illumina base-calling software over time, we decided that it is preferable to perform BQSR.  We evaluated two tools, GATK BaseRecalibrator (both GATK3 and GATK4) and bamUtil.
 68 | 
 69 | Standard:
 70 | * We will use the following files from the [GATK hg38 bundle](https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/) for the site list:
 71 |     * Homo_sapiens_assembly38.dbsnp138.vcf
 72 |     * Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
 73 |     * Homo_sapiens_assembly38.known_indels.vcf.gz
 74 | * The recalibration table may optionally be generated using only the autosomes (chr1-chr22)
 75 | * Downsampling of the reads is optional
 76 | * per-base alignment qualities (BAQ) algorithm is optional
 77 | 
 78 | Command line:
 79 | 
 80 | For users of GATK, the following command line options should be utilized for the BaseRecalibrator tool:
 81 | ```
 82 | -R ${ref_fasta} \
 83 | -I ${input_bam} \
 84 | -O ${recalibration_report_filename} \
 85 | -knownSites "Homo_sapiens_assembly38.dbsnp138.vcf" \
 86 | -knownSites “Mills_and_1000G_gold_standard.indels.hg38.vcf.gz" \
 87 | -knownSites “Homo_sapiens_assembly38.known_indels.vcf.gz”
 88 | ```
 89 | 
 90 | For users of GATK, the following command line options are optional for efficiency and can be utilized for the BaseRecalibrator tool:
 91 | Note: we've tested .1 downsampling fractions.  Lower fractions should be tested for functional equivalence.
 92 | ```
 93 | --downsample_to_fraction .1 \
 94 |     -L chr1 \
 95 |     -L chr2 \
 96 |     -L chr3 \
 97 |     -L chr4 \
 98 |     -L chr5 \
 99 |     -L chr6 \
100 |     -L chr7 \
101 |     -L chr8 \
102 |     -L chr9 \
103 |     -L chr10 \
104 |     -L chr11 \
105 |     -L chr12 \
106 |     -L chr13 \
107 |     -L chr14 \
108 |     -L chr15 \
109 |     -L chr16 \
110 |     -L chr17 \
111 |     -L chr18 \
112 |     -L chr19 \
113 |     -L chr20 \
114 |     -L chr21 \
115 |     -L chr22
116 | ```
117 | 
118 | For users of GATK, the following command line options are optional:
119 | * `-rf BadCigar`
120 | * `--preserve_qscores_less_than 6`
121 | * `--disable_auto_index_creation_and_locking_when_reading_rods`
122 | * `--disable_bam_indexing`
123 | * `-nct`
124 | * `--useOriginalQualities`
125 | 
126 | ## Base quality score binning scheme
127 | Additional base quality score compression is required to reduce file size.  It is possible to achieve this with minimal adverse impacts on variant calling.
128 | 
129 | Standard:
130 | * 4-bin quality score compression. The 4-bin scheme is 2-6, 10, 20, 30. The 2-6 scores correspond to Illumina error codes and will be left as-is by recalibration.
131 | * Bin base quality scores by rounding off to the nearest bin value, in probability space. This feature is already implemented in the current version of GATK.
132 | 
133 | Command line:
134 | 
135 | For users of GATK, the following command line options should be utilized for the PrintReads (GATK3) or ApplyBQSR (GATK4) tool:
136 | ```
137 | -R ${ref_fasta} \
138 | -I ${input_bam} \
139 | -O ${output_bam_basename}.bam \
140 | -bqsr ${recalibration_report} \
141 | -SQQ 10 -SQQ 20 -SQQ 30 \
142 | --disable_indel_quals
143 | ```
144 | 
145 | For users of GATK, the following command line options are optional:
146 | * `--globalQScorePrior -1.0`
147 | * `--preserve_qscores_less_than 6`
148 | * `--useOriginalQualities`
149 | * `-nct`
150 | * `-rf BadCigar`
151 | * `--createOutputBamMD5`
152 | * `--addOutputSAMProgramRecord`
153 | 
154 | ## File format
155 | Each center should use the same file format, while retaining flexibility to include additional information for specific centers or projects.
156 | 
157 | Standard:
158 | * Lossless CRAM. Upon conversion to BAM, the BAM file should be valid according to Picard’s ValidateSamFile.
159 | * Read group (@RG) tags should be present for all reads.
160 |    * The header for the RG should contain minimally the ID tag, PL tag, PU tag, SM tag, and LB tag.
161 |    * The CN tag is recommended.
162 |    * Other tags are optional.
163 |    * The ID tag must be unique within the CRAM. ID tags may be freely renamed to maintain uniqueness when merging CRAMs. No assumptions should be made about the permanence of RG IDs. 
164 |    * The PL tag should indicate the instrument vendor name according to the SAM spec (CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, ONT, and PACBIO).  PL values are case insensitive.
165 |    * The PU tag is used for grouping reads for BQSR and should uniquely identify reads as belonging to a sample-library-flowcell-lane (or other appropriate recalibration unit) within the CRAM file. PU is not required to contain values for fields that are uniform across the CRAM (e.g., single sample CRAM or single library CRAM). The PU tag is not guaranteed to be sufficiently informative after merging with other CRAMs, and anyone performing a merge should consider modifying PU values appropriately.
166 |    * SM should contain the individual identifier for the sample (e.g., NA12878) without any other process or aliquot-specific information.
167 |    * The LB tag should uniquely identify the library for the sample; it must be present even if there is only a single library per sample or CRAM file.
168 |    * If the PM tag is used, values should conform to one of the following (for Illumina instruments): “HiSeq-X”, “HiSeq-4000”, “HiSeq-2500”, “HiSeq-2000”, “NextSeq-500”, or “MiSeq”.
169 | * Retain original query names.
170 | * Retain @PG records for bwa, duplicate marking, quality recalibration, and any other tools that was run on the data.
171 | * Retain the minimal set of tags (RG, MQ, MC and SA).  NOTE: an additional tool may be needed to add the MQ and MC tags if none of the tools add these tags otherwise.  One option is to pipe the alignment through [samblaster](https://github.com/GregoryFaust/samblaster) with the options `-a --addMateTags` as it comes out of BWA
172 | * Groups can add custom tags as needed.
173 | * Do not retain the original base quality scores (OQ tag).
174 | *  it is recommended that users use samtools version >=1.3.1 to convert from BAM/SAM to CRAM (The use of htsjdk/Picard/GATK for converting BAM to CRAM is not currently condoned). Users that would like to convert back from CRAM to BAM (and want to avoid ending up with an invalid BAM) need to either convert to SAM and then to BAM (piping works) or compile samtools with HTSLib version >=1.3.2. To enable this you need to: configure the build of samtools with the parameter `--with-htslib=/path/to/htslib-1.3.2`.
175 | 
176 | # Functional equivalence evaluation
177 | All pipelines used for this effort need to be validated as functionally equivalent.  The validation methodology will be published alongside a test data set.
178 | 
179 | # Pathway for updates to this standard
180 | Pipelines will need to be updated during the project, but this should be a tightly controlled process given the need to reprocess vast amounts of data each time substantial pipeline modifications occur.
181 | 
182 | Draft plan:
183 | * Initial pipeline versions should serve for project years 1-2.
184 | * Efficiency updates passing functional equivalence tests are always allowed.
185 | * Propose to start a review process in late 2017: invite proposals for pipeline updates that incorporate new aligners, reference genomes and data processing steps.
186 | * If substantial improvements are achievable, implement new pipelines for project years 3-4.
187 | * There will need to be a decision about how large the potential variant calling improvements should be to warrant pipeline modification and data reprocessing.
188 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Functional equivalent whole genome sequencing analysis pipelines 
2 | 
3 | ### This pipeline standard was developed to aid in coordination of the Centers for Common Disease Genomics project. It was tested with HiSeq X data on pipeline implementations from five centers.
4 | 
5 | For detailed description, please see the following article:
6 | __[Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects.](https://www.ncbi.nlm.nih.gov/pubmed/30279509)__ _Nat Commun. 2018 Oct 2;9(1):4038_
7 | 


--------------------------------------------------------------------------------
/scripts/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Allison Regier
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/scripts/classify_mie.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import re
 4 | import sys
 5 | 
 6 | class Uninformative(RuntimeError):
 7 |     pass
 8 | 
 9 | class MIE(object):
10 |     def __init__(self):
11 |         pass
12 | 
13 |     def __call__(self, fields):
14 |         #assume last 3 columns are the genotypes in KID DAD MOM order
15 |         kid, dad, mom = [ re.split('[/|]', x) for x in fields[-3:] ]
16 |         dad_alleles, mom_alleles, kid_alleles = map(set, (dad, mom, kid))
17 |         if len(dad_alleles) == 2 and len(mom_alleles) == 2 and len(dad_alleles.union(mom_alleles).union(kid_alleles)) == 2:
18 |             # both are het and not a multi-allelic site, this site is UNINFORMATIVE assuming input includes no missing genotypes
19 |             raise Uninformative
20 |         a1, a2 = kid
21 |         classification = 'MIE'
22 |         if (a1 in dad_alleles and a2 in mom_alleles) or (a2 in dad_alleles and a1 in mom_alleles):
23 |             # NOT AN MIE
24 |             classification = 'NONE'
25 |         return fields[:-3] + [ classification ]
26 | 
27 | if __name__ == '__main__':
28 |     mie_classify = MIE()
29 |     for line in sys.stdin:
30 |         fields = line.rstrip().split('\t')
31 |         try:
32 |             print '\t'.join(mie_classify(fields))
33 |         except Uninformative:
34 |             pass
35 | 
36 | 


--------------------------------------------------------------------------------
/scripts/compare_based_on_strand_output_bedpe.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | from __future__ import division
  3 | 
  4 | import sys
  5 | import argparse
  6 | 
  7 | #TODO what do we do with missing data? Right now is counted as discordance or match depending on if the other sample has data or not
  8 | 
  9 | class Bedpe(object):
 10 |     def __init__(self, bed_list):
 11 |         self.c1 = bed_list[0]
 12 |         self.s1 = int(bed_list[1])
 13 |         self.e1 = int(bed_list[2])
 14 |         self.c2 = bed_list[3]
 15 |         self.s2 = int(bed_list[4])
 16 |         self.e2 = int(bed_list[5])
 17 |         self.name = bed_list[6]
 18 |         self.score = bed_list[7]
 19 |         self.o1 = bed_list[8]
 20 |         self.o2 = bed_list[9]
 21 |         self.svtype = bed_list[10]
 22 |         self.filter = bed_list[11]
 23 |         self.orig_name1 = bed_list[12]
 24 |         self.orig_ref1 = bed_list[13]
 25 |         self.orig_alt1 = bed_list[14]
 26 |         self.orig_name2 = bed_list[15]
 27 |         self.orig_ref2 = bed_list[16]
 28 |         self.orig_alt2 = bed_list[17]
 29 |         self.info1 = bed_list[18]
 30 |         self.info2 = bed_list[19]
 31 |         # remainder will be entries then second bedpe entry will follow
 32 |         self.format = bed_list[20]
 33 |         self.misc = bed_list[21:]
 34 | 
 35 |     def toString(self, status, sample, region):
 36 |         return '\t'.join(str(v) for v in (self.c1, self.s1, self.e1, self.c2, self.s2, self.e2, self.name, self.score, self.o1, self.o2, self.svtype, self.filter, self.orig_name1, self.orig_ref1, self.orig_alt1, self.orig_name2, self.orig_ref2, self.orig_alt2, self.info1+";Regions="+region, self.info2, self.format+":RE", self.misc[sample]+":"+status))+"\n"
 37 | 
 38 | class PairToPair(object):
 39 |     def __init__(self, sample_header, out_dir, region):
 40 |         self._load_samples(sample_header)
 41 |         self.outFiles = [ open(out_dir+"/"+x+".bedpe", 'a') for x in self.samples ]
 42 |         self.region = region
 43 |         self.deferred = dict()
 44 | 
 45 |     def _load_samples(self, sample_header):
 46 |         with open(sample_header) as fh:
 47 |             for line in fh:
 48 |                 fields = line.rstrip().split('\t')
 49 |                 self.samples = fields[21:]
 50 |                 self.num_samples = len(self.samples)
 51 |                 self.index = 21 + self.num_samples
 52 | 
 53 |     def process(self, fields_list):
 54 |         bedpe1 = Bedpe(fields_list[0:self.index])
 55 |         if len(fields_list) == self.index:
 56 |             # this is unique to the first file
 57 |             # process the line and record variants as unique
 58 |             for index, x in enumerate(bedpe1.misc):
 59 |                 fields = x.rstrip().split(':', 1)
 60 |                 #self.results[index].unique += 1
 61 |                 if fields[0] in ('0/1', '1/1'):
 62 |                     #should be a variant
 63 |                     self.outFiles[index].write(bedpe1.toString("0-only", index, self.region))
 64 |         else:
 65 |             # this MAY be a match
 66 |             # check strands and then determine class
 67 |             # for each sample record results
 68 |             bedpe2 = Bedpe(fields_list[self.index:])
 69 |             if strands_match(bedpe1.o1, bedpe1.o2, bedpe2.o1, bedpe2.o2):
 70 |                 # matched by strand
 71 |                 if bedpe1.name in self.deferred:
 72 |                     del self.deferred[bedpe1.name]
 73 |                 for index in range(self.num_samples):
 74 |                     discordant_type = not types_match(bedpe1.svtype, bedpe2.svtype)
 75 |                     gt1 = gt_for_index(bedpe1, index)
 76 |                     gt2 = gt_for_index(bedpe2, index)
 77 |                     if gt1 in ('0/1', '1/1') or gt2 in ('0/1', '1/1'):
 78 |                         if gt1 == gt2:
 79 |                             self.outFiles[index].write(bedpe1.toString("match", index, self.region))
 80 |                         else:
 81 |                             self.outFiles[index].write(bedpe1.toString("discordant", index, self.region))
 82 |             else:
 83 |                 self.deferred[bedpe1.name] = bedpe1
 84 | 
 85 |     def process_shadow_unique(self):
 86 |         # count up things that overlapped but ended up being unique
 87 |         for bedpe in self.deferred.values():
 88 |             for index, x in enumerate(bedpe.misc):
 89 |                 gt = gt_for_index(bedpe, index)
 90 |                 if gt in ('0/1', '1/1'):
 91 |                     self.outFiles[index].write(bedpe.toString("0-only", index, self.region))
 92 | 
 93 |     def finalize(self):
 94 |         for index, f in enumerate(self.outFiles):
 95 |             f.close()
 96 | 
 97 | def gt_for_index(bedpe, index):
 98 |     format_fields = bedpe.misc[index].rstrip().split(':', 1)
 99 |     return format_fields[0]
100 | 
101 | 
102 | def strands_match(eval_strand_a, eval_strand_b, compare_strand_a, compare_strand_b):
103 |     if '.' in [eval_strand_a, eval_strand_b, compare_strand_a, compare_strand_b]:
104 |         return False
105 |     if eval_strand_a != eval_strand_b:
106 |         return (eval_strand_a == compare_strand_a 
107 |                 and eval_strand_b == compare_strand_b)
108 |     else:
109 |         return compare_strand_a == compare_strand_b
110 | 
111 | def types_match(type1, type2):
112 |     return type1 == type2
113 | 
114 | if __name__ == '__main__':
115 |     parser = argparse.ArgumentParser(description='Filter output of pairtopair based on strand.')
116 |     parser.add_argument('-s', '--sample-header', metavar='<FILE>', type=str, help='Filename of file containing sample header of the bedpe')
117 |     parser.add_argument('-o', '--outDir', metavar='<STR>', type=str, help='Directory where output files should be written')
118 |     parser.add_argument('-r', '--region', metavar='<STR>', type=str, help='Region to record in INFO field')
119 |     args = parser.parse_args()
120 |     
121 |     p = PairToPair(args.sample_header, args.outDir, args.region)
122 |     try:
123 |         for line in sys.stdin:
124 |             fields = line.rstrip().split('\t')
125 |             p.process(fields)
126 |         p.process_shadow_unique()
127 |         p.finalize()
128 |     except ValueError as e:
129 |         sys.exit(e)
130 | 
131 | 


--------------------------------------------------------------------------------
/scripts/compare_round3_by_region.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | set -eo pipefail
 4 | BEDTOOLS=$1 #path to bedtools executable
 5 | pad1bp () {
 6 |     cat "$1" | perl -ape '$F[1] -= 1; $F[2]+=1; $F[4] -= 1; $F[5] += 1; $_ = join("\t", @F)."\n"' > $1.padded
 7 | }
 8 | 
 9 | OUTPUT_DIR=$2 #output directory
10 | OUTPUT=$OUTPUT/results.txt
11 | DIFFICULT=$3 #BED file of difficult regions
12 | REST=$4 #BED file of remaining regions
13 | SAMPLE_HEADER=$5 #BEDPE header line listing all samples in correct order
14 | echo "Pipeline1	Pipeline2	Sample	Class	Count" > $OUTPUT
15 | 
16 | #file containing list of pipelines and paths to final BEDPE files from each pipeline,
17 | #separated by tabs
18 | BEDPE_LIST=$6
19 | while read pipeline1 pipeline1_file
20 | do
21 |     while read pipeline2 pipeline2_file
22 |     do
23 |         if [ $pipeline1 == $pipeline2 ]; then
24 |             continue
25 |         fi
26 |         out_dir=$OUTDIR/${pipeline1}_${pipeline2}
27 |         mkdir -p $out_dir/logs
28 | 
29 |         for sample in $( cut -f 22- $SAMPLE_HEADER ); do
30 |             grep "^##" $pipeline1_file > $out_dir/$sample.bedpe
31 |             echo "##FORMAT=<ID=RE,Number=1,Type=String,Description=\"Matching status in pairwise comparison\">" >> $out_dir/$sample.bedpe
32 |             echo "##INFO=<ID=Regions,Number=1,Type=String,Description=\"Genomic regions\">" >> $out_dir/$sample.bedpe
33 |             grep "^#CHROM_A" $pipeline1_file | cut -f 1-21 | sed "s/$/\t$sample/" >> $out_dir/$sample.bedpe
34 |         done
35 |         $BEDTOOLS pairtopair -is -a $pipeline1_file.padded -b $pipeline2_file.padded -type both -slop 50 | $BEDTOOLS pairtobed -a - -b $DIFFICULT -type either | rev | cut -f 4- | rev | sort -u | python compare_based_on_strand_output_bedpe.py -s $SAMPLE_HEADER -o $out_dir -r hard
36 |         $BEDTOOLS pairtopair -is -a $pipeline1_file.padded -b $pipeline2_file.padded -type both -slop 50 | $BEDTOOLS pairtobed -a - -b $DIFFICULT -type neither | $BEDTOOLS pairtobed -a - -b $REST -type either | rev | cut -f 4- | rev | sort -u | python compare_based_on_strand_output_bedpe.py -s $SAMPLE_HEADER -o $out_dir -r medium
37 |         $BEDTOOLS pairtopair -is -a $pipeline1_file.padded -b $pipeline2_file.padded -type both -slop 50 | $BEDTOOLS pairtobed -a - -b $DIFFICULT -type neither | $BEDTOOLS pairtobed -a - -b $REST -type neither | sort -u | python compare_based_on_strand_output_bedpe.py -s $SAMPLE_HEADER -o $out_dir -r easy
38 |         $BEDTOOLS pairtopair -is -a $pipeline1_file.padded -b $pipeline2_file.padded -type notboth -slop 50 | $BEDTOOLS pairtobed -a - -b $DIFFICULT -type either | rev | cut -f 4- | rev | sort -u | python compare_based_on_strand_output_bedpe.py -s $SAMPLE_HEADER -o $out_dir -r hard
39 |         $BEDTOOLS pairtopair -is -a $pipeline1_file.padded -b $pipeline2_file.padded -type notboth -slop 50 | $BEDTOOLS pairtobed -a - -b $DIFFICULT -type neither | $BEDTOOLS pairtobed -a - -b $REST -type either | rev | cut -f 4- | rev | sort -u | python compare_based_on_strand_output_bedpe.py -s $SAMPLE_HEADER -o $out_dir -r medium
40 |         $BEDTOOLS pairtopair -is -a $pipeline1_file.padded -b $pipeline2_file.padded -type notboth -slop 50 | $BEDTOOLS pairtobed -a - -b $DIFFICULT -type neither | $BEDTOOLS pairtobed -a - -b $REST -type neither | sort -u | python compare_based_on_strand_output_bedpe.py -s $SAMPLE_HEADER -o $out_dir -r easy
41 | 
42 |     done < $BEDPE_LIST
43 | done < $BEDPE_LIST
44 | 


--------------------------------------------------------------------------------
/scripts/compare_single_sample_based_on_strand.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | from __future__ import division
  3 | 
  4 | import sys
  5 | import argparse
  6 | 
  7 | #TODO what do we do with missing data? Right now is counted as discordance or match depending on if the other sample has data or not
  8 | 
  9 | class Result(object):
 10 |     def __init__(self):
 11 |         self.unique = 0
 12 |         self.match = 0
 13 |         self.discordant = 0
 14 |         self.match_discordant_type = 0
 15 |         self.discordant_discordant_type = 0
 16 |         self.unmatched = dict()
 17 | 
 18 | class Bedpe(object):
 19 |     def __init__(self, bed_list):
 20 |         self.c1 = bed_list[0]
 21 |         self.s1 = int(bed_list[1])
 22 |         self.e1 = int(bed_list[2])
 23 |         self.c2 = bed_list[3]
 24 |         self.s2 = int(bed_list[4])
 25 |         self.e2 = int(bed_list[5])
 26 |         self.name = bed_list[6]
 27 |         self.score = bed_list[7]
 28 |         self.o1 = bed_list[8]
 29 |         self.o2 = bed_list[9]
 30 |         self.svtype = bed_list[10]
 31 |         self.filter = bed_list[11]
 32 |         self.orig_name1 = bed_list[12]
 33 |         self.orig_ref1 = bed_list[13]
 34 |         self.orig_alt1 = bed_list[14]
 35 |         self.orig_name2 = bed_list[15]
 36 |         self.orig_ref2 = bed_list[16]
 37 |         self.orig_alt2 = bed_list[17]
 38 |         self.info1 = bed_list[18]
 39 |         self.info2 = bed_list[19]
 40 |         # remainder will be entries then second bedpe entry will follow
 41 |         self.format = bed_list[20]
 42 |         self.misc = bed_list[21:]
 43 | 
 44 | class PairToPair(object):
 45 |     def __init__(self, label):
 46 |         self.label = label
 47 |         self.results = Result()
 48 |         self.deferred = dict()
 49 | 
 50 |     def process(self, fields_list):
 51 |         bedpe1 = Bedpe(fields_list[0:22])
 52 |         if len(fields_list) == 22:
 53 |             # this is unique to the first file
 54 |             # process the line and record variants as unique
 55 |             for index, x in enumerate(bedpe1.misc):
 56 |                 fields = x.rstrip().split(':', 1)
 57 |                 if fields[0] in ('0/1', '1/1'):
 58 |                     #should be a variant
 59 |                     self.results.unique += 1
 60 |         else:
 61 |             # this MAY be a match
 62 |             # check strands and then determine class
 63 |             # for each sample record results
 64 |             bedpe2 = Bedpe(fields_list[22:])
 65 |             if strands_match(bedpe1.o1, bedpe1.o2, bedpe2.o1, bedpe2.o2):
 66 |                 # matched by strand
 67 |                 if bedpe1.name in self.deferred:
 68 |                     del self.deferred[bedpe1.name]
 69 |                 for index in range(1):
 70 |                     discordant_type = not types_match(bedpe1.svtype, bedpe2.svtype)
 71 |                     gt1 = gt_for_index(bedpe1, index)
 72 |                     gt2 = gt_for_index(bedpe2, index)
 73 |                     if gt1 in ('0/1', '1/1') or gt2 in ('0/1', '1/1'):
 74 |                         if gt1 == gt2:
 75 |                             self.results.match += 1
 76 |                             if discordant_type:
 77 |                                 self.results.match_discordant_type += 1
 78 |                         else:
 79 |                             self.results.discordant += 1
 80 |                             if discordant_type:
 81 |                                 self.results.discordant_discordant_type += 1
 82 |             else:
 83 |                 self.deferred[bedpe1.name] = bedpe1
 84 | 
 85 |     def process_shadow_unique(self):
 86 |         # count up things that overlapped but ended up being unique
 87 |         for bedpe in self.deferred.values():
 88 |             for index, x in enumerate(bedpe.misc):
 89 |                 gt = gt_for_index(bedpe, index)
 90 |                 if gt in ('0/1', '1/1'):
 91 |                     self.results.unique += 1
 92 | 
 93 |     def summarize(self):
 94 |         unique_label = '-'.join((self.label, 'only'))
 95 |         print '\t'.join((unique_label, str(self.results.unique)))
 96 |         print '\t'.join(('match', str(self.results.match)))
 97 |         print '\t'.join(('discordant', str(self.results.discordant)))
 98 |         print '\t'.join(('match_discordant_type', str(self.results.match_discordant_type)))
 99 |         print '\t'.join(('discordant_discordant_type', str(self.results.discordant_discordant_type)))
100 | 
101 | def gt_for_index(bedpe, index):
102 |     format_fields = bedpe.misc[index].rstrip().split(':', 1)
103 |     return format_fields[0]
104 | 
105 | 
106 | def strands_match(eval_strand_a, eval_strand_b, compare_strand_a, compare_strand_b):
107 |     if '.' in [eval_strand_a, eval_strand_b, compare_strand_a, compare_strand_b]:
108 |         return False
109 |     if eval_strand_a != eval_strand_b:
110 |         return (eval_strand_a == compare_strand_a 
111 |                 and eval_strand_b == compare_strand_b)
112 |     else:
113 |         return compare_strand_a == compare_strand_b
114 | 
115 | def types_match(type1, type2):
116 |     return type1 == type2
117 | 
118 | if __name__ == '__main__':
119 |     parser = argparse.ArgumentParser(description='Filter output of pairtopair based on strand.')
120 |     parser.add_argument('-l', '--label', metavar='<STR>', type=str, help='Label for a file')
121 |     args = parser.parse_args()
122 |     
123 |     p = PairToPair(args.label)
124 |     try:
125 |         for line in sys.stdin:
126 |             fields = line.rstrip().split('\t')
127 |             p.process(fields)
128 |         p.process_shadow_unique()
129 |         p.summarize()
130 |     except ValueError as e:
131 |         sys.exit(e)
132 | 
133 | 


--------------------------------------------------------------------------------