├── .gitignore ├── LICENSE ├── README.md ├── figures └── chr1-934098-934868-HG03687-DEL.png └── workflows ├── conf ├── create_config.sh ├── samplot-ml-predict.yaml └── test.txt ├── config_utils.py ├── envs ├── samplot.yaml └── tensorflow.yaml ├── samplot-ml-predict.smk ├── saved_models └── samplot-ml.h5 └── scripts ├── annotate.py ├── crop.sh ├── gen_img.sh ├── get_del_regions.sh ├── images_from_regions.sh ├── install_gargs.sh ├── predict.py ├── test.sh └── utils ├── datasets.py └── models.py /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | *.swp 3 | __pycache__ 4 | .pylintrc 5 | .snakemake 6 | *.org 7 | *.backup 8 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Murad Chowdhury 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Samplot-ML 2 | [![DOI](https://zenodo.org/badge/191653284.svg)](https://zenodo.org/badge/latestdoi/191653284) 3 | 4 | [Samplot](https://github.com/ryanlayer/samplot) is a command line tool for rapid, multi-sample structural variant visualization. samplot takes SV coordinates and bam files and produces high-quality images that highlight any alignment and depth signals that substantiate the SV. 5 | 6 | Samplot-ML is a convolutional neural network trained to identify false positive deletion SVs using Samplot images. The workflow for Samplot-ML is simple: given a whole-genome sequenced sample (BAM or CRAM) as well as a set of putative deletions (VCF), Samplot-ML re-genotypes each putative deletion using the Samplot-generated image. The result is a call set where most false positives are flagged. 7 | 8 | This Repository provides a snakemake workflow for annotating an SV callset with Samplot-ML's predictions. 9 | 10 | To cite Samplot-ML, please use 11 | >Belyeu, J.R., Chowdhury, M., Brown, J. et al. Samplot: a platform for structural variant visual validation and automated filtering. Genome Biol 22, 161 (2021). https://doi.org/10.1186/s13059-021-02380-5 12 | 13 | ## Dependencies 14 | * `bz2` and `zlib` devel libraries. On Ubuntu systems, you can use: 15 | ``` 16 | $ sudo apt install libbz2-dev zlib1g-dev 17 | ``` 18 | * `conda` - for a minimal install of conda, check out [miniconda](https://docs.conda.io/en/latest/miniconda.html) 19 | 20 | * `mamba` - [drop in replacement](https://github.com/mamba-org/mamba) for conda's package manager. Used to install `snakemake` 21 | ``` 22 | $ conda install -n base -c conda-forge mamba 23 | ``` 24 | 25 | * `snakemake` 26 | ``` 27 | $ mamba create -c conda-forge -c bioconda -n snakemake snakemake 28 | ``` 29 | 30 | * `aws cli` (optional) - If you plan on using data sources from s3 buckets in the samplot-ml workflow. 31 | 32 | * Any other dependencies will be handled by snakemake through conda. 33 | 34 | ## Example Usage 35 | To demonstrate how to use Samplot-ML, let's work through a simple example that takes us from calling SV's to executing the Samplot-ML snakefile using some data from the 1000 genomes project (1kg). 36 | 37 | ### Calling SVs with [smoove](https://github.com/brentp/smoove) 38 | 1. Download a CRAM file from 1kg's ftp 39 | ``` 40 | $ wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3242876/HG03687.final.cram 41 | $ wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3242876/HG03687.final.cram.crai 42 | ``` 43 | 2. Get the reference genome 44 | ``` 45 | $ wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa 46 | $ wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa.fai 47 | ``` 48 | 49 | 3. Get a set of exclude regions in bed format for use during SV calling 50 | ``` 51 | $ wget https://raw.githubusercontent.com/hall-lab/speedseq/master/annotations/exclude.cnvnator_100bp.GRCh38.20170403.bed 52 | ``` 53 | 54 | 4. Install smoove with conda 55 | ``` 56 | $ conda create -c bioconda -n smoove smoove 57 | ``` 58 | 5. Call SVs 59 | ``` 60 | $ conda activate smoove 61 | $ smoove call -x \ 62 | --name HG03687 \ 63 | --exclude $exclude_bed \ 64 | --fasta $reference \ 65 | --genotype \ 66 | --outdir $outdir 67 | $cram 68 | ``` 69 | The resulting vcf will be `$outdir/HG003687-smoove.genotyped.vcf.gz`. 70 | 71 | 6. Clone the Samplot-ML git repo 72 | 73 | 7. Next, let's edit the config file located at `samplot-ml/workflows/samplot-ml-predict.yaml` 74 | ``` 75 | samples: 76 | HG03687: "/path/to/cram" # or you can use "s3://bucket/bam_or_cram" if you've got alignments in an s3 bucket. 77 | fasta: 78 | data_source: "local" # either local or s3 79 | file: "/path/to/reference" # or "s3://bucket/reference_file" 80 | fai: 81 | data_source: "local" 82 | file: "/path/to/reference_index" 83 | vcf: 84 | data_source: "local" 85 | file: "/path/to/vcf" 86 | 87 | # generated images will have filename: ${contig}-${start}-${end}-DEL.png 88 | # we give the choice of delimiter since contigs can sometimes contain 89 | # character like hypens, underscores, etc. 90 | image_filename_delimiter: "-" 91 | outdir: "/path/to/output_directory" 92 | ``` 93 | 94 | 8. Run the prediction snakefile located at `samplot-ml/workflows/samplot-ml-predict.smk` 95 | ``` 96 | conda activate snakemake 97 | snakemake -s samplot-ml-predict.smk \ 98 | -j $num_threads \ # number of parallel threads to use to execute jobs 99 | --use-conda --conda-fronend mamba # this allows snakemake to handle dependencies 100 | ``` 101 | 102 | Generated images of DEL regions from the input VCF will be located at `$outdir/img/` each image will be named `${contig}-${start}-${end}-DEL.png`. An annotated vcf containing the Samplot-ML predictions will be located at `$outdir/samplot-ml-results/HG03687-samplot-ml.vcf.gz` 103 | 104 | ### VCF annotations 105 | The resulting prediction vcf will contain the following format fields: 106 | * PREF, PHET, and PALT: the prediction score assigned by the model which corresponds to a prediction of a 0/0, 0/1, or 1/1 genotype, respectively. If the region in the input vcf was originally a 0/0, then these fields will contain 'nan' values 107 | * OLDGT: The original genotype of the region from the input SV callset. 108 | * If the predicted genotype differed from the input genotype, then the model will replace the GT field with the predicted genotype. 109 | 110 | ### Back to our example 111 | Now that we've got our annotated VCF, let's inspect one of the predictions and compare it with the samplot image. 112 | ``` 113 | # get the first DEL region from the vcf. Print out the filename of the 114 | # samplot image, the original genotype and the samplot-ml predicted genotype. 115 | $ bcftools query -i 'SVTYPE="DEL"' -f '%CHROM-%POS-%INFO/END-[%SAMPLE].png\t[%OLDGT\t%GT]\n' HG03687-samplot-ml.vcf.gz | head -1 116 | 117 | output: 118 | chr1-934098-934868-HG03687.png 0/1 1/1 119 | ``` 120 | 121 | It seems that in the very first deletion we came across, there was a difference between SVTyper's prediction and Samplot-ML's prediction. We went from a heterozygous deletion (0/1) to a homozygous alternate deletion (1/1). Let's take a look at the image in question. 122 | ![samplot-image](https://github.com/mchowdh200/samplot-ml/raw/master/figures/chr1-934098-934868-HG03687-DEL.png) 123 | -------------------------------------------------------------------------------- /figures/chr1-934098-934868-HG03687-DEL.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mchowdh200/samplot-ml/ec5dab5d19c094f03070fba25d3b8db780a4e5d3/figures/chr1-934098-934868-HG03687-DEL.png -------------------------------------------------------------------------------- /workflows/conf/create_config.sh: -------------------------------------------------------------------------------- 1 | #!/bin/env bash 2 | set -eu 3 | 4 | while (( "$#" )); do 5 | case "$1" in 6 | -b|--bam-bucket) 7 | bam_bucket=$2 8 | shift 2;; 9 | -f|--fasta) 10 | fasta=$2 11 | shift 2;; 12 | -i|--fai) 13 | fai=$2 14 | shift 2;; 15 | -v|--vcf) 16 | vcf=$2 17 | shift 2;; 18 | -d|--delimiter) 19 | delimiter=$2 20 | shift 2;; 21 | -o|--outdir) 22 | outdir=$2 23 | shift 2;; 24 | --) # end argument parsing 25 | shift 26 | break;; 27 | -*|--*=) # unsupported flags 28 | echo "Error: Unsupported flag $1" >&2 29 | exit 1;; 30 | esac 31 | done 32 | 33 | echo "samples:" 34 | 35 | # filenames 36 | bams=$(aws s3 ls $bam_bucket/ | grep -E '*.bam$' | sed -E 's/\s+/\t/g' | cut -f4) 37 | # OR do the same thing with a local directory of bams 38 | # TODO for now I'm just going to download bams locally and create this config 39 | # bams=$(ls $bam_bucket | grep -E '*.bam$') 40 | 41 | ### Sample bam pairs 42 | # get the sample name from the bam header 43 | for bam in $bams; do 44 | sample=$(samtools view -H $bam_bucket/$bam | grep SM -m1 | 45 | awk 'BEGIN{RS="\t"; FS=":"} /SM/ {print $2}') 46 | printf " $sample: \"$bam_bucket/$bam\"\n" 47 | done 48 | 49 | ### reference fasta and index 50 | ### TODO fix the s3 or local choice 51 | echo "fasta:" 52 | printf " data_source: \"s3\"\n" 53 | printf " file: \"$fasta\"\n" 54 | 55 | echo "fai:" 56 | printf " data_source: \"s3\"\n" 57 | printf " file: \"$fai\"\n" 58 | 59 | echo "vcf:" 60 | printf " data_source: \"s3\"\n" 61 | printf " file: \"$vcf\"\n" 62 | 63 | echo "image_filename_delimiter: \"$delimiter\"" 64 | echo "outdir: \"$outdir\"" 65 | -------------------------------------------------------------------------------- /workflows/conf/samplot-ml-predict.yaml: -------------------------------------------------------------------------------- 1 | samples: 2 | # TODO script to generate this config: 3 | # * Especially if there are going to be a lot of samples 4 | # eg SAMPLE: "PATH/TO/BAM_OR_CRAM" 5 | # or SAMPLE: "s3://BUCKET/BAM_OR_CRAM" 6 | # or SAMPLE: "http://URL/BAM_OR_CRAM" 7 | Yanbian: "s3://layerlabcu/cow/bams/Yanbian_merged_marked.bam" 8 | 9 | fasta: 10 | data_source: "s3" # local or s3 11 | file: "s3://layerlabcu/cow/ARS-UCD1.2_Btau5.0.1Y.prepend_chr.fa" 12 | fai: 13 | data_source: "s3" # local or s3 14 | file: "s3://layerlabcu/cow/ARS-UCD1.2_Btau5.0.1Y.prepend_chr.fa.fai" 15 | 16 | vcf: 17 | data_source: "s3" # local of s3 18 | file: "s3://layerlabcu/cow/VCF/sites.smoove.square.vcf.gz" 19 | 20 | image_filename_delimiter: "-" 21 | outdir: "/home/murad/Repositories/samplot-ml/workflows/temp" 22 | -------------------------------------------------------------------------------- /workflows/conf/test.txt: -------------------------------------------------------------------------------- 1 | abcdefg 2 | -------------------------------------------------------------------------------- /workflows/config_utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | from abc import ABC, abstractmethod 3 | 4 | class Data(ABC): 5 | """ 6 | Abstract class that describes a simple interface for how local/remote 7 | data will be obtained in snakemake rules. 8 | """ 9 | def __init__(self, input, output): 10 | self.input = input 11 | self.output = output 12 | 13 | @abstractmethod 14 | def get_cmd(self) -> str: 15 | """ 16 | returns a shell directive action (eg. downloading, or creating symlink) 17 | """ 18 | pass 19 | 20 | 21 | class LocalData(Data): 22 | def __init__(self, input, output): 23 | super().__init__(input, output) 24 | 25 | def get_cmd(self) -> str: 26 | """ 27 | returns symlink command to be run in snakemake shell directive 28 | """ 29 | return f"ln -sr {self.input} {self.output}" 30 | 31 | 32 | class S3Data(Data): 33 | def __init__(self, input, output): 34 | super().__init__(input, output) 35 | 36 | def get_cmd(self) -> str: 37 | """ 38 | returns aws s3 copy command to be run in snakemake shell directive 39 | """ 40 | return f"aws s3 cp {self.input} {self.output}" 41 | 42 | 43 | def data_factory(config, data_type): 44 | """ 45 | Create a Data object whose behavior is determined by the config. 46 | INPUTS: 47 | * config - snakemake config dictionary 48 | * data_type - type of data file (eg 'fasta' or 'vcf') 49 | """ 50 | data_source = config[data_type]['data_source'] 51 | input_file = config[data_type]['file'] 52 | outdir = config['outdir'] 53 | output_file = f'{outdir}/{os.path.basename(input_file)}' 54 | 55 | if data_source == 's3': 56 | return S3Data(input_file, output_file) 57 | elif data_source == 'local': 58 | return LocalData(input_file, output_file) 59 | else: 60 | raise ValueError( 61 | f'Unknown data_source: "{data_source}".' 62 | ' data_source must be either "s3" or "local".') 63 | 64 | 65 | class Conf: 66 | """ 67 | Use the snakemake config dict to set variables and configure the Data 68 | objects which will control the behavior of data aquisition rules. 69 | """ 70 | def __init__(self, config): 71 | self.samples = config['samples'].keys() # list of sample names 72 | self.alignments = config['samples'] # dict of bam/cram indexed by sample 73 | self.outdir = config['outdir'] 74 | self.delimiter = config['image_filename_delimiter'] 75 | self.fasta = data_factory(config, 'fasta') 76 | self.fai = data_factory(config, 'fai') 77 | self.vcf = data_factory(config, 'vcf') 78 | 79 | -------------------------------------------------------------------------------- /workflows/envs/samplot.yaml: -------------------------------------------------------------------------------- 1 | name: samplot 2 | channels: 3 | - bioconda 4 | - conda-forge 5 | dependencies: 6 | - python=3.6 7 | - matplotlib==3.1.0 8 | - numpy 9 | - cython 10 | - bcftools 11 | - htslib 12 | - imagemagick 13 | - joblib 14 | - samplot 15 | - pip 16 | - pip: 17 | - "git+https://github.com/pysam-developers/pysam.git" 18 | - "git+https://github.com/mchowdh200/samplot.git" 19 | -------------------------------------------------------------------------------- /workflows/envs/tensorflow.yaml: -------------------------------------------------------------------------------- 1 | name: tensorflow 2 | dependencies: 3 | - python=3.7 4 | - h5py=2.10 5 | - joblib 6 | - pip 7 | - pip: 8 | - tensorflow==2.0.0 9 | - tensorflow-addons==0.6.0 10 | 11 | -------------------------------------------------------------------------------- /workflows/samplot-ml-predict.smk: -------------------------------------------------------------------------------- 1 | ## General TODOs 2 | # * provide example scripts on how to programmaticaly create sample: file pairs 3 | # in the config yaml. 4 | # * or make a rule that looks at the RG tags to get sample names from list of bams 5 | 6 | import os 7 | import functools 8 | from glob import glob 9 | from config_utils import Conf 10 | 11 | 12 | ## Setup 13 | ################################################################################ 14 | configfile: 'conf/samplot-ml-predict.yaml' 15 | conf = Conf(config) 16 | 17 | ## Rules 18 | ################################################################################ 19 | rule All: 20 | input: 21 | expand(f'{conf.outdir}/samplot-ml-results/{{sample}}-samplot-ml.vcf.gz', 22 | sample=conf.samples) 23 | 24 | 25 | gargs = f'{conf.outdir}/bin/gargs' 26 | rule InstallGargs: 27 | """ 28 | Install system appropriate binary of gargs for use in image generation rules. 29 | """ 30 | output: 31 | gargs 32 | shell: 33 | f'bash scripts/install_gargs.sh {gargs}' 34 | 35 | 36 | rule GetReference: 37 | output: 38 | fasta = conf.fasta.output, 39 | fai = conf.fai.output 40 | run: 41 | shell(conf.fasta.get_cmd()) 42 | shell(conf.fai.get_cmd()) 43 | 44 | 45 | rule GetBaseVCF: 46 | output: 47 | conf.vcf.output 48 | run: 49 | shell(conf.vcf.get_cmd()) 50 | 51 | 52 | rule GetDelRegions: 53 | """ 54 | Get a sample's del regions from the vcf in bed format 55 | """ 56 | input: 57 | conf.vcf.output 58 | output: 59 | f'{conf.outdir}/bed/{{sample}}-del-regions.bed' 60 | conda: 61 | 'envs/samplot.yaml' 62 | shell: 63 | f""" 64 | [[ ! -d {conf.outdir}/bed ]] && mkdir {conf.outdir}/bed 65 | bash scripts/get_del_regions.sh {{input}} {{wildcards.sample}} > {{output}} 66 | """ 67 | 68 | 69 | def get_images(rule, wildcards): 70 | """ 71 | Return list of output images from the GenerateImages/CropImages checkpoints 72 | """ 73 | # gets output of the checkpoint (a directory) and re-evals workflow DAG 74 | if rule == "GenerateImages": 75 | image_dir = checkpoints.GenerateImages.get(sample=wildcards.sample).output[0] 76 | elif rule == "CropImages": 77 | image_dir = checkpoints.CropImages.get(sample=wildcards.sample).output[0] 78 | else: 79 | raise ValueError(f'Unknown argument for rule: {rule}.' 80 | 'Must be "GenerateImages" or "CropImages"') 81 | return glob(f'{image_dir}/*.png') 82 | 83 | 84 | rule GenerateImages: 85 | """ 86 | Images from del regions for a given sample. 87 | """ 88 | threads: workflow.cores 89 | input: 90 | # bam/ file: from config. could be a url 91 | # therefore will not be tracked by snakemake 92 | gargs_bin = gargs, 93 | fasta = conf.fasta.output, 94 | fai = conf.fai.output, 95 | regions = rules.GetDelRegions.output 96 | output: 97 | directory(f'{conf.outdir}/img/{{sample}}') 98 | params: 99 | bam = lambda wildcards: conf.alignments[wildcards.sample] 100 | conda: 101 | 'envs/samplot.yaml' 102 | shell: 103 | # TODO put the gen_img.sh script into a function in images_from_regions.sh 104 | f""" 105 | mkdir -p {conf.outdir}/img 106 | bash scripts/images_from_regions.sh \\ 107 | --gargs-bin {{input.gargs_bin}} \\ 108 | --fasta {{input.fasta}} \\ 109 | --regions {{input.regions}} \\ 110 | --bam {{params.bam}} \\ 111 | --outdir {conf.outdir}/img/{{wildcards.sample}} \\ 112 | --delimiter {conf.delimiter} \\ 113 | --processes {{threads}} 114 | """ 115 | 116 | 117 | rule CropImages: 118 | """ 119 | Crop axes and text from images to prepare for samplot-ml input 120 | """ 121 | threads: workflow.cores 122 | input: 123 | gargs_bin = gargs, 124 | imgs = rules.GenerateImages.output 125 | # imgs = functools.partial(get_images, 'GenerateImages') 126 | output: 127 | directory(f'{conf.outdir}/crop/{{sample}}') 128 | conda: 129 | 'envs/samplot.yaml' 130 | shell: 131 | f""" 132 | [[ ! -d {conf.outdir}/crop ]] && mkdir {conf.outdir}/crop 133 | bash scripts/crop.sh -i {{input.imgs}} \\ 134 | -o {{output}} \\ 135 | -p {{threads}} \\ 136 | -g {{input.gargs_bin}} 137 | """ 138 | 139 | rule CreateImageList: 140 | """ 141 | Samplot-ml needs list of input images. This rule takes the list 142 | of a sample's cropped images and puts them in a text file. 143 | """ 144 | input: 145 | #functools.partial(get_images, 'CropImages') 146 | rules.CropImages.output 147 | output: 148 | temp(f'{conf.outdir}/{{sample}}-cropped-imgs.txt') 149 | run: 150 | with open(output[0], 'w') as out: 151 | # for image_file in input: 152 | for image_file in glob(f'{conf.outdir}/crop/{wildcards.sample}/*.png'): 153 | out.write(f'{image_file}\n') 154 | 155 | 156 | rule PredictImages: 157 | """ 158 | Feed images into samplot-ml to get a bed file of predictions. 159 | Prediction format (tab separated): 160 | - chrm start end p_ref p_het p_alt 161 | """ 162 | threads: workflow.cores 163 | input: 164 | f'{conf.outdir}/{{sample}}-cropped-imgs.txt' 165 | output: 166 | f'{conf.outdir}/{{sample}}-predictions.bed' 167 | conda: 168 | 'envs/tensorflow.yaml' 169 | shell: 170 | """ 171 | python scripts/predict.py \\ 172 | --image-list {input} \\ 173 | --delimiter {conf.delimiter} \\ 174 | --processes {threads} \\ 175 | --batch-size {threads} \\ 176 | --model-path saved_models/samplot-ml.h5 \\ 177 | > {output} 178 | """ 179 | 180 | 181 | rule AnnotateVCF: 182 | input: 183 | vcf = conf.vcf.output, 184 | bed = f'{conf.outdir}/{{sample}}-predictions.bed' 185 | output: 186 | f'{conf.outdir}/samplot-ml-results/{{sample}}-samplot-ml.vcf.gz' 187 | conda: 188 | 'envs/samplot.yaml' 189 | shell: 190 | """ 191 | [[ ! -d {conf.outdir}/samplot-ml-results ]] && mkdir {conf.outdir}/samplot-ml-results 192 | bcftools view -s {wildcards.sample} {input.vcf} | 193 | python scripts/annotate.py {input.bed} {wildcards.sample} | 194 | bgzip -c > {output} 195 | """ 196 | 197 | 198 | 199 | 200 | 201 | -------------------------------------------------------------------------------- /workflows/saved_models/samplot-ml.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mchowdh200/samplot-ml/ec5dab5d19c094f03070fba25d3b8db780a4e5d3/workflows/saved_models/samplot-ml.h5 -------------------------------------------------------------------------------- /workflows/scripts/annotate.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import numpy as np 3 | import pysam 4 | 5 | bed = sys.argv[1] 6 | sample = sys.argv[2] 7 | 8 | predictions = {} 9 | genotypes = {0: (0, 0), 1: (0, 1), 2: (1, 1)} 10 | 11 | 12 | # go through the predictions bed file 13 | # and get the predictions keyed by region 14 | for l in open(bed, 'r'): 15 | A = l.rstrip().split() 16 | region = '\t'.join(A[:3]) 17 | predictions[region] = [float(x) for x in A[3:]] # prediction score 18 | 19 | # Pipe the vcf from stdin. 20 | # iterate over each record and replace the genotype 21 | # with the predicted genotypes 22 | with pysam.VariantFile('-', 'rb') as VCF: 23 | # get the old genotype and prediction scores into the vcf FORMAT fields 24 | VCF.header.add_meta('FORMAT', items=[('ID', 'OLDGT'), 25 | ('Number', '1'), 26 | ('Type', 'String'), 27 | ('Description', 'Genotype before samplot-ml')]) 28 | VCF.header.add_meta('FORMAT', items=[('ID', 'PREF'), 29 | ('Number', '1'), 30 | ('Type', 'Float'), 31 | ('Description', 'Samplot-ml P(0/0) prediction score')]) 32 | VCF.header.add_meta('FORMAT', items=[('ID', 'PHET'), 33 | ('Number', '1'), 34 | ('Type', 'Float'), 35 | ('Description', 'Samplot-ml P(0/1) prediction score')]) 36 | VCF.header.add_meta('FORMAT', items=[('ID', 'PALT'), 37 | ('Number', '1'), 38 | ('Type', 'Float'), 39 | ('Description', 'Samplot-ml P(1/1) prediction score')]) 40 | print(str(VCF.header).strip()) 41 | for variant in VCF: 42 | region = '\t'.join([str(x) for x in [variant.contig, 43 | variant.pos, 44 | variant.stop]]) 45 | oldgt = variant.samples[sample].allele_indices 46 | variant.samples[sample]["OLDGT"] = f"{oldgt[0]}/{oldgt[1]}" 47 | if region in predictions: 48 | pref, phet, palt = predictions[region] 49 | variant.samples[sample]["PREF"] = pref 50 | variant.samples[sample]["PHET"] = phet 51 | variant.samples[sample]["PALT"] = palt 52 | 53 | # set new GT 54 | variant.samples[sample].allele_indices = genotypes[ 55 | np.argmax(predictions[region])] 56 | else: 57 | variant.samples[sample]["PREF"] = float("nan") 58 | variant.samples[sample]["PHET"] = float("nan") 59 | variant.samples[sample]["PALT"] = float("nan") 60 | print(str(variant).rstrip()) 61 | -------------------------------------------------------------------------------- /workflows/scripts/crop.sh: -------------------------------------------------------------------------------- 1 | #!/bin/env bash 2 | 3 | function crop 4 | { 5 | input_img=$1 6 | outdir=$2 7 | output_img=$outdir/$(basename $input_img) 8 | 9 | convert \ 10 | -crop 2090x575+175+200 \ 11 | -fill white \ 12 | -draw "rectangle 0,30 500,50" \ 13 | -draw "rectangle 600,0 700,50" \ 14 | $input_img $output_img 15 | echo $output_img 16 | } 17 | export -f crop 18 | 19 | while (( "$#" )); do 20 | case "$1" in 21 | -i|--imgdir) 22 | imgdir=$2 23 | shift 2;; 24 | -o|--outdir) 25 | outdir=$2 26 | shift 2;; 27 | -p|--processes) 28 | processes=$2 29 | shift 2;; 30 | -g|--gargs-bin) 31 | gargs_bin=$2 32 | shift 2;; 33 | --) # end argument parsing 34 | shift 35 | break;; 36 | -*|--*=) # unsupported flags 37 | echo "Error: Unsupported flag $1" >&2 38 | exit 1;; 39 | esac 40 | done 41 | 42 | mkdir -p $outdir 43 | find $imgdir -name '*.png' | $gargs_bin -p $processes "crop {0} $outdir" 44 | -------------------------------------------------------------------------------- /workflows/scripts/gen_img.sh: -------------------------------------------------------------------------------- 1 | #!/bin/env bash 2 | set -eu 3 | 4 | while (( "$#" )); do 5 | case "$1" in 6 | -c|--chrom) 7 | chrom=$2 8 | shift 2;; 9 | -s|--start) 10 | start=$2 11 | shift 2;; 12 | -e|--end) 13 | end=$2 14 | shift 2;; 15 | -n|--sample) 16 | sample=$2 17 | shift 2;; 18 | -g|--genotype) 19 | genotype=$2 20 | shift 2;; 21 | -m|--min-mqual) 22 | min_mq=$2 23 | shift 2;; 24 | -f|--fasta) 25 | fasta=$2 26 | shift 2;; 27 | -b|--bam) 28 | bam=$2 29 | shift 2;; 30 | -d|--delimiter) 31 | delimiter=$2 32 | shift 2;; 33 | -o|--outdir) 34 | outdir=$2 35 | shift 2;; 36 | --) # end argument parsing 37 | shift 38 | break;; 39 | -*|--*=) # unsupported flags 40 | echo "Error: Unsupported flag $1" >&2 41 | exit 1;; 42 | esac 43 | done 44 | 45 | # out=$outdir/${chrom}_${end}_${sample}_${genotype}.png 46 | out=$outdir/$(echo "$chrom $start $end $sample $genotype.png" | tr ' ' $delimiter) 47 | echo $out 48 | svlen=$(($end-$start)) 49 | # window=$(python -c "print(int($svlen * 0.5))") 50 | 51 | if [[ $svlen -gt 5000 ]]; then 52 | samplot.py \ 53 | --zoom 1000 \ 54 | --chrom $chrom --start $start --end $end \ 55 | --min_mqual $min_mq \ 56 | --sv_type DEL \ 57 | --bams $bam \ 58 | --reference $fasta \ 59 | --output_file $out 60 | else 61 | samplot.py \ 62 | --chrom $chrom --start $start --end $end \ 63 | --min_mqual $min_mq \ 64 | --sv_type DEL \ 65 | --bams $bam \ 66 | --reference $fasta \ 67 | --output_file $out 68 | fi 69 | -------------------------------------------------------------------------------- /workflows/scripts/get_del_regions.sh: -------------------------------------------------------------------------------- 1 | #!/bin/env bash 2 | # given: 3 | # * input vcf 4 | # * sample name 5 | # output: 6 | # * sample's DEL regions in bed format to stdout 7 | 8 | # 1. get sample's part of vcf 9 | # 2. get SVTYPE = DEL (only het/alt genotypes) 10 | # 3. bcftools query into bed format with format: 11 | # %CHROM\t%POS\t%INFO/END\t%SVTYPE\t[%SAMPLE]\n 12 | 13 | vcf=$1 14 | sample=$2 15 | 16 | # note use single & for within sample logic 17 | bcftools view -s $sample -i 'SVTYPE="DEL"' $vcf | 18 | bcftools query -i 'GT!="0/0" & GT!="./."' \ 19 | -f '%CHROM\t%POS\t%INFO/END\t%SVTYPE\t[%SAMPLE]\n' 20 | -------------------------------------------------------------------------------- /workflows/scripts/images_from_regions.sh: -------------------------------------------------------------------------------- 1 | #!/bin/env bash 2 | # given a set of bed regions, reference fasta, and bam/cram 3 | # this script will output the regions to gargs which will generate 4 | # individual samplot images using gen_img.sh 5 | set -eu 6 | while (( "$#" )); do 7 | case "$1" in 8 | -g|--gargs-bin) 9 | gargs_bin=$2 10 | shift 2;; 11 | -f|--fasta) 12 | fasta=$2 13 | shift 2;; 14 | -r|--regions) 15 | regions=$2 16 | shift 2;; 17 | -b|--bam) 18 | bam=$2 19 | shift 2;; 20 | -d|--delimiter) # delimiter between chrm, start, end, etc in filename 21 | delimiter=$2 22 | shift 2;; 23 | -o|--outdir) 24 | outdir=$2 25 | shift 2;; 26 | -p|--processes) 27 | processes=$2 28 | shift 2;; 29 | --) # end argument parsing 30 | shift 31 | break;; 32 | -*|--*=) # unsupported flags 33 | echo "Error: Unsupported flag $1" >&2 34 | exit 1;; 35 | esac 36 | done 37 | 38 | [[ ! -d $outdir ]] && mkdir $outdir 39 | 40 | # format of regions bed is: 41 | # chrom start end svtype sample 42 | echo $PWD 43 | cat $regions | $gargs_bin -e -p $processes "bash scripts/gen_img.sh \\ 44 | --chrom {0} --start {1} --end {2} --genotype {3} --sample {4} \\ 45 | --min-mqual 10 --fasta $fasta --bam $bam \\ 46 | --delimiter $delimiter --outdir $outdir" 47 | -------------------------------------------------------------------------------- /workflows/scripts/install_gargs.sh: -------------------------------------------------------------------------------- 1 | #!/bin/env bash 2 | gargs=$1 3 | install_dir=$(dirname $gargs) 4 | [[ ! -d $install_dir ]] && mkdir $install_dir 5 | case "$(uname -s)" in 6 | Darwin) 7 | wget https://github.com/brentp/gargs/releases/download/v0.3.9/gargs_darwin \ 8 | -O $gargs 9 | ;; 10 | Linux) 11 | wget https://github.com/brentp/gargs/releases/download/v0.3.9/gargs_linux \ 12 | -O $gargs 13 | ;; 14 | esac 15 | chmod +x $gargs 16 | -------------------------------------------------------------------------------- /workflows/scripts/predict.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | import numpy as np 4 | import tensorflow as tf 5 | import tensorflow_addons as tfa 6 | 7 | # data loading, CNN models 8 | from utils import datasets, models 9 | 10 | def main(args): 11 | model = tf.keras.models.load_model(args.model_path) 12 | dataset = datasets.DataWriter.get_basic_dataset( 13 | args.image_list, args.processes) 14 | dataset = dataset.batch(args.batch_size, drop_remainder=False) 15 | dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE) 16 | 17 | for filenames, images in dataset: 18 | predictions = model(images) 19 | for file, pred in zip(filenames, predictions): 20 | f = os.path.splitext(os.path.basename(file.numpy()))[0] 21 | region = f.decode().split(args.delimiter) 22 | print(*region[:3], sep='\t', end='\t') 23 | print(*pred.numpy(), sep='\t') 24 | 25 | if __name__ == '__main__': 26 | parser = argparse.ArgumentParser() 27 | parser.add_argument( 28 | '--model-path', dest='model_path', type=str, required=True, 29 | help='Path of trained model') 30 | parser.add_argument( 31 | '--image-list', dest='image_list', type=str, required=True, 32 | help='list of image file paths.') 33 | parser.add_argument( 34 | '--processes', dest='processes', type=int, required=True, 35 | help='number of simultaneous processes.') 36 | parser.add_argument( 37 | '--batch-size', dest='batch_size', type=int, required=True, 38 | help='number of images per patch.') 39 | parser.add_argument( 40 | '--delimiter', dest='delimiter', type=str, required=True, 41 | help='delimiter within image file name.') 42 | args = parser.parse_args() 43 | main(args) 44 | -------------------------------------------------------------------------------- /workflows/scripts/test.sh: -------------------------------------------------------------------------------- 1 | x=1 2 | y=2 3 | 4 | svlen=$((y-x)) 5 | echo $svlen 6 | -------------------------------------------------------------------------------- /workflows/scripts/utils/datasets.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import functools 4 | from itertools import zip_longest, takewhile 5 | 6 | import numpy as np 7 | import tensorflow as tf 8 | from joblib import Parallel, delayed 9 | 10 | 11 | # Original images are 2090 x 575 12 | ORIG_SHAPE = [575, 2090, 3] 13 | 14 | # we down scale each dimension by constant factor 15 | SCALE_FACTOR = 8 16 | IMAGE_SHAPE = np.array([np.ceil(ORIG_SHAPE[0]/SCALE_FACTOR).astype(int), 17 | np.ceil(ORIG_SHAPE[1]/SCALE_FACTOR).astype(int), 18 | 3]) 19 | 20 | class DataWriter: 21 | def __init__(self, data_list, out_dir, training, num_classes=3): 22 | self.out_dir = out_dir 23 | self.training = training 24 | self.num_classes=num_classes 25 | self.filenames = [fname.rstrip() for fname in open(data_list)] 26 | self.labels = DataWriter._get_labels(self.filenames, num_classes) 27 | assert len(self.filenames) == len(self.labels) 28 | 29 | @staticmethod 30 | def _grouper(iterable, n, fillvalue=None): 31 | """ 32 | Collect data into fixed-length chunks or blocks, 33 | Taken from python itertools docs 34 | """ 35 | # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx 36 | args = [iter(iterable)] * n 37 | return zip_longest(fillvalue=fillvalue, *args) 38 | 39 | @staticmethod 40 | def _int64_feature(value): 41 | if not isinstance(value, list): 42 | value = [value] 43 | return tf.train.Feature(int64_list=tf.train.Int64List(value=value)) 44 | 45 | @staticmethod 46 | def _bytes_feature(value): 47 | # If the value is an eager tensor BytesList 48 | # won't unpack a string from an EagerTensor. 49 | if isinstance(value, tf.Tensor): 50 | value = value.numpy() 51 | elif not isinstance(value, bytes): 52 | value = value.encode() 53 | return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value])) 54 | 55 | @staticmethod 56 | def _get_labels(filenames, num_classes=3): 57 | """ 58 | ## TODO incorporate variable delimiter instead of _ 59 | Filnames are in the following format: 60 | ____.png 61 | We will use the genotypes as the labels. 62 | Additionally, we apply label smoothing to the categorical labels 63 | controlled by the parameter eps. 64 | """ 65 | labels = [f.split('_')[-1].split('.')[0].lower() for f in filenames] 66 | 67 | if num_classes == 3: 68 | label_to_index = { 69 | 'ref': 0, 'ref-tn': 0, 'ref-fp': 0, 70 | 'het': 1, 'alt': 2 71 | } 72 | elif num_classes == 4: 73 | label_to_index = { 74 | 'ref-tn': 0, 'ref-fp': 1, 75 | 'het': 2, 'alt': 3 76 | } 77 | else: 78 | label_to_index = {'ref': 0, 'del': 1} 79 | return [label_to_index[l] for l in labels] 80 | 81 | @staticmethod 82 | def _load_image(path): 83 | """ 84 | Used to load image from provided filepath for use with a tensorflow dataset. 85 | Most images we have are 3 channel, but there are some that are 1/4 channels, 86 | so we just make all 3 channel then normalize to 0-1 range. 87 | """ 88 | image = tf.io.read_file(path) 89 | image = tf.image.decode_png(image, channels=3) 90 | image = tf.image.convert_image_dtype(image, tf.float32) 91 | image = tf.image.resize(image, IMAGE_SHAPE[:2]) 92 | return image 93 | 94 | @staticmethod 95 | def get_basic_dataset(image_list, num_processes): 96 | """ 97 | return a tensorflow dataset consisting of just filenames 98 | and images (no labels) 99 | """ 100 | filename_ds = tf.data.Dataset.from_tensor_slices( 101 | [filename.rstrip() for filename in open(image_list, 'r')]) 102 | image_ds = filename_ds.map( 103 | DataWriter._load_image, 104 | num_parallel_calls=num_processes).map( 105 | tf.image.per_image_standardization, 106 | num_parallel_calls=num_processes 107 | ) 108 | 109 | return tf.data.Dataset.zip((filename_ds, image_ds)) 110 | 111 | 112 | @staticmethod 113 | def _serialize_example(filename, label): 114 | """ 115 | given a filepath to an image (label contained in filename), 116 | serialize the (image, label) 117 | """ 118 | example = tf.train.Example( 119 | features=tf.train.Features( 120 | feature={ 121 | 'filename': DataWriter._bytes_feature(filename), 122 | 'image': DataWriter._bytes_feature( 123 | tf.io.serialize_tensor(DataWriter._load_image(filename))), 124 | 'label': DataWriter._int64_feature(label), 125 | } 126 | ) 127 | ) 128 | return example.SerializeToString() 129 | 130 | @staticmethod 131 | def _write_batch(out_dir, batch_index, file_label_pairs, training): 132 | """ 133 | Write a single batch of images to TFRecord format 134 | """ 135 | with tf.io.TFRecordWriter( 136 | f"{out_dir}/{training}/{training}_{batch_index:05d}.tfrec") as writer: 137 | for file_label in file_label_pairs: 138 | filename, label = file_label 139 | serialized_example = DataWriter._serialize_example(filename, label) 140 | writer.write(serialized_example) 141 | 142 | def to_tfrecords(self, imgs_per_record=1000): 143 | """ 144 | Write train and/or val set to a set of TFRecords 145 | """ 146 | Parallel(n_jobs=-1)( 147 | delayed(self._write_batch)(self.out_dir, 148 | batch_index=i, 149 | file_label_pairs=takewhile( 150 | lambda x: x is not None, 151 | file_label_pairs), 152 | training=self.training) 153 | for i, file_label_pairs in enumerate( 154 | DataWriter._grouper(zip(self.filenames, self.labels), imgs_per_record)) 155 | ) 156 | 157 | 158 | class DataReader: 159 | def __init__(self, 160 | data_list, # list of original images in the dataset 161 | tfrec_list, # list of tfrecords in the dataset (s3 or local) 162 | num_processes, 163 | batch_size): 164 | 165 | self.data_list = data_list 166 | self.tfrec_list = tfrec_list 167 | self.num_processes = num_processes 168 | self.batch_size = batch_size 169 | 170 | @staticmethod 171 | def _parse_image(x): 172 | """ 173 | Used to reformat serialzed images to original shape 174 | """ 175 | result = tf.io.parse_tensor(x, out_type=tf.float32) 176 | result = tf.reshape(result, IMAGE_SHAPE) 177 | return result 178 | 179 | @staticmethod 180 | def _parse_serialized_example(serialized_example): 181 | """ 182 | Given a serialized example with the below format, extract/deserialize 183 | the image and (one-hot) label 184 | """ 185 | features = { 186 | 'filename': tf.io.FixedLenFeature((), tf.string), 187 | 'image': tf.io.FixedLenFeature((), tf.string), 188 | 'label': tf.io.FixedLenFeature((), tf.int64) 189 | } 190 | example = tf.io.parse_single_example(serialized_example, features) 191 | 192 | # read image and perform transormations 193 | image = DataReader._parse_image(example['image']) 194 | image = tf.image.per_image_standardization(image) 195 | 196 | label = example['label'] 197 | return image, tf.one_hot(label, depth=3, dtype=tf.int64) 198 | 199 | def get_dataset(self): 200 | n_images = len(open(self.data_list).readlines()) 201 | 202 | # we don't need examples to be loaded in order (better speed) 203 | options = tf.data.Options() 204 | options.experimental_deterministic = False 205 | 206 | with open(self.tfrec_list) as f: 207 | files = [filename.rstrip() for filename in f] 208 | dataset = tf.data.Dataset.from_tensor_slices(files) \ 209 | .shuffle(len(files)) \ 210 | .with_options(options) 211 | 212 | 213 | dataset = dataset.interleave( 214 | tf.data.TFRecordDataset, 215 | cycle_length=self.num_processes, 216 | num_parallel_calls=self.num_processes) \ 217 | 218 | dataset = dataset.map( 219 | functools.partial(DataReader._parse_serialized_example), 220 | num_parallel_calls=self.num_processes) \ 221 | .repeat() \ 222 | .shuffle(buffer_size=1000) \ 223 | .batch(self.batch_size, drop_remainder=False) \ 224 | .prefetch(buffer_size=tf.data.experimental.AUTOTUNE) 225 | 226 | return dataset, n_images 227 | -------------------------------------------------------------------------------- /workflows/scripts/utils/models.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import tensorflow_addons as tfa 3 | 4 | class Conv2DBlock: 5 | """ 6 | Composition of 2D convolutions 7 | * Why isn't this a keras Layer? I found that it just makes 8 | whole model saving + loading a real pain so I just wrote this 9 | as if it was a keras Layer but did not inherit from Layer 10 | """ 11 | def __init__(self, n_channels=32, n_layers=1, kernel_regularizer=None, 12 | kernel_size=(3, 3), dilation_rate=(1, 1)): 13 | # layer properties 14 | self.n_channels = n_channels 15 | self.n_layers = n_layers 16 | self.kernel_regularizer = kernel_regularizer 17 | self.kernel_size = kernel_size 18 | self.dilation_rate = dilation_rate 19 | 20 | # sub layers 21 | self.conv_layers = [ 22 | tf.keras.layers.Conv2D( 23 | filters=self.n_channels, kernel_size=self.kernel_size, 24 | dilation_rate=self.dilation_rate, 25 | kernel_regularizer=self.kernel_regularizer, padding='same') 26 | for i in range(self.n_layers)] 27 | self.bnorm_layers = [ 28 | tf.keras.layers.BatchNormalization() 29 | for i in range(self.n_layers)] 30 | self.leaky_relu_layers = [ 31 | # tf.keras.layers.LeakyReLU() 32 | tfa.layers.GeLU() 33 | for i in range(self.n_layers)] 34 | 35 | def __call__(self, x): 36 | for conv, bnorm, leaky_relu in zip( 37 | self.conv_layers, self.bnorm_layers, self.leaky_relu_layers): 38 | x = conv(x) 39 | x = bnorm(x) 40 | x = leaky_relu(x) 41 | return x 42 | 43 | 44 | class ResidualBlock(Conv2DBlock): 45 | def __init__(self, **kwargs): 46 | super().__init__(**kwargs) 47 | self.add = tf.keras.layers.Add() 48 | # self.leaky_relu_out = tf.keras.layers.LeakyReLU() 49 | self.leaky_relu_out = tfa.layers.GeLU() 50 | 51 | def __call__(self, x): 52 | temp = x 53 | x = super().__call__(x) 54 | x = self.add([temp, x]) 55 | return self.leaky_relu_out(x) 56 | 57 | 58 | def CNN(num_classes=3): 59 | """ 60 | Construct and return an (uncompiled) conv2d model out of Conv2DBlocks. 61 | """ 62 | inp = tf.keras.Input(shape=(None, None, 3)) 63 | x = inp 64 | x = tf.keras.layers.Conv2D( 65 | filters=32, kernel_size=(7, 7), strides=(1, 1), 66 | dilation_rate=(2, 2), padding='valid')(x) 67 | x = tf.keras.layers.BatchNormalization()(x) 68 | # x = tf.keras.layers.LeakyReLU()(x) 69 | x = tfa.layers.GeLU()(x) 70 | x = tf.keras.layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2))(x) 71 | 72 | for i in range(4): 73 | x = Conv2DBlock( 74 | n_channels=32*(i+1), n_layers=1, kernel_size=(1, 1))(x) 75 | x = ResidualBlock( 76 | n_channels=32*(i+1), n_layers=3, kernel_size=(3, 3))(x) 77 | x = ResidualBlock( 78 | n_channels=32*(i+1), n_layers=3, kernel_size=(3, 3))(x) 79 | x = ResidualBlock( 80 | n_channels=32*(i+1), n_layers=3, kernel_size=(3, 3))(x) 81 | x = tf.keras.layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2))(x) 82 | 83 | x = tf.keras.layers.GlobalAveragePooling2D()(x) 84 | x = tf.keras.layers.Dense(1024)(x) 85 | # x = tf.keras.layers.LeakyReLU()(x) 86 | x = tfa.layers.GeLU()(x) 87 | x = tf.keras.layers.Dropout(0.5)(x) 88 | 89 | x = tf.keras.layers.Dense(num_classes)(x) 90 | out = tf.keras.layers.Softmax()(x) 91 | return tf.keras.Model(inputs=inp, outputs=out) 92 | --------------------------------------------------------------------------------