├── .gitignore
├── LICENSE
├── README.md
├── figures
    └── chr1-934098-934868-HG03687-DEL.png
└── workflows
    ├── conf
        ├── create_config.sh
        ├── samplot-ml-predict.yaml
        └── test.txt
    ├── config_utils.py
    ├── envs
        ├── samplot.yaml
        └── tensorflow.yaml
    ├── samplot-ml-predict.smk
    ├── saved_models
        └── samplot-ml.h5
    └── scripts
        ├── annotate.py
        ├── crop.sh
        ├── gen_img.sh
        ├── get_del_regions.sh
        ├── images_from_regions.sh
        ├── install_gargs.sh
        ├── predict.py
        ├── test.sh
        └── utils
            ├── datasets.py
            └── models.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints
2 | *.swp
3 | __pycache__
4 | .pylintrc
5 | .snakemake
6 | *.org
7 | *.backup
8 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 Murad Chowdhury
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Samplot-ML
  2 | [![DOI](https://zenodo.org/badge/191653284.svg)](https://zenodo.org/badge/latestdoi/191653284)
  3 | 
  4 | [Samplot](https://github.com/ryanlayer/samplot) is a command line tool for rapid, multi-sample structural variant visualization. samplot takes SV coordinates and bam files and produces high-quality images that highlight any alignment and depth signals that substantiate the SV.
  5 | 
  6 | Samplot-ML is a convolutional neural network  trained to identify false positive deletion SVs using Samplot images. The workflow for Samplot-ML is simple: given a whole-genome sequenced sample (BAM or CRAM) as well as a set of putative deletions (VCF), Samplot-ML re-genotypes each putative deletion using the Samplot-generated image. The result is a call set where most false positives are flagged.
  7 | 
  8 | This Repository provides a snakemake workflow for annotating an SV callset with Samplot-ML's predictions.
  9 | 
 10 | To cite Samplot-ML, please use
 11 | >Belyeu, J.R., Chowdhury, M., Brown, J. et al. Samplot: a platform for structural variant visual validation and automated filtering. Genome Biol 22, 161 (2021). https://doi.org/10.1186/s13059-021-02380-5
 12 | 
 13 | ## Dependencies
 14 | * `bz2` and `zlib` devel libraries.  On Ubuntu systems, you can use: 
 15 | ```
 16 | $ sudo apt install libbz2-dev zlib1g-dev
 17 | ```
 18 | * `conda` - for a minimal install of conda, check out [miniconda](https://docs.conda.io/en/latest/miniconda.html)	
 19 | 	
 20 | * `mamba` - [drop in replacement](https://github.com/mamba-org/mamba) for conda's package manager.  Used to install `snakemake`
 21 | ```
 22 | $ conda install -n base -c conda-forge mamba
 23 | ```
 24 | 
 25 | * `snakemake` 
 26 | ```
 27 | $ mamba create -c conda-forge -c bioconda -n snakemake snakemake
 28 | ```
 29 | 
 30 | * `aws cli` (optional) - If you plan on using data sources from s3 buckets in the samplot-ml workflow.
 31 | 
 32 | * Any other dependencies will be handled  by snakemake through conda. 
 33 | 
 34 | ## Example Usage
 35 | To demonstrate how to use Samplot-ML, let's work through a simple example that takes us from calling SV's to executing the Samplot-ML snakefile using some data from the 1000 genomes project (1kg).
 36 | 
 37 | ### Calling SVs with [smoove](https://github.com/brentp/smoove)
 38 | 1. Download a CRAM file from 1kg's ftp
 39 | ```
 40 | $ wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3242876/HG03687.final.cram
 41 | $ wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3242876/HG03687.final.cram.crai
 42 | ```
 43 | 2. Get the reference genome
 44 | ```
 45 | $ wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
 46 | $ wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa.fai
 47 | ```
 48 | 
 49 | 3. Get a set of exclude regions in bed format for use during SV calling
 50 | ```
 51 | $ wget https://raw.githubusercontent.com/hall-lab/speedseq/master/annotations/exclude.cnvnator_100bp.GRCh38.20170403.bed
 52 | ```
 53 | 
 54 | 4. Install smoove with conda
 55 | ```
 56 | $ conda create -c bioconda -n smoove smoove
 57 | ```
 58 | 5. Call SVs
 59 | ```
 60 | $ conda activate smoove
 61 | $ smoove call -x \
 62 | 	--name HG03687 \
 63 | 	--exclude $exclude_bed \
 64 | 	--fasta $reference \
 65 | 	--genotype \
 66 | 	--outdir $outdir
 67 | 	$cram
 68 | ```
 69 | The resulting vcf will be `$outdir/HG003687-smoove.genotyped.vcf.gz`.
 70 | 
 71 | 6. Clone the Samplot-ML git repo
 72 | 
 73 | 7. Next, let's edit the config file located at `samplot-ml/workflows/samplot-ml-predict.yaml`
 74 | ```
 75 | samples:
 76 | 	HG03687: "/path/to/cram" # or you can use "s3://bucket/bam_or_cram" if you've got alignments in an s3 bucket.
 77 | fasta:
 78 | 	data_source: "local" # either local or s3
 79 | 	file: "/path/to/reference" # or "s3://bucket/reference_file"
 80 | fai:
 81 | 	data_source: "local"
 82 | 	file: "/path/to/reference_index"
 83 | vcf:
 84 | 	data_source: "local"
 85 | 	file: "/path/to/vcf"
 86 | 
 87 | # generated images will have filename: ${contig}-${start}-${end}-DEL.png
 88 | # we give the choice of delimiter since contigs can sometimes contain
 89 | # character like hypens, underscores, etc.
 90 | image_filename_delimiter: "-"
 91 | outdir: "/path/to/output_directory"
 92 | ```
 93 | 
 94 | 8. Run the prediction snakefile located at `samplot-ml/workflows/samplot-ml-predict.smk`
 95 | ```
 96 | conda activate snakemake
 97 | snakemake -s samplot-ml-predict.smk \
 98 |           -j $num_threads \ # number of parallel threads to use to execute jobs
 99 |           --use-conda --conda-fronend mamba # this allows snakemake to handle dependencies
100 | ```
101 | 
102 | Generated images of DEL regions from the input VCF will be located at `$outdir/img/` each image will be named `${contig}-${start}-${end}-DEL.png`.  An annotated vcf containing the Samplot-ML predictions will be located at `$outdir/samplot-ml-results/HG03687-samplot-ml.vcf.gz`
103 | 
104 | ### VCF annotations
105 | The resulting prediction vcf will contain the following format fields:
106 | * PREF, PHET, and PALT: the prediction score assigned by the model which corresponds to a prediction of a 0/0, 0/1, or 1/1 genotype, respectively.  If the region in the input vcf was originally a 0/0, then these fields will contain 'nan' values
107 | * OLDGT: The original genotype of the region from the input SV callset.
108 | * If the predicted genotype differed from the input genotype, then the model will replace the GT field with the predicted genotype.
109 | 
110 | ### Back to our example
111 | Now that we've got our annotated VCF, let's inspect one of the predictions and compare it with the samplot image.
112 | ```
113 | # get the first DEL region from the vcf.  Print out the filename of the
114 | # samplot image, the original genotype and the samplot-ml predicted genotype.
115 | $ bcftools query -i 'SVTYPE="DEL"' -f '%CHROM-%POS-%INFO/END-[%SAMPLE].png\t[%OLDGT\t%GT]\n' HG03687-samplot-ml.vcf.gz | head -1
116 | 
117 | output:
118 | chr1-934098-934868-HG03687.png	0/1	1/1
119 | ```
120 | 
121 | It seems that in the very first deletion we came across, there was a difference between SVTyper's prediction and Samplot-ML's prediction.  We went from a heterozygous deletion (0/1) to a homozygous alternate deletion (1/1).  Let's take a look at the image in question.
122 | ![samplot-image](https://github.com/mchowdh200/samplot-ml/raw/master/figures/chr1-934098-934868-HG03687-DEL.png) 
123 | 


--------------------------------------------------------------------------------
/figures/chr1-934098-934868-HG03687-DEL.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mchowdh200/samplot-ml/ec5dab5d19c094f03070fba25d3b8db780a4e5d3/figures/chr1-934098-934868-HG03687-DEL.png


--------------------------------------------------------------------------------
/workflows/conf/create_config.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/env bash
 2 | set -eu
 3 | 
 4 | while (( "$#" )); do
 5 |     case "$1" in
 6 |         -b|--bam-bucket)
 7 |             bam_bucket=$2
 8 |             shift 2;;
 9 |         -f|--fasta)
10 |             fasta=$2
11 |             shift 2;;
12 |         -i|--fai)
13 |             fai=$2
14 |             shift 2;;
15 |         -v|--vcf)
16 |             vcf=$2
17 |             shift 2;;
18 |         -d|--delimiter)
19 |             delimiter=$2
20 |             shift 2;;
21 |         -o|--outdir)
22 |             outdir=$2
23 |             shift 2;;
24 |         --) # end argument parsing
25 |             shift
26 |             break;;
27 |         -*|--*=) # unsupported flags
28 |             echo "Error: Unsupported flag $1" >&2
29 |             exit 1;;
30 |     esac
31 | done
32 | 
33 | echo "samples:"
34 | 
35 | # filenames
36 | bams=$(aws s3 ls $bam_bucket/ | grep -E '*.bam$' | sed -E 's/\s+/\t/g' | cut -f4)
37 | # OR do the same thing with a local directory of bams
38 | # TODO for now I'm just going to download bams locally and create this config
39 | # bams=$(ls $bam_bucket | grep -E '*.bam$')
40 | 
41 | ### Sample bam pairs
42 | # get the sample name from the bam header
43 | for bam in $bams; do
44 |     sample=$(samtools view -H $bam_bucket/$bam | grep SM -m1 |
45 |              awk 'BEGIN{RS="\t"; FS=":"} /SM/ {print $2}')
46 |     printf "  $sample: \"$bam_bucket/$bam\"\n"
47 | done
48 | 
49 | ### reference fasta and index
50 | ### TODO fix the s3 or local choice
51 | echo "fasta:"
52 | printf "  data_source: \"s3\"\n"
53 | printf "  file: \"$fasta\"\n"
54 | 
55 | echo "fai:"
56 | printf "  data_source: \"s3\"\n"
57 | printf "  file: \"$fai\"\n"
58 | 
59 | echo "vcf:"
60 | printf "  data_source: \"s3\"\n"
61 | printf "  file: \"$vcf\"\n"
62 | 
63 | echo "image_filename_delimiter: \"$delimiter\""
64 | echo "outdir: \"$outdir\""
65 | 


--------------------------------------------------------------------------------
/workflows/conf/samplot-ml-predict.yaml:
--------------------------------------------------------------------------------
 1 | samples:
 2 |   # TODO script to generate this config:
 3 |   #    * Especially if there are going to be a lot of samples
 4 |   # eg SAMPLE: "PATH/TO/BAM_OR_CRAM"
 5 |   # or SAMPLE: "s3://BUCKET/BAM_OR_CRAM"
 6 |   # or SAMPLE: "http://URL/BAM_OR_CRAM"
 7 |   Yanbian: "s3://layerlabcu/cow/bams/Yanbian_merged_marked.bam"
 8 | 
 9 | fasta:
10 |   data_source: "s3" # local or s3
11 |   file: "s3://layerlabcu/cow/ARS-UCD1.2_Btau5.0.1Y.prepend_chr.fa"
12 | fai:
13 |   data_source: "s3" # local or s3
14 |   file: "s3://layerlabcu/cow/ARS-UCD1.2_Btau5.0.1Y.prepend_chr.fa.fai"
15 | 
16 | vcf:
17 |   data_source: "s3" # local of s3
18 |   file: "s3://layerlabcu/cow/VCF/sites.smoove.square.vcf.gz"
19 | 
20 | image_filename_delimiter: "-"
21 | outdir: "/home/murad/Repositories/samplot-ml/workflows/temp"
22 | 


--------------------------------------------------------------------------------
/workflows/conf/test.txt:
--------------------------------------------------------------------------------
1 | abcdefg
2 | 


--------------------------------------------------------------------------------
/workflows/config_utils.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from abc import ABC, abstractmethod
 3 | 
 4 | class Data(ABC):
 5 |     """
 6 |     Abstract class that describes a simple interface for how local/remote
 7 |     data will be obtained in snakemake rules.
 8 |     """
 9 |     def __init__(self, input, output):
10 |         self.input = input
11 |         self.output = output
12 | 
13 |     @abstractmethod
14 |     def get_cmd(self) -> str:
15 |         """
16 |         returns a shell directive action (eg. downloading, or creating symlink)
17 |         """
18 |         pass
19 | 
20 | 
21 | class LocalData(Data):
22 |     def __init__(self, input, output):
23 |         super().__init__(input, output)
24 | 
25 |     def get_cmd(self) -> str:
26 |         """
27 |         returns symlink command to be run in snakemake shell directive
28 |         """
29 |         return f"ln -sr {self.input} {self.output}"
30 | 
31 | 
32 | class S3Data(Data):
33 |     def __init__(self, input, output):
34 |         super().__init__(input, output)
35 | 
36 |     def get_cmd(self) -> str:
37 |         """
38 |         returns aws s3 copy command to be run in snakemake shell directive
39 |         """
40 |         return f"aws s3 cp {self.input} {self.output}"
41 | 
42 | 
43 | def data_factory(config, data_type):
44 |     """
45 |     Create a Data object whose behavior is determined by the config.
46 |     INPUTS:
47 |         * config - snakemake config dictionary
48 |         * data_type - type of data file (eg 'fasta' or 'vcf')
49 |     """
50 |     data_source = config[data_type]['data_source']
51 |     input_file = config[data_type]['file']
52 |     outdir = config['outdir']
53 |     output_file = f'{outdir}/{os.path.basename(input_file)}'
54 | 
55 |     if data_source == 's3':
56 |         return S3Data(input_file, output_file)
57 |     elif data_source == 'local':
58 |         return LocalData(input_file, output_file)
59 |     else:
60 |         raise ValueError(
61 |             f'Unknown data_source: "{data_source}".'
62 |             ' data_source must be either "s3" or "local".')
63 | 
64 |         
65 | class Conf:
66 |     """
67 |     Use the snakemake config dict to set variables and configure the Data
68 |     objects which will control the behavior of data aquisition rules.
69 |     """
70 |     def __init__(self, config):
71 |         self.samples = config['samples'].keys() # list of sample names
72 |         self.alignments = config['samples'] # dict of bam/cram indexed by sample
73 |         self.outdir = config['outdir']
74 |         self.delimiter = config['image_filename_delimiter']
75 |         self.fasta = data_factory(config, 'fasta')
76 |         self.fai = data_factory(config, 'fai')
77 |         self.vcf = data_factory(config, 'vcf')
78 | 
79 | 


--------------------------------------------------------------------------------
/workflows/envs/samplot.yaml:
--------------------------------------------------------------------------------
 1 | name: samplot
 2 | channels:
 3 |   - bioconda
 4 |   - conda-forge
 5 | dependencies:
 6 |   - python=3.6
 7 |   - matplotlib==3.1.0
 8 |   - numpy
 9 |   - cython
10 |   - bcftools
11 |   - htslib
12 |   - imagemagick
13 |   - joblib
14 |   - samplot
15 |   - pip
16 |   - pip:
17 |       - "git+https://github.com/pysam-developers/pysam.git"
18 |       - "git+https://github.com/mchowdh200/samplot.git"
19 | 


--------------------------------------------------------------------------------
/workflows/envs/tensorflow.yaml:
--------------------------------------------------------------------------------
 1 | name: tensorflow
 2 | dependencies:
 3 |   - python=3.7
 4 |   - h5py=2.10
 5 |   - joblib
 6 |   - pip
 7 |   - pip:
 8 |       - tensorflow==2.0.0
 9 |       - tensorflow-addons==0.6.0
10 |   
11 | 


--------------------------------------------------------------------------------
/workflows/samplot-ml-predict.smk:
--------------------------------------------------------------------------------
  1 | ## General TODOs
  2 | # * provide example scripts on how to programmaticaly create sample: file pairs
  3 | #   in the config yaml.
  4 | # * or make a rule that looks at the RG tags to get sample names from list of bams
  5 | 
  6 | import os
  7 | import functools
  8 | from glob import glob
  9 | from config_utils import Conf
 10 | 
 11 | 
 12 | ## Setup
 13 | ################################################################################
 14 | configfile: 'conf/samplot-ml-predict.yaml'
 15 | conf = Conf(config)
 16 | 
 17 | ## Rules
 18 | ################################################################################
 19 | rule All:
 20 |     input:
 21 |         expand(f'{conf.outdir}/samplot-ml-results/{{sample}}-samplot-ml.vcf.gz',
 22 |                sample=conf.samples)
 23 | 
 24 | 
 25 | gargs = f'{conf.outdir}/bin/gargs'
 26 | rule InstallGargs:
 27 |     """
 28 |     Install system appropriate binary of gargs for use in image generation rules.
 29 |     """
 30 |     output:
 31 |         gargs
 32 |     shell:
 33 |         f'bash scripts/install_gargs.sh {gargs}'
 34 | 
 35 | 
 36 | rule GetReference:
 37 |     output:
 38 |         fasta = conf.fasta.output,
 39 |         fai = conf.fai.output
 40 |     run:
 41 |         shell(conf.fasta.get_cmd())
 42 |         shell(conf.fai.get_cmd())
 43 | 
 44 | 
 45 | rule GetBaseVCF:
 46 |     output:
 47 |         conf.vcf.output
 48 |     run:
 49 |         shell(conf.vcf.get_cmd())
 50 | 
 51 | 
 52 | rule GetDelRegions:
 53 |     """
 54 |     Get a sample's del regions from the vcf in bed format
 55 |     """
 56 |     input:
 57 |         conf.vcf.output
 58 |     output:
 59 |         f'{conf.outdir}/bed/{{sample}}-del-regions.bed'
 60 |     conda:
 61 |         'envs/samplot.yaml'
 62 |     shell:
 63 |         f"""
 64 |         [[ ! -d {conf.outdir}/bed ]] && mkdir {conf.outdir}/bed
 65 |         bash scripts/get_del_regions.sh {{input}} {{wildcards.sample}} > {{output}}
 66 |         """
 67 | 
 68 | 
 69 | def get_images(rule, wildcards):
 70 |     """
 71 |     Return list of output images from the GenerateImages/CropImages checkpoints
 72 |     """
 73 |     # gets output of the checkpoint (a directory) and re-evals workflow DAG 
 74 |     if rule == "GenerateImages":
 75 |         image_dir = checkpoints.GenerateImages.get(sample=wildcards.sample).output[0]
 76 |     elif rule == "CropImages":
 77 |         image_dir = checkpoints.CropImages.get(sample=wildcards.sample).output[0]
 78 |     else:
 79 |         raise ValueError(f'Unknown argument for rule: {rule}.'
 80 |                              'Must be "GenerateImages" or "CropImages"')
 81 |     return glob(f'{image_dir}/*.png')
 82 | 
 83 | 
 84 | rule GenerateImages:
 85 |     """
 86 |     Images from del regions for a given sample.
 87 |     """
 88 |     threads: workflow.cores
 89 |     input:
 90 |         # bam/ file: from config. could be a url
 91 |         # therefore will not be tracked by snakemake
 92 |         gargs_bin = gargs,
 93 |         fasta = conf.fasta.output,
 94 |         fai = conf.fai.output,
 95 |         regions = rules.GetDelRegions.output
 96 |     output:
 97 |         directory(f'{conf.outdir}/img/{{sample}}')
 98 |     params:
 99 |         bam = lambda wildcards: conf.alignments[wildcards.sample]
100 |     conda:
101 |         'envs/samplot.yaml'
102 |     shell:
103 |         # TODO put the gen_img.sh script into a function in images_from_regions.sh
104 |         f"""
105 |         mkdir -p {conf.outdir}/img
106 |         bash scripts/images_from_regions.sh \\
107 |             --gargs-bin {{input.gargs_bin}} \\
108 |             --fasta {{input.fasta}} \\
109 |             --regions {{input.regions}} \\
110 |             --bam {{params.bam}} \\
111 |             --outdir {conf.outdir}/img/{{wildcards.sample}} \\
112 |             --delimiter {conf.delimiter} \\
113 |             --processes {{threads}}
114 |         """
115 | 
116 | 
117 | rule CropImages:
118 |     """
119 |     Crop axes and text from images to prepare for samplot-ml input
120 |     """
121 |     threads: workflow.cores
122 |     input:
123 |         gargs_bin = gargs,
124 |         imgs = rules.GenerateImages.output
125 |         # imgs = functools.partial(get_images, 'GenerateImages')
126 |     output:
127 |         directory(f'{conf.outdir}/crop/{{sample}}')
128 |     conda:
129 |         'envs/samplot.yaml'
130 |     shell:
131 |         f"""
132 |         [[ ! -d {conf.outdir}/crop ]] && mkdir {conf.outdir}/crop
133 |         bash scripts/crop.sh -i {{input.imgs}} \\
134 |                              -o {{output}} \\
135 |                              -p {{threads}} \\
136 |                              -g {{input.gargs_bin}}
137 |         """
138 | 
139 | rule CreateImageList:
140 |     """
141 |     Samplot-ml needs list of input images. This rule takes the list
142 |     of a sample's cropped images and puts them in a text file.
143 |     """
144 |     input:
145 |         #functools.partial(get_images, 'CropImages')
146 |         rules.CropImages.output
147 |     output:
148 |         temp(f'{conf.outdir}/{{sample}}-cropped-imgs.txt')
149 |     run:
150 |         with open(output[0], 'w') as out:
151 |             # for image_file in input:
152 |             for image_file in glob(f'{conf.outdir}/crop/{wildcards.sample}/*.png'):
153 |                 out.write(f'{image_file}\n')
154 | 
155 | 
156 | rule PredictImages:
157 |     """
158 |     Feed images into samplot-ml to get a bed file of predictions.
159 |     Prediction format (tab separated):
160 |         - chrm start end p_ref p_het p_alt
161 |     """
162 |     threads: workflow.cores
163 |     input:
164 |         f'{conf.outdir}/{{sample}}-cropped-imgs.txt'
165 |     output:
166 |         f'{conf.outdir}/{{sample}}-predictions.bed'
167 |     conda:
168 |         'envs/tensorflow.yaml'
169 |     shell:
170 |         """
171 |         python scripts/predict.py \\
172 |             --image-list {input} \\
173 |             --delimiter {conf.delimiter} \\
174 |             --processes {threads} \\
175 |             --batch-size {threads} \\
176 |             --model-path saved_models/samplot-ml.h5 \\
177 |         > {output}
178 |         """
179 | 
180 | 
181 | rule AnnotateVCF:
182 |     input:
183 |         vcf = conf.vcf.output,
184 |         bed = f'{conf.outdir}/{{sample}}-predictions.bed'
185 |     output:
186 |         f'{conf.outdir}/samplot-ml-results/{{sample}}-samplot-ml.vcf.gz'
187 |     conda:
188 |         'envs/samplot.yaml'
189 |     shell:
190 |         """
191 |         [[ ! -d {conf.outdir}/samplot-ml-results ]] && mkdir {conf.outdir}/samplot-ml-results
192 |         bcftools view -s {wildcards.sample} {input.vcf} |
193 |         python scripts/annotate.py {input.bed} {wildcards.sample} |
194 |         bgzip -c > {output}
195 |         """
196 |         
197 | 
198 | 
199 | 
200 | 
201 | 


--------------------------------------------------------------------------------
/workflows/saved_models/samplot-ml.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mchowdh200/samplot-ml/ec5dab5d19c094f03070fba25d3b8db780a4e5d3/workflows/saved_models/samplot-ml.h5


--------------------------------------------------------------------------------
/workflows/scripts/annotate.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import numpy as np
 3 | import pysam
 4 | 
 5 | bed = sys.argv[1]
 6 | sample = sys.argv[2]
 7 | 
 8 | predictions = {}
 9 | genotypes = {0: (0, 0), 1: (0, 1), 2: (1, 1)}
10 | 
11 | 
12 | # go through the predictions bed file
13 | # and get the predictions keyed by region
14 | for l in open(bed, 'r'):
15 |     A = l.rstrip().split()
16 |     region = '\t'.join(A[:3])
17 |     predictions[region] = [float(x) for x in A[3:]] # prediction score
18 | 
19 | # Pipe the vcf from stdin.
20 | # iterate over each record and replace the genotype
21 | # with the predicted genotypes
22 | with pysam.VariantFile('-', 'rb') as VCF:
23 |     # get the old genotype and prediction scores into the vcf FORMAT fields
24 |     VCF.header.add_meta('FORMAT', items=[('ID', 'OLDGT'),
25 |                                          ('Number', '1'),
26 |                                          ('Type', 'String'),
27 |                                          ('Description', 'Genotype before samplot-ml')])
28 |     VCF.header.add_meta('FORMAT', items=[('ID', 'PREF'),
29 |                                          ('Number', '1'),
30 |                                          ('Type', 'Float'),
31 |                                          ('Description', 'Samplot-ml P(0/0) prediction score')])
32 |     VCF.header.add_meta('FORMAT', items=[('ID', 'PHET'),
33 |                                          ('Number', '1'),
34 |                                          ('Type', 'Float'),
35 |                                          ('Description', 'Samplot-ml P(0/1) prediction score')])
36 |     VCF.header.add_meta('FORMAT', items=[('ID', 'PALT'),
37 |                                          ('Number', '1'),
38 |                                          ('Type', 'Float'),
39 |                                          ('Description', 'Samplot-ml P(1/1) prediction score')])
40 |     print(str(VCF.header).strip())
41 |     for variant in VCF:
42 |         region = '\t'.join([str(x) for x in [variant.contig,
43 |                                           variant.pos,
44 |                                           variant.stop]]) 
45 |         oldgt = variant.samples[sample].allele_indices
46 |         variant.samples[sample]["OLDGT"] = f"{oldgt[0]}/{oldgt[1]}"
47 |         if region in predictions:
48 |             pref, phet, palt = predictions[region]
49 |             variant.samples[sample]["PREF"] = pref
50 |             variant.samples[sample]["PHET"] = phet
51 |             variant.samples[sample]["PALT"] = palt
52 | 
53 |             # set new GT
54 |             variant.samples[sample].allele_indices = genotypes[
55 |                 np.argmax(predictions[region])]
56 |         else:
57 |             variant.samples[sample]["PREF"] = float("nan")
58 |             variant.samples[sample]["PHET"] = float("nan")
59 |             variant.samples[sample]["PALT"] = float("nan")
60 |         print(str(variant).rstrip())
61 | 


--------------------------------------------------------------------------------
/workflows/scripts/crop.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/env bash
 2 | 
 3 | function crop
 4 | {
 5 |     input_img=$1
 6 |     outdir=$2
 7 |     output_img=$outdir/$(basename $input_img)
 8 | 
 9 |     convert \
10 |         -crop 2090x575+175+200 \
11 |         -fill white \
12 |         -draw "rectangle 0,30 500,50" \
13 |         -draw "rectangle 600,0 700,50" \
14 |         $input_img $output_img
15 |     echo $output_img
16 | }
17 | export -f crop
18 | 
19 | while (( "$#" )); do
20 |     case "$1" in
21 |         -i|--imgdir)
22 |             imgdir=$2
23 |             shift 2;;
24 |         -o|--outdir)
25 |             outdir=$2
26 |             shift 2;;
27 |         -p|--processes)
28 |             processes=$2
29 |             shift 2;;
30 |         -g|--gargs-bin)
31 |             gargs_bin=$2
32 |             shift 2;;
33 |         --) # end argument parsing
34 |             shift
35 |             break;;
36 |         -*|--*=) # unsupported flags
37 |             echo "Error: Unsupported flag $1" >&2
38 |             exit 1;;
39 |     esac
40 | done
41 | 
42 | mkdir -p $outdir
43 | find $imgdir -name '*.png' | $gargs_bin -p $processes "crop {0} $outdir"
44 | 


--------------------------------------------------------------------------------
/workflows/scripts/gen_img.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/env bash
 2 | set -eu
 3 | 
 4 | while (( "$#" )); do
 5 |     case "$1" in
 6 |         -c|--chrom)
 7 |             chrom=$2
 8 |             shift 2;;
 9 |         -s|--start)
10 |             start=$2
11 |             shift 2;;
12 |         -e|--end)
13 |             end=$2
14 |             shift 2;;
15 |         -n|--sample)
16 |             sample=$2
17 |             shift 2;;
18 |         -g|--genotype)
19 |             genotype=$2
20 |             shift 2;;
21 |         -m|--min-mqual)
22 |             min_mq=$2
23 |             shift 2;;
24 |         -f|--fasta)
25 |             fasta=$2
26 |             shift 2;;
27 |         -b|--bam)
28 |             bam=$2
29 |             shift 2;;
30 |         -d|--delimiter)
31 |             delimiter=$2
32 |             shift 2;;
33 |         -o|--outdir)
34 |             outdir=$2
35 |             shift 2;;
36 |         --) # end argument parsing
37 |             shift
38 |             break;;
39 |         -*|--*=) # unsupported flags
40 |             echo "Error: Unsupported flag $1" >&2
41 |             exit 1;;
42 |     esac
43 | done
44 | 
45 | # out=$outdir/${chrom}_${end}_${sample}_${genotype}.png
46 | out=$outdir/$(echo "$chrom $start $end $sample $genotype.png" | tr ' ' $delimiter)
47 | echo $out
48 | svlen=$(($end-$start))
49 | # window=$(python -c "print(int($svlen * 0.5))")
50 | 
51 | if [[ $svlen -gt 5000 ]]; then
52 |     samplot.py \
53 |         --zoom 1000 \
54 |         --chrom $chrom --start $start --end $end \
55 |         --min_mqual $min_mq \
56 |         --sv_type DEL \
57 |         --bams $bam \
58 |         --reference $fasta \
59 |         --output_file $out
60 | else
61 |     samplot.py \
62 |         --chrom $chrom --start $start --end $end \
63 |         --min_mqual $min_mq \
64 |         --sv_type DEL \
65 |         --bams $bam \
66 |         --reference $fasta \
67 |         --output_file $out
68 | fi
69 | 


--------------------------------------------------------------------------------
/workflows/scripts/get_del_regions.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/env bash
 2 | # given:
 3 | #     * input vcf
 4 | #     * sample name
 5 | # output:
 6 | #     * sample's DEL regions in bed format to stdout
 7 | 
 8 | # 1. get sample's part of vcf
 9 | # 2. get SVTYPE = DEL (only het/alt genotypes)
10 | # 3. bcftools query into bed format with format:
11 | #    %CHROM\t%POS\t%INFO/END\t%SVTYPE\t[%SAMPLE]\n
12 | 
13 | vcf=$1
14 | sample=$2
15 | 
16 | # note use single & for within sample logic
17 | bcftools view -s $sample -i 'SVTYPE="DEL"' $vcf |
18 |     bcftools query -i 'GT!="0/0" & GT!="./."' \
19 |         -f '%CHROM\t%POS\t%INFO/END\t%SVTYPE\t[%SAMPLE]\n'
20 | 


--------------------------------------------------------------------------------
/workflows/scripts/images_from_regions.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/env bash
 2 | # given a set of bed regions, reference fasta, and bam/cram
 3 | # this script will output the regions to gargs which will generate
 4 | # individual samplot images using gen_img.sh
 5 | set -eu
 6 | while (( "$#" )); do
 7 |     case "$1" in
 8 |         -g|--gargs-bin)
 9 |             gargs_bin=$2
10 |             shift 2;;
11 |         -f|--fasta)
12 |             fasta=$2
13 |             shift 2;;
14 |         -r|--regions)
15 |             regions=$2
16 |             shift 2;;
17 |         -b|--bam)
18 |             bam=$2
19 |             shift 2;;
20 |         -d|--delimiter) # delimiter between chrm, start, end, etc in filename
21 |             delimiter=$2
22 |             shift 2;;
23 |         -o|--outdir)
24 |             outdir=$2
25 |             shift 2;;
26 |         -p|--processes)
27 |             processes=$2
28 |             shift 2;;
29 |         --) # end argument parsing
30 |             shift
31 |             break;;
32 |         -*|--*=) # unsupported flags
33 |             echo "Error: Unsupported flag $1" >&2
34 |             exit 1;;
35 |     esac
36 | done
37 | 
38 | [[ ! -d $outdir ]] && mkdir $outdir
39 | 
40 | # format of regions bed is:
41 | # chrom start end svtype sample
42 | echo $PWD
43 | cat $regions | $gargs_bin -e -p $processes "bash scripts/gen_img.sh \\
44 |     --chrom {0} --start {1} --end {2} --genotype {3} --sample {4} \\
45 |     --min-mqual 10 --fasta $fasta --bam $bam \\
46 |     --delimiter $delimiter --outdir $outdir"
47 | 


--------------------------------------------------------------------------------
/workflows/scripts/install_gargs.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/env bash
 2 | gargs=$1
 3 | install_dir=$(dirname $gargs)
 4 | [[ ! -d $install_dir ]] && mkdir $install_dir
 5 | case "$(uname -s)" in
 6 |     Darwin)
 7 |         wget https://github.com/brentp/gargs/releases/download/v0.3.9/gargs_darwin \
 8 |              -O $gargs
 9 |         ;;
10 |     Linux)
11 |         wget https://github.com/brentp/gargs/releases/download/v0.3.9/gargs_linux \
12 |              -O $gargs
13 |         ;;
14 | esac
15 | chmod +x $gargs
16 | 


--------------------------------------------------------------------------------
/workflows/scripts/predict.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import argparse
 3 | import numpy as np
 4 | import tensorflow as tf
 5 | import tensorflow_addons as tfa
 6 | 
 7 | # data loading, CNN models
 8 | from utils import datasets, models
 9 | 
10 | def main(args):
11 |     model = tf.keras.models.load_model(args.model_path)
12 |     dataset = datasets.DataWriter.get_basic_dataset(
13 |         args.image_list, args.processes)
14 |     dataset = dataset.batch(args.batch_size, drop_remainder=False)
15 |     dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
16 |     
17 |     for filenames, images in dataset:
18 |         predictions = model(images)
19 |         for file, pred in zip(filenames, predictions):
20 |             f = os.path.splitext(os.path.basename(file.numpy()))[0]
21 |             region = f.decode().split(args.delimiter)
22 |             print(*region[:3], sep='\t', end='\t')
23 |             print(*pred.numpy(), sep='\t')
24 | 
25 | if __name__ == '__main__':
26 |     parser = argparse.ArgumentParser()
27 |     parser.add_argument(
28 |         '--model-path', dest='model_path', type=str, required=True,
29 |         help='Path of trained model')
30 |     parser.add_argument(
31 |         '--image-list', dest='image_list', type=str, required=True,
32 |         help='list of image file paths.')
33 |     parser.add_argument(
34 |         '--processes', dest='processes', type=int, required=True,
35 |         help='number of simultaneous processes.')
36 |     parser.add_argument(
37 |         '--batch-size', dest='batch_size', type=int, required=True,
38 |         help='number of images per patch.')
39 |     parser.add_argument(
40 |         '--delimiter', dest='delimiter', type=str, required=True,
41 |         help='delimiter within image file name.')
42 |     args = parser.parse_args()
43 |     main(args)
44 | 


--------------------------------------------------------------------------------
/workflows/scripts/test.sh:
--------------------------------------------------------------------------------
1 | x=1
2 | y=2
3 | 
4 | svlen=$((y-x))
5 | echo $svlen
6 | 


--------------------------------------------------------------------------------
/workflows/scripts/utils/datasets.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | import functools
  4 | from itertools import zip_longest, takewhile
  5 | 
  6 | import numpy as np
  7 | import tensorflow as tf
  8 | from joblib import Parallel, delayed
  9 | 
 10 | 
 11 | # Original images are 2090 x 575
 12 | ORIG_SHAPE = [575, 2090, 3]
 13 | 
 14 | # we down scale each dimension by constant factor
 15 | SCALE_FACTOR = 8
 16 | IMAGE_SHAPE = np.array([np.ceil(ORIG_SHAPE[0]/SCALE_FACTOR).astype(int), 
 17 |                         np.ceil(ORIG_SHAPE[1]/SCALE_FACTOR).astype(int), 
 18 |                         3])
 19 | 
 20 | class DataWriter:
 21 |     def __init__(self, data_list, out_dir, training, num_classes=3):
 22 |         self.out_dir = out_dir
 23 |         self.training = training
 24 |         self.num_classes=num_classes
 25 |         self.filenames = [fname.rstrip() for fname in open(data_list)]
 26 |         self.labels = DataWriter._get_labels(self.filenames, num_classes)
 27 |         assert len(self.filenames) == len(self.labels)
 28 | 
 29 |     @staticmethod
 30 |     def _grouper(iterable, n, fillvalue=None):
 31 |         """
 32 |         Collect data into fixed-length chunks or blocks,
 33 |         Taken from python itertools docs
 34 |         """
 35 |         # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
 36 |         args = [iter(iterable)] * n
 37 |         return zip_longest(fillvalue=fillvalue, *args)
 38 | 
 39 |     @staticmethod
 40 |     def _int64_feature(value):
 41 |         if not isinstance(value, list):
 42 |             value = [value]
 43 |         return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
 44 | 
 45 |     @staticmethod
 46 |     def _bytes_feature(value):
 47 |         # If the value is an eager tensor BytesList 
 48 |         # won't unpack a string from an EagerTensor.
 49 |         if isinstance(value, tf.Tensor):
 50 |             value = value.numpy() 
 51 |         elif not isinstance(value, bytes):
 52 |             value = value.encode()
 53 |         return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
 54 | 
 55 |     @staticmethod
 56 |     def _get_labels(filenames, num_classes=3):
 57 |         """
 58 |         ## TODO incorporate variable delimiter instead of _
 59 |         Filnames are in the following format:
 60 |             <chrm>_<start>_<end>_<sample>_<genotype>.png
 61 |         We will use the genotypes as the labels.
 62 |         Additionally, we apply label smoothing to the categorical labels
 63 |         controlled by the parameter eps.
 64 |         """
 65 |         labels = [f.split('_')[-1].split('.')[0].lower() for f in filenames]
 66 | 
 67 |         if num_classes == 3:
 68 |             label_to_index = {
 69 |                 'ref': 0, 'ref-tn': 0, 'ref-fp': 0,
 70 |                 'het': 1, 'alt': 2
 71 |             }
 72 |         elif num_classes == 4:
 73 |             label_to_index = {
 74 |                 'ref-tn': 0, 'ref-fp': 1,
 75 |                 'het': 2, 'alt': 3
 76 |             }
 77 |         else:
 78 |             label_to_index = {'ref': 0, 'del': 1}
 79 |         return [label_to_index[l] for l in labels]
 80 | 
 81 |     @staticmethod
 82 |     def _load_image(path):
 83 |         """
 84 |         Used to load image from provided filepath for use with a tensorflow dataset.
 85 |         Most images we have are 3 channel, but there are some that are 1/4 channels,
 86 |         so we just make all 3 channel then normalize to 0-1 range.
 87 |         """
 88 |         image = tf.io.read_file(path)
 89 |         image = tf.image.decode_png(image, channels=3)
 90 |         image = tf.image.convert_image_dtype(image, tf.float32)
 91 |         image = tf.image.resize(image, IMAGE_SHAPE[:2])
 92 |         return image
 93 | 
 94 |     @staticmethod
 95 |     def get_basic_dataset(image_list, num_processes):
 96 |         """
 97 |         return a tensorflow dataset consisting of just filenames
 98 |         and images (no labels)
 99 |         """
100 |         filename_ds = tf.data.Dataset.from_tensor_slices(
101 |             [filename.rstrip() for filename in open(image_list, 'r')])
102 |         image_ds = filename_ds.map(
103 |             DataWriter._load_image,
104 |             num_parallel_calls=num_processes).map(
105 |                 tf.image.per_image_standardization,
106 |                 num_parallel_calls=num_processes
107 |             )
108 | 
109 |         return tf.data.Dataset.zip((filename_ds, image_ds))
110 | 
111 | 
112 |     @staticmethod
113 |     def _serialize_example(filename, label):
114 |         """
115 |         given a filepath to an image (label contained in filename),
116 |         serialize the (image, label)
117 |         """
118 |         example = tf.train.Example(
119 |             features=tf.train.Features(
120 |                 feature={
121 |                     'filename': DataWriter._bytes_feature(filename),
122 |                     'image': DataWriter._bytes_feature(
123 |                         tf.io.serialize_tensor(DataWriter._load_image(filename))),
124 |                     'label': DataWriter._int64_feature(label),
125 |                 }
126 |             )
127 |         )
128 |         return example.SerializeToString()
129 | 
130 |     @staticmethod
131 |     def _write_batch(out_dir, batch_index, file_label_pairs, training):
132 |         """
133 |         Write a single batch of images to TFRecord format
134 |         """
135 |         with tf.io.TFRecordWriter(
136 |             f"{out_dir}/{training}/{training}_{batch_index:05d}.tfrec") as writer:
137 |             for file_label in file_label_pairs:
138 |                 filename, label = file_label
139 |                 serialized_example = DataWriter._serialize_example(filename, label)
140 |                 writer.write(serialized_example)
141 | 
142 |     def to_tfrecords(self, imgs_per_record=1000):
143 |         """
144 |         Write train and/or val set to a set of TFRecords
145 |         """
146 |         Parallel(n_jobs=-1)(
147 |             delayed(self._write_batch)(self.out_dir,
148 |                                  batch_index=i,
149 |                                  file_label_pairs=takewhile(
150 |                                      lambda x: x is not None,
151 |                                      file_label_pairs),
152 |                                  training=self.training) 
153 |             for i, file_label_pairs in enumerate(
154 |                 DataWriter._grouper(zip(self.filenames, self.labels), imgs_per_record))
155 |         )
156 | 
157 | 
158 | class DataReader:
159 |     def __init__(self, 
160 |                  data_list,  # list of original images in the dataset
161 |                  tfrec_list, # list of tfrecords in the dataset (s3 or local)
162 |                  num_processes,
163 |                  batch_size):
164 | 
165 |         self.data_list = data_list
166 |         self.tfrec_list = tfrec_list
167 |         self.num_processes = num_processes
168 |         self.batch_size = batch_size
169 | 
170 |     @staticmethod
171 |     def _parse_image(x):
172 |         """
173 |         Used to reformat serialzed images to original shape
174 |         """
175 |         result = tf.io.parse_tensor(x, out_type=tf.float32)
176 |         result = tf.reshape(result, IMAGE_SHAPE)
177 |         return result
178 | 
179 |     @staticmethod
180 |     def _parse_serialized_example(serialized_example):
181 |         """
182 |         Given a serialized example with the below format, extract/deserialize
183 |         the image and (one-hot) label
184 |         """
185 |         features = {
186 |             'filename': tf.io.FixedLenFeature((), tf.string),
187 |             'image': tf.io.FixedLenFeature((), tf.string),
188 |             'label': tf.io.FixedLenFeature((), tf.int64)
189 |         }
190 |         example = tf.io.parse_single_example(serialized_example, features)
191 | 
192 |         # read image and perform transormations
193 |         image = DataReader._parse_image(example['image'])
194 |         image = tf.image.per_image_standardization(image)
195 | 
196 |         label = example['label']
197 |         return image, tf.one_hot(label, depth=3, dtype=tf.int64)
198 |     
199 |     def get_dataset(self):
200 |         n_images = len(open(self.data_list).readlines())
201 | 
202 |         # we don't need examples to be loaded in order (better speed)
203 |         options = tf.data.Options()
204 |         options.experimental_deterministic = False
205 | 
206 |         with open(self.tfrec_list) as f:
207 |             files = [filename.rstrip() for filename in f]
208 |             dataset = tf.data.Dataset.from_tensor_slices(files) \
209 |                     .shuffle(len(files)) \
210 |                     .with_options(options)
211 | 
212 | 
213 |         dataset = dataset.interleave(
214 |             tf.data.TFRecordDataset,
215 |             cycle_length=self.num_processes,
216 |             num_parallel_calls=self.num_processes) \
217 | 
218 |         dataset = dataset.map(
219 |             functools.partial(DataReader._parse_serialized_example),
220 |             num_parallel_calls=self.num_processes) \
221 |                 .repeat() \
222 |                 .shuffle(buffer_size=1000) \
223 |                 .batch(self.batch_size, drop_remainder=False) \
224 |                 .prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
225 | 
226 |         return dataset, n_images
227 | 


--------------------------------------------------------------------------------
/workflows/scripts/utils/models.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | import tensorflow_addons as tfa
 3 | 
 4 | class Conv2DBlock:
 5 |     """
 6 |     Composition of 2D convolutions
 7 |     * Why isn't this a keras Layer?  I found that it just makes 
 8 |       whole model saving + loading a real pain so I just wrote this
 9 |       as if it was a keras Layer but did not inherit from Layer
10 |     """
11 |     def __init__(self, n_channels=32, n_layers=1, kernel_regularizer=None,
12 |                  kernel_size=(3, 3), dilation_rate=(1, 1)):
13 |         # layer properties
14 |         self.n_channels = n_channels
15 |         self.n_layers = n_layers
16 |         self.kernel_regularizer = kernel_regularizer
17 |         self.kernel_size = kernel_size
18 |         self.dilation_rate = dilation_rate
19 | 
20 |         # sub layers
21 |         self.conv_layers = [
22 |             tf.keras.layers.Conv2D(
23 |                 filters=self.n_channels, kernel_size=self.kernel_size,
24 |                 dilation_rate=self.dilation_rate,
25 |                 kernel_regularizer=self.kernel_regularizer, padding='same')
26 |             for i in range(self.n_layers)]
27 |         self.bnorm_layers = [
28 |             tf.keras.layers.BatchNormalization()
29 |             for i in range(self.n_layers)]
30 |         self.leaky_relu_layers = [
31 |             # tf.keras.layers.LeakyReLU()
32 |             tfa.layers.GeLU()
33 |             for i in range(self.n_layers)]
34 | 
35 |     def __call__(self, x):
36 |         for conv, bnorm, leaky_relu in zip(
37 |             self.conv_layers, self.bnorm_layers, self.leaky_relu_layers):
38 |             x = conv(x)
39 |             x = bnorm(x)
40 |             x = leaky_relu(x)
41 |         return x
42 | 
43 | 
44 | class ResidualBlock(Conv2DBlock):
45 |     def __init__(self, **kwargs):
46 |         super().__init__(**kwargs)
47 |         self.add = tf.keras.layers.Add()
48 |         # self.leaky_relu_out = tf.keras.layers.LeakyReLU()
49 |         self.leaky_relu_out = tfa.layers.GeLU()
50 | 
51 |     def __call__(self, x):
52 |         temp = x
53 |         x = super().__call__(x)
54 |         x = self.add([temp, x])
55 |         return self.leaky_relu_out(x)
56 | 
57 | 
58 | def CNN(num_classes=3):
59 |     """
60 |     Construct and return an (uncompiled) conv2d model out of Conv2DBlocks.
61 |     """
62 |     inp = tf.keras.Input(shape=(None, None, 3))
63 |     x = inp
64 |     x = tf.keras.layers.Conv2D(
65 |         filters=32, kernel_size=(7, 7), strides=(1, 1),
66 |         dilation_rate=(2, 2), padding='valid')(x)
67 |     x = tf.keras.layers.BatchNormalization()(x)
68 |     # x = tf.keras.layers.LeakyReLU()(x)
69 |     x = tfa.layers.GeLU()(x)
70 |     x = tf.keras.layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2))(x)
71 | 
72 |     for i in range(4):
73 |         x = Conv2DBlock(
74 |             n_channels=32*(i+1), n_layers=1, kernel_size=(1, 1))(x)
75 |         x = ResidualBlock(
76 |             n_channels=32*(i+1), n_layers=3, kernel_size=(3, 3))(x)
77 |         x = ResidualBlock(
78 |             n_channels=32*(i+1), n_layers=3, kernel_size=(3, 3))(x)
79 |         x = ResidualBlock(
80 |             n_channels=32*(i+1), n_layers=3, kernel_size=(3, 3))(x)
81 |         x = tf.keras.layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2))(x)
82 | 
83 |     x = tf.keras.layers.GlobalAveragePooling2D()(x)
84 |     x = tf.keras.layers.Dense(1024)(x)
85 |     # x = tf.keras.layers.LeakyReLU()(x)
86 |     x = tfa.layers.GeLU()(x)
87 |     x = tf.keras.layers.Dropout(0.5)(x)
88 | 
89 |     x = tf.keras.layers.Dense(num_classes)(x)
90 |     out = tf.keras.layers.Softmax()(x)
91 |     return tf.keras.Model(inputs=inp, outputs=out)
92 | 


--------------------------------------------------------------------------------