├── LICENSE
├── README.md
└── notebooks
    ├── Day2-Germline
        ├── 1-gatk-germline-variant-discovery-tutorial.ipynb
        ├── 2-gatk-hard-filtering-tutorial-python.ipynb
        ├── 3-gatk-hard-filtering-tutorial-r-plotting.ipynb
        └── 4-gatk-cnn-tutorial-python.ipynb
    └── Day3-Somatic
        ├── 1-somatic-mutect2-tutorial.ipynb
        └── 2-somatic-cna-tutorial.ipynb


/LICENSE:
--------------------------------------------------------------------------------
 1 | BSD 3-Clause License
 2 | 
 3 | Copyright (c) 2019, GATK workflows 
 4 | All rights reserved.
 5 | 
 6 | Redistribution and use in source and binary forms, with or without
 7 | modification, are permitted provided that the following conditions are met:
 8 | 
 9 | 1. Redistributions of source code must retain the above copyright notice, this
10 |    list of conditions and the following disclaimer.
11 | 
12 | 2. Redistributions in binary form must reproduce the above copyright notice,
13 |    this list of conditions and the following disclaimer in the documentation
14 |    and/or other materials provided with the distribution.
15 | 
16 | 3. Neither the name of the copyright holder nor the names of its
17 |    contributors may be used to endorse or promote products derived from
18 |    this software without specific prior written permission.
19 | 
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # gatk4-jupyter-notebook-tutorials
 2 | 
 3 | ### Purpose :
 4 | This repository contains Jupyter Notebooks that walks users through GATK Best Practices Workflows.
 5 | 
 6 | ### Notebooks :
 7 | - Day2
 8 |   - 1-gatk-germline-variant-discovery-tutorial.ipynb
 9 |   - 2-gatk-hard-filtering-tutorial-python.ipynb
10 |   - 3-gatk-hard-filtering-tutorial-r-plotting.ipynb
11 |   - 4-gatk-cnn-tutorial-python.ipynb
12 | - Day3
13 |   - 1-somatic-mutect2-tutorial.ipynb
14 |   - 2-somatic-cna-tutorial.ipynb
15 |   
16 | ### Software version notes :
17 | - GATK 4.1
18 |   
19 | ### Important Note :
20 | - If you are executing these notebooks on Terra workspace, be sure to use the following startup script when creating a cluster: [gs://gatk-tutorials/scripts/install_gatk_4100_with_condaenv.sh](https://storage.googleapis.com/gatk-tutorials/scripts/install_gatk_4100_with_condaenv.sh)
21 | - Relevant reference and resources bundles can be accessed in [gs://gatk-tutorials/workshop_1903](https://console.cloud.google.com/storage/browser/gatk-tutorials/workshop_1903/?project=broad-dsde-outreach&organizationId=548622027621).
22 | - The following material is provided by the GATK Team. Please post any questions or concerns to one of our forum sites : [GATK](https://gatkforums.broadinstitute.org/gatk/categories/ask-the-team/) , [FireCloud](https://gatkforums.broadinstitute.org/firecloud/categories/ask-the-firecloud-team) or [Terra](https://broadinstitute.zendesk.com/hc/en-us/community/topics/360000500432-General-Discussion) , [WDL/Cromwell](https://gatkforums.broadinstitute.org/wdl/categories/ask-the-wdl-team).
23 | - Please visit the [User Guide](https://software.broadinstitute.org/gatk/documentation/) site for further documentation on our workflows and tools.
24 | 
25 | ### LICENSING :
26 | Copyright Broad Institute, 2019 | BSD-3
27 | This script is released under the WDL open source code license (BSD-3) (full license text at https://github.com/openwdl/wdl/blob/master/LICENSE). Note however that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.
28 | 
29 | 


--------------------------------------------------------------------------------
/notebooks/Day2-Germline/1-gatk-germline-variant-discovery-tutorial.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "GATK Tutorial :: Germline SNPs & Indels :: Worksheet\n",
  8 |     "===================="
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "metadata": {},
 14 |    "source": [
 15 |     "**March 2019**  \n",
 16 |     "\n",
 17 |     "| . | . |\n",
 18 |     "|:---:|:---:|\n",
 19 |     "| <img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/2-germline/images/vd-image1.png\" alt=\"drawing\" width=\"2000\"/> | The tutorial demonstrates an effective workflow for joint calling germline SNPs and indels in cohorts of multiple samples. The workflow applies to whole genome or exome data. Specifically, the tutorial uses a trio of WG sample snippets to demonstrate HaplotypeCaller's GVCF workflow for joint variant analysis. We use a GenomicsDB database structure, perform a genotype refinement based on family pedigree, and evaluate the effects of refinement. |\n",
 20 |     "\n",
 21 |     "The tutorial was last tested with the broadinstitute/gatk:4.1.0.0 docker and IGV v2.4.13.\n",
 22 |     "\n",
 23 |     "---\n",
 24 |     "**Table of Contents**   \n",
 25 |     "1. HAPLOTYPECALLER BASICS\t\n",
 26 |     "   1.1 Call variants with HaplotypeCaller in default VCF mode\t\n",
 27 |     "   1.2 View realigned reads and assembled haplotypes\t\n",
 28 |     "2. GVCF WORKFLOW\t\n",
 29 |     "   2.1 Run HaplotypeCaller on a single bam file in GVCF mode\t\n",
 30 |     "   2.2 Consolidate GVCFs using GenomicsDBImport\t\n",
 31 |     "   2.3 Run joint genotyping on the trio to generate the VCF\t\n",
 32 |     "3. GENOTYPE REFINEMENT\t\n",
 33 |     "   3.1 Refine the genotype calls with CalculateGenotypePosteriors\t\n",
 34 |     "   3.2 Compare changes with CollectVariantCallingMetrics\t\n",
 35 |     "---"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "markdown",
 40 |    "metadata": {},
 41 |    "source": [
 42 |     "### First, make sure the notebook is using a Python 3 kernel in the top right corner.\n",
 43 |     "A kernel is a _computational engine_ that executes the code in the notebook. We can execute GATK commands using _Python Magic_ (`!`)."
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "markdown",
 48 |    "metadata": {},
 49 |    "source": [
 50 |     "### How to run this notebook:\n",
 51 |     "- **Click to select a gray cell and then pressing SHIFT+ENTER to run the cell.**\n",
 52 |     "- **Write results to `/home/jupyter-user/2-germline-vd/sandbox/`. To access the directory, click on the upper-left jupyter icon.**\n",
 53 |     "- **Your output directory will be synced with your workspace bucket in order to view the results using IGV**"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": null,
 59 |    "metadata": {},
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "# Create your sandbox directory\n",
 63 |     "! mkdir -p /home/jupyter-user/2-germline-vd/sandbox/\n",
 64 |     "\n",
 65 |     "# Set you workspace bucket name. **Replace with your bucket**\n",
 66 |     "%env BUCKET=fc-ea3b695a-7c46-4996-b1ef-7112c1ce5b27\n",
 67 |     "\n",
 68 |     "# copy files from your notebook sandbox to your workspace bucket sandbox\n",
 69 |     "! gsutil cp -a public-read /home/jupyter-user/2-germline-vd/sandbox/* gs://$BUCKET/sandbox"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "markdown",
 74 |    "metadata": {},
 75 |    "source": [
 76 |     "### Enable reading Google bucket data "
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": null,
 82 |    "metadata": {},
 83 |    "outputs": [],
 84 |    "source": [
 85 |     "# Check if data is accessible. The command should list several gs:// URLs.\n",
 86 |     "! gsutil ls gs://gatk-tutorials/workshop_1903/2-germline/"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "code",
 91 |    "execution_count": null,
 92 |    "metadata": {
 93 |     "scrolled": true
 94 |    },
 95 |    "outputs": [],
 96 |    "source": [
 97 |     "# If you do not see gs:// URLs listed above, run this cell to install Google Cloud Storage. \n",
 98 |     "# Afterwards, restart the kernel with Kernel > Restart.\n",
 99 |     "! pip install google-cloud-storage"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "markdown",
104 |    "metadata": {},
105 |    "source": [
106 |     "### Download Data to the Notebook \n",
107 |     "Some tools are not able to read directly from a googe bucket, we download their files locally."
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": null,
113 |    "metadata": {},
114 |    "outputs": [],
115 |    "source": [
116 |     "! mkdir /home/jupyter-user/2-germline-vd/ref\n",
117 |     "! mkdir /home/jupyter-user/2-germline-vd/resources\n",
118 |     "! gsutil cp gs://gatk-tutorials/workshop_1903/2-germline/ref/* /home/jupyter-user/2-germline-vd/ref\n",
119 |     "! gsutil cp gs://gatk-tutorials/workshop_1903/2-germline/trio.ped /home/jupyter-user/2-germline-vd/\n",
120 |     "! gsutil cp gs://gatk-tutorials/workshop_1903/2-germline/resources/* /home/jupyter-user/2-germline-vd/resources/"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "markdown",
125 |    "metadata": {},
126 |    "source": [
127 |     "### Setup IGV\n",
128 |     "\n",
129 |     "- Download IGV to your local machine if you haven't already done so.\n",
130 |     "- Follow the instructions to setup your google account with IGV using this document: [Browse_Genomic_Data](https://googlegenomics.readthedocs.io/en/latest/use_cases/browse_genomic_data/igv.html).\n",
131 |     "  This allows you to access data from your workspace bucket.\n"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "markdown",
136 |    "metadata": {},
137 |    "source": [
138 |     "## Call variants with HaplotypeCaller in default VCF mode\n",
139 |     "In this first step we run HaplotypeCaller in its simplest form on a single sample to get familiar with its operation and to learn some useful tips and tricks.  \n"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": null,
145 |    "metadata": {},
146 |    "outputs": [],
147 |    "source": [
148 |     "! gatk HaplotypeCaller \\\n",
149 |     "    -R gs://gatk-tutorials/workshop_1903/2-germline/ref/ref.fasta \\\n",
150 |     "    -I gs://gatk-tutorials/workshop_1903/2-germline/bams/mother.bam \\\n",
151 |     "    -O /home/jupyter-user/2-germline-vd/sandbox/motherHC.vcf \\\n",
152 |     "    -L 20:10,000,000-10,200,000"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": null,
158 |    "metadata": {},
159 |    "outputs": [],
160 |    "source": [
161 |     "# copy files from your notebook sandbox to your workspace bucket sandbox\n",
162 |     "! gsutil cp -a public-read /home/jupyter-user/2-germline-vd/sandbox/* gs://$BUCKET/sandbox"
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "code",
167 |    "execution_count": null,
168 |    "metadata": {},
169 |    "outputs": [],
170 |    "source": [
171 |     "! echo gs://$BUCKET/sandbox/motherHC.vcf"
172 |    ]
173 |   },
174 |   {
175 |    "cell_type": "markdown",
176 |    "metadata": {},
177 |    "source": [
178 |     "Load the input BAM file as well as the output VCF (sandbox/motherHC.vcf) in IGV and go to the coordinates 20:10,002,294-10,002,623. Be sure the genome is set to b37.\n",
179 |     "\n",
180 |     "We see that HaplotypeCaller called a homozygous variant insertion of three T bases. How is this possible when so few reads seem to support an insertion at this position?\n",
181 |     "\n",
182 |     "| Tool Tip | . |\n",
183 |     "| --- | :--- |\n",
184 |     "| <img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/2-germline/images/vd-image6.png\" alt=\"drawing\" width=\"1000\"/> | When you encounter indel-related weirdness, turn on the display of soft-clips, which IGV turns off by default. Go to View > Preferences > Alignments and select “Show soft-clipped bases” |\n",
185 |     "\n",
186 |     "With soft clip display turned on, the region lights up with mismatching bases. For these reads, the aligner (here, BWA MEM) found the penalty of soft-clipping mismatching bases less than the penalty of inserting bases or inserting a gap. \n",
187 |     "\n",
188 |     "<img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/2-germline/images/vd-image7.png\" alt=\"drawing\" width=\"1000\"/>"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "markdown",
193 |    "metadata": {},
194 |    "source": [
195 |     "1.2 View realigned reads and assembled haplotypes\n",
196 |     "Let's take a peek under the hood of HaplotypeCaller. HaplotypeCaller has a parameter called -bamout, which allows you to ask for the realigned reads. These realigned reads are what HaplotypeCaller uses to make its variant calls, so you will be able to see if a realignment fixed the messy region in the original bam.\n",
197 |     "\n",
198 |     "Run the following command:"
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "code",
203 |    "execution_count": null,
204 |    "metadata": {},
205 |    "outputs": [],
206 |    "source": [
207 |     "! gatk HaplotypeCaller \\\n",
208 |     "    -R gs://gatk-tutorials/workshop_1903/2-germline/ref/ref.fasta \\\n",
209 |     "    -I gs://gatk-tutorials/workshop_1903/2-germline/bams/mother.bam \\\n",
210 |     "    -O /home/jupyter-user/2-germline-vd/sandbox/motherHCdebug.vcf \\\n",
211 |     "    -bamout /home/jupyter-user/2-germline-vd/sandbox/motherHCdebug.bam \\\n",
212 |     "    -L 20:10,002,000-10,003,000"
213 |    ]
214 |   },
215 |   {
216 |    "cell_type": "code",
217 |    "execution_count": null,
218 |    "metadata": {},
219 |    "outputs": [],
220 |    "source": [
221 |     "# copy files from your notebook sandbox to your workspace bucket sandbox\n",
222 |     "! gsutil cp -a public-read /home/jupyter-user/2-germline-vd/sandbox/* gs://$BUCKET/sandbox"
223 |    ]
224 |   },
225 |   {
226 |    "cell_type": "code",
227 |    "execution_count": null,
228 |    "metadata": {},
229 |    "outputs": [],
230 |    "source": [
231 |     "! echo gs://$BUCKET/sandbox/motherHCdebug.bam"
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "markdown",
236 |    "metadata": {},
237 |    "source": [
238 |     "Since you are only interested in looking at that messy region, give the tool a narrowed interval with -L 20:10,002,000-10,003,000. \n",
239 |     "\n",
240 |     "Load the output BAM (sandbox/motherHCdebug.bam) in IGV, and switch to Collapsed view (right-click>Collapsed). You should still be zoomed in on the same coordinates (20:10,002,294-10,002,623), and have the mother.bam track loaded for comparison.\n",
241 |     "\n",
242 |     "<img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/2-germline/images/vd-image8.png\" alt=\"drawing\" width=\"800\"/>"
243 |    ]
244 |   },
245 |   {
246 |    "cell_type": "markdown",
247 |    "metadata": {},
248 |    "source": [
249 |     "After realignment by HaplotypeCaller (the bottom track), almost all the reads show the insertion, and the messy soft clips from the original bam are gone. HaplotypeCaller will utilize soft-clipped sequences towards realignment. Expand the reads in the output BAM (right-click>Expanded view), and you can see that all the insertions are in phase with the C/T SNP. \n",
250 |     "\n",
251 |     "This shows that HaplotypeCaller found a different alignment after performing its local graph assembly step. The reassembled region provided HaplotypeCaller with enough support to call the indel, which position-based callers like UnifiedGenotyper would have missed.\n",
252 |     "\n",
253 |     "<img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/2-germline/images/vd-image9.png\" alt=\"drawing\" width=\"800\"/>"
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "markdown",
258 |    "metadata": {},
259 |    "source": [
260 |     "➤ Focus on the insertion locus. How many different types of insertions do you see? Which one did HaplotypeCaller call in the VCF? What do you think of this choice?\n",
261 |     "\n",
262 |     "There is more to a BAM than meets the eye--or at least, what you can see in this view of IGV. Right-click on the motherHCdebug.bam track to bring up the view options menu. Select Color alignments by, and choose read group. Your gray reads should now be colored similar to the screenshot below.\n",
263 |     "\n",
264 |     "<img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/2-germline/images/vd-image10.png\" alt=\"drawing\" width=\"800\"/>"
265 |    ]
266 |   },
267 |   {
268 |    "cell_type": "markdown",
269 |    "metadata": {},
270 |    "source": [
271 |     "Some of the first reads, shown in red at the top of the pile, are not real reads. These represent artificial haplotypes that were constructed by HaplotypeCaller, and are tagged with a special read group identifier, RG:Z:ArtificialHaplotypeRG to differentiate them from actual reassembled reads. You can click on an artificial read to see this tag under Read Group. "
272 |    ]
273 |   },
274 |   {
275 |    "cell_type": "markdown",
276 |    "metadata": {},
277 |    "source": [
278 |     "| . | . |\n",
279 |     "| --- | --- |\n",
280 |     "| ➤ How is each of the three artificial haplotypes different from the others? Let's separate these artificial reads to the top of the track. Select Group alignments by, and choose read group. | <img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/2-germline/images/vd-image11.png\" alt=\"drawing\" width=\"1000\"/> |\n",
281 |     "\n",
282 |     "Now we will color the reads differently. Select Color alignments by, choose tag, and type in HC. HaplotypeCaller labels reassembled reads that have unequivocal support for a haplotype (based on likelihood calculations) with an HC tag value that matches the HC tag value of the corresponding haplotype. \n",
283 |     "\n",
284 |     "<img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/2-germline/images/vd-image12.png\" alt=\"drawing\" width=\"500\"/>\n",
285 |     "\n",
286 |     "➤ Again, what do you think of HaplotypeCaller's choice to call the three-base insertion instead of the two-base insertion? \n",
287 |     "\n",
288 |     "Zoom out to see the three active regions within the scope of the interval we provided. We can see that HaplotypeCaller considered twelve, three, and six putative haplotypes, respectively, for the regions. "
289 |    ]
290 |   },
291 |   {
292 |    "cell_type": "markdown",
293 |    "metadata": {},
294 |    "source": [
295 |     "# GVCF workflow"
296 |    ]
297 |   },
298 |   {
299 |    "cell_type": "markdown",
300 |    "metadata": {},
301 |    "source": [
302 |     "## Run HaplotypeCaller on a single bam file in GVCF mode\n",
303 |     "\n",
304 |     "It is possible to genotype a multi-sample cohort simultaneously with HaplotypeCaller. However, this scales poorly. For a scalable analysis, GATK offers the GVCF workflow, which separates BAM-level variant calling from genotyping. In the GVCF workflow, HaplotypeCaller is run with the -ERC GVCF option on each individual BAM file and produces a GVCF, which adheres to VCF format specifications while giving information about the data at every genomic position. GenotypeGVCFs then genotypes the samples in a cohort via the given GVCFs.\n",
305 |     "\n",
306 |     "Run HaplotypeCaller in GVCF mode on the mother’s bam. This will produce a GVCF file that contains likelihoods for each possible genotype for the variant alleles, including a symbolic <NON_REF> allele. You'll see what this looks like soon."
307 |    ]
308 |   },
309 |   {
310 |    "cell_type": "code",
311 |    "execution_count": null,
312 |    "metadata": {},
313 |    "outputs": [],
314 |    "source": [
315 |     "! gatk HaplotypeCaller \\\n",
316 |     "    -R gs://gatk-tutorials/workshop_1903/2-germline/ref/ref.fasta \\\n",
317 |     "    -I gs://gatk-tutorials/workshop_1903/2-germline/bams/mother.bam \\\n",
318 |     "    -O /home/jupyter-user/2-germline-vd/sandbox/mother.g.vcf \\\n",
319 |     "    -ERC GVCF \\\n",
320 |     "    -L 20:10,000,000-10,200,000"
321 |    ]
322 |   },
323 |   {
324 |    "cell_type": "code",
325 |    "execution_count": null,
326 |    "metadata": {},
327 |    "outputs": [],
328 |    "source": [
329 |     "# copy files from your notebook sandbox to your workspace bucket sandbox\n",
330 |     "! gsutil cp -a public-read /home/jupyter-user/2-germline-vd/sandbox/* gs://$BUCKET/sandbox"
331 |    ]
332 |   },
333 |   {
334 |    "cell_type": "code",
335 |    "execution_count": null,
336 |    "metadata": {},
337 |    "outputs": [],
338 |    "source": [
339 |     "! echo gs://$BUCKET/sandbox/mother.g.vcf"
340 |    ]
341 |   },
342 |   {
343 |    "cell_type": "markdown",
344 |    "metadata": {},
345 |    "source": [
346 |     "In the interest of time, we have supplied the other sample GVCFs in the bundle, but normally you would run them individually in the same way as the first. \n",
347 |     "\n",
348 |     "Let's take a look at a GVCF in IGV. Start a new session to clear your IGV screen (File>New Session), then load the GVCF for each family member (gvcfs/mother.g.vcf, gvcfs/father.g.vcf, gvcfs/son.g.vcf). Zoom in on 20:10,002,371-10,002,546. You should see this:\n",
349 |     "\n",
350 |     "<img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/2-germline/images/vd-image13.png\" alt=\"drawing\" width=\"800\"/>"
351 |    ]
352 |   },
353 |   {
354 |    "cell_type": "markdown",
355 |    "metadata": {},
356 |    "source": [
357 |     "Notice anything different from the VCF? Along with the colorful variant sites, you see many gray blocks in the GVCF representing reference confidence intervals. The gray blocks represent the blocks where the sample appears to be homozygous reference or invariant. The likelihoods are evaluated against an abstract non-reference allele and so these are referred to somewhat counterintuitively as NON_REF blocks of the GVCF. Each belongs to different contiguous quality GVCFBlock blocks. \n",
358 |     "\n",
359 |     "If we peek into the GVCF file, we actually see in the ALT column a symbolic <NON_REF> allele, which represents non-called but possible non-reference alleles. Using the likelihoods against the <NON_REF> allele we assign likelihoods to alleles that weren’t seen in the current sample during joint genotyping. Additionally, for NON_REF blocks, the INFO field gives the end position of the homozygous-reference block. The FORMAT field gives Phred-scaled likelihoods (PL) for each potential genotype given the alleles including the NON_REF allele.\n",
360 |     "\n",
361 |     "Later, the genotyping step will retain only sites that are confidently variant against the reference. \n"
362 |    ]
363 |   },
364 |   {
365 |    "cell_type": "markdown",
366 |    "metadata": {},
367 |    "source": [
368 |     "## Consolidate GVCFs using GenomicsDBImport\n",
369 |     "For the next step, we need to consolidate the GVCFs into a GenomicsDB datastore. That might sound complicated but it's actually very straightforward."
370 |    ]
371 |   },
372 |   {
373 |    "cell_type": "code",
374 |    "execution_count": null,
375 |    "metadata": {},
376 |    "outputs": [],
377 |    "source": [
378 |     "! gatk GenomicsDBImport \\\n",
379 |     "    -V gs://gatk-tutorials/workshop_1903/2-germline/gvcfs/mother.g.vcf.gz \\\n",
380 |     "    -V gs://gatk-tutorials/workshop_1903/2-germline/gvcfs/father.g.vcf.gz \\\n",
381 |     "    -V gs://gatk-tutorials/workshop_1903/2-germline/gvcfs/son.g.vcf.gz \\\n",
382 |     "    --genomicsdb-workspace-path /home/jupyter-user/2-germline-vd/sandbox/trio \\\n",
383 |     "    --intervals 20:10,000,000-10,200,000"
384 |    ]
385 |   },
386 |   {
387 |    "cell_type": "markdown",
388 |    "metadata": {},
389 |    "source": [
390 |     "_Note: older versions of GenomicsDBImport accept only one interval at a time. Each interval can be at most a contig. To run on a full genome, we would need to define a set of intervals, and execute this command on each interval by itself.  See this WDL script for an example pipelining solution. In GATK v4.0.6.0+, GenomicsDB can import  multiple intervals per command._\n",
391 |     "\n",
392 |     "For those who cannot use GenomicDBImport, the alternative is to consolidate GVCFs with CombineGVCFs. Keep in mind though that the GenomicsDB intermediate allows you to scale analyses to large cohort sizes efficiently. Because it's not trivial to examine the data within the database, we will extract the trio's combined data from the GenomicsDB database using SelectVariants. "
393 |    ]
394 |   },
395 |   {
396 |    "cell_type": "code",
397 |    "execution_count": null,
398 |    "metadata": {},
399 |    "outputs": [],
400 |    "source": [
401 |     "# Create a soft link to sandbox\n",
402 |     "! rm -r sandbox\n",
403 |     "! ln -s /home/jupyter-user/2-germline-vd/sandbox/ sandbox"
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "code",
408 |    "execution_count": null,
409 |    "metadata": {},
410 |    "outputs": [],
411 |    "source": [
412 |     "! gatk SelectVariants \\\n",
413 |     "    -R /home/jupyter-user/2-germline-vd/ref/ref.fasta \\\n",
414 |     "    -V gendb://sandbox/trio \\\n",
415 |     "    -O /home/jupyter-user/2-germline-vd/sandbox/trio_selectvariants.g.vcf"
416 |    ]
417 |   },
418 |   {
419 |    "cell_type": "markdown",
420 |    "metadata": {},
421 |    "source": [
422 |     "➤ Take a look inside the combined GVCF. How many samples are represented? What is going on with the genotype field (GT)? What does this genotype notation mean?"
423 |    ]
424 |   },
425 |   {
426 |    "cell_type": "code",
427 |    "execution_count": null,
428 |    "metadata": {},
429 |    "outputs": [],
430 |    "source": [
431 |     "# copy files from your notebook sandbox to your workspace bucket sandbox\n",
432 |     "! gsutil cp -a public-read /home/jupyter-user/2-germline-vd/sandbox/* gs://$BUCKET/sandbox"
433 |    ]
434 |   },
435 |   {
436 |    "cell_type": "code",
437 |    "execution_count": null,
438 |    "metadata": {},
439 |    "outputs": [],
440 |    "source": [
441 |     "! echo gs://$BUCKET/sandbox/trio_selectvariants.g.vcf"
442 |    ]
443 |   },
444 |   {
445 |    "cell_type": "markdown",
446 |    "metadata": {},
447 |    "source": [
448 |     "## Run joint genotyping on the trio to generate the VCF\n",
449 |     "The last step is to joint genotype variant sites for the samples using GenotypeGVCFs. "
450 |    ]
451 |   },
452 |   {
453 |    "cell_type": "code",
454 |    "execution_count": null,
455 |    "metadata": {},
456 |    "outputs": [],
457 |    "source": [
458 |     "! gatk GenotypeGVCFs \\\n",
459 |     "    -R /home/jupyter-user/2-germline-vd/ref/ref.fasta \\\n",
460 |     "    -V gendb://sandbox/trio \\\n",
461 |     "    -O /home/jupyter-user/2-germline-vd/sandbox/trioGGVCF.vcf \\\n",
462 |     "    -L 20:10,000,000-10,200,000"
463 |    ]
464 |   },
465 |   {
466 |    "cell_type": "code",
467 |    "execution_count": null,
468 |    "metadata": {},
469 |    "outputs": [],
470 |    "source": [
471 |     "# copy files from your notebook sandbox to your workspace bucket sandbox\n",
472 |     "! gsutil cp -a public-read /home/jupyter-user/2-germline-vd/sandbox/* gs://$BUCKET/sandbox"
473 |    ]
474 |   },
475 |   {
476 |    "cell_type": "markdown",
477 |    "metadata": {},
478 |    "source": [
479 |     "The calls made by GenotypeGVCFs and HaplotypeCaller run in multisample mode should mostly be equivalent, especially as cohort sizes increase. However, there can be some marginal differences in borderline calls, i.e. low-quality variant sites, in particular for small cohorts with low coverage. For such cases, joint genotyping directly with HaplotypeCaller and/or using the new quality score model with GenotypeGVCFs (turned on with -new-qual) may be preferable.\n",
480 |     "\n",
481 |     "➤ What would the command to run HaplotypeCaller jointly on the three samples look like? How about the command that also produces a reassembled BAM and uses the new quality score model?\n",
482 |     "\n",
483 |     "```\n",
484 |     "gatk HaplotypeCaller \\\n",
485 |     "    -R ref/ref.fasta \\\n",
486 |     "    -I bams/mother.bam \\\n",
487 |     "    -I bams/father.bam \\\n",
488 |     "    -I bams/son.bam \\\n",
489 |     "    -O sandbox/trio_hcjoint_nq.vcf \\\n",
490 |     "    -L 20:10,000,000-10,200,000 \\\n",
491 |     "    -new-qual \\\n",
492 |     "    -bamout sandbox/trio_hcjoint_nq.bam\n",
493 |     "```\n",
494 |     "\n",
495 |     "In the interest of time, we do not run the above command. Note the BAMOUT will contain reassembled reads for all the input samples. \n",
496 |     "\n",
497 |     "Let's circle back to the locus we examined at the start. Load sandbox/trioGGVCF.vcf into IGV and navigate to 20:10,002,376-10,002,550.\n",
498 |     "\n",
499 |     "<img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/2-germline/images/vd-image14.png\" alt=\"drawing\" width=\"800\"/>"
500 |    ]
501 |   },
502 |   {
503 |    "cell_type": "code",
504 |    "execution_count": null,
505 |    "metadata": {},
506 |    "outputs": [],
507 |    "source": [
508 |     "! echo gs://$BUCKET/sandbox/trioGGVCF.vcf"
509 |    ]
510 |   },
511 |   {
512 |    "cell_type": "markdown",
513 |    "metadata": {},
514 |    "source": [
515 |     "➤ Focus on NA12877's (father) genotype call at 20:10002458. Knowing the familial relationship for the three samples and the child's homozygous-variant genotype, what do you think about the father's HOM_REF call?\n",
516 |     "\n",
517 |     "_Results from GATK v4.0.1.0 also show HOM_REF as well but give PLs (phred-scaled likelihoods) of 0,0,460. Changes in v4.0.9.0 improve hom-ref GQs near indels in GVCFs. The table shows this is an ambiguous site for other callers as well._\n",
518 |     "\n",
519 |     "| . | . |\n",
520 |     "| --- | --- |\n",
521 |     "| <img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/2-germline/images/vd-image3.png\" alt=\"drawing\" width=\"500\"/> | <img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/2-germline/images/vd-image2.png\" alt=\"drawing\" width=\"150\"/> |"
522 |    ]
523 |   },
524 |   {
525 |    "cell_type": "markdown",
526 |    "metadata": {},
527 |    "source": [
528 |     "# GENOTYPE REFINEMENT"
529 |    ]
530 |   },
531 |   {
532 |    "cell_type": "markdown",
533 |    "metadata": {},
534 |    "source": [
535 |     "## Refine the genotype calls with CalculateGenotypePosteriors\n",
536 |     "We can systematically refine our calls for the trio using CalculateGenotypePosteriors. For starters, we can use pedigree information, which the tutorial provides in the trio.ped file. Second, we can use population priors. For priors we use a population allele frequencies resource derived from gnomAD."
537 |    ]
538 |   },
539 |   {
540 |    "cell_type": "code",
541 |    "execution_count": null,
542 |    "metadata": {},
543 |    "outputs": [],
544 |    "source": [
545 |     "! gatk CalculateGenotypePosteriors \\\n",
546 |     "    -V /home/jupyter-user/2-germline-vd/sandbox/trioGGVCF.vcf \\\n",
547 |     "    -ped /home/jupyter-user/2-germline-vd/trio.ped \\\n",
548 |     "    --skip-population-priors \\\n",
549 |     "    -O /home/jupyter-user/2-germline-vd/sandbox/trioCGP.vcf"
550 |    ]
551 |   },
552 |   {
553 |    "cell_type": "code",
554 |    "execution_count": null,
555 |    "metadata": {},
556 |    "outputs": [],
557 |    "source": [
558 |     "! gatk CalculateGenotypePosteriors \\\n",
559 |     "    -V /home/jupyter-user/2-germline-vd/sandbox/trioGGVCF.vcf \\\n",
560 |     "    -ped /home/jupyter-user/2-germline-vd/trio.ped \\\n",
561 |     "    --supporting-callsets /home/jupyter-user/2-germline-vd/resources/af-only-gnomad.chr20subset.b37.vcf.gz \\\n",
562 |     "    -O /home/jupyter-user/2-germline-vd/sandbox/trioCGP_gnomad.vcf"
563 |    ]
564 |   },
565 |   {
566 |    "cell_type": "code",
567 |    "execution_count": null,
568 |    "metadata": {},
569 |    "outputs": [],
570 |    "source": [
571 |     "# copy files from your notebook sandbox to your workspace bucket sandbox\n",
572 |     "! gsutil cp -a public-read /home/jupyter-user/2-germline-vd/sandbox/* gs://$BUCKET/sandbox"
573 |    ]
574 |   },
575 |   {
576 |    "cell_type": "code",
577 |    "execution_count": null,
578 |    "metadata": {},
579 |    "outputs": [],
580 |    "source": [
581 |     "! echo gs://$BUCKET/sandbox/trioCGP.vcf\n",
582 |     "! echo gs://$BUCKET/sandbox/trioCGP_gnomad.vcf"
583 |    ]
584 |   },
585 |   {
586 |    "cell_type": "markdown",
587 |    "metadata": {},
588 |    "source": [
589 |     "Add both sandbox/trioCGP.vcf and sandbox/trioCGP_gnomad.vcf to the IGV session. \n",
590 |     "\n",
591 |     "| . | . |\n",
592 |     "| --- | --- |\n",
593 |     "| ➤ What has changed? What has not changed? | <img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/2-germline/images/vd-image15.png\" alt=\"drawing\" width=\"370\"/> |"
594 |    ]
595 |   },
596 |   {
597 |    "cell_type": "markdown",
598 |    "metadata": {},
599 |    "source": [
600 |     "CalculateGenotypePosteriors adds three new FORMAT annotations–-PP, JL and JP. \n",
601 |     "\n",
602 |     "- Phred-scaled Posterior Probability (PP) basically refines the PL values. It incorporates the prior expectations for the given pedigree and/or population allele frequencies. \n",
603 |     "- Joint Trio Likelihood (JL) is the Phred-scaled joint likelihood of the posterior genotypes for the trio being incorrect.\n",
604 |     "- Joint Trio Posterior (JP) is the Phred-scaled posterior probability of the posterior genotypes for the three samples being incorrect.\n",
605 |     "\n"
606 |    ]
607 |   },
608 |   {
609 |    "cell_type": "markdown",
610 |    "metadata": {},
611 |    "source": [
612 |     "| . | . |\n",
613 |     "| --- | --- |\n",
614 |     "| <img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/2-germline/images/vd-image4.png\" alt=\"drawing\" width=\"370\"/> |<img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/2-germline/images/vd-image5.png\" alt=\"drawing\" width=\"400\"/> |\n",
615 |     "\n",
616 |     "You can learn more about the Genotype Refinement workflow in Article#11074 at <https://software.broadinstitute.org/gatk/documentation/article?id=11074>.  "
617 |    ]
618 |   },
619 |   {
620 |    "cell_type": "markdown",
621 |    "metadata": {},
622 |    "source": [
623 |     "## Compare changes with CollectVariantCallingMetrics \n",
624 |     "There are a few different GATK/Picard tools to compare site-level and genotype-level concordance that the Callset Evaluation presentation goes over. Here we perform a quick sanity-check on the refinements by comparing the number of GQ0 variants. The commands for the original callset and for that refined with the pedigree are below."
625 |    ]
626 |   },
627 |   {
628 |    "cell_type": "code",
629 |    "execution_count": null,
630 |    "metadata": {},
631 |    "outputs": [],
632 |    "source": [
633 |     "! gatk CollectVariantCallingMetrics \\\n",
634 |     "    -I /home/jupyter-user/2-germline-vd/sandbox/trioGGVCF.vcf \\\n",
635 |     "    --DBSNP /home/jupyter-user/2-germline-vd/resources/dbsnp.vcf \\\n",
636 |     "    -O /home/jupyter-user/2-germline-vd/sandbox/trioGGVCF_metrics"
637 |    ]
638 |   },
639 |   {
640 |    "cell_type": "code",
641 |    "execution_count": null,
642 |    "metadata": {},
643 |    "outputs": [],
644 |    "source": [
645 |     "! cat /home/jupyter-user/2-germline-vd/sandbox/trioGGVCF_metrics.variant_calling_summary_metrics"
646 |    ]
647 |   },
648 |   {
649 |    "cell_type": "code",
650 |    "execution_count": null,
651 |    "metadata": {},
652 |    "outputs": [],
653 |    "source": [
654 |     "! gatk CollectVariantCallingMetrics \\\n",
655 |     "    -I /home/jupyter-user/2-germline-vd/sandbox/trioCGP.vcf \\\n",
656 |     "    --DBSNP /home/jupyter-user/2-germline-vd/resources/dbsnp.vcf \\\n",
657 |     "    -O /home/jupyter-user/2-germline-vd/sandbox/trioCGP_metrics"
658 |    ]
659 |   },
660 |   {
661 |    "cell_type": "code",
662 |    "execution_count": null,
663 |    "metadata": {},
664 |    "outputs": [],
665 |    "source": [
666 |     "! cat /home/jupyter-user/2-germline-vd/sandbox/trioCGP_metrics.variant_calling_summary_metrics"
667 |    ]
668 |   },
669 |   {
670 |    "cell_type": "markdown",
671 |    "metadata": {},
672 |    "source": [
673 |     "CollectVariantCallingMetrics produces both summary and detail metrics. The summary metrics provide cohort-level variant metrics, while the detail metrics segment the variant metrics for each sample in the callset. The detail metrics give the same metrics as the summary metrics plus the following five additional fields: sample alias, het to homvar ratio, percent GQ0 variants, total GQ0 variants, and total het depth. Metrics are explained at <https://broadinstitute.github.io/picard/picard-metric-definitions.html>."
674 |    ]
675 |   }
676 |  ],
677 |  "metadata": {
678 |   "kernelspec": {
679 |    "display_name": "Python 3",
680 |    "language": "python",
681 |    "name": "python3"
682 |   },
683 |   "language_info": {
684 |    "codemirror_mode": {
685 |     "name": "ipython",
686 |     "version": 3
687 |    },
688 |    "file_extension": ".py",
689 |    "mimetype": "text/x-python",
690 |    "name": "python",
691 |    "nbconvert_exporter": "python",
692 |    "pygments_lexer": "ipython3",
693 |    "version": "3.6.8"
694 |   },
695 |   "toc": {
696 |    "base_numbering": 1,
697 |    "nav_menu": {},
698 |    "number_sections": true,
699 |    "sideBar": true,
700 |    "skip_h1_title": false,
701 |    "title_cell": "Table of Contents",
702 |    "title_sidebar": "Contents",
703 |    "toc_cell": false,
704 |    "toc_position": {},
705 |    "toc_section_display": true,
706 |    "toc_window_display": false
707 |   }
708 |  },
709 |  "nbformat": 4,
710 |  "nbformat_minor": 2
711 | }
712 | 


--------------------------------------------------------------------------------
/notebooks/Day2-Germline/2-gatk-hard-filtering-tutorial-python.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# GATK Tutorial | Hard Filtering | March 2019"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "\n",
 15 |     "This GATK tutorial corresponds to a section of the GATK Workshop _2b. Germline Hard Filtering Tutorial_ worksheet. The goal is to become familiar with germline variant annotations. The notebook illustrates the following steps. \n",
 16 |     "\n",
 17 |     "- Use GATK to stratify a variant callset against a truthset\n",
 18 |     "- Use R's ggplot2 package to plot the distribution of various annotation values\n",
 19 |     "- Hard-filter based on annotation thresholds and calculate concordance metrics  \n",
 20 |     "\n",
 21 |     "### First, make sure the notebook is using a Python 3 kernel in the top right corner.\n",
 22 |     "A kernel is a _computational engine_ that executes the code in the notebook. We use Python 3 in this notebook to execute GATK commands using _Python Magic_ (`!`). Later we will switch to another notebook to do some plotting in R.\n",
 23 |     "\n",
 24 |     "### How to run this notebook:\n",
 25 |     "- **Click to select a gray cell and then pressing SHIFT+ENTER to run the cell.**\n",
 26 |     "\n",
 27 |     "- **Write results to `/home/jupyter-user/`. To access the directory, click on the upper-left jupyter icon.**\n",
 28 |     "\n",
 29 |     "### Enable reading Google bucket data "
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": null,
 35 |    "metadata": {},
 36 |    "outputs": [],
 37 |    "source": [
 38 |     "# Check if data is accessible. The command should list several gs:// URLs.\n",
 39 |     "! gsutil ls gs://gatk-tutorials/workshop_1702/variant_discovery/data/resources/\n",
 40 |     "! gsutil ls gs://gatk-tutorials/workshop_1702/variant_discovery/data/intervals/motherHighconf.bed\n",
 41 |     "! gsutil ls gs://gatk-tutorials/workshop_1702/variant_discovery/data/inputVcfs/"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": null,
 47 |    "metadata": {},
 48 |    "outputs": [],
 49 |    "source": [
 50 |     "# If you do not see gs:// URLs listed above, run this cell to install Google Cloud Storage. \n",
 51 |     "# Afterwards, restart the kernel with Kernel > Restart.\n",
 52 |     "#! pip install google-cloud-storage"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "markdown",
 57 |    "metadata": {},
 58 |    "source": [
 59 |     "---\n",
 60 |     "## 1. Subset variants to SNPs of a single sample with SelectVariants"
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "markdown",
 65 |    "metadata": {},
 66 |    "source": [
 67 |     "Subset the trio callset to just the SNPs of the mother (sample NA12878). Make sure to remove sites for which the sample genotype is homozygous-reference and remove unused alleles, including spanning deletions. \n",
 68 |     "\n",
 69 |     "> The tool recalculates depth of coverage (DP) per site as well as the allele count in genotypes for each ALT allele (AC), allele frequency for each ALT allele (AF), and  total number of alleles in called genotypes (AN), to reflect only the subset sample(s)."
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": 1,
 75 |    "metadata": {
 76 |     "scrolled": true
 77 |    },
 78 |    "outputs": [
 79 |     {
 80 |      "name": "stdout",
 81 |      "output_type": "stream",
 82 |      "text": [
 83 |       "Using GATK jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar\n",
 84 |       "Running:\n",
 85 |       "    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar SelectVariants -V gs://gatk-tutorials/workshop_1702/variant_discovery/data/inputVcfs/trio.vcf.gz -sn NA12878 -select-type SNP --exclude-non-variants --remove-unused-alternates -O /home/jupyter-user/motherSNP.vcf.gz\n",
 86 |       "05:34:16.167 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so\n",
 87 |       "05:34:16.395 INFO  SelectVariants - ------------------------------------------------------------\n",
 88 |       "05:34:16.396 INFO  SelectVariants - The Genome Analysis Toolkit (GATK) v4.1.0.0\n",
 89 |       "05:34:16.396 INFO  SelectVariants - For support and documentation go to https://software.broadinstitute.org/gatk/\n",
 90 |       "05:34:16.396 INFO  SelectVariants - Executing as jupyter-user@saturn-f5de5fcd-6bbf-4424-8ae2-0ef73d2e5490-m on Linux v4.9.0-8-amd64 amd64\n",
 91 |       "05:34:16.396 INFO  SelectVariants - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_201-b09\n",
 92 |       "05:34:16.397 INFO  SelectVariants - Start Date/Time: March 21, 2019 5:34:16 AM UTC\n",
 93 |       "05:34:16.397 INFO  SelectVariants - ------------------------------------------------------------\n",
 94 |       "05:34:16.397 INFO  SelectVariants - ------------------------------------------------------------\n",
 95 |       "05:34:16.398 INFO  SelectVariants - HTSJDK Version: 2.18.2\n",
 96 |       "05:34:16.398 INFO  SelectVariants - Picard Version: 2.18.25\n",
 97 |       "05:34:16.399 INFO  SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2\n",
 98 |       "05:34:16.399 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false\n",
 99 |       "05:34:16.399 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true\n",
100 |       "05:34:16.399 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false\n",
101 |       "05:34:16.399 INFO  SelectVariants - Deflater: IntelDeflater\n",
102 |       "05:34:16.400 INFO  SelectVariants - Inflater: IntelInflater\n",
103 |       "05:34:16.400 INFO  SelectVariants - GCS max retries/reopens: 20\n",
104 |       "05:34:16.400 INFO  SelectVariants - Requester pays: disabled\n",
105 |       "05:34:16.400 INFO  SelectVariants - Initializing engine\n",
106 |       "05:34:21.549 INFO  FeatureManager - Using codec VCFCodec to read file gs://gatk-tutorials/workshop_1702/variant_discovery/data/inputVcfs/trio.vcf.gz\n",
107 |       "05:34:24.616 INFO  SelectVariants - Done initializing engine\n",
108 |       "05:34:24.683 INFO  SelectVariants - Including sample 'NA12878'\n",
109 |       "05:34:24.709 INFO  ProgressMeter - Starting traversal\n",
110 |       "05:34:24.710 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute\n",
111 |       "05:34:34.337 INFO  ProgressMeter -          20:62910954              0.2                122091         760928.6\n",
112 |       "05:34:34.338 INFO  ProgressMeter - Traversal complete. Processed 122091 total variants in 0.2 minutes.\n",
113 |       "05:34:34.354 INFO  SelectVariants - Shutting down engine\n",
114 |       "[March 21, 2019 5:34:34 AM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.30 minutes.\n",
115 |       "Runtime.totalMemory()=935854080\n"
116 |      ]
117 |     }
118 |    ],
119 |    "source": [
120 |     "! gatk SelectVariants \\\n",
121 |     "-V gs://gatk-tutorials/workshop_1702/variant_discovery/data/inputVcfs/trio.vcf.gz \\\n",
122 |     "-sn NA12878 \\\n",
123 |     "-select-type SNP \\\n",
124 |     "--exclude-non-variants \\\n",
125 |     "--remove-unused-alternates \\\n",
126 |     "-O /home/jupyter-user/motherSNP.vcf.gz"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": 2,
132 |    "metadata": {
133 |     "collapsed": true
134 |    },
135 |    "outputs": [
136 |     {
137 |      "name": "stdout",
138 |      "output_type": "stream",
139 |      "text": [
140 |       "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tNA12878\r\n",
141 |       "20\t61098\t.\tC\tT\t465.13\t.\tAC=1;AF=0.500;AN=2;BaseQRankSum=0.516;ClippingRankSum=0.00;DP=44;ExcessHet=3.0103;FS=0.000;MQ=59.48;MQRankSum=0.803;QD=10.57;ReadPosRankSum=1.54;SOR=0.603\tGT:AD:DP:GQ:PL\t0/1:28,16:44:99:496,0,938\r\n",
142 |       "20\t61795\t.\tG\tT\t2034.16\t.\tAC=1;AF=0.500;AN=2;BaseQRankSum=-6.330e-01;ClippingRankSum=0.00;DP=60;ExcessHet=3.9794;FS=0.000;MQ=59.81;MQRankSum=0.00;QD=17.09;ReadPosRankSum=1.23;SOR=0.723\tGT:AD:DP:GQ:PL\t0/1:30,30:60:99:1003,0,1027\r\n",
143 |       "20\t63244\t.\tA\tC\t923.13\t.\tAC=1;AF=0.500;AN=2;BaseQRankSum=0.637;ClippingRankSum=0.00;DP=57;ExcessHet=3.0103;FS=5.470;MQ=59.60;MQRankSum=-1.019e+00;QD=16.20;ReadPosRankSum=0.404;SOR=1.528\tGT:AD:DP:GQ:PL\t0/1:30,27:57:99:954,0,1064\r\n",
144 |       "20\t63799\t.\tC\tT\t1766.16\t.\tAC=1;AF=0.500;AN=2;BaseQRankSum=-6.530e-01;ClippingRankSum=0.00;DP=45;ExcessHet=3.9794;FS=0.000;MQ=59.78;MQRankSum=1.22;QD=16.98;ReadPosRankSum=-1.075e+00;SOR=0.709\tGT:AD:DP:GQ:PL\t0/1:19,26:45:99:953,0,670\r\n",
145 |       "20\t65900\t.\tG\tA\t5817.13\t.\tAC=1;AF=0.500;AN=2;BaseQRankSum=0.503;ClippingRankSum=0.00;DP=64;ExcessHet=3.0103;FS=4.289;MQ=59.65;MQRankSum=0.00;QD=31.61;ReadPosRankSum=0.732;SOR=1.032\tGT:AD:DP:GQ:PL\t0/1:41,23:64:99:809,0,1596\r\n",
146 |       "20\t66370\t.\tG\tA\t5611.13\t.\tAC=1;AF=0.500;AN=2;BaseQRankSum=1.26;ClippingRankSum=0.00;DP=52;ExcessHet=3.0103;FS=6.196;MQ=60.00;MQRankSum=0.00;QD=33.01;ReadPosRankSum=-9.100e-02;SOR=0.647\tGT:AD:DP:GQ:PL\t0/1:31,21:52:99:716,0,1103\r\n",
147 |       "20\t66720\t.\tC\tA\t2204.16\t.\tAC=1;AF=0.500;AN=2;BaseQRankSum=0.663;ClippingRankSum=0.00;DP=59;ExcessHet=3.9794;FS=12.193;MQ=60.00;MQRankSum=0.00;QD=15.86;ReadPosRankSum=-1.250e-01;SOR=1.219\tGT:AD:DP:GQ:PL\t0/1:31,28:59:99:948,0,1123\r\n",
148 |       "20\t68749\t.\tT\tC\t4285.16\t.\tAC=2;AF=1.00;AN=2;BaseQRankSum=0.613;ClippingRankSum=0.00;DP=52;ExcessHet=3.9794;FS=2.137;MQ=59.86;MQRankSum=0.00;QD=26.29;ReadPosRankSum=1.24;SOR=0.924\tGT:AD:DP:GQ:PL\t1/1:0,52:52:99:2214,157,0\r\n",
149 |       "20\t70980\t.\tG\tA\t672.13\t.\tAC=1;AF=0.500;AN=2;BaseQRankSum=-1.800e-01;ClippingRankSum=0.00;DP=53;ExcessHet=3.0103;FS=1.042;MQ=59.65;MQRankSum=-1.196e+00;QD=12.68;ReadPosRankSum=0.444;SOR=0.848\tGT:AD:DP:GQ:PL\t0/1:32,21:53:99:703,0,1198\r\n",
150 |       "grep: write error: Broken pipe\r\n",
151 |       "\r\n",
152 |       "gzip: stdout: Broken pipe\r\n"
153 |      ]
154 |     }
155 |    ],
156 |    "source": [
157 |     "# Peruse the resulting file \n",
158 |     "! zcat /home/jupyter-user/motherSNP.vcf.gz | grep -v '##' | head"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "markdown",
163 |    "metadata": {},
164 |    "source": [
165 |     "---\n",
166 |     "## 2. Annotate intersecting true positives with VariantAnnotator"
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "markdown",
171 |    "metadata": {},
172 |    "source": [
173 |     "We use VariantAnnotator to annotate which variants in our callset are also present in the truthset (GIAB), which are considered true positives. Variants not present in the truthset are considered false positives. Here we produce a callset where variants that are present in the truthset are annotated with the giab.callsets annotation plus a value indicating how many of the callsets used to develop the truthset agreed with that call."
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": null,
179 |    "metadata": {},
180 |    "outputs": [],
181 |    "source": [
182 |     "! gatk VariantAnnotator \\\n",
183 |     "-V /home/jupyter-user/motherSNP.vcf.gz \\\n",
184 |     "--resource:giab gs://gatk-tutorials/workshop_1702/variant_discovery/data/resources/motherGIABsnps.vcf.gz \\\n",
185 |     "-E giab.callsets \\\n",
186 |     "-O /home/jupyter-user/motherSNP.giab.vcf.gz"
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": null,
192 |    "metadata": {},
193 |    "outputs": [],
194 |    "source": [
195 |     "# Peruse the resulting file \n",
196 |     "! zcat /home/jupyter-user/motherSNP.giab.vcf.gz | grep -v '##' | head"
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "markdown",
201 |    "metadata": {},
202 |    "source": [
203 |     "---\n",
204 |     "## 3. Tabulate annotations of interest with VariantsToTable"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "markdown",
209 |    "metadata": {},
210 |    "source": [
211 |     "Convert the information from the callset into a tab delimited table using VariantsToTable, so that we can parse it easily in R. The tool parameters differentiate INFO/site-level fields fields (`-F`) and FORMAT/sample-level fields genotype fields (`-GF`). This step produces a table where each line represents a variant record from the VCF, and each column represents an annotation we have specified. Wherever the requested annotations are not present, e.g. RankSum annotations at homozygous sites, the value will be replaced by NA. "
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": null,
217 |    "metadata": {},
218 |    "outputs": [],
219 |    "source": [
220 |     "! gatk VariantsToTable \\\n",
221 |     "-V /home/jupyter-user/motherSNP.giab.vcf.gz \\\n",
222 |     "-F CHROM -F POS -F QUAL \\\n",
223 |     "-F BaseQRankSum -F MQRankSum -F ReadPosRankSum \\\n",
224 |     "-F DP -F FS -F MQ -F QD -F SOR \\\n",
225 |     "-F giab.callsets \\\n",
226 |     "-GF GQ \\\n",
227 |     "-O /home/jupyter-user/motherSNP.giab.txt"
228 |    ]
229 |   },
230 |   {
231 |    "cell_type": "code",
232 |    "execution_count": null,
233 |    "metadata": {},
234 |    "outputs": [],
235 |    "source": [
236 |     "# Peruse the resulting file\n",
237 |     "! cat /home/jupyter-user/motherSNP.giab.txt | head -n300"
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "code",
242 |    "execution_count": null,
243 |    "metadata": {},
244 |    "outputs": [],
245 |    "source": [
246 |     "# Focus in on a few columns\n",
247 |     "! cat /home/jupyter-user/motherSNP.giab.txt | cut -f1,2,7,12 | head -n300"
248 |    ]
249 |   },
250 |   {
251 |    "cell_type": "markdown",
252 |    "metadata": {},
253 |    "source": [
254 |     "\n",
255 |     "---\n",
256 |     "## 4. Make density and scatter plots in R and determine filtering thresholds"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "markdown",
261 |    "metadata": {},
262 |    "source": [
263 |     "<span style=\"color:red\">Load the R notebook now to run the plots for this next section. Continue below only after you've finished with the other notebook.</span>\n"
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "markdown",
268 |    "metadata": {},
269 |    "source": [
270 |     "---\n",
271 |     "## 5. Apply filters with VariantFiltration and evaluate results"
272 |    ]
273 |   },
274 |   {
275 |    "cell_type": "markdown",
276 |    "metadata": {},
277 |    "source": [
278 |     "### A. Filter on QUAL and tabulate baseline concordance\n",
279 |     "\n",
280 |     "Based on the plots we generated, we're going to apply some filters to weed out false positives. To illustrate how VariantFiltration works, and to establish baseline performance, we first filter on QUAL < 30. By default, GATK GenotypeGVCFs filters out variants with QUAL < 10. This step produces a VCF with all the original variants; those that failed the filter are annotated with the filter name in the FILTER column.\n"
281 |    ]
282 |   },
283 |   {
284 |    "cell_type": "code",
285 |    "execution_count": null,
286 |    "metadata": {},
287 |    "outputs": [],
288 |    "source": [
289 |     "# Filter callset on one annotation, QUAL < 30\n",
290 |     "! gatk VariantFiltration \\\n",
291 |     "-R gs://gatk-tutorials/workshop_1702/variant_discovery/data/ref/ref.fasta \\\n",
292 |     "-V /home/jupyter-user/motherSNP.vcf.gz \\\n",
293 |     "--filter-expression \"QUAL < 30\" \\\n",
294 |     "--filter-name \"qual30\" \\\n",
295 |     "-O /home/jupyter-user/motherSNPqual30.vcf.gz"
296 |    ]
297 |   },
298 |   {
299 |    "cell_type": "code",
300 |    "execution_count": null,
301 |    "metadata": {},
302 |    "outputs": [],
303 |    "source": [
304 |     "# Peruse the results; try adding 'grep \"qual30\"'\n",
305 |     "! zcat /home/jupyter-user/motherSNPqual30.vcf.gz | grep -v '##' | head -n10"
306 |    ]
307 |   },
308 |   {
309 |    "cell_type": "code",
310 |    "execution_count": null,
311 |    "metadata": {},
312 |    "outputs": [],
313 |    "source": [
314 |     "# Calculate concordance metrics using GATK4 BETA tool Concordance\n",
315 |     "! gatk Concordance \\\n",
316 |     "-eval /home/jupyter-user/motherSNPqual30.vcf.gz \\\n",
317 |     "-truth gs://gatk-tutorials/workshop_1702/variant_discovery/data/resources/motherGIABsnps.vcf.gz \\\n",
318 |     "-L gs://gatk-tutorials/workshop_1702/variant_discovery/data/intervals/motherHighconf.bed \\\n",
319 |     "-S /home/jupyter-user/motherSNPqual30.txt"
320 |    ]
321 |   },
322 |   {
323 |    "cell_type": "code",
324 |    "execution_count": null,
325 |    "metadata": {},
326 |    "outputs": [],
327 |    "source": [
328 |     "# View the results\n",
329 |     "! echo \"\"\n",
330 |     "! cat /home/jupyter-user/motherSNPqual30.txt"
331 |    ]
332 |   },
333 |   {
334 |    "cell_type": "markdown",
335 |    "metadata": {},
336 |    "source": [
337 |     "### B. Filter on multiple annotations simultaneously using VariantFiltration"
338 |    ]
339 |   },
340 |   {
341 |    "cell_type": "markdown",
342 |    "metadata": {},
343 |    "source": [
344 |     "To filter on multiple expressions, provide each in separate expression. For INFO level annotations, the parameter is  `-filter`, which should be immediately followed by the corresponding `–-filter-name` label. Here we show basic hard-filtering thresholds.\n",
345 |     "\n",
346 |     "- If an annotation is missing, VariantFiltration skips any judgement on that annotation. To conservatively fail such missing annotation sites, set the `--missing-values-evaluate-as-failing` flag. \n",
347 |     "- To filter based on FORMAT level annotations, use `--genotype-filter-expression` and `--genotype-filter-name`. "
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "code",
352 |    "execution_count": null,
353 |    "metadata": {},
354 |    "outputs": [],
355 |    "source": [
356 |     "# Filter callset on multiple annotations.\n",
357 |     "# Iterate on thresholds to improve precision while maintaining high sensitivity.\n",
358 |     "! gatk VariantFiltration \\\n",
359 |     "-V /home/jupyter-user/motherSNP.vcf.gz \\\n",
360 |     "-filter \"QD < 2.0\" --filter-name \"QD2\" \\\n",
361 |     "-filter \"QUAL < 30.0\" --filter-name \"QUAL30\" \\\n",
362 |     "-filter \"SOR > 3.0\" --filter-name \"SOR3\" \\\n",
363 |     "-filter \"FS > 60.0\" --filter-name \"FS60\" \\\n",
364 |     "-filter \"MQ < 40.0\" --filter-name \"MQ40\" \\\n",
365 |     "-filter \"MQRankSum < -12.5\" --filter-name \"MQRankSum-12.5\" \\\n",
366 |     "-filter \"ReadPosRankSum < -8.0\" --filter-name \"ReadPosRankSum-8\" \\\n",
367 |     "-O /home/jupyter-user/motherSNPfilters.vcf.gz"
368 |    ]
369 |   },
370 |   {
371 |    "cell_type": "code",
372 |    "execution_count": null,
373 |    "metadata": {},
374 |    "outputs": [],
375 |    "source": [
376 |     "# Sanity-check that filtering is as expected by examining filtered records and PASS records.\n",
377 |     "! zcat /home/jupyter-user/motherSNPfilters.vcf.gz | grep -v '##' | grep -v 'PASS' | head -n20 | cut -f6-10\n",
378 |     "! zcat /home/jupyter-user/motherSNPfilters.vcf.gz | grep -v '#' | grep 'PASS' | head | cut -f6-10"
379 |    ]
380 |   },
381 |   {
382 |    "cell_type": "code",
383 |    "execution_count": null,
384 |    "metadata": {},
385 |    "outputs": [],
386 |    "source": [
387 |     "# Calculate concordance metrics using GATK4 BETA tool Concordance\n",
388 |     "! gatk Concordance \\\n",
389 |     "-eval /home/jupyter-user/motherSNPfilters.vcf.gz \\\n",
390 |     "-truth gs://gatk-tutorials/workshop_1702/variant_discovery/data/resources/motherGIABsnps.vcf.gz \\\n",
391 |     "-L gs://gatk-tutorials/workshop_1702/variant_discovery/data/intervals/motherHighconf.bed \\\n",
392 |     "-S /home/jupyter-user/motherSNPfilters.txt\n",
393 |     "        "
394 |    ]
395 |   },
396 |   {
397 |    "cell_type": "code",
398 |    "execution_count": null,
399 |    "metadata": {},
400 |    "outputs": [],
401 |    "source": [
402 |     "#Now lets re-run concordance from just using QUAL filtering first\n",
403 |     "!cat /home/jupyter-user/motherSNPqual30.txt"
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "code",
408 |    "execution_count": null,
409 |    "metadata": {},
410 |    "outputs": [],
411 |    "source": [
412 |     "# View the results from filtering on multiple annotations\n",
413 |     "! echo \"\"\n",
414 |     "! cat /home/jupyter-user/motherSNPfilters.txt"
415 |    ]
416 |   },
417 |   {
418 |    "cell_type": "markdown",
419 |    "metadata": {},
420 |    "source": [
421 |     "---\n",
422 |     "\n",
423 |     "We performed hard-filtering to learn about germline variant annotations. Remember that GATK recommends _Variant Quality Score Recalibration_ (VQSR) for germline variant callset filtering. For more complex variant filtering and annotation, see the Broad [Hail.is](https://hail.is/index.html) framework. "
424 |    ]
425 |   }
426 |  ],
427 |  "metadata": {
428 |   "kernelspec": {
429 |    "display_name": "Python 3",
430 |    "language": "python",
431 |    "name": "python3"
432 |   },
433 |   "language_info": {
434 |    "codemirror_mode": {
435 |     "name": "ipython",
436 |     "version": 3
437 |    },
438 |    "file_extension": ".py",
439 |    "mimetype": "text/x-python",
440 |    "name": "python",
441 |    "nbconvert_exporter": "python",
442 |    "pygments_lexer": "ipython3",
443 |    "version": "3.6.8"
444 |   },
445 |   "toc": {
446 |    "base_numbering": 1,
447 |    "nav_menu": {},
448 |    "number_sections": true,
449 |    "sideBar": true,
450 |    "skip_h1_title": false,
451 |    "title_cell": "Table of Contents",
452 |    "title_sidebar": "Contents",
453 |    "toc_cell": false,
454 |    "toc_position": {
455 |     "height": "calc(100% - 180px)",
456 |     "left": "10px",
457 |     "top": "150px",
458 |     "width": "305px"
459 |    },
460 |    "toc_section_display": true,
461 |    "toc_window_display": true
462 |   }
463 |  },
464 |  "nbformat": 4,
465 |  "nbformat_minor": 2
466 | }
467 | 


--------------------------------------------------------------------------------
/notebooks/Day2-Germline/3-gatk-hard-filtering-tutorial-r-plotting.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "raw",
  5 |    "metadata": {},
  6 |    "source": []
  7 |   },
  8 |   {
  9 |    "cell_type": "markdown",
 10 |    "metadata": {},
 11 |    "source": [
 12 |     "# GATK Tutorial | Hard Filtering | March 2019\n",
 13 |     "\n",
 14 |     "This GATK tutorial corresponds to a section of the GATK Workshop _2b. Germline Hard Filtering Tutorial_ worksheet available. The goal is to become familiar with germline variant annotations. The notebook and its paired Python notebook illustrate the following steps. \n",
 15 |     "\n",
 16 |     "- Use GATK to stratify a variant callset against a truthset\n",
 17 |     "- Use R's ggplot2 package to plot the distribution of various annotation values\n",
 18 |     "- Hard-filter based on annotation thresholds and calculate concordance metrics  \n",
 19 |     "\n",
 20 |     "### First, make sure the notebook is using an R kernel in the top right corner.\n",
 21 |     "A kernel is a _computational engine_ that executes the code in the notebook. We use this notebook to make plots using R. \n",
 22 |     "\n",
 23 |     "### How to run this notebook:\n",
 24 |     "- **Click to select a gray cell and then pressing SHIFT+ENTER to run the cell.**\n",
 25 |     "\n",
 26 |     "- **Write results to `/home/jupyter-user/`. To access the directory, click on the upper-left jupyter icon.**\n",
 27 |     "\n",
 28 |     "### Enable reading Google bucket data "
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "code",
 33 |    "execution_count": null,
 34 |    "metadata": {},
 35 |    "outputs": [],
 36 |    "source": [
 37 |     "# Check if data is accessible. The command should list several gs:// URLs.\n",
 38 |     "system(\"gsutil ls gs://gatk-tutorials/workshop_1702/variant_discovery/data/resources/\", intern=TRUE)\n",
 39 |     "system(\"gsutil ls gs://gatk-tutorials/workshop_1702/variant_discovery/data/intervals/motherHighconf.bed\", intern=TRUE)\n",
 40 |     "system(\"gsutil ls gs://gatk-tutorials/workshop_1702/variant_discovery/data/inputVcfs/\", intern=TRUE)"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "code",
 45 |    "execution_count": null,
 46 |    "metadata": {},
 47 |    "outputs": [],
 48 |    "source": [
 49 |     "# If you do not see gs:// URLs listed above, run this cell to install Google Cloud Storage. \n",
 50 |     "# Afterwards, restart the kernel with Kernel > Restart.\n",
 51 |     "#system(\"pip install google-cloud-storage\", intern=TRUE)"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "---\n",
 59 |     "## 4. Make density and scatter plots in R and determine filtering thresholds"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "markdown",
 64 |    "metadata": {},
 65 |    "source": [
 66 |     "<span style=\"color:red\">Make sure your kernal is set to R. Go to the menubar and select _Kernel > Change Kernel > R_.</span>\n",
 67 |     "\n",
 68 |     "Plotting the density of values for an annotation shows us to see the overall range and distribution of values observed in a callset. In combination with some basic knowledge of what the annotation represents and how it is calculated, this allows us to make a first estimation of value thresholds that segregate FPs from TPs. Plotting the scatter of values for two annotations, one against the other, additionally shows us what tradeoffs we make when setting a threshold on annotation values individually. "
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "markdown",
 73 |    "metadata": {},
 74 |    "source": [
 75 |     "\n",
 76 |     "\n",
 77 |     "### A. Load R libraries, plotting functions and data"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "markdown",
 82 |    "metadata": {},
 83 |    "source": [
 84 |     "Don't worry if you don't know how to read the R script below. Also, you can ignore the red boxes that appear, e.g. stating `as ‘lib’ is unspecified`. "
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": null,
 90 |    "metadata": {},
 91 |    "outputs": [],
 92 |    "source": [
 93 |     "# plotting.R script loads ggplot and gridExtra libraries and defines functions to plot variant annotations \n",
 94 |     "\n",
 95 |     "library(ggplot2)\n",
 96 |     "install.packages(\"gridExtra\")\n",
 97 |     "library(gridExtra)\n",
 98 |     "\n",
 99 |     "get_legend<-function(myggplot){\n",
100 |     "  tmp <- ggplot_gtable(ggplot_build(myggplot))\n",
101 |     "  leg <- which(sapply(tmp$grobs, function(x) x$name) == \"guide-box\")\n",
102 |     "  legend <- tmp$grobs[[leg]]\n",
103 |     "  return(legend)\n",
104 |     "}\n",
105 |     "\n",
106 |     "\n",
107 |     "# Function for making density plots of a single annotation\n",
108 |     "makeDensityPlot <- function(dataframe, xvar, split, xmin=min(dataframe[xvar], na.rm=TRUE), xmax=max(dataframe[xvar], na.rm=TRUE), alpha=0.5) {\n",
109 |     "  \n",
110 |     "  if(missing(split)) {\n",
111 |     "    return(ggplot(data=dataframe, aes_string(x=xvar)) + xlim(xmin,xmax) + geom_density() )\n",
112 |     "  }\n",
113 |     "  else {\n",
114 |     "    return(ggplot(data=dataframe, aes_string(x=xvar, fill=split)) + xlim(xmin,xmax) + geom_density(alpha=alpha) )\n",
115 |     "  }\n",
116 |     "}\n",
117 |     "\n",
118 |     "# Function for making scatter plots of two annotations\n",
119 |     "makeScatterPlot <- function(dataframe, xvar, yvar, split, xmin=min(dataframe[xvar], na.rm=TRUE), xmax=max(dataframe[xvar], na.rm=TRUE), ymin=min(dataframe[yvar], na.rm=TRUE), ymax=max(dataframe[yvar], na.rm=TRUE), ptSize=1, alpha=0.6) {\n",
120 |     "  if(missing(split)) {\n",
121 |     "    return(ggplot(data=dataframe) + aes_string(x=xvar, y=yvar) + xlim(xmin,xmax) + ylim(ymin,ymax) + geom_point(size=ptSize, alpha=alpha) )\n",
122 |     "  }\n",
123 |     "  else {\n",
124 |     "    return(ggplot(data=dataframe) + aes_string(x=xvar, y=yvar) + aes_string(color=split) + xlim(xmin,xmax) + ylim(ymin,ymax) + geom_point(size=ptSize, alpha=alpha) )\n",
125 |     "  }\n",
126 |     "}\n",
127 |     "\n",
128 |     "# Function for making scatter plots of two annotations with marginal density plots of each\n",
129 |     "makeScatterPlotWithMarginalDensity <- function(dataframe, xvar, yvar, split, xmin=min(dataframe[xvar], na.rm=TRUE), xmax=max(dataframe[xvar], na.rm=TRUE), ymin=min(dataframe[yvar], na.rm=TRUE), ymax=max(dataframe[yvar], na.rm=TRUE), ptSize=1, ptAlpha=0.6, fillAlpha=0.5) {\n",
130 |     "  empty <- ggplot()+geom_point(aes(1,1), colour=\"white\") +\n",
131 |     "    theme(\n",
132 |     "      plot.background = element_blank(), \n",
133 |     "      panel.grid.major = element_blank(), \n",
134 |     "      panel.grid.minor = element_blank(), \n",
135 |     "      panel.border = element_blank(), \n",
136 |     "      panel.background = element_blank(),\n",
137 |     "      axis.title.x = element_blank(),\n",
138 |     "      axis.title.y = element_blank(),\n",
139 |     "      axis.text.x = element_blank(),\n",
140 |     "      axis.text.y = element_blank(),\n",
141 |     "      axis.ticks = element_blank()\n",
142 |     "    )\n",
143 |     "  \n",
144 |     "  if(missing(split)){\n",
145 |     "    scatter <- ggplot(data=dataframe) + aes_string(x=xvar, y=yvar) + geom_point(size=ptSize, alpha=ptAlpha) + xlim(xmin,xmax) + ylim(ymin,ymax) \n",
146 |     "    plot_top <- ggplot(data=dataframe, aes_string(x=xvar)) + geom_density(alpha=fillAlpha) + theme(legend.position=\"none\") + xlim(xmin,xmax) \n",
147 |     "    plot_right <- ggplot(data=dataframe, aes_string(x=yvar)) + geom_density(alpha=fillAlpha) + coord_flip() + theme(legend.position=\"none\") + xlim(ymin,ymax) \n",
148 |     "  } \n",
149 |     "  else{\n",
150 |     "    scatter <- ggplot(data=dataframe) + aes_string(x=xvar, y=yvar) + geom_point(size=ptSize, alpha=ptAlpha, aes_string(color=split)) + xlim(xmin,xmax) + ylim(ymin,ymax) \n",
151 |     "    plot_top <- ggplot(data=dataframe, aes_string(x=xvar, fill=split)) + geom_density(alpha=fillAlpha) + theme(legend.position=\"none\") + xlim(xmin,xmax) \n",
152 |     "    plot_right <- ggplot(data=dataframe, aes_string(x=yvar, fill=split)) + geom_density(alpha=fillAlpha) + coord_flip() + theme(legend.position=\"none\") + xlim(ymin,ymax) \n",
153 |     "  }\n",
154 |     "  legend <- get_legend(scatter)\n",
155 |     "  scatter <- scatter + theme(legend.position=\"none\")\n",
156 |     "  temp <- grid.arrange(plot_top, legend, scatter, plot_right, ncol=2, nrow=2, widths=c(4,1), heights=c(1,4))\n",
157 |     "  return(temp)\n",
158 |     "}"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "code",
163 |    "execution_count": null,
164 |    "metadata": {},
165 |    "outputs": [],
166 |    "source": [
167 |     "# Call the readr library and use its read_delim function to load motherSNP.giab.txt into the motherSNP.giab object.\n",
168 |     "library(readr)\n",
169 |     "motherSNP.giab <- read_delim(\"/home/jupyter-user/motherSNP.giab.txt\",\"\\t\", \n",
170 |     "              escape_double = FALSE, col_types = cols(giab.callsets = col_character()), trim_ws = TRUE)"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": null,
176 |    "metadata": {},
177 |    "outputs": [],
178 |    "source": [
179 |     "# Rename the 'giab.callsets' column to 'set'.\n",
180 |     "names(motherSNP.giab)[names(motherSNP.giab) == 'giab.callsets'] <- 'set'"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "markdown",
185 |    "metadata": {},
186 |    "source": [
187 |     "---\n",
188 |     "\n",
189 |     "For reference, here are some basic filtering thresholds to improve upon.\n",
190 |     "\n",
191 |     "- -filter \"QD < 2.0\"\n",
192 |     "- -filter \"QUAL < 30.0\"\n",
193 |     "- -filter \"SOR > 3.0\"\n",
194 |     "- -filter \"FS > 60.0\"\n",
195 |     "- -filter \"MQ < 40.0\"\n",
196 |     "- -filter \"MQRankSum < -12.5 \n",
197 |     "- -filter \"ReadPosRankSum < -8.0\"\n",
198 |     "\n",
199 |     "---"
200 |    ]
201 |   },
202 |   {
203 |    "cell_type": "markdown",
204 |    "metadata": {},
205 |    "source": [
206 |     "### B. Make a density plot for QUAL with the `makeDensityPlot` function\n",
207 |     "\n",
208 |     "Iteratively improve the plot by modifying `qual`. Here are some suggestions to start.\n",
209 |     "- B = makeDensityPlot(motherSNP.giab, \"QUAL\")\n",
210 |     "- B = makeDensityPlot(motherSNP.giab, \"QUAL\", xmax=10000)\n",
211 |     "- B = makeDensityPlot(motherSNP.giab, \"QUAL\", xmax=10000, split=\"set\")\n",
212 |     "\n",
213 |     "> _How does the density distribution relate to what the annotation represents? Can we find some clues of what might distinguish good vs. bad variants?_\n",
214 |     "> _When we plot the split version, can we see a clear difference between the set distributions? What does that tell us?_"
215 |    ]
216 |   },
217 |   {
218 |    "cell_type": "code",
219 |    "execution_count": null,
220 |    "metadata": {},
221 |    "outputs": [],
222 |    "source": [
223 |     "# B = makeDensityPlot(motherSNP.giab, \"QUAL\")\n",
224 |     "B = makeDensityPlot(motherSNP.giab, \"QUAL\")\n"
225 |    ]
226 |   },
227 |   {
228 |    "cell_type": "code",
229 |    "execution_count": null,
230 |    "metadata": {
231 |     "scrolled": false
232 |    },
233 |    "outputs": [],
234 |    "source": [
235 |     "# Plot 'B'\n",
236 |     "B"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "markdown",
241 |    "metadata": {},
242 |    "source": [
243 |     "### C. Make a QD (QualByDepth) density plot"
244 |    ]
245 |   },
246 |   {
247 |    "cell_type": "markdown",
248 |    "metadata": {},
249 |    "source": [
250 |     "QD puts the variant confidence QUAL score into perspective by normalizing for the amount of coverage available. Because each read contributes a little to the QUAL score, variants in regions with deep coverage can have artificially inflated QUAL scores, giving the impression that the call is supported by more evidence than it really is. To compensate for this, we normalize the variant confidence by depth, which gives us a more objective picture of how well supported the call is.\n",
251 |     "\n",
252 |     "> _What do the peaks represent?_"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "code",
257 |    "execution_count": null,
258 |    "metadata": {},
259 |    "outputs": [],
260 |    "source": [
261 |     "# C = makeDensityPlot(motherSNP.giab, \"QD\")\n",
262 |     "# Change up the parameters, e.g. add 'split=\"set\"', examine RankSums, FS and SOR\n",
263 |     "C = makeDensityPlot(motherSNP.giab, \n",
264 |     "                    \"QD\")"
265 |    ]
266 |   },
267 |   {
268 |    "cell_type": "code",
269 |    "execution_count": null,
270 |    "metadata": {},
271 |    "outputs": [],
272 |    "source": [
273 |     "C"
274 |    ]
275 |   },
276 |   {
277 |    "cell_type": "markdown",
278 |    "metadata": {},
279 |    "source": [
280 |     "### D. Make a scatterplot of QD vs. DP using the `makeScatterPlot` function"
281 |    ]
282 |   },
283 |   {
284 |    "cell_type": "markdown",
285 |    "metadata": {},
286 |    "source": [
287 |     "The DP (depth) here refers to the unfiltered count of reads at the site level (INFO). An identically named annotation exists at the sample level (FORMAT) that refers to the count of reads that passed the caller's internal quality control metrics for the sample. \n",
288 |     "\n",
289 |     "> What is the relationship between DP and QUAL? How does high-depth correlate with true positives?"
290 |    ]
291 |   },
292 |   {
293 |    "cell_type": "code",
294 |    "execution_count": null,
295 |    "metadata": {},
296 |    "outputs": [],
297 |    "source": [
298 |     "# D = makeScatterPlot(motherSNP.giab, \"QD\", \"DP\", split=\"set\")\n",
299 |     "# Play with the axis limits to zoom in on subsets of the data, e.g. by adding ymax=1000.\n",
300 |     "D = makeScatterPlot(motherSNP.giab, \n",
301 |     "                    \"QD\", \"DP\")"
302 |    ]
303 |   },
304 |   {
305 |    "cell_type": "code",
306 |    "execution_count": null,
307 |    "metadata": {
308 |     "scrolled": false
309 |    },
310 |    "outputs": [],
311 |    "source": [
312 |     "D"
313 |    ]
314 |   },
315 |   {
316 |    "cell_type": "markdown",
317 |    "metadata": {},
318 |    "source": [
319 |     "### E. Make a scatterplot winged by marginal density plots\n",
320 |     "\n",
321 |     "The `makeScatterPlotWithMarginalDensity` function defines and plots. The `ptAlpha` parameter changes the transparency of the points. \n",
322 |     "\n",
323 |     "> _When plotting two annotations, does the combination of the two tell us anything more than either did separately?_\n",
324 |     "\n",
325 |     "- Try adjusting the parameters.\n",
326 |     "- Substitute in other annotations. For example, the following recreates the plot on the front page of the tutorial worksheet.\n",
327 |     "\n",
328 |     "```\n",
329 |     "F = makeScatterPlotWithMarginalDensity(motherSNP.giab, \"QUAL\", \"DP\", split=\"set\", xmax=10000, ymax=100, ptSize=0.5, ptAlpha=0.05)\n",
330 |     "```\n",
331 |     "\n",
332 |     "\n"
333 |    ]
334 |   },
335 |   {
336 |    "cell_type": "code",
337 |    "execution_count": null,
338 |    "metadata": {},
339 |    "outputs": [],
340 |    "source": [
341 |     "# E = makeScatterPlotWithMarginalDensity(motherSNP.giab, \"QD\", \"DP\", split=\"set\", ymax=250, ptSize=0.5, ptAlpha=0.2)\n",
342 |     "E = makeScatterPlotWithMarginalDensity(motherSNP.giab, \n",
343 |     "                                       \"QD\", \"DP\", \n",
344 |     "                                       split=\"set\", \n",
345 |     "                                       ymax=250, \n",
346 |     "                                       ptSize=0.5, ptAlpha=0.2)"
347 |    ]
348 |   },
349 |   {
350 |    "cell_type": "code",
351 |    "execution_count": null,
352 |    "metadata": {},
353 |    "outputs": [],
354 |    "source": [
355 |     "# Blank cell for free use. Add additional cells with Menu > Insert.\n",
356 |     "# Change the cell type with Cell > Cell Type.\n",
357 |     "# Delete a cell with Edit > Delete Cells."
358 |    ]
359 |   },
360 |   {
361 |    "cell_type": "markdown",
362 |    "metadata": {},
363 |    "source": [
364 |     "---\n",
365 |     "## 5. Apply filters with VariantFiltration and evaluate results"
366 |    ]
367 |   },
368 |   {
369 |    "cell_type": "markdown",
370 |    "metadata": {},
371 |    "source": [
372 |     "<span style=\"color:blue\">Go back to the Python notebook now to continue your work.</span>"
373 |    ]
374 |   }
375 |  ],
376 |  "metadata": {
377 |   "kernelspec": {
378 |    "display_name": "R",
379 |    "language": "R",
380 |    "name": "ir"
381 |   },
382 |   "language_info": {
383 |    "codemirror_mode": "r",
384 |    "file_extension": ".r",
385 |    "mimetype": "text/x-r-source",
386 |    "name": "R",
387 |    "pygments_lexer": "r",
388 |    "version": "3.5.2"
389 |   },
390 |   "toc": {
391 |    "base_numbering": 1,
392 |    "nav_menu": {},
393 |    "number_sections": true,
394 |    "sideBar": true,
395 |    "skip_h1_title": false,
396 |    "title_cell": "Table of Contents",
397 |    "title_sidebar": "Contents",
398 |    "toc_cell": false,
399 |    "toc_position": {
400 |     "height": "669.333px",
401 |     "left": "70px",
402 |     "top": "281.667px",
403 |     "width": "176px"
404 |    },
405 |    "toc_section_display": true,
406 |    "toc_window_display": true
407 |   }
408 |  },
409 |  "nbformat": 4,
410 |  "nbformat_minor": 2
411 | }
412 | 


--------------------------------------------------------------------------------
/notebooks/Day2-Germline/4-gatk-cnn-tutorial-python.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# GATK Tutorial | Convolutional Neural Network (CNN) Filtering | March 2019"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "\n",
 15 |     "This GATK tutorial corresponds to a section of the GATK Workshop _2b. Germline Convolutional Neural Network (CNN) Filtering Tutorial_ worksheet. The goal is to become familiar with using Convolutional Neural Net to filter annotated variants. The notebook illustrates the following steps. \n",
 16 |     "\n",
 17 |     "- Use GATK to annotate a VCF with scores from a Convolutional Neural Network (CNN)\n",
 18 |     "- Generate 1D and 2D CNN models\n",
 19 |     "- Apply tranche filtering to VCF based on scores from an annotation in the INFO field  \n",
 20 |     "- Calculate concordance metrics\n",
 21 |     "\n",
 22 |     "### First, make sure the notebook is using a Python 3 kernel in the top right corner.\n",
 23 |     "A kernel is a _computational engine_ that executes the code in the notebook. We use Python 3 in this notebook to execute GATK commands using _Python Magic_ (`!`). Later we will switch to another notebook to do some plotting in R.\n",
 24 |     "\n",
 25 |     "### How to run this notebook:\n",
 26 |     "- **Click to select a gray cell and then pressing SHIFT+ENTER to run the cell.**\n",
 27 |     "\n",
 28 |     "- **Write results to `/home/jupyter-user/CNN/Output/`. To access the directory, click on the upper-left jupyter icon.**\n",
 29 |     "\n",
 30 |     "### Enable reading Google bucket data "
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": 1,
 36 |    "metadata": {},
 37 |    "outputs": [
 38 |     {
 39 |      "name": "stdout",
 40 |      "output_type": "stream",
 41 |      "text": [
 42 |       "gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/\n",
 43 |       "gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/bams/\n",
 44 |       "gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/references/\n",
 45 |       "gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/\n",
 46 |       "gs://gcp-public-data--broad-references/hg19/v0/1000G_omni2.5.b37.vcf.gz\n",
 47 |       "gs://gcp-public-data--broad-references/hg19/v0/1000G_omni2.5.b37.vcf.gz.tbi\n",
 48 |       "gs://gcp-public-data--broad-references/hg19/v0/1000G_phase1.snps.high_confidence.b37.vcf.gz\n",
 49 |       "gs://gcp-public-data--broad-references/hg19/v0/1000G_phase1.snps.high_confidence.b37.vcf.gz.tbi\n",
 50 |       "gs://gcp-public-data--broad-references/hg19/v0/Axiom_Exome_Plus.genotypes.all_populations.poly.vcf.gz\n",
 51 |       "gs://gcp-public-data--broad-references/hg19/v0/Axiom_Exome_Plus.genotypes.all_populations.poly.vcf.gz.tbi\n",
 52 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.cdna.all.fa\n",
 53 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.cds.all.fa\n",
 54 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.cloud_references.json\n",
 55 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.contam.UD\n",
 56 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.contam.V\n",
 57 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.contam.bed\n",
 58 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.contam.mu\n",
 59 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.dbsnp135.vcf\n",
 60 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.dbsnp135.vcf.idx\n",
 61 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.dbsnp138.vcf\n",
 62 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.dbsnp138.vcf.idx\n",
 63 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.delly_exclusionRegions.shard0.tsv\n",
 64 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.delly_exclusionRegions.shard1.tsv\n",
 65 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.delly_exclusionRegions.shard10.tsv\n",
 66 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.delly_exclusionRegions.shard11.tsv\n",
 67 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.delly_exclusionRegions.shard2.tsv\n",
 68 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.delly_exclusionRegions.shard3.tsv\n",
 69 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.delly_exclusionRegions.shard4.tsv\n",
 70 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.delly_exclusionRegions.shard5.tsv\n",
 71 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.delly_exclusionRegions.shard6.tsv\n",
 72 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.delly_exclusionRegions.shard7.tsv\n",
 73 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.delly_exclusionRegions.shard8.tsv\n",
 74 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.delly_exclusionRegions.shard9.tsv\n",
 75 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.dict\n",
 76 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta\n",
 77 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.64.amb\n",
 78 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.64.ann\n",
 79 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.64.bwt\n",
 80 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.64.pac\n",
 81 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.64.sa\n",
 82 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.alt\n",
 83 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.amb\n",
 84 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.ann\n",
 85 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.bwt\n",
 86 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.fai\n",
 87 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.pac\n",
 88 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.sa\n",
 89 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.known_indels_20120518.vcf\n",
 90 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.known_indels_20120518.vcf.idx\n",
 91 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.tile_db_header.vcf\n",
 92 |       "gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.vid\n",
 93 |       "gs://gcp-public-data--broad-references/hg19/v0/Mills_and_1000G_gold_standard.indels.b37.sites.vcf\n",
 94 |       "gs://gcp-public-data--broad-references/hg19/v0/Mills_and_1000G_gold_standard.indels.b37.sites.vcf.idx\n",
 95 |       "gs://gcp-public-data--broad-references/hg19/v0/Mills_and_1000G_gold_standard.indels.b37.vcf.gz\n",
 96 |       "gs://gcp-public-data--broad-references/hg19/v0/Mills_and_1000G_gold_standard.indels.b37.vcf.gz.tbi\n",
 97 |       "gs://gcp-public-data--broad-references/hg19/v0/README\n",
 98 |       "gs://gcp-public-data--broad-references/hg19/v0/WholeGenomeShotgunContam.vcf\n",
 99 |       "gs://gcp-public-data--broad-references/hg19/v0/WholeGenomeShotgunContam.vcf.idx\n",
100 |       "gs://gcp-public-data--broad-references/hg19/v0/broad_vid.json\n",
101 |       "gs://gcp-public-data--broad-references/hg19/v0/dbsnp_135.b37.vcf.gz\n",
102 |       "gs://gcp-public-data--broad-references/hg19/v0/dbsnp_135.b37.vcf.gz.tbi\n",
103 |       "gs://gcp-public-data--broad-references/hg19/v0/dbsnp_138.b37.vcf.gz\n",
104 |       "gs://gcp-public-data--broad-references/hg19/v0/dbsnp_138.b37.vcf.gz.tbi\n",
105 |       "gs://gcp-public-data--broad-references/hg19/v0/hapmap_3.3.b37.vcf.gz\n",
106 |       "gs://gcp-public-data--broad-references/hg19/v0/hapmap_3.3.b37.vcf.gz.tbi\n",
107 |       "gs://gcp-public-data--broad-references/hg19/v0/wgs_calling_regions.v1.chr20.interval_list\n",
108 |       "gs://gcp-public-data--broad-references/hg19/v0/wgs_calling_regions.v1.interval_list\n",
109 |       "gs://gcp-public-data--broad-references/hg19/v0/wgs_evaluation_regions.v1.interval_list\n",
110 |       "gs://gcp-public-data--broad-references/hg19/v0/TileDB/\n",
111 |       "gs://gcp-public-data--broad-references/hg19/v0/genomestrip/\n"
112 |      ]
113 |     }
114 |    ],
115 |    "source": [
116 |     "!gsutil ls gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/\n",
117 |     "!gsutil ls gs://gcp-public-data--broad-references/hg19/v0/"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": 2,
123 |    "metadata": {},
124 |    "outputs": [],
125 |    "source": [
126 |     "# If you do not see gs:// URLs listed above, run this cell to install Google Cloud Storage. \n",
127 |     "# Afterwards, restart the kernel with Kernel > Restart.\n",
128 |     "#! pip install google-cloud-storage"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "code",
133 |    "execution_count": 5,
134 |    "metadata": {},
135 |    "outputs": [],
136 |    "source": [
137 |     "# Write results to /home/jupyter-user/3-somatic-cna/sandbox/. \n",
138 |     "# To access the directory, click on the upper-left jupyter icon.\n",
139 |     "!mkdir -p /home/jupyter-user/CNN/Output/"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "markdown",
144 |    "metadata": {},
145 |    "source": [
146 |     "---\n",
147 |     "## Run the default 1D model on the VCF with CNNScoreVariants\n",
148 |     "\n",
149 |     "CNNScoreVariant is a pre-trained Convolutional Neural Network tool to score variants. This tool uses machine learning to differentiate between good variants and artifacts of the sequencing process, a fairly new approach that is especially effective at correctly calling indels. \n",
150 |     "\n",
151 |     "> **VQSR and Hard-filtering only takes into account variant annotations. However, CNNScoreVariants 1D Model evaluates a) annotations AND b) reference files, +-64bases from the variant. Example: it accounts for regions in the ref file that RE difficult to sequence.**\n",
152 |     "\n",
153 |     "To enable the models to accurately filter and score variants from VCF files, we trained on validated VCFs (from truth models including **SynDip, Genomes in a bottle, and Platinum Genomes**) with unvalidated VCFs aligned to different reference builds (**HG19, HG38**), sequenced on different machines, using different protocols. "
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "code",
158 |    "execution_count": null,
159 |    "metadata": {
160 |     "scrolled": true
161 |    },
162 |    "outputs": [
163 |     {
164 |      "name": "stdout",
165 |      "output_type": "stream",
166 |      "text": [
167 |       "Using GATK jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar\n",
168 |       "Running:\n",
169 |       "    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar CNNScoreVariants -V gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/g94982_b37_chr20_1m_15871.vcf.gz -O /home/jupyter-user/CNN/Output/my_1d_cnn_scored.vcf -R gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta\n",
170 |       "05:28:55.239 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so\n",
171 |       "05:28:56.595 INFO  CNNScoreVariants - ------------------------------------------------------------\n",
172 |       "05:28:56.595 INFO  CNNScoreVariants - The Genome Analysis Toolkit (GATK) v4.1.0.0\n",
173 |       "05:28:56.595 INFO  CNNScoreVariants - For support and documentation go to https://software.broadinstitute.org/gatk/\n",
174 |       "05:28:56.596 INFO  CNNScoreVariants - Executing as jupyter-user@saturn-f5de5fcd-6bbf-4424-8ae2-0ef73d2e5490-m on Linux v4.9.0-8-amd64 amd64\n",
175 |       "05:28:56.596 INFO  CNNScoreVariants - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_201-b09\n",
176 |       "05:28:56.596 INFO  CNNScoreVariants - Start Date/Time: March 21, 2019 5:28:54 AM UTC\n",
177 |       "05:28:56.596 INFO  CNNScoreVariants - ------------------------------------------------------------\n",
178 |       "05:28:56.596 INFO  CNNScoreVariants - ------------------------------------------------------------\n",
179 |       "05:28:56.597 INFO  CNNScoreVariants - HTSJDK Version: 2.18.2\n",
180 |       "05:28:56.597 INFO  CNNScoreVariants - Picard Version: 2.18.25\n",
181 |       "05:28:56.597 INFO  CNNScoreVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2\n",
182 |       "05:28:56.597 INFO  CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false\n",
183 |       "05:28:56.597 INFO  CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true\n",
184 |       "05:28:56.597 INFO  CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false\n",
185 |       "05:28:56.598 INFO  CNNScoreVariants - Deflater: IntelDeflater\n",
186 |       "05:28:56.598 INFO  CNNScoreVariants - Inflater: IntelInflater\n",
187 |       "05:28:56.598 INFO  CNNScoreVariants - GCS max retries/reopens: 20\n",
188 |       "05:28:56.598 INFO  CNNScoreVariants - Requester pays: disabled\n",
189 |       "05:28:56.598 INFO  CNNScoreVariants - Initializing engine\n",
190 |       "05:29:02.651 INFO  FeatureManager - Using codec VCFCodec to read file gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/g94982_b37_chr20_1m_15871.vcf.gz\n",
191 |       "05:29:05.153 INFO  CNNScoreVariants - Done initializing engine\n",
192 |       "05:29:05.154 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so\n",
193 |       "05:29:30.133 INFO  CNNScoreVariants - Using key:CNN_1D for CNN architecture:/tmp/1d_cnn_mix_train_full_bn.635470853536329430.json and weights:/tmp/1d_cnn_mix_train_full_bn.6224138842643266956.hd5\n",
194 |       "05:29:33.179 INFO  ProgressMeter - Starting traversal\n",
195 |       "05:29:33.179 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute\n",
196 |       "05:29:33.180 INFO  CNNScoreVariants - Starting first pass through the variants\n",
197 |       "05:29:43.315 INFO  ProgressMeter -           20:2380323              0.2                  3000          17816.5\n",
198 |       "05:29:56.948 INFO  ProgressMeter -           20:4076408              0.4                  6000          15145.8\n",
199 |       "05:30:08.814 INFO  ProgressMeter -           20:5432203              0.6                  9000          15153.6\n",
200 |       "05:30:22.116 INFO  ProgressMeter -           20:7003252              0.8                 12000          14712.8\n",
201 |       "05:30:33.458 INFO  ProgressMeter -           20:8300381              1.0                 14000          13935.2\n",
202 |       "05:30:42.337 INFO  CNNScoreVariants - Finished first pass through the variants\n",
203 |       "05:30:47.740 INFO  CNNScoreVariants - Starting second pass through the variants\n",
204 |       "05:30:48.179 INFO  ProgressMeter -           20:1080814              1.3                 16000          12800.0\n",
205 |       "05:30:48.982 INFO  CNNScoreVariants - No variants filtered by: AllowAllVariantsVariantFilter\n",
206 |       "05:30:48.982 INFO  CNNScoreVariants - No reads filtered by: AllowAllReadsReadFilter\n",
207 |       "05:30:48.982 INFO  ProgressMeter -           20:8840293              1.3                 31742          25124.6\n",
208 |       "05:30:48.982 INFO  ProgressMeter - Traversal complete. Processed 31742 total variants in 1.3 minutes.\n",
209 |       "05:30:48.982 INFO  CNNScoreVariants - Done scoring variants with CNN.\n",
210 |       "05:30:48.996 INFO  CNNScoreVariants - Shutting down engine\n",
211 |       "[March 21, 2019 5:30:48 AM UTC] org.broadinstitute.hellbender.tools.walkers.vqsr.CNNScoreVariants done. Elapsed time: 1.90 minutes.\n",
212 |       "Runtime.totalMemory()=562561024\n"
213 |      ]
214 |     }
215 |    ],
216 |    "source": [
217 |     "!gatk CNNScoreVariants \\\n",
218 |     "-V gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/g94982_b37_chr20_1m_15871.vcf.gz \\\n",
219 |     "-O /home/jupyter-user/CNN/Output/my_1d_cnn_scored.vcf \\\n",
220 |     "-R gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta"
221 |    ]
222 |   },
223 |   {
224 |    "cell_type": "markdown",
225 |    "metadata": {},
226 |    "source": [
227 |     "The output VCF my_1d_cnn_scored.vcf will now have an INFO  field CNN_1D which corresponds to the score assigned by 1D model.\n"
228 |    ]
229 |   },
230 |   {
231 |    "cell_type": "code",
232 |    "execution_count": null,
233 |    "metadata": {},
234 |    "outputs": [],
235 |    "source": [
236 |     "!cat /home/jupyter-user/CNN/Output/my_1d_cnn_scored.vcf | grep -v '##' | head -5"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "markdown",
241 |    "metadata": {},
242 |    "source": [
243 |     "## Apply filters to the VCF based on the CNN_1D score with the FilterVariantTranches tool\n",
244 |     "\n",
245 |     "After scoring, you can filter your VCF by applying a sensitivity threshold with the tool FilterVariantTranches. "
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "code",
250 |    "execution_count": 8,
251 |    "metadata": {},
252 |    "outputs": [
253 |     {
254 |      "name": "stdout",
255 |      "output_type": "stream",
256 |      "text": [
257 |       "Using GATK jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar\n",
258 |       "Running:\n",
259 |       "    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar FilterVariantTranches -V /home/jupyter-user/CNN/Output/my_1d_cnn_scored.vcf --resource gs://gcp-public-data--broad-references/hg19/v0/1000G_omni2.5.b37.vcf.gz --resource gs://gcp-public-data--broad-references/hg19/v0/hapmap_3.3.b37.vcf.gz --info-key CNN_1D --snp-tranche 95.9 --indel-tranche 95.0 -O /home/jupyter-user/CNN/Output/my_1d_filtered.vcf --invalidate-previous-filters\n",
260 |       "05:30:53.802 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so\n",
261 |       "05:30:54.005 INFO  FilterVariantTranches - ------------------------------------------------------------\n",
262 |       "05:30:54.005 INFO  FilterVariantTranches - The Genome Analysis Toolkit (GATK) v4.1.0.0\n",
263 |       "05:30:54.006 INFO  FilterVariantTranches - For support and documentation go to https://software.broadinstitute.org/gatk/\n",
264 |       "05:30:54.006 INFO  FilterVariantTranches - Executing as jupyter-user@saturn-f5de5fcd-6bbf-4424-8ae2-0ef73d2e5490-m on Linux v4.9.0-8-amd64 amd64\n",
265 |       "05:30:54.006 INFO  FilterVariantTranches - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_201-b09\n",
266 |       "05:30:54.006 INFO  FilterVariantTranches - Start Date/Time: March 21, 2019 5:30:53 AM UTC\n",
267 |       "05:30:54.006 INFO  FilterVariantTranches - ------------------------------------------------------------\n",
268 |       "05:30:54.006 INFO  FilterVariantTranches - ------------------------------------------------------------\n",
269 |       "05:30:54.007 INFO  FilterVariantTranches - HTSJDK Version: 2.18.2\n",
270 |       "05:30:54.007 INFO  FilterVariantTranches - Picard Version: 2.18.25\n",
271 |       "05:30:54.007 INFO  FilterVariantTranches - HTSJDK Defaults.COMPRESSION_LEVEL : 2\n",
272 |       "05:30:54.007 INFO  FilterVariantTranches - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false\n",
273 |       "05:30:54.007 INFO  FilterVariantTranches - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true\n",
274 |       "05:30:54.007 INFO  FilterVariantTranches - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false\n",
275 |       "05:30:54.008 INFO  FilterVariantTranches - Deflater: IntelDeflater\n",
276 |       "05:30:54.008 INFO  FilterVariantTranches - Inflater: IntelInflater\n",
277 |       "05:30:54.008 INFO  FilterVariantTranches - GCS max retries/reopens: 20\n",
278 |       "05:30:54.008 INFO  FilterVariantTranches - Requester pays: disabled\n",
279 |       "05:30:54.008 WARN  FilterVariantTranches - \n",
280 |       "\n",
281 |       "\u001b[1m\u001b[31m   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n",
282 |       "\n",
283 |       "   Warning: FilterVariantTranches is an EXPERIMENTAL tool and should not be used for production\n",
284 |       "\n",
285 |       "   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\u001b[0m\n",
286 |       "\n",
287 |       "\n",
288 |       "05:30:54.008 INFO  FilterVariantTranches - Initializing engine\n",
289 |       "05:30:56.248 INFO  FeatureManager - Using codec VCFCodec to read file gs://gcp-public-data--broad-references/hg19/v0/1000G_omni2.5.b37.vcf.gz\n",
290 |       "05:30:59.912 INFO  FeatureManager - Using codec VCFCodec to read file gs://gcp-public-data--broad-references/hg19/v0/hapmap_3.3.b37.vcf.gz\n",
291 |       "05:31:02.639 INFO  FeatureManager - Using codec VCFCodec to read file file:///home/jupyter-user/CNN/Output/my_1d_cnn_scored.vcf\n",
292 |       "05:31:02.697 INFO  FilterVariantTranches - Done initializing engine\n",
293 |       "05:31:02.746 INFO  ProgressMeter - Starting traversal\n",
294 |       "05:31:02.746 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute\n",
295 |       "05:31:02.747 INFO  FilterVariantTranches - Starting first pass through the variants\n",
296 |       "05:31:04.538 INFO  FilterVariantTranches - Finished first pass through the variants\n",
297 |       "05:31:04.538 INFO  FilterVariantTranches - Found 12929 SNPs and 2932 indels with INFO score key:CNN_1D.\n",
298 |       "05:31:04.539 INFO  FilterVariantTranches - Found 6669 SNPs and 37 indels in the resources.\n",
299 |       "05:31:04.552 INFO  FilterVariantTranches - Starting second pass through the variants\n",
300 |       "05:31:05.236 INFO  FilterVariantTranches - No variants filtered by: AllowAllVariantsVariantFilter\n",
301 |       "05:31:05.236 INFO  FilterVariantTranches - No reads filtered by: AllowAllReadsReadFilter\n",
302 |       "05:31:05.237 INFO  ProgressMeter -           20:8840293              0.0                 31742         764867.5\n",
303 |       "05:31:05.237 INFO  ProgressMeter - Traversal complete. Processed 31742 total variants in 0.0 minutes.\n",
304 |       "05:31:05.237 INFO  FilterVariantTranches - Filtered 1221 SNPs out of 12929 and filtered 669 indels out of 2932 with INFO score: CNN_1D.\n",
305 |       "05:31:05.249 INFO  FilterVariantTranches - Shutting down engine\n",
306 |       "[March 21, 2019 5:31:05 AM UTC] org.broadinstitute.hellbender.tools.walkers.vqsr.FilterVariantTranches done. Elapsed time: 0.19 minutes.\n",
307 |       "Runtime.totalMemory()=1170735104\n"
308 |      ]
309 |     }
310 |    ],
311 |    "source": [
312 |     "!gatk FilterVariantTranches \\\n",
313 |     "-V /home/jupyter-user/CNN/Output/my_1d_cnn_scored.vcf \\\n",
314 |     "--resource gs://gcp-public-data--broad-references/hg19/v0/1000G_omni2.5.b37.vcf.gz \\\n",
315 |     "--resource gs://gcp-public-data--broad-references/hg19/v0/hapmap_3.3.b37.vcf.gz \\\n",
316 |     "--info-key CNN_1D \\\n",
317 |     "--snp-tranche 95.9 \\\n",
318 |     "--indel-tranche 95.0 \\\n",
319 |     "-O /home/jupyter-user/CNN/Output/my_1d_filtered.vcf \\\n",
320 |     "--invalidate-previous-filters \n"
321 |    ]
322 |   },
323 |   {
324 |    "cell_type": "markdown",
325 |    "metadata": {},
326 |    "source": [
327 |     "> Now you have neural network filtered VCF!"
328 |    ]
329 |   },
330 |   {
331 |    "cell_type": "markdown",
332 |    "metadata": {},
333 |    "source": [
334 |     "## Run the default 2D model on the VCF with CNNScoreVariants"
335 |    ]
336 |   },
337 |   {
338 |    "cell_type": "markdown",
339 |    "metadata": {},
340 |    "source": [
341 |     "The process is quite similar for the 2D model except we will also need to supply a BAM file with DNA read data to CNNScoreVariants.  We tell the tool to use the 2D read processing model with the tensor-type argument.\n",
342 |     "\n",
343 |     "> **CNNScoreVariants 2D Model evaluates a) annotations, b) reference files and c) all variant information from the bam file.**"
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "code",
348 |    "execution_count": 9,
349 |    "metadata": {},
350 |    "outputs": [
351 |     {
352 |      "name": "stdout",
353 |      "output_type": "stream",
354 |      "text": [
355 |       "Using GATK jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar\n",
356 |       "Running:\n",
357 |       "    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar CNNScoreVariants -I gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/bams/g94982_chr20_1m_10m_bamout.bam -V gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/g94982_b37_chr20_1m_895.vcf -R gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta -O /home/jupyter-user/CNN/Output/my_2d_cnn_scored.vcf --tensor-type read_tensor --transfer-batch-size 8 --inference-batch-size 8\n",
358 |       "05:31:08.815 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so\n",
359 |       "05:31:09.048 INFO  CNNScoreVariants - ------------------------------------------------------------\n",
360 |       "05:31:09.048 INFO  CNNScoreVariants - The Genome Analysis Toolkit (GATK) v4.1.0.0\n",
361 |       "05:31:09.048 INFO  CNNScoreVariants - For support and documentation go to https://software.broadinstitute.org/gatk/\n",
362 |       "05:31:09.048 INFO  CNNScoreVariants - Executing as jupyter-user@saturn-f5de5fcd-6bbf-4424-8ae2-0ef73d2e5490-m on Linux v4.9.0-8-amd64 amd64\n",
363 |       "05:31:09.048 INFO  CNNScoreVariants - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_201-b09\n",
364 |       "05:31:09.049 INFO  CNNScoreVariants - Start Date/Time: March 21, 2019 5:31:08 AM UTC\n",
365 |       "05:31:09.049 INFO  CNNScoreVariants - ------------------------------------------------------------\n",
366 |       "05:31:09.049 INFO  CNNScoreVariants - ------------------------------------------------------------\n",
367 |       "05:31:09.049 INFO  CNNScoreVariants - HTSJDK Version: 2.18.2\n",
368 |       "05:31:09.049 INFO  CNNScoreVariants - Picard Version: 2.18.25\n",
369 |       "05:31:09.049 INFO  CNNScoreVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2\n",
370 |       "05:31:09.050 INFO  CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false\n",
371 |       "05:31:09.050 INFO  CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true\n",
372 |       "05:31:09.050 INFO  CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false\n",
373 |       "05:31:09.050 INFO  CNNScoreVariants - Deflater: IntelDeflater\n",
374 |       "05:31:09.050 INFO  CNNScoreVariants - Inflater: IntelInflater\n",
375 |       "05:31:09.050 INFO  CNNScoreVariants - GCS max retries/reopens: 20\n",
376 |       "05:31:09.050 INFO  CNNScoreVariants - Requester pays: disabled\n",
377 |       "05:31:09.050 INFO  CNNScoreVariants - Initializing engine\n",
378 |       "05:31:17.406 INFO  FeatureManager - Using codec VCFCodec to read file gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/g94982_b37_chr20_1m_895.vcf\n",
379 |       "05:31:18.981 INFO  CNNScoreVariants - Done initializing engine\n",
380 |       "05:31:18.982 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so\n",
381 |       "05:31:21.555 INFO  CNNScoreVariants - Using key:CNN_2D for CNN architecture:/tmp/small_2d.6187135354216927075.json and weights:/tmp/small_2d.266325595129410085.hd5\n",
382 |       "05:31:22.312 INFO  ProgressMeter - Starting traversal\n",
383 |       "05:31:22.313 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute\n",
384 |       "05:31:22.318 INFO  CNNScoreVariants - Starting first pass through the variants\n",
385 |       "05:32:58.437 INFO  CNNScoreVariants - Finished first pass through the variants\n",
386 |       "05:33:14.102 INFO  CNNScoreVariants - Starting second pass through the variants\n",
387 |       "05:33:14.440 INFO  ProgressMeter -           20:1068462              1.9                  1000            535.1\n",
388 |       "05:33:14.528 INFO  CNNScoreVariants - No variants filtered by: AllowAllVariantsVariantFilter\n",
389 |       "05:33:14.547 INFO  CNNScoreVariants - 7679 read(s) filtered by: (WellformedReadFilter AND ReadGroupBlackListReadFilter)\n",
390 |       "  7679 read(s) filtered by: ReadGroupBlackListReadFilter \n",
391 |       "\n",
392 |       "05:33:14.548 INFO  ProgressMeter -           20:1068462              1.9                  1790            956.9\n",
393 |       "05:33:14.548 INFO  ProgressMeter - Traversal complete. Processed 1790 total variants in 1.9 minutes.\n",
394 |       "05:33:14.548 INFO  CNNScoreVariants - Done scoring variants with CNN.\n",
395 |       "05:33:14.554 INFO  CNNScoreVariants - Shutting down engine\n",
396 |       "[March 21, 2019 5:33:14 AM UTC] org.broadinstitute.hellbender.tools.walkers.vqsr.CNNScoreVariants done. Elapsed time: 2.10 minutes.\n",
397 |       "Runtime.totalMemory()=1118306304\n"
398 |      ]
399 |     }
400 |    ],
401 |    "source": [
402 |     "!gatk CNNScoreVariants \\\n",
403 |     "-I gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/bams/g94982_chr20_1m_10m_bamout.bam \\\n",
404 |     "-V gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/g94982_b37_chr20_1m_895.vcf \\\n",
405 |     "-R gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta \\\n",
406 |     "-O /home/jupyter-user/CNN/Output/my_2d_cnn_scored.vcf \\\n",
407 |     "--tensor-type read_tensor \\\n",
408 |     "--transfer-batch-size 8 \\\n",
409 |     "--inference-batch-size 8"
410 |    ]
411 |   },
412 |   {
413 |    "cell_type": "markdown",
414 |    "metadata": {},
415 |    "source": [
416 |     "## Now apply filters to the VCF based on the CNN_2D score with the FilterVariantTranches tool"
417 |    ]
418 |   },
419 |   {
420 |    "cell_type": "code",
421 |    "execution_count": 10,
422 |    "metadata": {},
423 |    "outputs": [
424 |     {
425 |      "name": "stdout",
426 |      "output_type": "stream",
427 |      "text": [
428 |       "Using GATK jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar\n",
429 |       "Running:\n",
430 |       "    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar FilterVariantTranches -V /home/jupyter-user/CNN/Output/my_2d_cnn_scored.vcf --resource gs://gcp-public-data--broad-references/hg19/v0/1000G_omni2.5.b37.vcf.gz --resource gs://gcp-public-data--broad-references/hg19/v0/hapmap_3.3.b37.vcf.gz --info-key CNN_2D --snp-tranche 95.9 --indel-tranche 95.0 -O /home/jupyter-user/CNN/Output/my_2d_filtered.vcf --invalidate-previous-filters\n",
431 |       "05:33:17.980 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so\n",
432 |       "05:33:18.214 INFO  FilterVariantTranches - ------------------------------------------------------------\n",
433 |       "05:33:18.215 INFO  FilterVariantTranches - The Genome Analysis Toolkit (GATK) v4.1.0.0\n",
434 |       "05:33:18.215 INFO  FilterVariantTranches - For support and documentation go to https://software.broadinstitute.org/gatk/\n",
435 |       "05:33:18.216 INFO  FilterVariantTranches - Executing as jupyter-user@saturn-f5de5fcd-6bbf-4424-8ae2-0ef73d2e5490-m on Linux v4.9.0-8-amd64 amd64\n",
436 |       "05:33:18.216 INFO  FilterVariantTranches - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_201-b09\n",
437 |       "05:33:18.216 INFO  FilterVariantTranches - Start Date/Time: March 21, 2019 5:33:17 AM UTC\n",
438 |       "05:33:18.216 INFO  FilterVariantTranches - ------------------------------------------------------------\n",
439 |       "05:33:18.216 INFO  FilterVariantTranches - ------------------------------------------------------------\n",
440 |       "05:33:18.217 INFO  FilterVariantTranches - HTSJDK Version: 2.18.2\n",
441 |       "05:33:18.217 INFO  FilterVariantTranches - Picard Version: 2.18.25\n",
442 |       "05:33:18.217 INFO  FilterVariantTranches - HTSJDK Defaults.COMPRESSION_LEVEL : 2\n",
443 |       "05:33:18.217 INFO  FilterVariantTranches - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false\n",
444 |       "05:33:18.217 INFO  FilterVariantTranches - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true\n",
445 |       "05:33:18.217 INFO  FilterVariantTranches - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false\n",
446 |       "05:33:18.217 INFO  FilterVariantTranches - Deflater: IntelDeflater\n",
447 |       "05:33:18.217 INFO  FilterVariantTranches - Inflater: IntelInflater\n",
448 |       "05:33:18.217 INFO  FilterVariantTranches - GCS max retries/reopens: 20\n",
449 |       "05:33:18.217 INFO  FilterVariantTranches - Requester pays: disabled\n",
450 |       "05:33:18.218 WARN  FilterVariantTranches - \n",
451 |       "\n",
452 |       "\u001b[1m\u001b[31m   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n",
453 |       "\n",
454 |       "   Warning: FilterVariantTranches is an EXPERIMENTAL tool and should not be used for production\n",
455 |       "\n",
456 |       "   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\u001b[0m\n",
457 |       "\n",
458 |       "\n",
459 |       "05:33:18.218 INFO  FilterVariantTranches - Initializing engine\n",
460 |       "05:33:20.615 INFO  FeatureManager - Using codec VCFCodec to read file gs://gcp-public-data--broad-references/hg19/v0/1000G_omni2.5.b37.vcf.gz\n",
461 |       "05:33:24.555 INFO  FeatureManager - Using codec VCFCodec to read file gs://gcp-public-data--broad-references/hg19/v0/hapmap_3.3.b37.vcf.gz\n",
462 |       "05:33:27.181 INFO  FeatureManager - Using codec VCFCodec to read file file:///home/jupyter-user/CNN/Output/my_2d_cnn_scored.vcf\n",
463 |       "05:33:27.210 INFO  FilterVariantTranches - Done initializing engine\n",
464 |       "05:33:27.256 INFO  ProgressMeter - Starting traversal\n",
465 |       "05:33:27.256 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute\n",
466 |       "05:33:27.257 INFO  FilterVariantTranches - Starting first pass through the variants\n",
467 |       "05:33:28.360 INFO  FilterVariantTranches - Finished first pass through the variants\n",
468 |       "05:33:28.360 INFO  FilterVariantTranches - Found 756 SNPs and 138 indels with INFO score key:CNN_2D.\n",
469 |       "05:33:28.361 INFO  FilterVariantTranches - Found 422 SNPs and 3 indels in the resources.\n",
470 |       "05:33:28.362 INFO  FilterVariantTranches - Starting second pass through the variants\n",
471 |       "05:33:28.493 INFO  FilterVariantTranches - No variants filtered by: AllowAllVariantsVariantFilter\n",
472 |       "05:33:28.494 INFO  FilterVariantTranches - No reads filtered by: AllowAllReadsReadFilter\n",
473 |       "05:33:28.494 INFO  ProgressMeter -           20:1068462              0.0                  1790          86752.8\n",
474 |       "05:33:28.494 INFO  ProgressMeter - Traversal complete. Processed 1790 total variants in 0.0 minutes.\n",
475 |       "05:33:28.494 INFO  FilterVariantTranches - Filtered 56 SNPs out of 756 and filtered 125 indels out of 138 with INFO score: CNN_2D.\n",
476 |       "05:33:28.498 INFO  FilterVariantTranches - Shutting down engine\n",
477 |       "[March 21, 2019 5:33:28 AM UTC] org.broadinstitute.hellbender.tools.walkers.vqsr.FilterVariantTranches done. Elapsed time: 0.18 minutes.\n",
478 |       "Runtime.totalMemory()=1067974656\n"
479 |      ]
480 |     }
481 |    ],
482 |    "source": [
483 |     "!gatk FilterVariantTranches \\\n",
484 |     "-V /home/jupyter-user/CNN/Output/my_2d_cnn_scored.vcf \\\n",
485 |     "--resource gs://gcp-public-data--broad-references/hg19/v0/1000G_omni2.5.b37.vcf.gz \\\n",
486 |     "--resource gs://gcp-public-data--broad-references/hg19/v0/hapmap_3.3.b37.vcf.gz \\\n",
487 |     "--info-key CNN_2D \\\n",
488 |     "--snp-tranche 95.9 \\\n",
489 |     "--indel-tranche 95.0 \\\n",
490 |     "-O /home/jupyter-user/CNN/Output/my_2d_filtered.vcf \\\n",
491 |     "--invalidate-previous-filters"
492 |    ]
493 |   },
494 |   {
495 |    "cell_type": "markdown",
496 |    "metadata": {},
497 |    "source": [
498 |     "## Evaluate the 2D Model\n",
499 |     "Now let’s evaluate how the filter did by running the concordance tool. "
500 |    ]
501 |   },
502 |   {
503 |    "cell_type": "code",
504 |    "execution_count": 11,
505 |    "metadata": {},
506 |    "outputs": [
507 |     {
508 |      "name": "stdout",
509 |      "output_type": "stream",
510 |      "text": [
511 |       "Using GATK jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar\n",
512 |       "Running:\n",
513 |       "    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar Concordance -truth gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/hg001_na12878_b37_truth.vcf.gz -eval /home/jupyter-user/CNN/Output/my_2d_filtered.vcf -L 20:1000000-1432828 -S /home/jupyter-user/CNN/Output/2d_filtered_concordance.txt\n",
514 |       "05:33:32.262 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so\n",
515 |       "05:33:32.509 INFO  Concordance - ------------------------------------------------------------\n",
516 |       "05:33:32.510 INFO  Concordance - The Genome Analysis Toolkit (GATK) v4.1.0.0\n",
517 |       "05:33:32.510 INFO  Concordance - For support and documentation go to https://software.broadinstitute.org/gatk/\n",
518 |       "05:33:32.510 INFO  Concordance - Executing as jupyter-user@saturn-f5de5fcd-6bbf-4424-8ae2-0ef73d2e5490-m on Linux v4.9.0-8-amd64 amd64\n",
519 |       "05:33:32.510 INFO  Concordance - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_201-b09\n",
520 |       "05:33:32.511 INFO  Concordance - Start Date/Time: March 21, 2019 5:33:32 AM UTC\n",
521 |       "05:33:32.511 INFO  Concordance - ------------------------------------------------------------\n",
522 |       "05:33:32.511 INFO  Concordance - ------------------------------------------------------------\n",
523 |       "05:33:32.511 INFO  Concordance - HTSJDK Version: 2.18.2\n",
524 |       "05:33:32.511 INFO  Concordance - Picard Version: 2.18.25\n",
525 |       "05:33:32.512 INFO  Concordance - HTSJDK Defaults.COMPRESSION_LEVEL : 2\n",
526 |       "05:33:32.512 INFO  Concordance - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false\n",
527 |       "05:33:32.512 INFO  Concordance - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true\n",
528 |       "05:33:32.512 INFO  Concordance - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false\n",
529 |       "05:33:32.512 INFO  Concordance - Deflater: IntelDeflater\n",
530 |       "05:33:32.512 INFO  Concordance - Inflater: IntelInflater\n",
531 |       "05:33:32.512 INFO  Concordance - GCS max retries/reopens: 20\n",
532 |       "05:33:32.512 INFO  Concordance - Requester pays: disabled\n",
533 |       "05:33:32.512 WARN  Concordance - \n",
534 |       "\n",
535 |       "\u001b[1m\u001b[31m   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n",
536 |       "\n",
537 |       "   Warning: Concordance is a BETA tool and is not yet ready for use in production\n",
538 |       "\n",
539 |       "   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\u001b[0m\n",
540 |       "\n",
541 |       "\n",
542 |       "05:33:32.512 INFO  Concordance - Initializing engine\n",
543 |       "05:33:36.850 INFO  FeatureManager - Using codec VCFCodec to read file gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/hg001_na12878_b37_truth.vcf.gz\n",
544 |       "05:33:40.864 INFO  IntervalArgumentCollection - Processing 432829 bp from intervals\n",
545 |       "05:33:41.240 INFO  FeatureManager - Using codec VCFCodec to read file file:///home/jupyter-user/CNN/Output/my_2d_filtered.vcf\n",
546 |       "05:33:41.534 INFO  Concordance - Done initializing engine\n",
547 |       "05:33:41.562 INFO  ProgressMeter - Starting traversal\n",
548 |       "05:33:41.562 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Records Processed   Records/Minute\n",
549 |       "05:33:43.492 INFO  ProgressMeter -             unmapped              0.0                   895          27838.3\n",
550 |       "05:33:43.493 INFO  ProgressMeter - Traversal complete. Processed 895 total records in 0.0 minutes.\n",
551 |       "05:33:43.578 INFO  Concordance - Shutting down engine\n",
552 |       "[March 21, 2019 5:33:43 AM UTC] org.broadinstitute.hellbender.tools.walkers.validation.Concordance done. Elapsed time: 0.19 minutes.\n",
553 |       "Runtime.totalMemory()=952631296\n",
554 |       "Tool returned:\n",
555 |       "SUCCESS\n"
556 |      ]
557 |     }
558 |    ],
559 |    "source": [
560 |     "!gatk Concordance \\\n",
561 |     "-truth gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/hg001_na12878_b37_truth.vcf.gz \\\n",
562 |     "-eval /home/jupyter-user/CNN/Output/my_2d_filtered.vcf \\\n",
563 |     "-L 20:1000000-1432828 \\\n",
564 |     "-S /home/jupyter-user/CNN/Output/2d_filtered_concordance.txt\n"
565 |    ]
566 |   },
567 |   {
568 |    "cell_type": "markdown",
569 |    "metadata": {},
570 |    "source": [
571 |     "## Evaluate the unfiltered VCF"
572 |    ]
573 |   },
574 |   {
575 |    "cell_type": "code",
576 |    "execution_count": 12,
577 |    "metadata": {},
578 |    "outputs": [
579 |     {
580 |      "name": "stdout",
581 |      "output_type": "stream",
582 |      "text": [
583 |       "Using GATK jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar\n",
584 |       "Running:\n",
585 |       "    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar Concordance -truth gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/hg001_na12878_b37_truth.vcf.gz -eval gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/g94982_b37_chr20_1m_895.vcf -L 20:1000000-1432828 -S /home/jupyter-user/CNN/Output/unfiltered_2d_concordance.txt\n",
586 |       "05:33:47.361 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so\n",
587 |       "05:33:47.580 INFO  Concordance - ------------------------------------------------------------\n",
588 |       "05:33:47.580 INFO  Concordance - The Genome Analysis Toolkit (GATK) v4.1.0.0\n",
589 |       "05:33:47.581 INFO  Concordance - For support and documentation go to https://software.broadinstitute.org/gatk/\n",
590 |       "05:33:47.581 INFO  Concordance - Executing as jupyter-user@saturn-f5de5fcd-6bbf-4424-8ae2-0ef73d2e5490-m on Linux v4.9.0-8-amd64 amd64\n",
591 |       "05:33:47.581 INFO  Concordance - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_201-b09\n",
592 |       "05:33:47.581 INFO  Concordance - Start Date/Time: March 21, 2019 5:33:47 AM UTC\n",
593 |       "05:33:47.581 INFO  Concordance - ------------------------------------------------------------\n",
594 |       "05:33:47.581 INFO  Concordance - ------------------------------------------------------------\n",
595 |       "05:33:47.582 INFO  Concordance - HTSJDK Version: 2.18.2\n",
596 |       "05:33:47.582 INFO  Concordance - Picard Version: 2.18.25\n",
597 |       "05:33:47.582 INFO  Concordance - HTSJDK Defaults.COMPRESSION_LEVEL : 2\n",
598 |       "05:33:47.582 INFO  Concordance - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false\n",
599 |       "05:33:47.582 INFO  Concordance - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true\n",
600 |       "05:33:47.582 INFO  Concordance - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false\n",
601 |       "05:33:47.583 INFO  Concordance - Deflater: IntelDeflater\n",
602 |       "05:33:47.583 INFO  Concordance - Inflater: IntelInflater\n",
603 |       "05:33:47.583 INFO  Concordance - GCS max retries/reopens: 20\n",
604 |       "05:33:47.583 INFO  Concordance - Requester pays: disabled\n",
605 |       "05:33:47.583 WARN  Concordance - \n",
606 |       "\n",
607 |       "\u001b[1m\u001b[31m   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n",
608 |       "\n",
609 |       "   Warning: Concordance is a BETA tool and is not yet ready for use in production\n",
610 |       "\n",
611 |       "   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\u001b[0m\n",
612 |       "\n",
613 |       "\n",
614 |       "05:33:47.583 INFO  Concordance - Initializing engine\n",
615 |       "05:33:51.851 INFO  FeatureManager - Using codec VCFCodec to read file gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/hg001_na12878_b37_truth.vcf.gz\n",
616 |       "05:33:55.608 INFO  IntervalArgumentCollection - Processing 432829 bp from intervals\n",
617 |       "05:33:58.083 INFO  FeatureManager - Using codec VCFCodec to read file gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/g94982_b37_chr20_1m_895.vcf\n",
618 |       "05:33:59.255 INFO  Concordance - Done initializing engine\n",
619 |       "05:33:59.264 INFO  ProgressMeter - Starting traversal\n",
620 |       "05:33:59.264 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Records Processed   Records/Minute\n",
621 |       "05:34:01.259 INFO  ProgressMeter -             unmapped              0.0                   895          26930.8\n",
622 |       "05:34:01.259 INFO  ProgressMeter - Traversal complete. Processed 895 total records in 0.0 minutes.\n",
623 |       "05:34:01.274 INFO  Concordance - Shutting down engine\n",
624 |       "[March 21, 2019 5:34:01 AM UTC] org.broadinstitute.hellbender.tools.walkers.validation.Concordance done. Elapsed time: 0.23 minutes.\n",
625 |       "Runtime.totalMemory()=958922752\n",
626 |       "Tool returned:\n",
627 |       "SUCCESS\n"
628 |      ]
629 |     }
630 |    ],
631 |    "source": [
632 |     "!gatk Concordance \\\n",
633 |     "-truth gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/hg001_na12878_b37_truth.vcf.gz \\\n",
634 |     "-eval gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/g94982_b37_chr20_1m_895.vcf \\\n",
635 |     "-L 20:1000000-1432828 \\\n",
636 |     "-S /home/jupyter-user/CNN/Output/unfiltered_2d_concordance.txt"
637 |    ]
638 |   },
639 |   {
640 |    "cell_type": "markdown",
641 |    "metadata": {},
642 |    "source": [
643 |     "> Now look at how precision goes up (and sensitivity goes down) as we filter."
644 |    ]
645 |   },
646 |   {
647 |    "cell_type": "code",
648 |    "execution_count": 14,
649 |    "metadata": {
650 |     "scrolled": false
651 |    },
652 |    "outputs": [
653 |     {
654 |      "name": "stdout",
655 |      "output_type": "stream",
656 |      "text": [
657 |       "type\ttrue-positive\tfalse-positive\tfalse-negative\tsensitivity\tprecision\r\n",
658 |       "SNP\t699\t57\t0\t1.0\t0.9246031746031746\r\n",
659 |       "INDEL\t86\t53\t0\t1.0\t0.6187050359712231\r\n"
660 |      ]
661 |     }
662 |    ],
663 |    "source": [
664 |     "!cat /home/jupyter-user/CNN/Output/unfiltered_2d_concordance.txt"
665 |    ]
666 |   },
667 |   {
668 |    "cell_type": "code",
669 |    "execution_count": 13,
670 |    "metadata": {},
671 |    "outputs": [
672 |     {
673 |      "name": "stdout",
674 |      "output_type": "stream",
675 |      "text": [
676 |       "type\ttrue-positive\tfalse-positive\tfalse-negative\tsensitivity\tprecision\r\n",
677 |       "SNP\t668\t32\t31\t0.9556509298998569\t0.9542857142857143\r\n",
678 |       "INDEL\t12\t2\t74\t0.13953488372093023\t0.8571428571428571\r\n"
679 |      ]
680 |     }
681 |    ],
682 |    "source": [
683 |     "!cat /home/jupyter-user/CNN/Output/2d_filtered_concordance.txt"
684 |    ]
685 |   },
686 |   {
687 |    "cell_type": "markdown",
688 |    "metadata": {},
689 |    "source": [
690 |     "## Evaluate the 1D Model "
691 |    ]
692 |   },
693 |   {
694 |    "cell_type": "code",
695 |    "execution_count": 15,
696 |    "metadata": {},
697 |    "outputs": [
698 |     {
699 |      "name": "stdout",
700 |      "output_type": "stream",
701 |      "text": [
702 |       "Using GATK jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar\n",
703 |       "Running:\n",
704 |       "    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar Concordance -truth gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/hg001_na12878_b37_truth.vcf.gz -eval /home/jupyter-user/CNN/Output/my_1d_filtered.vcf -L 20:1000000-9467292 -S /home/jupyter-user/CNN/Output/my_1d_filtered_concordance.txt\n",
705 |       "05:34:06.061 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so\n",
706 |       "05:34:06.289 INFO  Concordance - ------------------------------------------------------------\n",
707 |       "05:34:06.289 INFO  Concordance - The Genome Analysis Toolkit (GATK) v4.1.0.0\n",
708 |       "05:34:06.289 INFO  Concordance - For support and documentation go to https://software.broadinstitute.org/gatk/\n",
709 |       "05:34:06.290 INFO  Concordance - Executing as jupyter-user@saturn-f5de5fcd-6bbf-4424-8ae2-0ef73d2e5490-m on Linux v4.9.0-8-amd64 amd64\n",
710 |       "05:34:06.290 INFO  Concordance - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_201-b09\n",
711 |       "05:34:06.290 INFO  Concordance - Start Date/Time: March 21, 2019 5:34:06 AM UTC\n",
712 |       "05:34:06.290 INFO  Concordance - ------------------------------------------------------------\n",
713 |       "05:34:06.290 INFO  Concordance - ------------------------------------------------------------\n",
714 |       "05:34:06.291 INFO  Concordance - HTSJDK Version: 2.18.2\n",
715 |       "05:34:06.291 INFO  Concordance - Picard Version: 2.18.25\n",
716 |       "05:34:06.291 INFO  Concordance - HTSJDK Defaults.COMPRESSION_LEVEL : 2\n",
717 |       "05:34:06.291 INFO  Concordance - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false\n",
718 |       "05:34:06.291 INFO  Concordance - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true\n",
719 |       "05:34:06.291 INFO  Concordance - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false\n",
720 |       "05:34:06.291 INFO  Concordance - Deflater: IntelDeflater\n",
721 |       "05:34:06.291 INFO  Concordance - Inflater: IntelInflater\n",
722 |       "05:34:06.292 INFO  Concordance - GCS max retries/reopens: 20\n",
723 |       "05:34:06.292 INFO  Concordance - Requester pays: disabled\n",
724 |       "05:34:06.292 WARN  Concordance - \n",
725 |       "\n",
726 |       "\u001b[1m\u001b[31m   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n",
727 |       "\n",
728 |       "   Warning: Concordance is a BETA tool and is not yet ready for use in production\n",
729 |       "\n",
730 |       "   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\u001b[0m\n",
731 |       "\n",
732 |       "\n",
733 |       "05:34:06.292 INFO  Concordance - Initializing engine\n",
734 |       "05:34:10.557 INFO  FeatureManager - Using codec VCFCodec to read file gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/hg001_na12878_b37_truth.vcf.gz\n",
735 |       "05:34:14.312 INFO  IntervalArgumentCollection - Processing 8467293 bp from intervals\n",
736 |       "05:34:14.697 INFO  FeatureManager - Using codec VCFCodec to read file file:///home/jupyter-user/CNN/Output/my_1d_filtered.vcf\n",
737 |       "05:34:15.081 INFO  Concordance - Done initializing engine\n",
738 |       "05:34:15.093 INFO  ProgressMeter - Starting traversal\n",
739 |       "05:34:15.095 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Records Processed   Records/Minute\n",
740 |       "05:34:18.361 INFO  ProgressMeter -           20:8799299              0.1                 15906         292300.2\n",
741 |       "05:34:18.362 INFO  ProgressMeter - Traversal complete. Processed 15906 total records in 0.1 minutes.\n",
742 |       "05:34:18.382 INFO  Concordance - Shutting down engine\n",
743 |       "[March 21, 2019 5:34:18 AM UTC] org.broadinstitute.hellbender.tools.walkers.validation.Concordance done. Elapsed time: 0.21 minutes.\n",
744 |       "Runtime.totalMemory()=1084751872\n",
745 |       "Tool returned:\n",
746 |       "SUCCESS\n"
747 |      ]
748 |     }
749 |    ],
750 |    "source": [
751 |     "!gatk Concordance \\\n",
752 |     "-truth gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/hg001_na12878_b37_truth.vcf.gz \\\n",
753 |     "-eval /home/jupyter-user/CNN/Output/my_1d_filtered.vcf \\\n",
754 |     "-L 20:1000000-9467292 \\\n",
755 |     "-S /home/jupyter-user/CNN/Output/my_1d_filtered_concordance.txt\n"
756 |    ]
757 |   },
758 |   {
759 |    "cell_type": "markdown",
760 |    "metadata": {},
761 |    "source": [
762 |     "## Evaluate the unfiltered VCF"
763 |    ]
764 |   },
765 |   {
766 |    "cell_type": "code",
767 |    "execution_count": 16,
768 |    "metadata": {},
769 |    "outputs": [
770 |     {
771 |      "name": "stdout",
772 |      "output_type": "stream",
773 |      "text": [
774 |       "Using GATK jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar\n",
775 |       "Running:\n",
776 |       "    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar Concordance -truth gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/hg001_na12878_b37_truth.vcf.gz -eval gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/g94982_b37_chr20_1m_15871.vcf.gz -L 20:1000000-9467292 -S /home/jupyter-user/CNN/Output/unfiltered_concordance.txt\n",
777 |       "05:34:22.592 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/etc/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so\n",
778 |       "05:34:22.944 INFO  Concordance - ------------------------------------------------------------\n",
779 |       "05:34:22.945 INFO  Concordance - The Genome Analysis Toolkit (GATK) v4.1.0.0\n",
780 |       "05:34:22.945 INFO  Concordance - For support and documentation go to https://software.broadinstitute.org/gatk/\n",
781 |       "05:34:22.945 INFO  Concordance - Executing as jupyter-user@saturn-f5de5fcd-6bbf-4424-8ae2-0ef73d2e5490-m on Linux v4.9.0-8-amd64 amd64\n",
782 |       "05:34:22.945 INFO  Concordance - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_201-b09\n",
783 |       "05:34:22.945 INFO  Concordance - Start Date/Time: March 21, 2019 5:34:22 AM UTC\n",
784 |       "05:34:22.946 INFO  Concordance - ------------------------------------------------------------\n",
785 |       "05:34:22.946 INFO  Concordance - ------------------------------------------------------------\n",
786 |       "05:34:22.946 INFO  Concordance - HTSJDK Version: 2.18.2\n",
787 |       "05:34:22.946 INFO  Concordance - Picard Version: 2.18.25\n",
788 |       "05:34:22.946 INFO  Concordance - HTSJDK Defaults.COMPRESSION_LEVEL : 2\n",
789 |       "05:34:22.947 INFO  Concordance - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false\n",
790 |       "05:34:22.947 INFO  Concordance - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true\n",
791 |       "05:34:22.947 INFO  Concordance - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false\n",
792 |       "05:34:22.947 INFO  Concordance - Deflater: IntelDeflater\n",
793 |       "05:34:22.947 INFO  Concordance - Inflater: IntelInflater\n",
794 |       "05:34:22.947 INFO  Concordance - GCS max retries/reopens: 20\n",
795 |       "05:34:22.947 INFO  Concordance - Requester pays: disabled\n",
796 |       "05:34:22.947 WARN  Concordance - \n",
797 |       "\n",
798 |       "\u001b[1m\u001b[31m   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n",
799 |       "\n",
800 |       "   Warning: Concordance is a BETA tool and is not yet ready for use in production\n",
801 |       "\n",
802 |       "   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\u001b[0m\n",
803 |       "\n",
804 |       "\n",
805 |       "05:34:22.947 INFO  Concordance - Initializing engine\n",
806 |       "05:34:27.443 INFO  FeatureManager - Using codec VCFCodec to read file gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/hg001_na12878_b37_truth.vcf.gz\n",
807 |       "05:34:33.586 INFO  IntervalArgumentCollection - Processing 8467293 bp from intervals\n",
808 |       "05:34:36.730 INFO  FeatureManager - Using codec VCFCodec to read file gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/g94982_b37_chr20_1m_15871.vcf.gz\n",
809 |       "05:34:38.054 INFO  Concordance - Done initializing engine\n",
810 |       "05:34:38.063 INFO  ProgressMeter - Starting traversal\n",
811 |       "05:34:38.063 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Records Processed   Records/Minute\n",
812 |       "05:34:40.590 INFO  ProgressMeter -           20:8787413              0.0                 15914         378004.8\n",
813 |       "05:34:40.591 INFO  ProgressMeter - Traversal complete. Processed 15914 total records in 0.0 minutes.\n",
814 |       "05:34:40.611 INFO  Concordance - Shutting down engine\n",
815 |       "[March 21, 2019 5:34:40 AM UTC] org.broadinstitute.hellbender.tools.walkers.validation.Concordance done. Elapsed time: 0.30 minutes.\n",
816 |       "Runtime.totalMemory()=1098383360\n",
817 |       "Tool returned:\n",
818 |       "SUCCESS\n"
819 |      ]
820 |     }
821 |    ],
822 |    "source": [
823 |     "!gatk Concordance \\\n",
824 |     "-truth gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/hg001_na12878_b37_truth.vcf.gz \\\n",
825 |     "-eval gs://gatk-tutorials/workshop_1903/2-germline/CNNScoreVariants/vcfs/g94982_b37_chr20_1m_15871.vcf.gz \\\n",
826 |     "-L 20:1000000-9467292 \\\n",
827 |     "-S /home/jupyter-user/CNN/Output/unfiltered_concordance.txt\n"
828 |    ]
829 |   },
830 |   {
831 |    "cell_type": "markdown",
832 |    "metadata": {},
833 |    "source": [
834 |     "> **Now look at how precision goes up (and sensitivity goes down) as we filter.**"
835 |    ]
836 |   },
837 |   {
838 |    "cell_type": "code",
839 |    "execution_count": 18,
840 |    "metadata": {},
841 |    "outputs": [
842 |     {
843 |      "name": "stdout",
844 |      "output_type": "stream",
845 |      "text": [
846 |       "type\ttrue-positive\tfalse-positive\tfalse-negative\tsensitivity\tprecision\r\n",
847 |       "SNP\t10715\t1218\t558\t0.9505011975516722\t0.8979301097796027\r\n",
848 |       "INDEL\t1617\t1022\t109\t0.9368482039397451\t0.6127320954907162\r\n"
849 |      ]
850 |     }
851 |    ],
852 |    "source": [
853 |     "!cat /home/jupyter-user/CNN/Output/unfiltered_concordance.txt"
854 |    ]
855 |   },
856 |   {
857 |    "cell_type": "code",
858 |    "execution_count": 17,
859 |    "metadata": {},
860 |    "outputs": [
861 |     {
862 |      "name": "stdout",
863 |      "output_type": "stream",
864 |      "text": [
865 |       "type\ttrue-positive\tfalse-positive\tfalse-negative\tsensitivity\tprecision\r\n",
866 |       "SNP\t10576\t1132\t697\t0.9381708507052249\t0.9033139733515545\r\n",
867 |       "INDEL\t1582\t691\t144\t0.9165701042873696\t0.6959964804223493\r\n"
868 |      ]
869 |     }
870 |    ],
871 |    "source": [
872 |     "!cat /home/jupyter-user/CNN/Output/my_1d_filtered_concordance.txt"
873 |    ]
874 |   },
875 |   {
876 |    "cell_type": "markdown",
877 |    "metadata": {},
878 |    "source": [
879 |     "> Finally, you can train your own models with the tools CNNVariantWriteTensors and CNNVariantTrain, as long as you have validated VCFs to use as training data."
880 |    ]
881 |   },
882 |   {
883 |    "cell_type": "code",
884 |    "execution_count": null,
885 |    "metadata": {},
886 |    "outputs": [],
887 |    "source": []
888 |   }
889 |  ],
890 |  "metadata": {
891 |   "kernelspec": {
892 |    "display_name": "Python 3",
893 |    "language": "python",
894 |    "name": "python3"
895 |   },
896 |   "language_info": {
897 |    "codemirror_mode": {
898 |     "name": "ipython",
899 |     "version": 3
900 |    },
901 |    "file_extension": ".py",
902 |    "mimetype": "text/x-python",
903 |    "name": "python",
904 |    "nbconvert_exporter": "python",
905 |    "pygments_lexer": "ipython3",
906 |    "version": "3.6.8"
907 |   },
908 |   "toc": {
909 |    "base_numbering": 1,
910 |    "nav_menu": {},
911 |    "number_sections": true,
912 |    "sideBar": true,
913 |    "skip_h1_title": false,
914 |    "title_cell": "Table of Contents",
915 |    "title_sidebar": "Contents",
916 |    "toc_cell": false,
917 |    "toc_position": {
918 |     "height": "calc(100% - 180px)",
919 |     "left": "10px",
920 |     "top": "150px",
921 |     "width": "165px"
922 |    },
923 |    "toc_section_display": true,
924 |    "toc_window_display": false
925 |   }
926 |  },
927 |  "nbformat": 4,
928 |  "nbformat_minor": 2
929 | }
930 | 


--------------------------------------------------------------------------------
/notebooks/Day3-Somatic/2-somatic-cna-tutorial.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "GATK TUTORIAL :: Somatic CNA :: Worksheet\n",
  8 |     "===================="
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "metadata": {},
 14 |    "source": [
 15 |     "**March 2019**  \n",
 16 |     "\n",
 17 |     "This GATK tutorial corresponds to a section of the GATK named GATK TUTORIAL Somatic CNA Worksheet available at [https://drive.google.com/drive/folders/1CZnuBm0z0sbLL6UA8cKHxV7uiFZxCl3M](https://drive.google.com/drive/folders/1CZnuBm0z0sbLL6UA8cKHxV7uiFZxCl3M). \n",
 18 |     "\n",
 19 |     "This hands-on tutorial outlines steps to sensitively detect alterations in total and allelic copy ratios using GATK4's ModelSegments CNA workflow. The workflow is suitable towards detecting somatic copy ratio alterations, more familiarly copy number alterations (CNAs) or copy number variants (CNVs), for whole genomes and targeted exomes.\n",
 20 |     "\n",
 21 |     "<img src=\"https://us.v-cdn.net/5019796/uploads/editor/t8/f03soh0gs0ve.png\" alt=\"drawing\" width=\"2000\"/> \n",
 22 |     "\n",
 23 |     "Tutorial was last tested with the GATK4.1.0.0 Docker. If the system's Docker engine limits memory, increase the memory available to Docker to at least 8GB. Otherwise, the commands will error. \n",
 24 |     "\n",
 25 |     "---\n",
 26 |     "**Table of Contents**   \n",
 27 |     "1. NOTES ON THE WORKFLOW BETA STATUS AND TUTORIAL DATA\t \n",
 28 |     "   1.1 Differences between GATK4 and GATK4.beta CNA workflows  \n",
 29 |     "   1.2 Tutorial switches between data subset to chr17 and full data\n",
 30 |     "\n",
 31 |     "2. PERFORM COVERAGE ANALYSIS: MODELSEGMENTS CNA EQUIVALENT OF GATK4.BETA CNA  \n",
 32 |     "   2.1 Prepare intervals for coverage collection  \n",
 33 |     "   2.2 Collect read counts for samples across target intervals  \n",
 34 |     "   2.3 Create CNA panel of normals (PoN)  \n",
 35 |     "   2.4 Remove noise from sample coverage using the PoN  \n",
 36 |     "   2.5 Perform segmentation based on coverage alone  \n",
 37 |     "\n",
 38 |     "3. INCORPORATE ALLELIC DATA: MODELSEGMENTS CNA EQUIVALENT OF GATK4.BETA ACNA  \n",
 39 |     "   3.1 Perform segmentation jointly with coverage and allelic data  \n",
 40 |     "   3.2 Perform segmentation with allelic data alone  \n",
 41 |     "---"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "markdown",
 46 |    "metadata": {},
 47 |    "source": [
 48 |     "## Differences between GATK4 and GATK4.beta CNA workflows\n",
 49 |     "The workflow in the official GATK4 release differs from that of the GATK4.BETA release. On the surface, two differences stand out. First, the official workflow is capable of efficiently handling WGS data. Second, it incorporates functionality that considers allelic data, which previously was a separate workflow. Note that the official GATK4 release CNA workflow itself is still in beta. This means the workflow is still undergoing adjustments. \n",
 50 |     "\n",
 51 |     "Tool-wise, differences are as follows. Note that you cannot substitute a PoN created with one workflow in the other workflow. \n",
 52 |     "\n",
 53 |     "\n",
 54 |     "| GATK4.0.3+ | GATK4.beta | Description |\n",
 55 |     "| :---- | :---- | :---- |\n",
 56 |     "| PreprocessIntervals | PadTargets | Pad or bin intervals |\n",
 57 |     "| CollectReadCounts* | CalculateTargetCoverage | Collect read counts |\n",
 58 |     "| CreateReadCountPanelOfNormals | CreatePanelOfNormals| Create the PoN |\n",
 59 |     "| DenoiseReadCounts | NormalizeSomaticReadCounts | Denoise case sample counts against the PoN |\n",
 60 |     "| CollectAllelicCounts | CollectAllelicCounts | Count alleles |\n",
 61 |     "| ModelSegments | PerformSegmentation, ACNV workflow tools | Group and model contiguous copy-ratios and allele fractions |\n",
 62 |     "| CallCopyRatioSegments | CallSegments | Call copy neutral (0) loss (-), and gain (+) segments |\n",
 63 |     "| PlotDenoisedCopyRatios & PlotModeledSegments | PlotSegmentedCopyRatio, PlotACNVResults | Plot copy ratios and allele fractions to visualize denoising and segmentation |\n",
 64 |     "\n",
 65 |     "---\n"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "markdown",
 70 |    "metadata": {},
 71 |    "source": [
 72 |     "## Tutorial switches between data subset to chr17 and full data\n",
 73 |     "We use 1000 Genomes Project (1KGP) data and the HCC1143 matched normal and tumor samples that we also use in the Mutect2 tutorial. Note the tutorial coverage data originates from a previous iteration of the workflow prior to v4.0.3 that used CollectFragmentCounts instead of CollectReadCounts*.\n",
 74 |     "\n",
 75 |     "- Panel of normals samples are Phase 3 1KGP samples aligned to GRCh38.\n",
 76 |     "- Case sample data are based on a breast cancer cell line and its matched normal cell line derived from blood. Both cell lines are appropriately consented and known as HCC1143 and HCC1143_BL, respectively. \n",
 77 |     "- Target intervals are an intersection of the HCC capture kit targets and 1KGP WES targets. Targets were converted from GRCh37 to GRCh38 coordinates using UCSC liftOver.\n",
 78 |     "\n",
 79 |     "Note the tutorial switches between data subset to chr17 and full data. At any point, you can use the input files provided in the cna_precomputed folder instead of the sandbox files generated during the tutorial. [gs://gatk-tutorials/workshop_1903/3-somatic/cna_precomputed](https://console.cloud.google.com/storage/browser/gatk-tutorials/workshop_1903/3-somatic/cna_precomputed/?project=broad-dsde-outreach&organizationId=548622027621)"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "markdown",
 84 |    "metadata": {},
 85 |    "source": [
 86 |     "### First, make sure the notebook is using a Python 3 kernel in the top right corner.\n",
 87 |     "A kernel is a _computational engine_ that executes the code in the notebook. We can execute GATK commands using _Python Magic_ (`!`)."
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "markdown",
 92 |    "metadata": {},
 93 |    "source": [
 94 |     "### How to run this notebook:\n",
 95 |     "- **Click to select a gray cell and then pressing SHIFT+ENTER to run the cell.**\n",
 96 |     "- **Write results to `/home/jupyter-user/3-somatic-cna/sandbox/`. To access the directory, click on the upper-left jupyter icon.**"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": null,
102 |    "metadata": {},
103 |    "outputs": [],
104 |    "source": [
105 |     "# Create your sandbox directory\n",
106 |     "! mkdir -p /home/jupyter-user/3-somatic-cna/sandbox/\n",
107 |     "# Create directory to store plots\n",
108 |     "! mkdir /home/jupyter-user/3-somatic-cna/sandbox/cna_plots/\n",
109 |     "# Removes any old symbolic linked sandbox directory and adds a new link\n",
110 |     "! rm sandbox\n",
111 |     "! ln -s /home/jupyter-user/3-somatic-cna/sandbox sandbox\n",
112 |     "! ls sandbox/"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "markdown",
117 |    "metadata": {},
118 |    "source": [
119 |     "### Enable reading Google bucket data "
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": null,
125 |    "metadata": {},
126 |    "outputs": [],
127 |    "source": [
128 |     "# Check if data is accessible. The command should list several gs:// URLs.\n",
129 |     "! gsutil ls gs://gatk-tutorials/workshop_1903/3-somatic/"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "code",
134 |    "execution_count": null,
135 |    "metadata": {
136 |     "scrolled": true
137 |    },
138 |    "outputs": [],
139 |    "source": [
140 |     "# If you do not see gs:// URLs listed above, run this cell to install Google Cloud Storage. \n",
141 |     "# Afterwards, restart the kernel with Kernel > Restart.\n",
142 |     "#! pip install google-cloud-storage"
143 |    ]
144 |   },
145 |   {
146 |    "cell_type": "markdown",
147 |    "metadata": {},
148 |    "source": [
149 |     "### Install R packages"
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "code",
154 |    "execution_count": null,
155 |    "metadata": {},
156 |    "outputs": [],
157 |    "source": [
158 |     "# Install R packages for ploting \n",
159 |     "! echo \"install.packages(c(\\\"optparse\\\",\\\"data.table\\\"))\" | R --no-save"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "markdown",
164 |    "metadata": {},
165 |    "source": [
166 |     "### Download Data Locally\n",
167 |     "Some tools are not able to use read from a googe bucket, here we download the files locally."
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": null,
173 |    "metadata": {},
174 |    "outputs": [],
175 |    "source": [
176 |     "! mkdir /home/jupyter-user/3-somatic-cna/cna_inputs\n",
177 |     "! gsutil cp gs://gatk-tutorials/workshop_1903/3-somatic/cna_inputs/* /home/jupyter-user/3-somatic-cna/cna_inputs/\n",
178 |     "! mkdir /home/jupyter-user/3-somatic-cna/ref/\n",
179 |     "! gsutil cp gs://gatk-tutorials/workshop_1903/3-somatic/ref/Homo_sapiens_assembly38.dict /home/jupyter-user/3-somatic-cna/ref/"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "markdown",
184 |    "metadata": {},
185 |    "source": [
186 |     "---\n",
187 |     "\n",
188 |     "# PERFORM COVERAGE ANALYSIS: MODELSEGMENTS CNA EQUIVALENT OF GATK4.BETA CNA"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "markdown",
193 |    "metadata": {},
194 |    "source": [
195 |     "## Prepare intervals for coverage collection\n",
196 |     "We define the genomic regions in which we expect read coverage. Since we are using exome data, we will pad the target regions. Padding target regions 250 bases on each side has been shown to increase sensitivity for the CNA workflow. In the case of whole genome data, we would divide the reference genome into equally sized intervals or bins. In either case, we use PreprocessIntervals to prepare the intervals list.\n",
197 |     " \n",
198 |     "The --bin-length value must be set for different data types, e.g. default 1000 for whole genome or 0 for exomes. For the tutorial exome data, we provide a snippet of the capture kit target regions and set --bin-length to zero."
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "code",
203 |    "execution_count": null,
204 |    "metadata": {
205 |     "scrolled": true
206 |    },
207 |    "outputs": [],
208 |    "source": [
209 |     " ! gatk PreprocessIntervals \\\n",
210 |     "    -L gs://gatk-tutorials/workshop_1903/3-somatic/resources/targets_chr17.interval_list \\\n",
211 |     "    -R gs://gatk-tutorials/workshop_1903/3-somatic/ref/Homo_sapiens_assembly38.fasta \\\n",
212 |     "    --padding 250 \\\n",
213 |     "    --bin-length 0 \\\n",
214 |     "    --interval-merging-rule OVERLAPPING_ONLY \\\n",
215 |     "    -O /home/jupyter-user/3-somatic-cna/sandbox/targets_chr17.preprocessed.interval_list"
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "markdown",
220 |    "metadata": {
221 |     "scrolled": true
222 |    },
223 |    "source": [
224 |     "This produces a Picard-style intervals list targets_chr17.preprocessed.interval_list with 11,307 targets for use in the coverage collection step."
225 |    ]
226 |   },
227 |   {
228 |    "cell_type": "markdown",
229 |    "metadata": {},
230 |    "source": [
231 |     "➤ Peruse both the before and after intervals. Do we have the same number of intervals as before? How does the tool pad intervals that are less than 500bp apart?  \n",
232 |     "➤ Take a look at the tool doc description for -imr OVERLAPPING_ONLY at <https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_copynumber_CollectReadCounts.php#--interval-merging-rule>. What does this option ensure?"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "markdown",
237 |    "metadata": {},
238 |    "source": [
239 |     "---\n",
240 |     "## Collect read counts for samples across target intervals"
241 |    ]
242 |   },
243 |   {
244 |    "cell_type": "markdown",
245 |    "metadata": {},
246 |    "source": [
247 |     "The basis for detecting amplification and deletion events from sequencing data is read coverage. In this step, we count the number of read starts that overlap each interval using CollectReadCounts. We perform this step for the tumor sample and for the normal sample.\n",
248 |     "\n",
249 |     "By default, the tool writes HDF5 format <https://software.broadinstitute.org/gatk/documentation/article?id=11508> data, which is handled more efficiently by downstream tools (decreases runtime by reducing time spent on IO). Here we change the output format to TSV for teaching purposes. "
250 |    ]
251 |   },
252 |   {
253 |    "cell_type": "code",
254 |    "execution_count": null,
255 |    "metadata": {
256 |     "scrolled": true
257 |    },
258 |    "outputs": [],
259 |    "source": [
260 |     "! gatk CollectReadCounts \\\n",
261 |     "    -I gs://gatk-tutorials/workshop_1903/3-somatic/bams/tumor.bam \\\n",
262 |     "    -L /home/jupyter-user/3-somatic-cna/sandbox/targets_chr17.preprocessed.interval_list \\\n",
263 |     "    -R gs://gatk-tutorials/workshop_1903/3-somatic/ref/Homo_sapiens_assembly38.fasta \\\n",
264 |     "    --format TSV \\\n",
265 |     "    -imr OVERLAPPING_ONLY \\\n",
266 |     "    -O /home/jupyter-user/3-somatic-cna/sandbox/tumor.counts.tsv"
267 |    ]
268 |   },
269 |   {
270 |    "cell_type": "code",
271 |    "execution_count": null,
272 |    "metadata": {
273 |     "scrolled": true
274 |    },
275 |    "outputs": [],
276 |    "source": [
277 |     "! gatk CollectReadCounts \\\n",
278 |     "    -I gs://gatk-tutorials/workshop_1903/3-somatic/bams/normal.bam \\\n",
279 |     "    -L /home/jupyter-user/3-somatic-cna/sandbox/targets_chr17.preprocessed.interval_list \\\n",
280 |     "    -R gs://gatk-tutorials/workshop_1903/3-somatic/ref/Homo_sapiens_assembly38.fasta \\\n",
281 |     "    --format TSV \\\n",
282 |     "    -imr OVERLAPPING_ONLY \\\n",
283 |     "    -O /home/jupyter-user/3-somatic-cna/sandbox/normal.counts.tsv"
284 |    ]
285 |   },
286 |   {
287 |    "cell_type": "markdown",
288 |    "metadata": {},
289 |    "source": [
290 |     "Here we show the raw counts per target (y-axis) for the normal and the tumor across 23 chromosomes (x-axis), produced by a previous iteration of the workflow that used a now deprecated tool, CollectFragmentCounts. Each target is represented by a point.\n",
291 |     "<img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/cna-image1.png\" alt=\"drawing\" width=\"1000\"/>\n",
292 |     "➤ Can you tell if either sample has copy number variants?"
293 |    ]
294 |   },
295 |   {
296 |    "cell_type": "markdown",
297 |    "metadata": {},
298 |    "source": [
299 |     "---\n",
300 |     "## Create CNA panel of normals (PoN)"
301 |    ]
302 |   },
303 |   {
304 |    "cell_type": "markdown",
305 |    "metadata": {},
306 |    "source": [
307 |     "Now we generate the CNA PoN with CreateReadCountPanelOfNormals. The tool creates a panel of normals that forms the baseline for what is the norm against which the workflow compares case samples. The tool uses Singular Value Decomposition, a type of Principal Component Analysis to capture systematic noise against statistical noise.\n",
308 |     "\n",
309 |     "Normally, you will want to create a PoN with some number of normal samples that were ideally subject to the same batch effects as your case sample under scrutiny. This tutorial will use a PoN made of forty 1KGP normal samples and generated with the following command: \n",
310 |     "\n",
311 |     "```\n",
312 |     "gatk --java-options \"-Xmx6500m\" CreateReadCountPanelOfNormals \\\n",
313 |     "    -I file1_clean.counts.hdf5 \\\n",
314 |     "    … \n",
315 |     "    -I file40_clean.counts.hdf5 \\\n",
316 |     "    --minimum-interval-median-percentile 5.0 \\\n",
317 |     "    -O cnaponC.pon.hdf5\n",
318 |     "```\n",
319 |     "\n",
320 |     "Changing the `--minimum-interval-median-percentile` argument from the default of 10.0 to a smaller value of 5.0 allows retention of more data, which is appropriate for this carefully selected normals cohort. With this parameter, the tool filters out targets or bins with a median fractional coverage below this percentile. The median is across the samples. The fractional coverage is the target coverage divided by the sum of the coverage of all targets for a sample.\n",
321 |     "\n",
322 |     "CreateReadCountPanelOfNormals performs several other filtering steps across samples and across targets, and <https://gatkforums.broadinstitute.org/dsde/discussion/11682#2> outlines these.\n",
323 |     "\n",
324 |     "At the least, the PoN should consist of ten normal samples that were ideally subject to the same batch effects as that of the tumor sample. Our recommendation is forty or more normal samples. To illustrate tool features, we create a PoN with our normal sample with the following command."
325 |    ]
326 |   },
327 |   {
328 |    "cell_type": "code",
329 |    "execution_count": null,
330 |    "metadata": {
331 |     "scrolled": true
332 |    },
333 |    "outputs": [],
334 |    "source": [
335 |     "! gatk CreateReadCountPanelOfNormals \\\n",
336 |     "    -I /home/jupyter-user/3-somatic-cna/sandbox/normal.counts.tsv \\\n",
337 |     "    -O /home/jupyter-user/3-somatic-cna/sandbox/normal.pon.hdf5"
338 |    ]
339 |   },
340 |   {
341 |    "cell_type": "markdown",
342 |    "metadata": {},
343 |    "source": [
344 |     "➤ Study the stdout. Are we losing any data during the filtering steps? Given the reasons one might want to use a matched normal, would you change this command? Remember also PoN medians are used to standardize case counts (by dividing).\n",
345 |     "\n",
346 |     "So far we have been using subset data. Run the CreateReadCountPanelOfNormals command using the full data file cna_inputs/hcc1143_N_clean.counts.hdf5. Adjust the parameter <your parameter change> to include --minimum-interval-median-percentile."
347 |    ]
348 |   },
349 |   {
350 |    "cell_type": "code",
351 |    "execution_count": null,
352 |    "metadata": {},
353 |    "outputs": [],
354 |    "source": [
355 |     "! gatk CreateReadCountPanelOfNormals \\\n",
356 |     "    -I /home/jupyter-user/3-somatic-cna/cna_inputs/hcc1143_N_clean.counts.hdf5 \\\n",
357 |     "    --minimum-interval-median-percentile 5.0 \\\n",
358 |     "    -O /home/jupyter-user/3-somatic-cna/sandbox/normal.pon.hdf5"
359 |    ]
360 |   },
361 |   {
362 |    "cell_type": "markdown",
363 |    "metadata": {},
364 |    "source": [
365 |     "➤ Which do you think will perform better in revealing copy number events in the tumor, the 40-sample PoN or the matched-normal? Why?\n",
366 |     " \n",
367 |     "If you are curious to see for yourself how the matched-normal PoN pans out, it is possible to substitute it in to the remaining steps. Instructions continue with the 40-sample PoN."
368 |    ]
369 |   },
370 |   {
371 |    "cell_type": "markdown",
372 |    "metadata": {},
373 |    "source": [
374 |     "---\n",
375 |     "## Remove noise from sample coverage using the PoN"
376 |    ]
377 |   },
378 |   {
379 |    "cell_type": "markdown",
380 |    "metadata": {},
381 |    "source": [
382 |     "We use DenoiseReadCounts and the PoN to standardize and then denoise sample read counts. The resulting two files each capture a step. In the single-sample-PoN case, the two results will be identical to each other, as the tool only performs standardization.  \n",
383 |     "\n",
384 |     "**A. Denoise Read Counts**"
385 |    ]
386 |   },
387 |   {
388 |    "cell_type": "code",
389 |    "execution_count": null,
390 |    "metadata": {},
391 |    "outputs": [],
392 |    "source": [
393 |     "! gatk --java-options \"-Xmx7g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true\" DenoiseReadCounts \\\n",
394 |     "    -I /home/jupyter-user/3-somatic-cna/cna_inputs/hcc1143_T_clean.counts.hdf5 \\\n",
395 |     "    --count-panel-of-normals /home/jupyter-user/3-somatic-cna/cna_inputs/cnaponC.pon.hdf5 \\\n",
396 |     "    --standardized-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.standardizedCR.tsv \\\n",
397 |     "    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv"
398 |    ]
399 |   },
400 |   {
401 |    "cell_type": "code",
402 |    "execution_count": null,
403 |    "metadata": {},
404 |    "outputs": [],
405 |    "source": [
406 |     "! gatk --java-options \"-Xmx7g\" DenoiseReadCounts \\\n",
407 |     "    -I /home/jupyter-user/3-somatic-cna/cna_inputs/hcc1143_N_clean.counts.hdf5 \\\n",
408 |     "    --count-panel-of-normals /home/jupyter-user/3-somatic-cna/cna_inputs/cnaponC.pon.hdf5 \\\n",
409 |     "    --standardized-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_N_clean.standardizedCR.tsv \\\n",
410 |     "    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_N_clean.denoisedCR.tsv"
411 |    ]
412 |   },
413 |   {
414 |    "cell_type": "markdown",
415 |    "metadata": {},
416 |    "source": [
417 |     "➤ Skim the stdout to get a sense of the data transformations during standardization vs. denoising. \n",
418 |     "\n",
419 |     "The tool uses the maximum number of eigensamples available in the PoN. Changing the `--number-of-eigensamples` in DenoiseReadCounts to lower values can change the resolution of results, i.e. how smooth segments are. Using a larger number of principal components will result in a higher level of denoising and a larger difference in the MADs. The level of denoising should be chosen with some care, as it will ultimately affect the sensitivity of the analysis.\n",
420 |     "\n",
421 |     "**B. Plot Denoised Copy Ratios**  \n",
422 |     "Let's take a look at the data in its current state."
423 |    ]
424 |   },
425 |   {
426 |    "cell_type": "code",
427 |    "execution_count": null,
428 |    "metadata": {},
429 |    "outputs": [],
430 |    "source": [
431 |     "! gatk PlotDenoisedCopyRatios \\\n",
432 |     "    --standardized-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.standardizedCR.tsv \\\n",
433 |     "    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv \\\n",
434 |     "    --sequence-dictionary /home/jupyter-user/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \\\n",
435 |     "    --minimum-contig-length 46709983 \\\n",
436 |     "    --output /home/jupyter-user/3-somatic-cna/sandbox/cna_plots \\\n",
437 |     "    --output-prefix hcc1143_T_clean"
438 |    ]
439 |   },
440 |   {
441 |    "cell_type": "markdown",
442 |    "metadata": {},
443 |    "source": [
444 |     "View the plot generated from the previous command below. Remove qoutes around `!` to view plot in this notebook.  \n",
445 |     "\"!\"[hcc1143_T_clean.denoised.png](sandbox/cna_plots/hcc1143_T_clean.denoised.png)"
446 |    ]
447 |   },
448 |   {
449 |    "cell_type": "code",
450 |    "execution_count": null,
451 |    "metadata": {},
452 |    "outputs": [],
453 |    "source": [
454 |     "! gatk PlotDenoisedCopyRatios \\\n",
455 |     "    --standardized-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_N_clean.standardizedCR.tsv \\\n",
456 |     "    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_N_clean.denoisedCR.tsv \\\n",
457 |     "    --sequence-dictionary /home/jupyter-user/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \\\n",
458 |     "    --minimum-contig-length 46709983 \\\n",
459 |     "    --output /home/jupyter-user/3-somatic-cna/sandbox/cna_plots \\\n",
460 |     "    --output-prefix hcc1143_N_clean"
461 |    ]
462 |   },
463 |   {
464 |    "cell_type": "markdown",
465 |    "metadata": {},
466 |    "source": [
467 |     "View the plot generated from the previous command below. Remove qoutes around `!` to view plot in this notebook.  \n",
468 |     "\"!\"[hcc1143_N_clean.denoised.png](sandbox/cna_plots/hcc1143_N_clean.denoised.png)"
469 |    ]
470 |   },
471 |   {
472 |    "cell_type": "markdown",
473 |    "metadata": {},
474 |    "source": [
475 |     "➤ Skim the stdout to get a sense of the data transformations during standardization and denoising. \n",
476 |     "\n",
477 |     "Each command produces two sets of data: plots and QC values. \n",
478 |     "- In the plots, standardized copy ratios are shown in blue. Standardization involves median-centering and log-transformation. Denoised copy ratios are in green. Denoising is performed using the principal components of the PoN. \n",
479 |     "- The QC values pertain to the median-absolute-deviation (MAD) in different contexts, including the change between standardized and denoised (.deltaMAD.txt)  and the change between the two scaled by the standardized MAD (.deltaScaledMAD.txt).\n",
480 |     "\n",
481 |     "| . | . |\n",
482 |     "| --- | --- |\n",
483 |     "| <img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/cna-image2.png\" alt=\"drawing\" width=\"500\"/> | <img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/cna-image3.png\" alt=\"drawing\" width=\"500\"/> |\n",
484 |     "\n",
485 |     "---\n"
486 |    ]
487 |   },
488 |   {
489 |    "cell_type": "markdown",
490 |    "metadata": {},
491 |    "source": [
492 |     "## Perform segmentation based on coverage alone"
493 |    ]
494 |   },
495 |   {
496 |    "cell_type": "markdown",
497 |    "metadata": {},
498 |    "source": [
499 |     "At the heart of the GATK4 CNA workflow is ModelSegments, a tool that groups contiguous copy ratios into segments. Either or both copy ratios and allelic copy ratios inform segmentation. So far, the tutorial has focused only on coverage data. So let's see what segmentation with coverage alone looks like.\n",
500 |     "\n",
501 |     "**A. Model segments on coverage alone**  "
502 |    ]
503 |   },
504 |   {
505 |    "cell_type": "code",
506 |    "execution_count": null,
507 |    "metadata": {},
508 |    "outputs": [],
509 |    "source": [
510 |     "! gatk --java-options \"-Xmx7g\" ModelSegments \\\n",
511 |     "    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv \\\n",
512 |     "    --output /home/jupyter-user/3-somatic-cna/sandbox  \\\n",
513 |     "    --output-prefix hcc1143_T_clean"
514 |    ]
515 |   },
516 |   {
517 |    "cell_type": "code",
518 |    "execution_count": null,
519 |    "metadata": {},
520 |    "outputs": [],
521 |    "source": [
522 |     "! gatk --java-options \"-Xmx7g\" ModelSegments \\\n",
523 |     "    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_N_clean.denoisedCR.tsv \\\n",
524 |     "    --output /home/jupyter-user/3-somatic-cna/sandbox  \\\n",
525 |     "    --output-prefix hcc1143_N_clean"
526 |    ]
527 |   },
528 |   {
529 |    "cell_type": "markdown",
530 |    "metadata": {},
531 |    "source": [
532 |     "Each command produces nine files.\n",
533 |     "\n",
534 |     "Underneath the hood, a Gaussian-kernel binary-segmentation algorithm differentiates ModelSegments from a GATK4.beta tool, PerformSegmentation, which GATK4 ModelSegments replaces. The older tool used a CBS (circular binary-segmentation) algorithm. ModelSegment's kernel algorithm enables efficient segmentation of dense data, e.g. that of whole genome sequences. The tool (i) performs multidimensional kernel segmentation and (ii) performs Markov-Chain Monte Carlo (MCMC) sampling and segment smoothing iteratively.  \n",
535 |     "\n",
536 |     "**B. Plot Modeled Segments**\n",
537 |     "Let's see what modeling segments on coverage alone looks like. Here we provide a second plotting tool, PlotModeledSegments, with the denoised copy ratios (from DenoiseReadCounts), the segments (from ModelSegments), and the reference sequence dictionary. "
538 |    ]
539 |   },
540 |   {
541 |    "cell_type": "code",
542 |    "execution_count": null,
543 |    "metadata": {},
544 |    "outputs": [],
545 |    "source": [
546 |     "! gatk PlotModeledSegments \\\n",
547 |     "    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv \\\n",
548 |     "    --segments /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.modelFinal.seg \\\n",
549 |     "    --sequence-dictionary /home/jupyter-user/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \\\n",
550 |     "    --minimum-contig-length 46709983 \\\n",
551 |     "    --output /home/jupyter-user/3-somatic-cna/sandbox/cna_plots \\\n",
552 |     "    --output-prefix hcc1143_T_clean"
553 |    ]
554 |   },
555 |   {
556 |    "cell_type": "markdown",
557 |    "metadata": {},
558 |    "source": [
559 |     "View the plot generated from the previous command below. Remove qoutes around `!` to view plot in this notebook.  \n",
560 |     "\"!\"[hcc1143_T_clean.modeled.png](sandbox/cna_plots/hcc1143_T_clean.modeled.png)"
561 |    ]
562 |   },
563 |   {
564 |    "cell_type": "code",
565 |    "execution_count": null,
566 |    "metadata": {},
567 |    "outputs": [],
568 |    "source": [
569 |     "! gatk PlotModeledSegments \\\n",
570 |     "    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_N_clean.denoisedCR.tsv \\\n",
571 |     "    --segments /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_N_clean.modelFinal.seg \\\n",
572 |     "    --sequence-dictionary /home/jupyter-user/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \\\n",
573 |     "    --minimum-contig-length 46709983 \\\n",
574 |     "    --output /home/jupyter-user/3-somatic-cna/sandbox/cna_plots \\\n",
575 |     "    --output-prefix hcc1143_N_clean\n"
576 |    ]
577 |   },
578 |   {
579 |    "cell_type": "markdown",
580 |    "metadata": {},
581 |    "source": [
582 |     "View the plot generated from the previous command below. Remove qoutes around `!` to view plot in this notebook.  \n",
583 |     "\"!\"[hcc1143_N_clean.modeled.png](sandbox/cna_plots/hcc1143_N_clean.modeled.png)"
584 |    ]
585 |   },
586 |   {
587 |    "cell_type": "markdown",
588 |    "metadata": {},
589 |    "source": [
590 |     "The command produces a plot with extension .modeled.png, where denoised copy ratios in alternate segments are colored in blue and orange and segment medians are drawn in black. For noisy data, box plots of the available posteriors for each segment become visible. \n",
591 |     "\n",
592 |     "<img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/cna-image4.png\" alt=\"drawing\" width=\"900\"/>\n",
593 |     "<img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/cna-image5.png\" alt=\"drawing\" width=\"900\"/>\n",
594 |     "\n",
595 |     "➤ The tumor sample shows a lot of activity. Specifically, it has 235 segments. Is this surprising?\n",
596 |     "➤ Focus on chr2 of the normal sample. How do you interpret its copy ratio of ~1.3? How about the ~0.9 copy ratio of chr6? \n",
597 |     " \n",
598 |     "At glance, segments appear to separate into roughly evenly spaced ratios, which represent absolute copy numbers, e.g. 1, 2, 3 and so on. Segments that fall between these likely represent subclonal populations. \n",
599 |     "\n",
600 |     "**C. (Optional) Call Copy Ratio Segments**\n",
601 |     "If you need callsets with amplifications (+), deletions (-) and neutral segments (0) clearly marked, then CallCopyRatioSegments can do this for you. These designations are appended as a new column to the segmented copy-ratio .cr.seg file from ModelSegments. As of July 2018, this part of the workflow is still under active development."
602 |    ]
603 |   },
604 |   {
605 |    "cell_type": "code",
606 |    "execution_count": null,
607 |    "metadata": {},
608 |    "outputs": [],
609 |    "source": [
610 |     "! gatk CallCopyRatioSegments \\\n",
611 |     "    -I /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.cr.seg \\\n",
612 |     "    -O /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.called.seg"
613 |    ]
614 |   },
615 |   {
616 |    "cell_type": "markdown",
617 |    "metadata": {},
618 |    "source": [
619 |     "---\n",
620 |     "# INCORPORATE ALLELIC DATA: MODELSEGMENTS CNA EQUIVALENT OF GATK4.BETA ACNA"
621 |    ]
622 |   },
623 |   {
624 |    "cell_type": "markdown",
625 |    "metadata": {},
626 |    "source": [
627 |     "## Perform segmentation jointly with coverage and allelic data\n",
628 |     "We just saw what segmentation with coverage data alone looks like. But we can squeeze more juice out of the lemon! In this section, we will model segments using both allelic counts and coverage data for a matched-control case. \n",
629 |     "\n",
630 |     "➤ How is allelic counts data useful in detecting copy alteration?\n",
631 |     "\n",
632 |     "Consider in normal germline sequencing, how it is we decide a site's genotype is heterozygous versus homozygous. For a site that is heterozygous, that presents two alleles for a diploid sample, the confidence that the sample has at least two chromosomes is high. A hundred heterozygous sites adjacent to each other becomes strong evidence towards the multi-copy number state of the genomic interval. \n",
633 |     "\n",
634 |     "We can extend this concept further towards detection of a type of zygosity that has implications for cancer. We can take allele counts for sites that are commonly variant in the population. For sites where the normal control is heterozygous, if the tumor sample is homozygous, then we can deduce the tumor underwent loss of heterozygosity (LOH) for the allele. With a string of adjacent LOH sites, we can be confident of an LOH segment. Here, either the tumor simply lost the chromosome segment or underwent a slightly more complicated event called copy-neutral LOH (cnLOH). Coverage data can offer clues towards deducing which type of loss is likely.   \n",
635 |     "\n",
636 |     "Note it is possible to use allelic counts alone with ModelSegments. Furthermore, the tool will model segments for either a matched case or for a case sample alone. The latter can be useful in revealing clonal subpopulations. \n",
637 |     "\n",
638 |     "**A. Collect allelic counts from pileups (chr17 data)**  \n",
639 |     "CollectAllelicCounts tabulates counts of the reference allele and counts of the dominant alternate allele for sites in a given genomic intervals list. The tool filters out reads with MAPQ below 30 and discounts bases with base quality less than 20.\n",
640 |     " \n",
641 |     "We perform this step on the chr17 subset data. In later steps, we will use precomputed results from the full data. Here, theta_snps_paddedC_chr17.vcf.gz contains lifted-over gnomAD SNPs-only sites subset to the padded target regions from section 1. "
642 |    ]
643 |   },
644 |   {
645 |    "cell_type": "code",
646 |    "execution_count": null,
647 |    "metadata": {},
648 |    "outputs": [],
649 |    "source": [
650 |     "! gatk CollectAllelicCounts \\\n",
651 |     "    -L gs://gatk-tutorials/workshop_1903/3-somatic/resources/theta_snps_paddedC_chr17.vcf.gz \\\n",
652 |     "    -I gs://gatk-tutorials/workshop_1903/3-somatic/bams/tumor.bam \\\n",
653 |     "    -R gs://gatk-tutorials/workshop_1903/3-somatic/ref/Homo_sapiens_assembly38.fasta \\\n",
654 |     "    -O /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.allelicCounts.tsv"
655 |    ]
656 |   },
657 |   {
658 |    "cell_type": "code",
659 |    "execution_count": null,
660 |    "metadata": {
661 |     "scrolled": false
662 |    },
663 |    "outputs": [],
664 |    "source": [
665 |     "! gatk CollectAllelicCounts \\\n",
666 |     "    -L gs://gatk-tutorials/workshop_1903/3-somatic/resources/theta_snps_paddedC_chr17.vcf.gz \\\n",
667 |     "    -I gs://gatk-tutorials/workshop_1903/3-somatic/bams/normal.bam \\\n",
668 |     "    -R gs://gatk-tutorials/workshop_1903/3-somatic/ref/Homo_sapiens_assembly38.fasta \\\n",
669 |     "    -O /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_N_clean.allelicCounts.tsv"
670 |    ]
671 |   },
672 |   {
673 |    "cell_type": "markdown",
674 |    "metadata": {},
675 |    "source": [
676 |     "The resulting tables notate the read counts for REF and ALT as well as the REF allele and the ALT allele for every site provided in the intervals list.\n",
677 |     "\n",
678 |     "➤ For sites lacking ALT allele counts, what is in the field for ALT_NUCLEOTIDE?\n",
679 |     "\n",
680 |     "**B. Model segments jointly on coverage and allelic data (full data)**  \n",
681 |     "In this step, the full spectrum of data converge. We provide precomputed allelic counts from the cna_inputs folder and tumor denoised read counts. Here we use default parameters. Adjusting tool parameters can change the resolution and smoothness of the segmentation results and we recommend researchers tune the parameters for their data."
682 |    ]
683 |   },
684 |   {
685 |    "cell_type": "code",
686 |    "execution_count": null,
687 |    "metadata": {},
688 |    "outputs": [],
689 |    "source": [
690 |     "! gatk --java-options \"-Xmx7g\" ModelSegments \\\n",
691 |     "    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv \\\n",
692 |     "    --allelic-counts /home/jupyter-user/3-somatic-cna/cna_inputs/hcc1143_T_clean.allelicCounts.tsv \\\n",
693 |     "    --normal-allelic-counts /home/jupyter-user/3-somatic-cna/cna_inputs/hcc1143_N_clean.allelicCounts.tsv \\\n",
694 |     "    --output /home/jupyter-user/3-somatic-cna/sandbox \\\n",
695 |     "    --output-prefix hcc1143_TN_clean"
696 |    ]
697 |   },
698 |   {
699 |    "cell_type": "markdown",
700 |    "metadata": {},
701 |    "source": [
702 |     "➤ Skim the stdout to get a sense of the preprocessing and analysis. The tool filters sites with total allelic counts less than how many? How many control heterozygous sites does the tool retain? Does the tool then use all of these towards the joint analysis?  \n",
703 |     "➤ How many segments does the MultidimensionalKernelSegmenter initially find? After smoothing, how many final segments are there? \n",
704 |     "\n",
705 |     "The step produces eleven files. See ModelSegments tool documentation for details <https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.1.1/org_broadinstitute_hellbender_tools_copynumber_ModelSegments.php>. Of note, we have two files with .hets. in the extension, .hets.normal.tsv and .hets.tsv. The former contains the normal control's heterozygous sites. The latter contains the tumor's allele counts for the normal's heterozygous sites. Finally, the .modelFinal.seg file contains the segmentation results.\n",
706 |     "\n",
707 |     "**C. Plot coverage copy ratios and alternate allele fractions**  \n",
708 |     "We provide PlotModeledSegments the case sample's denoised copy ratios, .hets allele counts, and final segmentation results. "
709 |    ]
710 |   },
711 |   {
712 |    "cell_type": "code",
713 |    "execution_count": null,
714 |    "metadata": {},
715 |    "outputs": [],
716 |    "source": [
717 |     "! gatk PlotModeledSegments \\\n",
718 |     "    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv \\\n",
719 |     "    --allelic-counts /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_TN_clean.hets.tsv \\\n",
720 |     "    --segments /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_TN_clean.modelFinal.seg \\\n",
721 |     "    --sequence-dictionary /home/jupyter-user/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \\\n",
722 |     "    --minimum-contig-length 46709983 \\\n",
723 |     "    --output /home/jupyter-user/3-somatic-cna/sandbox/cna_plots \\\n",
724 |     "    --output-prefix hcc1143_TN_clean"
725 |    ]
726 |   },
727 |   {
728 |    "cell_type": "markdown",
729 |    "metadata": {},
730 |    "source": [
731 |     "View the plot generated from the previous command below. Remove qoutes around `!` to view plot in this notebook.  \n",
732 |     "\"!\"[hcc1143_TN_clean.modeled.png](sandbox/cna_plots/hcc1143_TN_clean.modeled.png)"
733 |    ]
734 |   },
735 |   {
736 |    "cell_type": "markdown",
737 |    "metadata": {},
738 |    "source": [
739 |     "This produces a file with two plots, each with 398 segments. The top plot shows segmented copy ratios and the bottom plot shows segmented alternate-allele fractions. Box plots for the major and minor allele fractions mark the 10th, 50th and 90th percentile credible intervals. Vertical streaks appear for very short segments as fewer supporting data points make estimates more uncertain.\n",
740 |     "<img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/cna-image6.png\" alt=\"drawing\" width=\"900\"/>\n",
741 |     "➤ What do the allelic segments at 0 and 1 indicate? For example, at chr4, chr5 and chr17?\n",
742 |     "\n",
743 |     "---"
744 |    ]
745 |   },
746 |   {
747 |    "cell_type": "markdown",
748 |    "metadata": {},
749 |    "source": [
750 |     "## Perform segmentation with allelic data alone\n",
751 |     "Perform one final comparison. Run ModelSegments and PlotModeledSegments for the matched-case using allelic data alone. \n"
752 |    ]
753 |   },
754 |   {
755 |    "cell_type": "code",
756 |    "execution_count": null,
757 |    "metadata": {},
758 |    "outputs": [],
759 |    "source": [
760 |     "! gatk --java-options \"-Xmx7g\" ModelSegments \\\n",
761 |     "    --allelic-counts /home/jupyter-user/3-somatic-cna/cna_inputs/hcc1143_T_clean.allelicCounts.tsv \\\n",
762 |     "    --normal-allelic-counts /home/jupyter-user/3-somatic-cna/cna_inputs/hcc1143_N_clean.allelicCounts.tsv \\\n",
763 |     "    --output /home/jupyter-user/3-somatic-cna/sandbox \\\n",
764 |     "    --output-prefix hcc1143_TN_allelic"
765 |    ]
766 |   },
767 |   {
768 |    "cell_type": "code",
769 |    "execution_count": null,
770 |    "metadata": {},
771 |    "outputs": [],
772 |    "source": [
773 |     "! gatk PlotModeledSegments \\\n",
774 |     "    --allelic-counts /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_TN_allelic.hets.tsv \\\n",
775 |     "    --segments /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_TN_allelic.modelFinal.seg \\\n",
776 |     "    --sequence-dictionary /home/jupyter-user/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \\\n",
777 |     "    --minimum-contig-length 46709983 \\\n",
778 |     "    --output /home/jupyter-user/3-somatic-cna/sandbox/cna_plots \\\n",
779 |     "    --output-prefix hcc1143_TN_allelic"
780 |    ]
781 |   },
782 |   {
783 |    "cell_type": "markdown",
784 |    "metadata": {},
785 |    "source": [
786 |     "View the plot generated from the previous command below. Remove qoutes around `!` to view plot in this notebook.  \n",
787 |     "\"!\"[hcc1143_TN_allelic.modeled.png](sandbox/cna_plots/hcc1143_TN_allelic.modeled.png)"
788 |    ]
789 |   },
790 |   {
791 |    "cell_type": "markdown",
792 |    "metadata": {},
793 |    "source": [
794 |     "This produces an allelic ratios plot with 105 segments. \n",
795 |     "<img src=\"https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/cna-image7.png\" alt=\"drawing\" width=\"900\"/>\n",
796 |     "➤ This is ~4x less segments than the CR + allelic analysis, and ~2x less than the 235 segments from the copy ratios alone. How do you explain such differences? \n",
797 |     "\n",
798 |     "Remember that joint calling groups contiguous segments with the same copy ratio and the same minor allele fraction, for high-resolution results. Finally, remember that the CNA workflow produces copy ratios and not copy numbers. GATK is developing a tool to call absolute somatic copy numbers. For germline absolute copy number detection, see GATK4's GermlineCNVCaller <https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_copynumber_GermlineCNVCaller.php>. "
799 |    ]
800 |   }
801 |  ],
802 |  "metadata": {
803 |   "kernelspec": {
804 |    "display_name": "Python 3",
805 |    "language": "python",
806 |    "name": "python3"
807 |   },
808 |   "language_info": {
809 |    "codemirror_mode": {
810 |     "name": "ipython",
811 |     "version": 3
812 |    },
813 |    "file_extension": ".py",
814 |    "mimetype": "text/x-python",
815 |    "name": "python",
816 |    "nbconvert_exporter": "python",
817 |    "pygments_lexer": "ipython3",
818 |    "version": "3.6.8"
819 |   },
820 |   "toc": {
821 |    "base_numbering": 1,
822 |    "nav_menu": {
823 |     "height": "321px",
824 |     "width": "622px"
825 |    },
826 |    "number_sections": true,
827 |    "sideBar": true,
828 |    "skip_h1_title": false,
829 |    "title_cell": "Table of Contents",
830 |    "title_sidebar": "Contents",
831 |    "toc_cell": false,
832 |    "toc_position": {},
833 |    "toc_section_display": true,
834 |    "toc_window_display": true
835 |   }
836 |  },
837 |  "nbformat": 4,
838 |  "nbformat_minor": 2
839 | }
840 | 


--------------------------------------------------------------------------------