├── presentations
    └── Alignment, Variant Calling, and Filtering (WGC NGS Bioinformatics).pdf
├── LICENSE
└── README.md


/presentations/Alignment, Variant Calling, and Filtering (WGC NGS Bioinformatics).pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ekg/alignment-and-variant-calling-tutorial/HEAD/presentations/Alignment, Variant Calling, and Filtering (WGC NGS Bioinformatics).pdf


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2015 Erik Garrison
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 
23 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # NGS alignment and variant calling
  2 | 
  3 | This tutorial steps through some basic tasks in alignment and variant calling using a handful of Illumina sequencing data sets. For theoretical background, please refer to the included [presentation on alignment and variant calling](https://docs.google.com/presentation/d/1t921ccF66N0_oyn09gbM0w8nzADzWF20rfZkeMv3Sy8/edit?usp=sharing), or the [included PDF from a previous year](https://github.com/ekg/alignment-and-variant-calling-tutorial/raw/master/presentations/Alignment%2C%20Variant%20Calling%2C%20and%20Filtering%20(WGC%20NGS%20Bioinformatics).pdf).
  4 |     
  5 | ## Part 0: Setup
  6 | 
  7 | We're going to use a bunch of fun tools for working with genomic data:
  8 | 
  9 | 1. [bwa](https://github.com/lh3/bwa)
 10 | 2. [samtools](https://github.com/samtools/samtools)
 11 | 3. [htslib](https://github.com/samtools/htslib)
 12 | 4. [vt](https://github.com/atks/vt)
 13 | 5. [freebayes](https://github.com/ekg/freebayes)
 14 | 6. [vcflib](https://github.com/ekg/vcflib/)
 15 | 7. [sambamba](https://github.com/lomereiter/sambamba)
 16 | 8. [seqtk](https://github.com/lh3/seqtk)
 17 | 9. [mutatrix](https://github.com/ekg/mutatrix)
 18 | 10. [sra-tools](https://github.com/ncbi/sra-tools/wiki/HowTo:-Binary-Installation)
 19 | 11. [glia](https://github.com/ekg/glia.git)
 20 | 12. [hhga](https://github.com/ekg/hhga)
 21 | 13. [vg](https://github.com/vgteam/vg)
 22 | 14. [vw](https://github.com/JohnLangford/vowpal_wabbit/wiki/Download)
 23 | 
 24 | In most cases, you can download and build these using this kind of pattern:
 25 | 
 26 | ```bash
 27 | git clone https://github.com/lh3/bwa
 28 | cd bwa && make
 29 | ```
 30 | 
 31 | or, in the case of several packages (vcflib, sambamba, freebayes, glia, hhga, and vg), submodules are used to control the dependencies of the project, and so the whole source tree must be cloned using the `--recursive` flag to git. For example, here is how we'd clone and build freebayes:
 32 | 
 33 | ```bash
 34 | git clone --recursive https://github.com/ekg/freebayes
 35 | cd freebayes && make
 36 | ```
 37 | 
 38 | In some cases you can download precompiled binaries. For instance, you can head
 39 | to the [sambamba releases page](https://github.com/lomereiter/sambamba/releases) to find binaries that
 40 | should work on any modern Linux or OSX distribution. The same applies to the [sra-toolkit, which is probably easier to install from available binaries](https://github.com/ncbi/sra-tools/wiki/HowTo:-Binary-Installation).
 41 | 
 42 | Otherwise, let's assume you're in an environment where you've already got them available.
 43 | 
 44 | ## Part 1: Aligning E. Coli data with `bwa mem`
 45 | 
 46 | [E. Coli K12](https://en.wikipedia.org/wiki/Escherichia_coli#Model_organism) is a common laboratory strain that has lost its ability to live in the human intestine, but is ideal for manipulation in a controlled setting.
 47 | The genome is relatively short, and so it's a good place to start learning about alignment and variant calling.
 48 | 
 49 | ### E. Coli K12 reference
 50 | 
 51 | We'll get some test data to play with. First, [the E. Coli K12 reference](http://www.ncbi.nlm.nih.gov/nuccore/556503834), from NCBI. It's a bit of a pain to pull out of the web interface so [you can also download it here](http://hypervolu.me/~erik/genomes/E.coli_K12_MG1655.fa).
 52 | 
 53 | ```bash
 54 | # the start of the genome, which is circular but must be represented linearly in FASTA
 55 | curl -s http://hypervolu.me/%7Eerik/genomes/E.coli_K12_MG1655.fa | head
 56 | # >NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome
 57 | # AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
 58 | # TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
 59 | # ...
 60 | # now download and save the genome
 61 | curl -s http://hypervolu.me/%7Eerik/genomes/E.coli_K12_MG1655.fa > E.coli_K12_MG1655.fa
 62 | ```
 63 | 
 64 | ### E. Coli K12 Illumina 2x300bp MiSeq sequencing results
 65 | 
 66 | For testing alignment, let's get some data from a [recently-submitted sequencing run on a K12 strain from the University of Exeter](http://www.ncbi.nlm.nih.gov/sra/?term=SRR1770413). We can use the [sratoolkit](https://github.com/ncbi/sratoolkit) to directly pull the sequence data (in paired FASTQ format) from the archive:
 67 | 
 68 | ```bash
 69 | fastq-dump --split-files SRR1770413
 70 | ```
 71 | 
 72 | `fastq-dump` is in the SRA toolkit. It allows directly downloading data from a particular sequencing run ID. SRA stores data in a particular compressed format (SRA!) that isn't directly compatible with any downstream tools, so it's necessary to put things into [FASTQ](https://en.wikipedia.org/wiki/FASTQ_format) for further processing. The `--split-files` part of the command ensures we get two files, one for the first and second mate in each pair. We'll use them in this format when aligning.
 73 | 
 74 | ```bash
 75 | # alternatively, you may want to first download, and then dump
 76 | # but this seems to fail sometimes for me
 77 | wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR177/SRR1770413/SRR1770413.sra
 78 | sra-dump --split-files SRR1770413.sra
 79 | ```
 80 | 
 81 | These appear to be paired 300bp reads from a modern MiSeq.
 82 | 
 83 | ### E. Coli O104:H4 HiSeq 2000 2x100bp
 84 | 
 85 | As a point of comparison, let's also pick up a [sequencing data set from a different E. Coli strain](http://www.ncbi.nlm.nih.gov/sra/SRX095630[accn]). This one is [famous for its role in foodborne illness](https://en.wikipedia.org/wiki/Escherichia_coli_O104%3AH4#Infection) and is of medical interest.
 86 | 
 87 | ```bash
 88 | fastq-dump --split-files SRR341549
 89 | ```
 90 | 
 91 | ### Setting up our reference indexes
 92 | 
 93 | #### FASTA file index
 94 | 
 95 | First, we'll want to allow tools (such as our variant caller) to quickly access certain regions in the reference. This is done using the samtools `.fai` FASTA index format, which records the lengths of the various sequences in the reference and their offsets from the beginning of the file.
 96 | 
 97 | ```bash
 98 | samtools faidx E.coli_K12_MG1655.fa
 99 | ```
100 | 
101 | Now it's possible to quickly obtain any part of the E. Coli K12 reference sequence. For instance, we can get the 200bp from position 1000000 to 1000200. We'll use a special format to describe the target region `[chr]:[start]-[end]`.
102 | 
103 | ```bash
104 | samtools faidx E.coli_K12_MG1655.fa NC_000913.3:1000000-1000200
105 | ```
106 | 
107 | We get back a small FASTA-format file describing the region:
108 | 
109 | ```text
110 | >NC_000913.3:1000000-1000200
111 | GTGTCAGCTTTCGTGGTGTGCAGCTGGCGTCAGATGACAACATGCTGCCAGACAGCCTGA
112 | AAGGGTTTGCGCCTGTGGTGCGTGGTATCGCCAAAAGCAATGCCCAGATAACGATTAAGC
113 | AAAATGGTTACACCATTTACCAAACTTATGTATCGCCTGGTGCTTTTGAAATTAGTGATC
114 | TCTATTCCACGTCGTCGAGCG
115 | ```
116 | 
117 | #### BWA's FM-index
118 | 
119 | BWA uses the [FM-index](https://en.wikipedia.org/wiki/FM-index), which a compressed full-text substring index based around the [Burrows-Wheeler transform](https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform).
120 | To use this index, we first need to build it:
121 | 
122 | ```bash
123 | bwa index E.coli_K12_MG1655.fa
124 | ```
125 | 
126 | You should see `bwa` generate some information about the build process:
127 | 
128 | ```text
129 | [bwa_index] Pack FASTA... 0.04 sec
130 | [bwa_index] Construct BWT for the packed sequence...
131 | [bwa_index] 2.26 seconds elapse.
132 | [bwa_index] Update BWT... 0.04 sec
133 | [bwa_index] Pack forward-only FASTA... 0.03 sec
134 | [bwa_index] Construct SA from BWT and Occ... 0.72 sec
135 | [main] Version: 0.7.8-r455
136 | [main] CMD: bwa index E.coli_K12_MG1655.fa
137 | [main] Real time: 3.204 sec; CPU: 3.121 sec
138 | ```
139 | 
140 | And, you should notice a new index file which has been made using the FASTA file name as prefix:
141 | 
142 | ```bash
143 | ls -rt1 E.coli_K12_MG1655.fa*
144 | # -->
145 | E.coli_K12_MG1655.fa
146 | E.coli_K12_MG1655.fa.fai
147 | E.coli_K12_MG1655.fa.bwt
148 | E.coli_K12_MG1655.fa.pac
149 | E.coli_K12_MG1655.fa.ann
150 | E.coli_K12_MG1655.fa.amb
151 | E.coli_K12_MG1655.fa.sa
152 | ```
153 | 
154 | ### Aligning our data against the E. Coli K12 reference
155 | 
156 | Here's an outline of the steps we'll follow to align our K12 strain against the K12 reference:
157 | 
158 | 1. use bwa to generate SAM records for each read
159 | 2. convert the output to BAM
160 | 3. sort the output
161 | 4. mark PCR duplicates that result from exact duplication of a template during amplification
162 | 
163 | We could do the steps one-by-one, generating an intermediate file for each step.
164 | However, this isn't really necessary unless we want to debug the process, and it will make a lot of excess files which will do nothing but confuse us when we come to work with the data later.
165 | Thankfully, it's easy to use [unix pipes](https://en.wikiepdia.org/wiki/Pipeline_%28Unix%29) to stream most of these tools together (see this [nice thread about piping bwa and samtools together on biostar](https://www.biostars.org/p/43677/) for a discussion of the benefits and possible drawbacks of this).
166 | 
167 | You can now run the alignment using a piped approach. _Replace `$threads` with the number of CPUs you would like to use for alignment._ Not all steps in `bwa` run in parallel, but the alignment, which is the most time-consuming step, does. You'll need to set this given the available resources you have.
168 | 
169 | ```bash
170 | bwa mem -t 2 -R '@RG\tID:K12\tSM:K12' \
171 |     E.coli_K12_MG1655.fa SRR1770413_1.fastq SRR1770413_2.fastq \
172 |     | samtools view -b - >SRR1770413.raw.bam
173 | sambamba sort SRR1770413.raw.bam
174 | sambamba markdup SRR1770413.raw.sorted.bam SRR1770413.bam
175 | ```
176 | 
177 | Breaking it down by line:
178 | 
179 | - *alignment with bwa*: `bwa mem -t $threads -R '@RG\tID:K12\tSM:K12'` --- this says "align using so many threads" and also "give the reads the read group K12 and the sample name K12"
180 | - *reference and FASTQs* `E.coli_K12_MG1655.fa SRR1770413_1.fastq SRR1770413_2.fastq` --- this just specifies the base reference file name (`bwa` finds the indexes using this) and the input alignment files. The first file should contain the first mate, the second file the second mate.
181 | - *conversion to BAM*: `samtools view -b -` --- this reads SAM from stdin (the `-` specifier in place of the file name indicates this) and converts to BAM.
182 | - *sorting the BAM file*: `sambamba sort SRR1770413.raw.bam` --- sort the BAM file, writing it to `.sorted.bam`.
183 | - *marking PCR duplicates*: `sambamba markdup SRR1770413.raw.sorted.bam SRR1770413.bam` --- this marks reads which appear to be redundant PCR duplicates based on their read mapping position. It [uses the same criteria for marking duplicates as picard](http://lomereiter.github.io/sambamba/docs/sambamba-markdup.html).
184 | 
185 | Now, run the same alignment process for the O104:H4 strain's data. Make sure to specify a different sample name via the `-R '@RG...` flag incantation to specify the identity of the data in the BAM file header and in the alignment records themselves:
186 | 
187 | ```bash
188 | bwa mem -t 2 -R '@RG\tID:O104_H4\tSM:O104_H4' \
189 |     E.coli_K12_MG1655.fa SRR341549_1.fastq  SRR341549_2.fastq \
190 |     | samtools view -b - >SRR341549.raw.bam
191 | sambamba sort SRR341549.raw.bam
192 | sambamba markdup SRR341549.raw.sorted.bam SRR341549.bam
193 | ```
194 | 
195 | As a standard post-processing step, it's helpful to add a BAM index to the files. This let's us jump around in them quickly using BAM compatible tools that can read the index. `sambamba` does this for us by default, but if it hadn't or we had used a different process to generate the BAM files, we could use samtools to achieve exactly the same index.
196 | 
197 | ```bash
198 | samtools index SRR1770413.bam
199 | samtools index SRR341549.bam
200 | ```
201 | 
202 | ### Using minimap2
203 | 
204 | It's easy to use `minimap2` instead of `bwa mem`. This may help in some contexts, as it can be several fold faster with minimal reduction in alignment quality. In the case of these short reads, we'd use it as follows. The only major change from bwa mem is that we'll tell it we're working with short read data using `-ax sr`:
205 | 
206 | ```bash
207 | minimap2 -ax sr -t 2 -R '@RG\tID:O104_H4\tSM:O104_H4' \
208 |     E.coli_K12_MG1655.fa SRR341549_1.fastq  SRR341549_2.fastq \
209 |     | samtools view -b - >SRR341549.raw.minimap2.bam
210 | sambamba sort SRR341549.raw.minimap2.bam
211 | sambamba markdup SRR341549.raw.sorted.minimap2.bam SRR341549.minimap2.bam
212 | ```
213 | 
214 | ## Part 2: Calling variants
215 | 
216 | Now that we have our alignments sorted, we can quickly determine variation against the reference by scanning through them using a variant caller.
217 | There are many options, including [samtools mpileup](http://samtools.sourceforge.net/samtools.shtml), [platypus](http://www.well.ox.ac.uk/platypus), and the [GATK](https://www.broadinstitute.org/gatk/).
218 | 
219 | For this tutorial, we'll keep things simple and use [freebayes](https://github.com/ekg/freebayes). It has a number of advantages in this context (bacterial genomes), such as long-term support for haploid (and polyploid) genomes. However, the best reason to use it is that it's very easy to set up and run, and it produces a very well-annotated VCF output that is suitable for immediate downstream filtering.
220 | 
221 | ### Variant calls with `freebayes`
222 | 
223 | It's quite easy to use `freebayes` provided you have your BAM file completed. We use `--ploidy 1` to indicate that the sample should be genotyped as haploid.
224 | 
225 | ```bash
226 | freebayes -f E.coli_K12_MG1655.fa --ploidy 1 SRR1770413.bam >SRR1770413.vcf
227 | ```
228 | 
229 | ### Joint calling
230 | 
231 | We can put the samples together if we want to find differences between them. Calling them jointly can help if we have a population of samples to use to help remove calls from paralogous regions. The Bayesian model in freebayes combines the data likelihoods from sequencing data with an estimate of the probability of observing a given set of genotypes under assumptions of neutral evolution and a [panmictic](https://en.wikipedia.org/wiki/Panmixia) population. For instance, [it would be very unusual to find a locus at which all the samples are heterozygous](https://en.wikipedia.org/wiki/Hardy%E2%80%93Weinberg_principle). It also helps improve statistics about observational biases (like strand bias, read placement bias, and allele balance in heterozygotes) by bringing more data into the algorithm.
232 | 
233 | However, in this context, we only have two samples and the best reason to call them jointly is to make sure we have a genotype for each one at every locus where a non-reference allele passes the caller's thresholds in either sample.
234 | 
235 | We would run a joint call by dropping in both BAMs on the command line to freebayes:
236 | 
237 | ```bash
238 | freebayes -f E.coli_K12_MG1655.fa --ploidy 1 SRR1770413.bam SRR341549.bam >e_colis.vcf
239 | ```
240 | 
241 | As long as we've added the read group (@RG) flags when we aligned (or did so after with [bamaddrg](https://github.com/ekg/bamaddrg), that's all we need to do to run the joint calling. (NB: due to the amount of data in SRR341549, this should take about 20 minutes.)
242 | 
243 | ### `bgzip` and `tabix`
244 | 
245 | We can speed up random access to VCF files by compressing them with `bgzip`, in the [htslib](https://github.com/samtools/htslib) package.
246 | `bgzip` is a "block-based GZIP", which compresses files in chunks of lines. This chunking let's us quickly seek to a particular part of the file, and support indexes to do so. The default one to use is tabix. It generates indexes of the file with the default name `.tbi`.
247 | 
248 | ```bash
249 | bgzip SRR1770413.vcf # makes SRR1770413.vcf.gz
250 | tabix -p vcf SRR1770413.vcf.gz
251 | ```
252 | 
253 | Now you can pick up a single part of the file. For instance, we could count the variants in a particular region:
254 | 
255 | ```bash
256 | tabix SRR1770413.vcf.gz NC_000913.3:1000000-1500000 | wc -l
257 | ```
258 | 
259 | If we want to pipe the output into a tool that reads VCF, we'll need to add the `-h` flag, to output the header as well.
260 | 
261 | ```bash
262 | # tabix -h SRR1770413.vcf.gz NC_000913.3:1000000-1500000 | vcffilter ...
263 | # example vcf filter
264 | tabix -h SRR1770413.vcf.gz NC_000913.3:1000000-1500000 | vcffilter -f 'DP > 20' | wc -l
265 | ```
266 | 
267 | The `bgzip` format is very similar to that used in BAM, and the indexing scheme is also similar (blocks of compressed data which we build a chromosome position index on top of).
268 | 
269 | ### Take a peek with `vt`
270 | 
271 | [vt](https://github.com/atks/vt) is a toolkit for variant annotation and manipulation. In addition to other methods, it provides a nice method, `vt peek`, to determine basic statistics about the variants in a VCF file.
272 | 
273 | We can get a summary like so:
274 | 
275 | ```bash
276 | vt peek SRR1770413.vcf.gz
277 | ```
278 | 
279 | ### Filtering using the transition/transversion ratio (ts/tv) as a rough guide
280 | 
281 | `vt` produces a nice summary with the transition/transversion ratio. Transitions are mutations that switch between DNA bases that have the same base structure (either a [purine](https://en.wikipedia.org/wiki/Purine) or [pyrimidine](https://en.wikipedia.org/wiki/Pyrimidine) ring).
282 | 
283 | In most biological systems, [transitions (A<->G, C<->T) are far more likely than transversions](https://upload.wikimedia.org/wikipedia/commons/3/35/Transitions-transversions-v3.png), so we expect the ts/tv ratio to be pretty far from 0.5, which is what it would be if all mutations between DNA bases were random. In practice, we tend to see something that's at least 1 in most organisms, and ~2 in some, such as human. In some biological contexts, such as in mitochondria, we see an even higher ratio, perhaps as much as 20.
284 | 
285 | As we don't have validation information for our sample, we can use this as a simple guide for our first filtering attempts. An easy way is to try different filterings using `vcffilter` and check the ratio of the resulting set with `vt peek`:
286 | 
287 | ```bash
288 | # a basic filter to remove low-quality sites
289 | vcffilter -f 'QUAL > 10' SRR1770413.vcf.gz | vt peek -
290 | 
291 | # scaling quality by depth is like requiring that the additional log-unit contribution
292 | # of each read is at least N
293 | vcffilter -f 'QUAL / AO > 10' SRR1770413.vcf.gz | vt peek -
294 | ```
295 | 
296 | Note that the second filtering removes a large region near the beginning of the reference where there appears to be some paralogy. The read counts for reference and alternate are each around half of the total depth, which is unusual for a sequenced clone and may indicate some structural differences between the sample and the original reference.
297 | 
298 | ## Part 3: When you know the truth
299 | 
300 | For serious applications, it's not sufficient to simply filter on the basis of bulk metrics like the ts/tv ratio. Some external validation information should be used to guide the development of pipelines for processing genomic data. In our case, we're just using free data from the web, and unless we find some validation data associated with the strains that were sequenced, we can only filter on intuition, bulk metrics like ts/tv, and with an eye for the particular question we're interested in. What we want is to know the truth for a particular context, so as to understand if our filtering criteria make sense.
301 | 
302 | ### The NIST Genome in a Bottle truth set for NA12878
303 | 
304 | Luckily, a group at the [National Institute of Standards and Technology](https://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology) (NIST) has developed a kind of truth set based on the [HapMap CEU cell line NA12878](https://catalog.coriell.org/0/Sections/Search/Sample_Detail.aspx?Ref=GM12878). It's called the [Genome in a Bottle](https://sites.stanford.edu/abms/giab). In addition to characterizing the genome of this cell line using extensive sequencing and manual curation of inconsistencies found between sequencing protocols, the group actually distributes reference material from the cell line for use in validating sequencing pipelines.
305 | 
306 | To download the truth set, head over to the [Genome in a Bottle ftp site](ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/) and pick up the latest release. As of writing this, we're at [GiAB v3.3.2](ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/). Download the highly confident calls and the callable region targets:
307 | 
308 | ```bash
309 | wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz{,.tbi}
310 | wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/GRCh37/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel.bed
311 | ```
312 | 
313 | ### The human reference
314 | 
315 | For variant calling work, we can use the [1000 Genomes Project's version of the GRCh37 reference](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz). We could also use the [version of the reference that doesn't include dummy sequences](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz), as we're just doing variant calling.
316 | 
317 | ```bash
318 | wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
319 | gunzip hs37d5.fa.gz
320 | samtools faidx hs37d5.fa
321 | ```
322 | 
323 | ### Calling variants in [20p12.1](http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg19&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr20%3A12100001-17900000&hgsid=220600397_Vs2XvVv0rRPE9lPwepHAL4Iq3ndi)
324 | 
325 | To keep things quick enough for the tutorial, let's grab a little chunk of an NA12878 dataset. Let's use [20p12.1](http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg19&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr20%3A12100001-17900000&hgsid=220600397_Vs2XvVv0rRPE9lPwepHAL4Iq3ndi). We'll use a downsampled alignment made from Illumina HiSeq sequencing of NA12878 (HG001) that was used as an input to the [NIST Genome in a Bottle](https://github.com/genome-in-a-bottle) truth set for this sample. (Other relevant data can be found in the [GiAB alignment indexes](https://github.com/genome-in-a-bottle/giab_data_indexes).)
326 | 
327 | We don't need to download the entire BAM file to do this. `samtools` can download the BAM index (`.bai`) provided it hosted alongside the file on the HTTP/FTP server and then use this to jump to a particular target in the remote file.
328 | 
329 | ```bash
330 | samtools view -b ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/RMNISTHS_30xdownsample.bam 20:12100000-17900000 >NA12878.20p12.1.30x.bam
331 | samtools index NA12878.20p12.1.30x.bam
332 | ```
333 | 
334 | We can call variants as before. Note that we drop the `--ploidy 1` flag. `freebayes` assumes its input is diploid by default. We can use bgzip in-line here to save the extra command for compression.
335 | 
336 | ```bash
337 | freebayes -f hs37d5.fa NA12878.20p12.1.30x.bam | bgzip >NA12878.20p12.1.30x.vcf.gz
338 | tabix -p vcf NA12878.20p12.1.30x.vcf.gz
339 | ```
340 | 
341 | ### Comparing our results to the GiAB truth set
342 | 
343 | We'll need to download the [GiAB truth set](ftp://ftp-trace.ncbi.nih.gov/giab/ftp/release/NA12878_HG001/NISTv2.19/). Its core consists of a VCF file defining "true" variants and a BED file defining callable regions.
344 | 
345 | In order to compare, we need to exclude things in our output that are outside the callable region, and then intersect with the truth set. That which we don't see in the truth set, and is also in the callable region should be considered a false positive.
346 | 
347 | First, we'll prepare a reduced representation of this dataset to match 20p12.1:
348 | 
349 | ```bash
350 | # subset the callable regions to chr20 (makes intersection much faster)
351 | cat HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel.bed | grep ^20 >giab_callable.chr20.bed
352 | 
353 | # subset the high confidence calls to 20p12.1 and rename the sample to match the BAM
354 | tabix -h HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz 20:12100000-17900000 \
355 |     | sed s/HG001/NA12878/ | bgzip >NIST_NA12878_20p12.1.vcf.gz
356 | tabix -p vcf NIST_NA12878_20p12.1.vcf.gz
357 | ```
358 | 
359 | Now, we can compare our results to the calls to get a list of potentially failed sites.
360 | 
361 | ```bash
362 | vcfintersect -r hs37d5.fa -v -i NIST_NA12878_20p12.1.vcf.gz NA12878.20p12.1.30x.vcf.gz \
363 |     | vcfintersect -b giab_callable.chr20.bed \
364 |     | bgzip >NA12878.20p12.1.30x.giab_failed.vcf.gz
365 | tabix -p vcf NA12878.20p12.1.30x.giab_failed.vcf.gz
366 | ```
367 | 
368 | We can now examine these using `vt peek` and `vcfstats`, or manually by inspecting them either serially:
369 | 
370 | ```bash
371 | zcat NA12878.20p12.1.30x.giab_failed.vcf.gz | less -S
372 | ```
373 | 
374 | ... or by looking at loci which fail in `samtools tview`.
375 | 
376 | 
377 | ### Variant normalization
378 | 
379 | Many of the failed variants are unusual in their normalization. For instance:
380 | 
381 | ```text
382 | 20   9575773 .  GAGAG  TATAT  1172.52
383 | ```
384 | 
385 | To ensure that comparisons work correctly, we should "normalize" the variants so that they are represented solely as short indels and SNPs.
386 | 
387 | There are two main problems:
388 | 
389 | 1. freebayes represents short haplotypes in the VCF output
390 | 2. indels may not be completely left-aligned, there could be additional bases on the call that should be removed so that it can be represented in most-normalized form
391 | 
392 | Finally, the variants in the GiAB set have been normalized using a similar process, and doing so will ensure there are not any discrepancies when we compare.
393 | 
394 | ```bash
395 | vcfallelicprimitives -kg NA12878.20p12.1.30x.vcf.gz \
396 |     | vt normalize -r hs37d5.fa - \
397 |     | bgzip >NA12878.20p12.1.30x.norm.vcf.gz
398 | tabix -p vcf NA12878.20p12.1.30x.norm.vcf.gz
399 | ```
400 | 
401 | Here, `vcfallelicprimitives -kg` decomposes any haplotype calls from `freebayes`, keeping the genotype and site level annotation. (This isn't done by default because in some contexts doing so is inappropriate.) Then `vt normalize` ensures the variants are left-aligned. This isn't important for the comparison, as `vcfintersect` is haplotype-based, so it isn't affected by small differences in the positioning or description of single alleles, but it is good practice.
402 | 
403 | We can now compare the results again:
404 | 
405 | ```bash
406 | vcfintersect -r hs37d5.fa -v -i NIST_NA12878_20p12.1.vcf.gz NA12878.20p12.1.30x.norm.vcf.gz \
407 |     | vcfintersect -b giab_callable.chr20.bed \
408 |     | bgzip >NA12878.20p12.1.30x.norm.giab_failed.vcf.gz
409 | tabix -p vcf NA12878.20p12.1.30x.norm.giab_failed.vcf.gz
410 | ```
411 | 
412 | Here we observe why normalization is important when comparing VCF files. Fortunately, the best package available for comparing variant calls to truth sets, [rtgeval](https://github.com/lh3/rtgeval), addresses exactly this concern, and also breaks comparisons into three parts matching the three types of information provided by the VCF file--- positional, allele, and genotype. We'll get into that in the next section when we learn to genotype and filter.
413 | 
414 | ### Hard filtering strategies
415 | 
416 | The failed list provides a means to examine ways to reduce our false positive rate using post-call filtering. We can look at the failed list to get some idea of what might be going on with the failures.
417 | 
418 | For example, we can test how many of the failed SNPs are removed by applying a simple quality filter and checking the output file's statistics.
419 | 
420 | ```bash
421 | vcffilter -f "QUAL > 10" NA12878.20p12.1.30x.norm.giab_failed.vcf.gz \
422 |     | vt peek -
423 | ```
424 | 
425 | We might also want to measure our sensitivity from different strategies. To do this, just invert the call to `vcfintersect` by removing the `-v` flag (which tells it to invert):
426 | 
427 | ```bash
428 | vcfintersect -r hs37d5.fa -i NIST_NA12878_20p12.1.vcf.gz NA12878.20p12.1.30x.norm.vcf.gz \
429 |     | vcfintersect -b giab_callable.chr20.bed \
430 |     | bgzip >NA12878.20p12.1.30x.norm.giab_passed.vcf.gz
431 | tabix -p vcf NA12878.20p12.1.30x.norm.giab_passed.vcf.gz
432 | ```
433 | 
434 | Now we can test how many variants remain after using the same filters on both:
435 | 
436 | ```bash
437 | vcffilter -f "QUAL / AO > 10 & SAF > 0 & SAR > 0" NA12878.20p12.1.30x.norm.giab_passed.vcf.gz | vt peek -
438 | vcffilter -f "QUAL / AO > 10 & SAF > 0 & SAR > 0" NA12878.20p12.1.30x.norm.giab_failed.vcf.gz | vt peek -
439 | ```
440 | 
441 | 
442 | ## Part 4: Learning to filter and genotype
443 | 
444 | Bayesian variant callers like `freebayes` use models based on first principles to generate estimates of variant quality, or the probability that a given genotyping is correct.
445 | However, there is not reason that such a model could not be learned directly from labeled data using supervised machine learning techniques.
446 | In the previous section, we used hard filters on features provided in the VCF file to remove outlier and low-quality variants.
447 | In this section we will use the Genome in a Bottle truth set to learn a model that will directly genotype and filter candidate calls in one step.
448 | 
449 | ### HHGA (Have Haplotypes, Genotypes, and Alleles) and the Vowpal Wabbit
450 | 
451 | [hhga](https://github.com/ekg/hhga) is an "example decision synthesizer" that transforms alignments (in BAM) and variant calls (in VCF) into a line-based text format compatible with the [Vowpal Wabbit](http://hunch.net/~vw/) (vw).
452 | The Vowpal Wabbit is a high-throughput machine learning method that uses the hashing trick to map arbitrary text features into a bounded vector space.
453 | It then uses online stochastic gradient descent (SGD) to learn a regressor (the model) mapping the hashed input space to a given output label.
454 | The fact that it is online and uses SGD allows it to be applied to staggeringly large data sets.
455 | Its use of the hashing trick enables its use on extremely large feature sets--- trillions of unique features are not out of the question, although they may be overkill for practical use!
456 | 
457 | HHGA is implemented as a core utility in C++, `hhga`, as well as a wrapper script that enables the labeling of a VCF file with a truth set.
458 | The output file from this script can then be fed into `vw` to generate a model. This model then can be applied to other `hhga`-transformed data, and finally the labeled result may be transformed back into VCF.
459 | Effectively, this allows us to use the model we train as the core of a generic variant caller and genotyper that is driven by candidates produced by a variant caller like freebayes, samtools, platypus, or the GATK.
460 | 
461 | Let's train a model on 20p12.2 (the neighboring band to 20p12.1). We'll then apply it to 20p12.1 to see if we can best the hard filters we tested in the previous section.
462 | 
463 | First, we download the region using samtools:
464 | 
465 | ```bash
466 | samtools view -b ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/RMNISTHS_30xdownsample.bam 20:9200000-12100000 >NA12878.20p12.2.30x.bam
467 | samtools index NA12878.20p12.2.30x.bam
468 | ```
469 | 
470 | Now subset the truth set to 20p12.2:
471 | 
472 | ```bash
473 | cat HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel.bed | grep ^20 >giab_callable.chr20.bed
474 | tabix -h HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz 20:9200000-12100000 \
475 |     | sed s/HG001/NA12878/ | bgzip >NIST_NA12878_20p12.2.vcf.gz
476 | tabix -p vcf NIST_NA12878_20p12.2.vcf.gz
477 | ```
478 | 
479 | And call variants using `freebayes`:
480 | 
481 | ```bash
482 | freebayes -f hs37d5.fa NA12878.20p12.2.30x.bam | bgzip >NA12878.20p12.2.30x.vcf.gz
483 | tabix -p vcf NA12878.20p12.2.30x.vcf.gz
484 | ```
485 | 
486 | We can now generate the approprate input to `vw` for training by using the `hhga_region` script provided in the `hhga` distribution:
487 | 
488 | ```bash
489 | hhga_region \
490 |     -r 20:9200000-12100000 \
491 |     -w 32 \
492 |     -W 128 \
493 |     -x 20 \
494 |     -f hs37d5.fa \
495 |     -T NIST_NA12878_20p12.2.vcf.gz \
496 |     -B giab_callable.chr20.bed \
497 |     -b NA12878.20p12.2.30x.bam \
498 |     -v NA12878.20p12.2.30x.vcf.gz \
499 |     -o 20p12.2 \
500 |     -C 3 \
501 |     -E 0.1 \
502 |     -S NA12878
503 | ```
504 | 
505 | Each line in the output file `20p12.2/20:9200000-12100000.hhga.gz` contains the "true" genotype as the label. We allow up to 7 alleles, allowing 120 different diploid genotypes--- thus we should see numbers up to 120 at the start of each line. The rest of the line contains a number of feature spaces, each of which captures information from a particular source relevant to variant calling. The feature spaces are:
506 | 
507 | ```
508 | ref          # the reference sequence
509 | hap*         # the alleles in the VCF record
510 | geno*        # the genotype called in the VCF record
511 | aln*         # the alignments sorted by affinity for each allele
512 | col*         # the alignment matrix transposed
513 | match*       # a match score between each alignment and allele
514 | qual*        # qualty-scaled match score
515 | properties*  # alignment properties from bam
516 | kgraph       # variation graph alignment weights
517 | software     # annotations in the VCF file from the variant caller
518 | ```
519 | 
520 | We now can build models using `vw` that learn the mapping between these different feature spaces and the genotype. Because we can have a large number of different classes, we use the "error correcting tournament", enabled via the `--ect` argument. Otherwise, we build a model by selecting the feature spaces to consider using `--keep`, followed by a list of the namespaces by the first letter of their name. For instance:
521 | 
522 | ```bash
523 | vw --ect 120 \
524 |    -d 20p12.2/20:9200000-12100000.hhga.gz \
525 |    -ck \
526 |    --passes 20 \
527 |    --keep s \
528 |    -f soft.model
529 | ```
530 | 
531 | This would learn a model based only on the software features and save it in `soft.model`. In effect, we would be learning a kind of one againt all regression for each possible class label (genotype) based only on the features that freebayes provides in the QUAL and INFO columns of the VCF file. Of note, the `--passes 20` argument tells vw to make up to 20 passes over the data using SGD to learn a regressor, while `-ck` tells vw to use caching to speed this iteration up and to kill any pre-made caches. As `vw` runs it prints an estimate of its performance by holding out some of the data and testing the model against it at every iteration. If it stops improving on the holdout, it will stop iterating. This, like virtually every aspect of `vw`, is configurable.
532 | 
533 | The above model is good for 3% error, but we can do better by adding feature spaces and ineractions between them. We can also change the learning model in various ways. We might try adding nonlinearities, such as a neural network (`--nn 10`).
534 | 
535 | Interactions are particularly important, as they allow us to generate feature spaces for all combinations of other feature spaces. For instance, we might cross the software features with themselves, generating a new feature for every pair of software features. This could be important if we tend to see certain errors or genotypes when pairs of software features move in a correlated way. (We can specify this interaction as `-q ss`.)
536 | 
537 | Here is a slightly better model that uses more of the feature spaces provided by `hhga`. Including the alignments allows us to learn directly from the raw input to the variant caller. The `match` namespace provides a compressed description of the relationship between every alignment and every allele, while the `kgraph` namespace provides a high-level overview of the set of alignments versus the set of alleles, assuming we've realigned them to a graph that includes all the alleles so as to minimize local alignment bias. The large number of interaction terms are essential for good performance:
538 | 
539 | ```bash
540 | vw --ect 120 \
541 |    -d 20p12.2/20:9200000-12100000.hhga.gz \
542 |    -ck \
543 |    --passes 20 \
544 |    --keep kmsa \
545 |    -q kk -q km -q mm -q ms -q ss \
546 |    -f kmsa.model
547 | ```
548 | 
549 | This achieves 0.2% loss on the held out portion of the data.
550 | 
551 | We can now test it on 20p12.1 by running `hhga` to generate unlabeled transformations of our results from `freebayes`, labeling these by running `vw` in prediction mode, and then piping the output back into `hhga`, which can transform the `vw` output into a VCF file. Note that we _must_ not forget to add `--keep kmsa`, as this is not recorded in the model file and omission will result in poor performance.
552 | 
553 | ```bash
554 | hhga -b NA12878.20p12.1.30x.bam -v NA12878.20p12.1.30x.norm.vcf.gz -f hs37d5.fa \
555 |     -C 3 -E 0.1 -w 32 -W 128 -x 20 \
556 |     | vw --quiet -t -i kmsa.model --keep kmsa -p /dev/stdout \
557 |     | hhga -G -S NA12878 \
558 |     | bgzip >NA12878.20p12.1.30x.norm.hhga.vcf.gz
559 | ```
560 | 
561 | ### RTG-eval
562 | 
563 | The best variant calling comparison and evaluation framework in current use was developed by Real Time Genomics, and has since been open sourced and repackaged into [rtgeval](https://github.com/lh3/rtgeval) by Heng Li. This package was subsequently used for the basis of comparison in the PrecisionFDA challenges in 2016, for which `hhga` was initially developed.
564 | 
565 | We can easily apply `rtgeval` to our results, but we will need to prepare the reference in RTG's "SDF" format first.
566 | 
567 | ```bash
568 | rtg format -o hs37d5.sdf hs37d5.fa
569 | ```
570 | 
571 | Now we can proceed and test the performance of our previous `hhga` run against the GiAB truth set:
572 | 
573 | ```bash
574 | run-eval -o eval1 -s hs37d5.sdf -b giab_callable.chr20.bed \
575 |     NIST_NA12878_20p12.1.vcf.gz NA12878.20p12.1.30x.norm.hhga.vcf.gz
576 | ```
577 | 
578 | The output of `rtgeval` is a set of reports and files tallying true and false positives.
579 | 
580 | ```
581 | # for alleles
582 | Running allele evaluation (rtg vcfeval --squash-ploidy)...
583 | Threshold  True-pos  False-pos  False-neg  Precision  Sensitivity  F-measure
584 | ----------------------------------------------------------------------------
585 | None      8120        139        324     0.9832       0.9616     0.9723
586 | 
587 | # for genotypes
588 | Running allele evaluation (rtg vcfeval)...
589 | Threshold  True-pos  False-pos  False-neg  Precision  Sensitivity  F-measure
590 | ----------------------------------------------------------------------------
591 | None      8085        176        359     0.9787       0.9575     0.9680
592 | ```
593 | 
594 | In this case, we can get a quick overview by looking in the files and directories prefixed by `eval1`. It is also quick to clean up with `rm -rf eval1.*`. _Make sure you clean up before re-running on a new file, or use a different prefix!_
595 | 
596 | ## Part 5: Genome variation graphs
597 | 
598 | Variation graphs are a data structure that enables a powerful set of techniques which dramatically reduce reference bias in resequencing analysis by embedding information about variation directly into the reference.
599 | In these patterns, the reference is properly understood as a graph.
600 | Nodes and edges describe sequences and allowed linkages, and paths through the graph represent the sequences of genomes that have been used to construct the system.
601 | 
602 | In this section, we will walk through a basic resequencing pipeline, replacing operations implemented on the linear reference with ones that are based around a graph data model.
603 | 
604 | First, we construct a variation graph using a [megabase long fragment of the 1000 Genomes Phase 3 release that's included in the `vg` repository](https://github.com/vgteam/vg/tree/master/test/1mb1kgp).
605 | 
606 | ```bash
607 | vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 32 -p >z.vg
608 | ```
609 | 
610 | Having constructed the graph, we can take a look at it using a [GFA](https://github.com/GFA-spec/GFA-spec) output format in `vg view`.
611 | 
612 | ```bash
613 | vg view z.vg | head
614 | ```
615 | 
616 | The output shows nodes, edges, and path elements that thread the reference through the graph:
617 | 
618 | ```text
619 | H       VN:Z:1.0
620 | S       1       TGGGAGAGA
621 | P       1       z       1       +       9M
622 | L       1       +       2       +       0M
623 | L       1       +       3       +       0M
624 | S       2       T
625 | L       2       +       4       +       0M
626 | S       3       A
627 | P       3       z       2       +       1M
628 | L       3       +       4       +       0M
629 | ```
630 | 
631 | To implement high-throughput operations on the graph, we use a variety of indexes. The two most important ones for read alignment are the [xg](https://github.com/vgteam/xg) and [gcsa2](https://github.com/jltsiren/gcsa2) indexes. `xg` provides a succinct, but immutable index of the graph that allows high-performance queries of various attributes of the graph--- such as neighborhood searches and queries relating to the reference sequences that are embedded in the graph. Meanwhile, `GCSA2` provides a full generalization of the FM-index to sequence graphs. GCSA2 indexes a de Bruijn graph generated from our underlying variation graph.
632 | 
633 | ```bash
634 | vg index -x z.xg -g z.gcsa -k 16 -p z.vg
635 | ```
636 | 
637 | This will generate the xg index and a 128-mer GCSA2 index.
638 | 
639 | Now, we can simulate reads from the graph and align them back.
640 | 
641 | ```bash
642 | vg sim -n 10000 -l 100 -e 0.01 -i 0.002 -x z.xg -a >z.sim
643 | ```
644 | 
645 | Aligning them back is straightforward:
646 | 
647 | ```bash
648 | vg map -x z.xg -g z.gcsa -G z.sim >z.gam
649 | ```
650 | 
651 | We are then able to look at our alignments (in JSON format) using `vg view -a z.gam | less`.
652 | 
653 | 
654 | ## errata
655 | 
656 | If you're part of the 2018 Biology for Adaptation genomics course, [here is a shared document describing system-specific information about available data sets and binaries](https://docs.google.com/document/d/1CV3AUackPEaSw7GkY6f7Q5lnlTVeWkyh6IOrB4jQwMg/edit?usp=sharing).
657 | The [day's lecture slides](https://docs.google.com/presentation/d/1t921ccF66N0_oyn09gbM0w8nzADzWF20rfZkeMv3Sy8/edit?usp=sharing) are also available.
658 | 


--------------------------------------------------------------------------------