├── images └── ScreenShot.png ├── README.md ├── implementation_notes.md ├── attributions.md ├── references.md ├── unix.md └── edX_Notes.md /images/ScreenShot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/edX/HEAD/images/ScreenShot.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # edX 2 | 3 | Course materials for the [edX Data Analysis for Genomics](https://courses.edx.org/courses/HarvardX/PH525x/1T2014/info) whole genome sequencing module. Course materials include: 4 | 5 | * Introduction to sequencing 6 | * Introduction to re-sequencing studies 7 | * The command line & using VirtualBox 8 | * Resequencing workflow 9 | * Quality control 10 | * Read trimming and filtering 11 | * Alignment and marking duplicates 12 | * Calling variants with FreeBayes 13 | * VCF filtering and annotation 14 | * Exploring VCFs with GEMINI 15 | 16 | ## Authors 17 | 18 | * Oliver Hofmann (overall course design) 19 | * Shannan Ho Sui 20 | * Meeta Mistry 21 | * Radhika Khetani 22 | * Brad Chapman (building the VM) 23 | * Erik Garrison (all things FreeBayes and variant calling) 24 | * Aaron Quinlan (Gemini tutorial) 25 | * Jessica Chong (Gemini case studies) 26 | 27 | ## Attributions 28 | 29 | This course borrows heavily from the [FreeBayes tutorial](http://clavius.bc.edu/~erik/CSHL-advanced-sequencing/freebayes-tutorial.html) developed by Erik Garrison and course materials from [UC Davis](http://training.bioinformatics.ucdavis.edu/), and other online courses such as the [Software Carpentry materials](http://software-carpentry.org/). 30 | 31 | ### License 32 | 33 | Course materials are made available under an [MIT license](http://opensource.org/licenses/MIT). 34 | 35 | -------------------------------------------------------------------------------- /implementation_notes.md: -------------------------------------------------------------------------------- 1 | ## Implementation Notes 2 | 3 | ### VM installation 4 | 5 | * How to [add local folders to a VirtualBox VM](http://www.howtogeek.com/187703/how-to-access-folders-on-your-host-machine-from-an-ubuntu-virtual-machine-in-virtualbox/) 6 | * Also see [this blog post](http://www.binarytides.com/vbox-guest-additions-ubuntu-14-04/) on how to mount folders accessible to the host OS 7 | 8 | ### Data generation 9 | 10 | #### FASTQ data 11 | 12 | Method to grab the original *.bam data and convert to FASTQC: 13 | 14 | cd ~/ 15 | mkdir ./data 16 | cd ./data 17 | wget http://goo.gl/bq1QQQ 18 | samtools sort -n NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.0121211.bam temp 19 | bedtools bamtofastq -i temp.bam -fq reads.end1.fq -fq2 reads.end2.fq 2> error.log 20 | 21 | Adding noise using [Sherman](http://www.bioinformatics.babraham.ac.uk/projects/download.html#sherman): 22 | 23 | wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/chr20.fa.gz 24 | gunzip chr20.fa.gz 25 | Sherman -l 101 -n 400000 --genome_folder . -pe -cr 0 -q 40 --fixed_length_adapter 40 26 | 27 | Fixed what Sherman does to the reads: 28 | 29 | sed 's/_R1/\/1/' simulated_1.fastq > simulated_1.fixed 30 | sed 's/_R2/\/2/' simulated_2.fastq > simulated_2.fixed 31 | cat simulated_1.fixed >> reads.end1.fq 32 | cat simulated_2.fixed >> reads.end2.fq 33 | 34 | #### VCF annotation 35 | 36 | This turned into a comedy of errors. snpEff doesn't handle annotation with dbSNP anymore, that's done with SnpSift -- for which there is no wrapper. Falling back to bcftools which won't work as the input file is too large for the 32-bit VM to index. Same for vcflib. 37 | 38 | Instead, I installed the old vcftools which can handle uncompressed data, grabbed the hg19 dbSNP bundle from the Broad and subset as needed. Except vcftools throws away the ID column. 39 | 40 | Final solution: bcftools after streaming the data directly into bgzip which works: 41 | 42 | # Get raw data at ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/hg19 43 | wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/hg19/dbsnp_138.hg19.excluding_sites_after_129.vcf.gz 44 | gunzip dbsnp_138.hg19.excluding_sites_after_129.vcf.gz 45 | cat dbsnp_138.hg19.excluding_sites_after_129.vcf | bgzip -c > dbsnp.138.vcf.gz 46 | tabix dbsnp.138.vcf.gz 47 | bcftools filter -r chr20 dbsnp.138.vcf.gz > dbsnp.138.chr20.vcf 48 | bgzip dbsnp.138.chr20.vcf 49 | 50 | 51 | ### Tool notes 52 | 53 | I have run into several cases where vcffilter just crashes. Could replace relevant sections completely with [bcftools](http://samtools.github.io/bcftools/bcftools.html#expressions) if this continues to be an issue. 54 | -------------------------------------------------------------------------------- /attributions.md: -------------------------------------------------------------------------------- 1 | ## Attributions 2 | 3 | Attributions for images where no URL is provided on the slide. All photographs courtesy of Flickr Creative Commons (only licenses which allow modifications and derivative use); images are screenshots from HBC work unless stated otherwise. 4 | 5 | **Title Slide:** 6 | 7 | * Ethernet, Bob Mical, https://www.flickr.com/photos/small_realm/14186949118 8 | * DNA, John Goode, https://www.flickr.com/photos/johnnieb/17200471/ 9 | 10 | **Genomics medicine:** 11 | 12 | * Karyotype, Can H., https://www.flickr.com/photos/47988426@N08/8252270882 13 | 14 | **Exome sequencing:** 15 | 16 | * Exome view in IGV, test data screenshot 17 | 18 | **Scope:** 19 | 20 | * Studying…, Clay Shonkwiler, https://www.flickr.com/photos/shonk/418180402 21 | 22 | 23 | **Introduction to Sequencing Technologies:** 24 | 25 | * Riding Shotgun, Steve Jurvetson, https://www.flickr.com/photos/jurvetson/57080968/ 26 | 27 | **Illumina slides:** 28 | 29 | * Taken from the UC Davis High Throughput Sequencing Fundamentals lecture, http://training.bioinformatics.ucdavis.edu/docs/2014/09/september-2014-workshop/Monday_JF_HTS_lecture.html 30 | 31 | **Sequencing by synthesis:** 32 | 33 | * http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2734321/figure/F1/ 34 | 35 | 36 | **IonTorrent:** 37 | 38 | * EBI Training Course materials, https://www.ebi.ac.uk/training/online/course/ebi-next-generation-sequencing-practical-course/what-next-generation-dna-sequencing/ion-torre 39 | 40 | **PacBio:** 41 | 42 | * PacBio Press Kit, http://www.pacificbiosciences.com/news_and_events/mediakit/ 43 | 44 | **Introduction to Resequencing:** 45 | 46 | * DNA / Protein function finder from @WellcomeTrust @SangerInstitute @emblebi @YourGenome, Duncan Hull, https://www.flickr.com/photos/dullhunk/4422952630/ 47 | 48 | 49 | **In an ideal world:** 50 | 51 | * Rainbow, Moyan Brenn, https://www.flickr.com/photos/aigle_dore/4516670863 52 | 53 | **Unix and the command line:** 54 | 55 | * Keyboard, Tristan Schmurr, https://www.flickr.com/photos/kewl/5212626838 56 | 57 | 58 | **Quality control and grooming** 59 | 60 | * Backyard Hair Cut, sean hobson, https://www.flickr.com/photos/seanhobson/4353671803 61 | 62 | **Illumina error sources:** 63 | 64 | * Ledergerber, Base-calling for next-generation sequencing platforms, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3178052/figure/F1/ 65 | 66 | **Alignment:** 67 | 68 | * Puzzled, Pratap Sankar, https://www.flickr.com/photos/gugugagaa/5378186359 69 | 70 | **What to align against:** 71 | 72 | * Human genome printed, Adam Nieman, https://www.flickr.com/photos/johnjobby/2252981353 73 | 74 | **Marking duplicates:** 75 | 76 | * 91 variations on a theme, Kevin Dooley, https://www.flickr.com/photos/pagedooley/3227848591 77 | 78 | **Identifying variation:** 79 | 80 | * répétition 2, olibac, https://www.flickr.com/photos/olibac/2354102486 81 | 82 | **Finding the right variants:** 83 | 84 | * A Needle in a Hay Stack, Michael Gil, https://www.flickr.com/photos/msvg/5143096005 85 | 86 | **Experimental design:** 87 | 88 | * Marcin Wichary, https://www.flickr.com/photos/mwichary/2635813162 89 | 90 | **First step:** 91 | 92 | * 20141025 Pencils 068, John, https://www.flickr.com/photos/cygnus921/15628288442 93 | * Question Mark Graffiti, Bilal Kamoon, https://www.flickr.com/photos/bilal-kamoon/6835060992 94 | 95 | **A quick orientation:** 96 | 97 | * Wikimedia, http://commons.wikimedia.org/wiki/File:UCSC_human_chromosome_colours.png 98 | -------------------------------------------------------------------------------- /references.md: -------------------------------------------------------------------------------- 1 | Dear all, 2 | 3 | 4 | Shannan suggested I started providing URLs of materials I’ll re-use for the edX course. Here goes: 5 | 6 | * I’ve used the general framework of our [Galaxy course](http://scriptogr.am/ohofmann/exome-seq) on Exome-Seq, ripping out all parts that are not relevant. 7 | * As we probably have to provide a bit more in-depth information on NGS itself I was planning to recycle materials from [UC Davis’ Intro to NGS](https://bioshare.bioinformatics.ucdavis.edu/bioshare/download/00000dyb3riyzvb/Next%20Gen%20Fundamentals%20Boot%20Camp%20-%202013%20June%2018.pdf) presentation (PDF) and the more recent [2014 version](http://training.bioinformatics.ucdavis.edu/docs/2014/09/september-2014-workshop/_downloads/Monday_JF_HTS_lecture.pdf) 8 | * For QC I’ll stick to our approach, but use [the command line](http://training.bioinformatics.ucdavis.edu/docs/2014/09/september-2014-workshop/Monday_JF_QAI_exercises.html) instead; I might pull in some additional examples from the [Davis slides](http://training.bioinformatics.ucdavis.edu/docs/2014/09/september-2014-workshop/_downloads/Monday_JF_QAI_lecture.pdf) (PDF) 9 | * We have [read alignment](http://training.bioinformatics.ucdavis.edu/docs/2014/09/september-2014-workshop/_downloads/Tuesday_JF_Alignment_lecture.pdf) (PDF) covered nicely already but can expand if needed. 10 | * The core component is going to use [FreeBayes](http://arxiv.org/pdf/1207.3907v2.pdf) (PDF) instead of GATK. Luckily, Eric has created an awesome [‘Getting started’ tutorial](http://clavius.bc.edu/~erik/CSHL-advanced-sequencing/freebayes-tutorial.html) that I will be retracing pretty much completely. There is more information on the [FreeBayes GitHub page](https://github.com/ekg/freebayes#readme) and in a [2013 presentation](http://clavius.bc.edu/~erik/CSHL-advanced-sequencing/CSHL%20advanced%20sequencing%20variant%20detection.pdf) (PDF) 11 | * Despite that, it might make sense to go through the [GATK Best Practices](https://www.broadinstitute.org/gatk/guide/best-practices?bpm=DNAseq). I might also pull in some more details from their [in-depth description](http://www.scribd.com/doc/254312013/From-FastQ-Data-to-High-Confidence-Variant-Calls-The-Genome-Analysis-Toolkit-Best-Practices-Pipeline) if there’s sufficient time, but taking a few excerpts from [UC Davis’ Notes on GATK presentation](http://training.bioinformatics.ucdavis.edu/docs/2013/12/variant-discovery/_downloads/GATKvariantdiscovery_120913.pdf) (PDF) should be sufficient. 12 | * For the VCF interpretation, the GATK’s [summary](https://www.broadinstitute.org/gatk/guide/article?id=1268) is a good starting point; I will stick to regular VCFs rather than tackling gVCFs 13 | * Aaron has a [Gemini tutorial](http://quinlanlab.org/tutorials/cshl2013/gemini.html) that we can retrace. Major drawback is that this adds 1.5GB to the download 14 | * For the ‘next steps’ I wanted to walk them through a bcbio tutorial. Problem here is that this is quite unlikely to work on any of the participants laptops / VMs. Ideas welcome. 15 | * Ideally I’d like to use the [somatic variant calling pipeline for bcbio](https://bcbio-nextgen.readthedocs.org/en/latest/contents/testing.html#cancer-tumor-normal) as an example, but given the IT constraints I am thinking we might do the whole somatic variant calling bit just as slides / presentation. 16 | * Can use this as a reminder of reproducibility, scalability, etc., using slides from Brad and Rory (see [some recent talks](https://bcbio-nextgen.readthedocs.org/en/latest/contents/presentations.html). 17 | * Also part of what-next could be AWS; again, now supported by bcbio. We can point people at (again) the [UC Davis AWS signup](http://training.bioinformatics.ucdavis.edu/docs/2014/12/december-2014-workshop/Tuesday-AWS-Intro.html) tutorial or use our own. 18 | 19 | The Linux/Unix parts are completely independent right now. I was going to re-use materials from the [Unix Intro](http://www.ee.surrey.ac.uk/Teaching/Unix/) and the [Unix primer for biologists](http://korflab.ucdavis.edu/unix_and_Perl/), but if you have something ready to go from the other course I’d love to delegate this. They _will_ have to learn about basic redirects and pipes, but that’s as far as it goes. 20 | 21 | At this point I am tempted to drop structural variation calls as they require almost invariably whole genome data which I do not want to handle in this course. We can explore CNVs, but I am not sure we are going to find anything in the shallow sequence data we are using. If we absolutely want to include it the section will be based on [CNVKit](http://cnvkit.readthedocs.org/en/latest/). 22 | 23 | 24 | -------------------------------------------------------------------------------- /unix.md: -------------------------------------------------------------------------------- 1 | ### Introduction to Unix [SCREENCAST] 2 | 3 | ####Introduction and the terminal 4 | 5 | A short slide deck as part of the screencast to introduce UNIX/Linux and command line concepts 6 | 7 | ####Finding your way around the file system 8 | >Note – All UNIX commands are case sensitive 9 | 10 |
 11 | $ ls      # list all the files and subdirectories in the current directory
 12 | 
 13 | $ ls unix_exercise/       # list all the files and subdirectories in the directory unix_exercise
 14 | 
 15 | $ ls unix_exercise/readme.txt       # list the file readme.txt within unix_exercise
 16 | 
17 | 18 | The colors in the listing of the items in the current directory indicate what type of file/folder it is. 19 |
 20 | $ ls –F unix_exercise/         # “-F” is an argument for the command ls, we will talk more about arguments momentarily
 21 | 
22 | 23 | In the listing - 24 | 25 | * All entries with a `/` at the end <-> directories 26 | * All entries with a `*` at the end <-> executables 27 | * All entries with a `@` at the end <-> linked directory or file (“symbolic link” or a “sym link”). This is a directory/file linked to your home page, but is actually a branch elsewhere on the directory tree. This is useful for easy access. 28 | * The rest of the entries are files. 29 | 30 | Lets make a new file 31 |
 32 | $ touch testfile          # touch lets you create a new, empty file
 33 | 
 34 | $ ls
 35 | 
36 | 37 | You can see that you have created a simple file, now lets remove or delete that file we just created 38 | 39 |
 40 | $ rm testfile        # rm = remove file. 
 41 | 
42 | > 43 | Note – File naming convention in UNIX has certain features: 44 | > 45 | >* Use only letters (upper- and lower-case), numbers from 0 to 9, a dot (.), underscore (_), hyphen (-). Do not use “space” in file names. 46 | > 47 | >* Avoid other characters, as they may have special meaning to either Linux, or to the application you are trying to run. 48 | > 49 | >* Extensions are commonly used to denote the type of file, but are not necessary. On the command line you always explicitly specify a program that is supposed to open or do something specific to the file. 50 | > 51 | >* The dot (.) does not have any special meaning in Linux file names. (Remember that the dot does have a special meaning in the context of the command line though.) 52 | 53 | 54 | What is your location right now in the context of the directory structure? You are in your home directory… But, “where” in the tree of the directory structure is your home directory? 55 | 56 |
 57 | $ pwd          # pwd = print working directory
 58 | 
59 | Lets change our working directory to unix_exercise 60 |
 61 | $ cd unix_exercise        # cd = change directory
 62 | 
63 | 64 | A *Relative* path, is a path to the location of your file of interest relative to your working directory (or location) (e.g. `../file.txt`), whereas the *Full* path is the location of the file starting with the root directory (e.g. `/home/vagrant/text_files/file.txt`). 65 | 66 | In the `cd` command above you used a relative path. 67 |
 68 | $ cd      # no matter where you are, if you say just “cd”, the OS returns you back to your home directory
 69 | 
 70 | $ cd /home/vagrant/unix_exercise
 71 | 
 72 | $ pwd
 73 | 
 74 | $ cd
 75 | 
 76 | $ cd unix_exercise
 77 | 
 78 | $ pwd
 79 | 
80 | 81 | #### Manipulating files and directories 82 | Making new directories is very simple, lets make a new one called new_dir. 83 |
 84 | $ mkdir new_dir      # mkdir = make new directory
 85 | 
86 | How is the following command different from the previous one? 87 |
 88 | $ mkdir new dir      # two more directories new and dir will be created because of naming conventions
 89 | 
90 | 91 |
 92 | $ rm new             # “rm: cannot remove `new`: Is a directory”
 93 | 
94 | 95 | We need an argument to go along with rm to enable it to delete directories. Man pages (manual for each command) are very helpful in figuring out the arguments you can use with specific commands. (Other than man pages, the internet is a good resource for more information about commands, but be discerning.) Arguments help you give the command special instructions and do more specific tasks. 96 | 97 |
 98 | $ man rm             # man = manual for a specific UNIX command. typing the letter q (lower case) will get you back to the command line
 99 | 
100 | $ rm –r new
101 | 
102 | $ man ls
103 | 
104 | 105 | Let's backup the unix_exercise directory; we can copy the whole directory to a new directory: 106 |
 
107 | $ cp –r unix_exercise unix_exercise_backup         # cp = copy file or directory (-r). The first directory or file name you state after the command is what is being copied; the second file name is what you are copying to. When you use copy, if the a directory/file, that you state second, doesn’t already exist it is created
108 | 
109 | 110 |
111 | $ cd unix_exercise/
112 | 
113 | 114 | Create a new file using touch, move it to the home directory and then rename it: 115 |
116 | $ touch new_file.txt
117 | 
118 | $ mv new_file.txt /home/vagrant/     # moved.
119 | 
120 | $ cd /home/vagrant/
121 | 
122 | $ mv new_file.txt home_new_file.txt      # renamed! mv can move and rename files.
123 | 
124 | 125 | > 126 | Note – As we start learning more about manipulating files and directories, one important thing to keep in mind is that unlike Windows and Mac OS, this OS will not check with you before replacing a file. E.g. if you already have a file named foo.txt, and you give the command `cp boo.txt foo.txt`, all your original information in foo.txt will be lost. 127 | 128 | #### Examining file contents 129 | 130 | So far we have learned to move files around, and do basic file and directory manipulations. Next we’ll learn about how to look at the content of a file. 131 | 132 | The commands, `cat`, `head` and `tail` will print the contents of the file onto the monitor. `cat` will print ALL the contents of a file onto your terminal window, so be aware of this for huge files. 133 | 134 | `head` and `tail` will show only the number of lines you want to see (default is 10 lines): 135 |
136 | $ cat readme.txt          # cat = catenate, prints the whole file
137 | 
138 | $ cd sequence/
139 | 
140 | $ head chr4.fa       # prints on the screen the first 10 lines of the file
141 | 
142 | $ tail chr4.fa       # prints on the screen the last 10 lines of the file
143 | 
144 | 145 | The commands `less` and `more` allow you to quickly take a look inside a file. With `less` you can use the arrow keys to go up and down, pressing the “q” key will get you back to the command prompt. This is similar to what you encountered with the `man` command. `more` is not as good for large files, since it loads the whole file into memory, and also it doesn’t let you go backwards within the file. So we will stick to using `less`. 146 |
147 | $ less chr4.fa
148 | 
149 | 150 | #### Output redirection (>, >> and |) 151 | 152 | What if we wanted to collect the top and bottom 50 genes from genelist1.txt in the genelists directory, and make a new file with the 100 genes. 153 |
154 | $ cd
155 | 
156 | $ cd unix_exercise/genelists/
157 | 
158 | $ head –n 50 genelist1.txt > genelist_test1.txt          # “>” redirects output to specified file, but if the file already existed it overwrites the contents!
159 | 
160 | $ tail –n 50 genelist1.txt > genelist_test2.txt
161 | 
162 | $ cat genelist1_test1.txt genelist_test2.txt > genelist1_test_combined.txt           # the “cat” command will print the contents of both files in the order you list them, and in this case you have redirected the merged output from the 2 files to a new file instead of the terminal
163 | 
164 | 165 | We used 3 steps above to get the combined file, instead we could have done it in 2 steps (see below). 166 |
167 | $ head –n 50 genelist1.txt > genelist1_test_combined.txt           # this will overwrite the file
168 | 
169 | $ tail –n 50 genelist1.txt >> genelist1_test_combined.txt          # “>>” redirects the output to specified file, however it appends the new content to the end of the file and does not overwrite it
170 | 
171 | 172 | Pipes or `|` is a very handy UNIX tool to string together several commands into 1 command line. Basically, it takes the output of one command and “pipes” it into the next command as input. 173 | 174 | What if we also needed to make sure that the new document had the data sorted alphabetically? 175 |
176 | $ cat genelist1_test1.txt genelist1_test2.txt | sort > genelist1_test_combined_sorted.txt       #sort = sorts data as you specify in the arguments, default is alphanumeric in ascending order
177 | 
178 | $ head genelist1_test_combined*			# the asterisk "*" is a wildcard and can be used in place of 1 or multiple characters
179 | 
180 | 181 | #### Permissions 182 | UNIX is a multiuser system, and to maintain privacy and security, most users can only access a small subset of all the files. 183 | 184 | * You are the owner of every file and directory that is under your home directory. 185 | * The system administrator (sys admin) or other users can determine what else you have access to. 186 | * Each file and directory has associated “permissions” for different types of access; reading, writing and executing (scripts or programs). 187 | * You are allowed to change the permissions of any file or directory you “own” and in some cases a file or a directory that you have access to as part of a “group” (co-ownership). 188 | 189 |
190 | $ ls -l /home/vagrant/unix_exercise/
191 | 
192 | 193 |
194 | 195 | 196 | 197 |
198 | 199 | `d`: directory (or `-` if file); 200 | 201 | `r`: read permission; 202 | 203 | `w`: write permission; 204 | 205 | `x`: execute permission (or permission to `cd` if it is a directory); 206 | 207 | `-`: no permission. 208 | 209 | `drwxr-xr--` can be divided into `d` `rwx` `r-x` `r--`, and it means the following: 210 | 211 | * owner (u) has `rwx` read, write and execute permissions for the directory 212 | 213 | * group (g) has `r-x` only read and execute permissions for the directory 214 | 215 | * others (o) has `r--` only read permission for the directory 216 | 217 | How do you set or change these permissions? 218 |
219 | $ cd ../
220 | 
221 | $ chmod -R o-rwx sequence/     # others have no read, write or execute permissions for any this directory or any file within
222 | 
223 | $ ls -lh 
224 | 
225 | $ chmod u+rwx hello_world.sh
226 | 
227 | $ ls -lh
228 | 
229 | $ chmod -R 764 sequence/       # same as “chmod –R u+rwx,g+rw,o+r”
230 | 
231 | $ ls -lh
232 | 
233 | A “sticky bit” is applied to shared directories to protect files such that only the owner has the ability to change permissions. 234 | 235 | `chown` and `chgrp` are commands that let you change owner and groups respectively, but you need to start out with correct permissions to be able to execute these on a file/directory. 236 | -------------------------------------------------------------------------------- /edX_Notes.md: -------------------------------------------------------------------------------- 1 | # Case study: Variant Discovery and Genotyping 2 | 3 | This session provides a basic introduction to conducting a re-sequencing analysis using command-line tools to identify [single nucleotide polymorphims](http://ghr.nlm.nih.gov/handbook/genomicresearch/snp) (SNPs) and small insertion or deletions (InDels). We will be retracing all of the steps required to get from an Illumina FASTQ sequence file received from a sequencing facility as part of a genome sequence analysis all the way to germline variant calls and variant prioritization. As part of this module we will also discuss approaches to test for changes in structural variation, explore differences between germline and somatic variant calling, and provide an overview on how to best scale the approach when handling much larger sample sets. 4 | 5 | To keep things manageable and allow algorithms to finish within a few minutes we picked a single sample from the 1000 Genomes Project: NA12878, sequenced as part of a CEU trio which has become the de-facto standard to benchmark variant calling approaches and is used by groups such as the [Genome in a Bottle consortium](http://www.nist.gov/mml/bbd/ppgenomeinabottle2.cfm) to better understand the accuracy of sequencing technologies and bioinformatic workflows. 6 | 7 | 8 | ## Introduction to sequencing technologies 9 | 10 | While current high-throughput sequencing is dominated by Illumina sequencers — which sequence by synthesis - a wide variety of other technologies exist, with ‘long read’ technology such as PacBio and NanoPore having a strong impact in the _de novo_ assembly of microbial genomes (although more recently they have also been used to re-assemble human genomes if at a significant cost). In brief, the sequencing technology used needs to be matched to the experimental design, and even within the Illumina platform there are distinct differences — what read length to aim for, whether a protocol should be stranded or not, whether to use ‘paired ends’, etc. Maybe most importantly, it also needs to be paired to local expertise: what is your sequencing group comfortable with, and what can be produced from the starting material (fresh frozen samples, FFPE-treated materials, old degraded DNA, etc.). 11 | 12 | 13 | ## Introduction to resequencing analysis 14 | 15 | Genomic DNA can be obtained from small samples of almost any biological material such as saliva, blood, skin cells. By isolating the DNA, framenting it into smaller pieces and generated short-read data of sufficient depth (or coverage) with any current sequencing technologies allows for the systematic comparison of the sample's DNA with the human reference genome, ideally identifying single nucleotide variation (SNVs), short insertion or deletions (InDels) and all the way to larger structural variants. This can be of interest when sequencing whole populations to get a better understanding of population structure and heritage, in case/control studies to identify variants potentially associated with common diseases, in trio- or family-based studies to identify the likely cause of rare (ideally monogenetic) diseases by comparing affected and unaffected family members or by sequencing individuals, mostly in the case of tumor/normal samples in cancer patients to identify suitable therapies and explore likely paths to resistance. 16 | 17 | Approaches range from targeted sequencing of regions of interest (GWAS regions, targeted gene panels) to exome-sequencing and whole genome sequencing. You will find [countless discussions](http://blog.genohub.com/whole-genome-sequencing-wgs-vs-whole-exome-sequencing-wes/) about the benefits of each, but it ultimately comes down to the availability of the material and the cost/benefit ratio. In many ways a whole genome sequencing run is a 'better' exome in that coverage tends to be more uniform and there are no issues with trying to enrich for target regions with different capture protocols, but at the same time we are still quite a ways away from being able to fully interpret non-coding regions. 18 | 19 | For this session we will pursue the analysis of a single sample from a healthy donor that was whole genome sequenced at shallow depth. Our workflow includes an initial quality control, masking duplicate reads, the alignment of reads to a reference genome followed by variant calls and subsequent annotation and filtering. As we want you to be able to scale up this workflow to multiple samples sequenced at much higher depth we will use command line tools only -- which means we need to start with a basic introduction to UNIX. 20 | 21 | 22 | ## Introduction to the command line 23 | 24 | Since we want you to be able to apply anything you learn in this module to your own data and at scale we will have to fall back to the command line. While commercial solutions with user interfaces exists those are not uniformly available, and while we use [Galaxy](http://hbc.github.io/ngs-workshops/) for workshops on RNA-Seq, ChIP-Seq and re-sequencing studies it is not (yet) designed for large scale data sets with hundreds of samples. We will talk about workflow systems towards the end of this course, but for now this means gaining a basic understanding of UNIX and how to use a command line. 25 | 26 | We will be using a [‘Virtual Machine’](http://en.wikipedia.org/wiki/Virtual_machine) (VM) for this module to ensure everyone has access to the variety of methods and tools required for a whole genome re-sequencing analysis. While you might have access to the same software on a local server or cluster it still might be easier to use the VM as we have tested that the used versions work well together. We will guide you through the installation process in our screencast, but here are the URLs for you to follow along: 27 | 28 | * Download the virtual machine manager, VirtualBox, from [VirtualBox.org](https://www.virtualbox.org/), making sure you pick the right operating system for your laptop or desktop; install VirtualBox on your machine. 29 | * Download the [VM image](https://s3.amazonaws.com/edx-public-downloads/HarvardX/PH525_Files/edX_PH525.6x_variant.ova) (~1 GB) to your local machine. 30 | * Start VirtualBox and use `File` - `Import Appliance` from the menu, selecting the VM image you just downloaded. This will trigger a menu where you can change the Appliance settings. We recommend giving the VM as much memory as you can given your local machine (you won't need more than 4GB though). Start the import process. 31 | * With the import finished right click on the newly imported `edx_ngs` image in the list view and pick `Settings`. In the settings menu find the `Shared Folders` entry to add a disk drive share (the folder symbol with a green plus symbol). 32 | * Pick a folder path on your local drive that you can access easily. This will be used to exchange data files and reports with your VM. Give it an easy to remember name (e.g., `ngs`) and tick the `auto-mount` box. 33 | * I would also recommend enabling the `Shared Clipboard` (`Bidirectional`) under the `General` section. Leave the settings dialog via the `OK` button. 34 | 35 | With these configurations complete you can start the VM by selecting it in the list and clicking on the `Start` button. The VM will start up and leave you at the login screen. For this course, please use the following user and password: 36 | 37 | User: vagrant 38 | Password: vagrant 39 | 40 | ### Introduction and the terminal 41 | 42 | You should end up with a ‘terminal’ window showing a command line prompt in your home directory. Time to take a look around by typing in a few commands followed by enter. Note — all UNIX commands are case sensitive. We prefix things to enter at the terminal with a command line prompt (`$`), and flag comments with a `#`. Let’s start by listing all the files and subdirectories in the current directory: 43 | 44 | $ ls 45 | 46 | List all the files and subdirectories in the directory `unix_exercise`: 47 | 48 | $ ls unix_exercise/ 49 | 50 | List the file `readme.txt` within `unix_exercise`: 51 | 52 | $ ls unix_exercise/readme.txt 53 | 54 | The colors in the listing of the items in the current directory indicate what type of file/folder it is: 55 | 56 | $ ls –F unix_exercise/ 57 | 58 | The `-F` is an argument for the command `ls`, we will talk more about arguments momentarily. In the listing: 59 | 60 | * All entries with a `/` at the end are directories 61 | * All entries with a `*` at the end are _executables_ (things you can start) 62 | * All entries with a `@` at the end represent a linked directory or file (“symbolic link” or a “sym link”). This is a directory or file linked to your home page, but is actually a branch elsewhere on the directory tree. This is useful for easy access. 63 | * The rest of the entries are files. 64 | 65 | Lets make a new file using the `touch` command which creates an empty file: 66 | 67 | $ touch testfile 68 | $ ls 69 | 70 | You can see that you have created a simple file, now lets remove or delete that file we just created: 71 | 72 | $ rm testfile 73 | 74 | Here, `rm` stands for `remove` (file). Note that file naming convention in UNIX has certain features: 75 | 76 | * Use only letters (upper- and lower-case), numbers from 0 to 9, a dot (`.`), underscore (`_`), hyphen (`-`). Do not use “space” in file names. 77 | * Avoid other characters, as they may have special meaning to either Linux, or to the application you are trying to run. 78 | * Extensions are commonly used to denote the type of file, but are not necessary. On the command line you always explicitly specify a program that is supposed to open or do something specific to the file. 79 | * The dot (.) does not have any special meaning in Linux file names. (Remember that the dot does have a special meaning in the context of the command line though.) 80 | 81 | What is your location right now in the context of the directory structure? You are in your home directory… but, _where_ in the tree of the directory structure is your home directory? 82 | 83 | # pwd = print working directory 84 | $ pwd 85 | 86 | Lets change our working directory to unix_exercise: 87 | 88 | # cd = change directory 89 | $ cd unix_exercise 90 | 91 | A *relative* path is a path to the location of your file of interest relative to your working directory (or location) (e.g. `../file.txt`), whereas the *full* path is the location of the file starting with the root directory (e.g. `/home/vagrant/text_files/file.txt`). In the `cd` command above you used a relative path. 92 | 93 | # no matter where you are, if you say just “cd”, the OS returns you back to your home directory 94 | $ cd 95 | $ cd /home/vagrant/unix_exercise 96 | $ pwd 97 | $ cd 98 | $ cd unix_exercise 99 | $ pwd 100 | 101 | 102 | ### Manipulating files and directories 103 | 104 | Making new directories is very simple, lets make a new one called new_dir. 105 | 106 | # mkdir = make new directory 107 | $ mkdir new_dir 108 | 109 | How is the following command different from the previous one? 110 | 111 | # two more directories new and dir will be created because of naming conventions 112 | $ mkdir new dir 113 | # “rm: cannot remove `new`: Is a directory” 114 | $ rm new 115 | 116 | We need an argument to go along with `rm` to enable it to delete directories. Man pages (manual for each command) are very helpful in figuring out the arguments you can use with specific commands (other than man pages, the internet is a good resource for more information about commands, but be discerning). Arguments help you give the command special instructions and do more specific tasks. 117 | 118 | # man = manual for a specific UNIX command. typing the letter q (lower case) will get you back to the command line 119 | $ man rm 120 | $ rm –r new 121 | $ man ls 122 | 123 | Let's backup the unix_exercise directory; we can copy the whole directory to a new directory: 124 | 125 | # cp = copy file or directory (-r). 126 | $ cp –r unix_exercise unix_exercise_backup 127 | 128 | The first directory or file name you state after the command is what is being copied; the second file name is what you are copying to. When you use copy, if the a directory / file that you state as the second argument doesn’t already exist it is created. 129 | 130 | $ cd unix_exercise/ 131 | 132 | Create a new file using touch, move it to the home directory and then rename it: 133 | 134 | $ touch new_file.txt 135 | $ mv new_file.txt /home/vagrant/ 136 | $ cd /home/vagrant/ 137 | # mv can move and rename files. 138 | $ mv new_file.txt home_new_file.txt 139 | 140 | Note – As we start learning more about manipulating files and directories, one important thing to keep in mind is that unlike Windows and Mac OS, this OS will not check with you before replacing a file. E.g., if you already have a file named foo.txt, and you give the command `cp boo.txt foo.txt`, all your original information in foo.txt will be lost. 141 | 142 | 143 | #### Examining file contents 144 | 145 | So far we have learned to move files around, and do basic file and directory manipulations. Next we’ll learn about how to look at the content of a file. The commands, `cat`, `head` and `tail` will print the contents of the file onto the monitor. `cat` will print ALL the contents of a file onto your terminal window, so be aware of this for huge files. `head` and `tail` will show only the number of lines you want to see (default is 10 lines): 146 | 147 | $ cd unix_exercise/ 148 | # cat = catenate, prints the whole file 149 | $ cat readme.txt 150 | $ cd sequence/ 151 | $ head chr4.fa # prints on the screen the first 10 lines of the file 152 | $ tail chr4.fa # prints on the screen the last 10 lines of the file 153 | 154 | The commands `less` and `more` allow you to quickly take a look inside a file. With `less` you can use the arrow keys to go up and down, pressing the “q” key will get you back to the command prompt. This is similar to what you encountered with the `man` command. `more` is not as good for large files, since it loads the whole file into memory, and also it doesn’t let you go backwards within the file. So we will stick to using `less`. 155 | 156 | $ less chr4.fa 157 | 158 | 159 | ### Output redirection (>, >> and |) 160 | 161 | What if we wanted to collect the top and bottom 50 genes from genelist1.txt in the genelists directory, and make a new file with the 100 genes? 162 | 163 | $ cd 164 | $ cd unix_exercise/genelists/ 165 | $ head –n 50 genelist1.txt > genelist_test1.txt # “>” redirects output to specified file, but if the file already existed it overwrites the contents! 166 | $ tail –n 50 genelist1.txt > genelist_test2.txt 167 | $ cat genelist1_test1.txt genelist_test2.txt > genelist1_test_combined.txt 168 | 169 | The “cat” command will print the contents of both files in the order you list them, and in this case you have redirected the merged output from the 2 files to a new file instead of the terminal. We used 3 steps above to get the combined file, instead we could have done it in 2 steps: 170 | 171 | $ head –n 50 genelist1.txt > genelist1_test_combined.txt # this will overwrite the file 172 | $ tail –n 50 genelist1.txt >> genelist1_test_combined.txt 173 | 174 | The `>>` symbols redirect the output to specified file, however it appends the new content to the end of the file and does _not_ overwrite it. 175 | 176 | Another way to handle output redirection are ‘pipes’. A pipe, represented by the `|` symbol, is a very handy UNIX tool to string together several commands into one command line. Basically, it takes the output of one command and “pipes” it into the next command as input. What if we also needed to make sure that the new document had the data sorted alphabetically? 177 | 178 | $ cat genelist1_test1.txt genelist1_test2.txt | sort > genelist1_test_combined_sorted.txt #sort = sorts data as you specify in the arguments, default is alphanumeric in ascending order 179 | $ head genelist1_test_combined* # the asterisk "*" is a wildcard and can be used in place of 1 or multiple characters 180 | 181 | With these basic commands you have everything you need to finish the rest of this module. Over time you will want to do additional things, and a good starting point to learn more are the [Software Carpentry course](http://software-carpentry.org/). As with everything else, the more you work in a command line environment the easier it gets. Avoid the temptation to fall back to your graphical user interface to create folders or move files, constant trial and error is worth it in the long run. 182 | 183 | 184 | ## A resequencing workflow 185 | 186 | As discussed we will be calling variants in a subset of the genome of [NA12878](https://catalog.coriell.org/0/sections/search/Sample_Detail.aspx?Ref=NA12878&product=DNA), annotating and filtering these for comparison with existing standards. To start with we need data to work with. The general workflow follows Erik Garrison’s excellent [FreeBayes tutorial](http://clavius.bc.edu/~erik/CSHL-advanced-sequencing/freebayes-tutorial.html) and you are encouraged to work through it at the end of the course to learn more about the involved methods. 187 | 188 | ### Obtaining read data 189 | 190 | To call variants we need an sequence data in [FASTQ format](http://en.wikipedia.org/wiki/FASTQ_format). We also need a [reference genome](https://en.wikipedia.org/wiki/Reference_genome) in [FASTA format](https://en.wikipedia.org/wiki/FASTA_format). As mentioned before, the CEU hapmap sample `NA12878` is a good starting point as it is widely used to assess sequencing quality and workflow accuracy. It is also publicly available and comes with a 'truth' set that is constantly being improved by the [Genome in a Bottle consortium](http://www.genomeinabottle.org/). Sequencing data is available from a number of sources including [Illumina's Platinum Genome collection](http://www.illumina.com/platinumgenomes/), but for this course we will be using a low-coverage sequence data set (at ~5X genomic coverage) generated by the [1000 Genomes Project](http://www.1000genomes.org/). Specifically, we will be using reads just from chromosome 20 -- i.e., these reads have previously been aligned to the human genome and will be already of reasonably good quality, something you cannot expect for regular sample data. 191 | 192 | Before we can grab the read and reference data there is one final administrative step we need to do. We told your host operating system where to find the folder that is to be used to share data between itself and the VM; now we need to connect to it (mount it) from inside UNIX. Back to the command line you will add yourself to a new user group that allows access to the folder: 193 | 194 | sudo adduser vagrant vboxsf 195 | 196 | Then navigate to the standard UNIX directory where you can 'mount' filesystems and create a directory for your shared folder: 197 | 198 | cd /mnt/ 199 | sudo mkdir ngs 200 | 201 | Finally, connect the shared folder to the newly created directory: 202 | 203 | sudo mount -t vboxsf SHARENAME -o rw,uid=1000,gid=1000,umask=0000,dmode=777 ngs 204 | 205 | where `SHARENAME` is the folder share name you picked when setting up the VirtualBox shared folders (do not include the full path; the folder name alone will suffice). Try copying something into that shared folder on your desktop or laptop, then check if you can see it from the terminal: 206 | 207 | ls -alih ngs/ 208 | 209 | We will also require a reference genome for the alignment. We could align to the whole human genome, but since we are focusing on reads from chromosome 20 we will just get a copy of this chromosome. You have two options for getting this chromosome — either directly from the source at UCSC with the `wget` command, or by retrieving it from your home directory. To minimize the number of downloads we pre-packaged all files required for your home directory. Let's move into the ngs directory and create a new directory top copy in the data: 210 | 211 | cd ngs 212 | mkdir data 213 | cp ~/reference/chr20.fa data/ 214 | 215 | If you are curious, this is how you would have gotten the data from UCSC: 216 | 217 | wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/chr20.fa.gz 218 | gunzip chr20.fa.gz 219 | 220 | Take a look at your reference chromosome using `less` or `head`. If this is the reference genome, why are you seeing so many `N` nucleotides in the sequence? 221 | 222 | head chr20.fa 223 | 224 | Next, grab the sequencing data. This would normally have been provided by a collaborator or your sequencing facility: 225 | 226 | cp ~/sequence/reads* data/ 227 | 228 | You should have two files in FASTQ format in your directory now — a single sample sequenced in paired-end mode. We can check by moving into the data directory and listing all files: 229 | 230 | cd data 231 | ls -alih reads* 232 | 233 | ### Quality Controls 234 | 235 | Reads from the FASTQ file need to be mapped to a reference genome in order to identify systematic differences between the sequenced sample and the human reference genome. However, before we can delve into read mapping, we first need to make sure that our preliminary data is of sufficiently high quality. This involves several steps: 236 | 237 | 1. Obtaining summary quality statistics for the reads and reviewing diagnostic graphs 238 | 2. Filtering out genetic contaminants (primers, vectors, adaptors) 239 | 3. Trimming or filtering low-quality reads 240 | 4. Recalculating quality statistics and review diagnostic plots on filtered data 241 | 242 | This tends to be an interactive process, and you will have to make your own decisions on what you consider acceptable quality. For the most part sequencing data tends to be good enough that it won’t need any filtering or trimming as modern aligners will _soft-clip_ reads that do not perfectly align to the reference genome — we will show you examples of this during a later module. 243 | 244 | 245 | #### Exploring the FASTQ file 246 | 247 | Go back to where you downloaded the sequencing data in FASTQ format and take a look at the contents of your file: 248 | 249 | ls -alih 250 | head reads.end1.fq 251 | 252 | The reads are just for human chromosome 20 which amounts to read data from about 2% of the human genome, and sequenced at only 5X coverage. Current standard for whole genome sequencing is closer to 30-40X, so you can expect _actual_ WGS read data files to be about 400 times larger than your test data. 253 | 254 | > What additional information does FASTQ contain over FASTA? Pick the first sequence in your FASTQ file and note the identifier. What is the quality score of it's first and last nucleotide, respectively? 255 | 256 | For a quick assessment of read quality you will want to stick to standard tools such as [FASTQC](http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/) in most cases. FASTQC will generate a report in HTML format which you can open with any web browser. 257 | 258 | fastqc --version # record the version number! 259 | fastqc reads.end1.fq 260 | fastqc reads.end2.fq 261 | 262 | In order to look at the FastQC output you will need to copy the html report file to the directory you mounted from your host environment previously if you are not already working out of `/mnt/ngs/`. When you copy over remember to add `sudo` to the command. Navigate to the shared folder on your host OS and take a look at the HTML reports with a web browser (`reads.end1_fastqc`). 263 | 264 | > Take a look at the overall quality of your data. What is the number of reads? Are there differences in paired read 1 vs 2 in overall quality? Can you spot the adapter sequence still being present? Take a look at examples of sequencing gone wrong from a [Core conference call](http://bioinfo-core.org/index.php/9th_Discussion-28_October_2010). 265 | 266 | 267 | #### Error sources in sequencing 268 | 269 | Each sequencing technology comes with its own inherent source of errors which not only lead to an overall variation of quality but sometimes quite specific error models that can be corrected for. We will guide you through some examples as part of this presentation, but most errors can either be fixed by a simple two stage process (filtering out contamination and removing low quality bases) or, actually, ignored. Modern aligners ‘soft-clip’ the part of reads that they cannot successfully align, and these unaligned parts of a read will be ignored during the variant calling step. 270 | 271 | 272 | #### Screen for adapter sequences 273 | 274 | For the purposes of this module we assume that you might want to trim adapter sequences from your reads with a tool such as [cutadapt](http://cutadapt.readthedocs.org/en/latest/guide.html). 275 | 276 | As you are working through your data manually on the command line it is important that you keep track of what you did. A straightforward approach to do this is to copy and paste all commands that you typed into the terminal into a seperate text document. Make sure to also keep track of _where_ you are in your Unix enviromnent (the absolute directory path). For all tools you use you should also keep track of the _version_ as updates to your software might change the results. Most, if not all, software can be tested for their version with the `-v` or `--version` command switch: 277 | 278 | cutadapt --version 279 | 280 | Keep track of where the output ends up if not mentioned explicitly in the command itself. It can also be helpful to keep track of any output your software generates in addition to the final result. 281 | 282 | For our data set we will trim off a standard adapter from the 3'-ends of our reads. Cutadapt can handle paired-end reads in one pass (see the documentation for details). While you can run cutadapt on the paired reads separately it is highly recommended to process both files at the same time so cutadapt can check for problems in the input data. The `-p` / `--paired-output` flag ensures that cutadapt checks for proper pairing of your data and will raise an error if read names in the files do not match. It also ensures that the paired files remain synchronized if the filtering and trimming ends up in one read being discarded -- cutadapt will remove the matching paired read from the second file, too. 283 | 284 | cutadapt is very flexible and can handle multiple adapters in one pass (in the same or different position), take a look at the excellent documentation. It is even possible to trim more than one adapter from a given read though at this point you might want to go back to the experimental design. The cutadapt commands for adaptor trimming and quality trimming are provided below, but do not type them into the command line just yet: 285 | 286 | # **Do not run this code.** 287 | cutadapt --format=fastq -a AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATC -o fq1_tmp.fq -p fq2_tmp.fq reads.end1.fq reads.end2.fq > cutadapt.log 288 | 289 | cutadapt --quality-cutoff=5 --format=fastq -a AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATC -o fq1_trimmed.fq -p fq2_trimmed.fq fq1_tmp.fq fq2_tmp.fq > cutadapt2.log 290 | 291 | rm fq1_tmp.fq fq2_tmp.fq 292 | 293 | Since the adapter sequences are prone to errors while typing them in we have provided you with a small shell script in your home directory (`cut.sh`) that you can apply. To run the shell script, type in the commands below: 294 | 295 | mv ~/cut.sh . 296 | 297 | # Take a look at the contents of the script 298 | less cut.sh 299 | 300 | # Run the script 301 | sh cut.sh > cutadapt.log 302 | 303 | We have used a pipe to re-direct cutadapt's output to a logfile. This will allow you to go back to your recorded logfiles to explore additional information, e.g., how many adapters were removed. Different tools have different ways of reporting log messages and you might have to experiment a bit to figure out what output to capture: you can redirect standard output with the `>` symbol which is equivalent to `1>` (standard out); other tools might require you to use `2>` to re-direct the standard error instead. 304 | 305 | > Explore the CutAdapt output, making use of the [manual](http://cutadapt.readthedocs.org/en/latest/guide.html#cutadapt-s-output) if something doesn’t seem to make sense. 306 | 307 | 308 | ### Trimming and filtering by quality 309 | 310 | After reviewing the quality diagnostics from FASTQC decide on whether you want to trim lower quality parts of your read and filter reads that have overall lower quality. This step is optional in that aligners and subsequent variant calling steps can be configured to discard reads aligning poorly or to multiple genomic locations, but it can be good practice. For read data sets of particularly poor quality a trimming and filtering step will also reduce the overall number of reads to be aligned, saving both processing time and storage space. 311 | 312 | There are a number of tools out there that specialize in read trimming, but luckily cutadapt can _also_ handle quality-based read trimming with the `-q` (or `--trim-qualities`) parameter. This can be done before or after adapter removal, and expects your FASTQ data to be in standard Sanger FASTQ format. Cutadapt uses the same approach as aligners such as bwa: it removes all bases starting from the _end_ of the read where the quality is smaller than a provided threshold while allowing some good-quality bases among the bad quality ones. Let’s use a quality cutoff of 20 for now: 313 | 314 | cutadapt -q 20 -o final1.fastq -p final2.fastq fq1_trimmed.fq fq2_trimmed.fq > cutadapt_qual.log 315 | less cutadapt_qual.log 316 | 317 | After you have filtered your data generate another `FASTQC` quality report and compare the report to your previous one. 318 | 319 | fastqc final1.fastq 320 | 321 | > Do you have an improvement in read quality? What has changed and why? Keep in mind that there are no universal recipes on how to best filter. Your approach will depend on the use case. As mentioned before, most aligners are perfectly capable of handling sequencing errors or stretches of low sequence quality — the same poor quality sequences might wreak havoc on an attempt to assemble a genome _de novo_. Experiment! 322 | 323 | Move or copy the final FASTQ files do a separate directory for the next stage, read alignment. Keep track of where you move files at all times. 324 | 325 | 326 | ### Read alignment 327 | 328 | Next we are going to look at the steps we need to take once we have a clean, filtered FASTQ file that is ready for alignment. This alignment process consists of choosing an appropriate reference genome to map your reads against, and performing the read alignment using one of several alignment tools such as [NovoAlign](http://www.novocraft.com/main/page.php?s=novoalign) or [BWA-mem](https://github.com/lh3/bwa). The resulting output file can then be checked again for the quality of alignment as all subsequent steps rely on the reads having been placed at the correct location of the reference genome, so it pays to spend some extra time to verify the alignment quality. 329 | 330 | To avoid excessive runtimes during later steps we are not aligning the sequences against the whole human genome, but will just use chromosome 20 of human genome build 19 (hg19) which you downloaded before. For our example we will be using [bwa-mem](https://github.com/lh3/bwa) which has become one of the standard tools for aligning Illumina reads >75bp to large genomes. 331 | 332 | > Have a look at the [bwa options page](http://bio-bwa.sourceforge.net/bwa.shtml). While you will be running bwa-mem with the default parameters in this module your use case might require a change of parameters. This is best done in combination with a good benchmarking set (simulated or otherwise) to assess the impact of any parameter changes you introduce. 333 | 334 | For the actual alignment we will need our chr20 reference sequence and the trimmed read data in FASTQ format. Create a new directory and copy or move the relevant files over: 335 | 336 | cd .. 337 | mkdir alignment 338 | mv data/final*.fastq alignment/ 339 | mv data/chr20.fa alignment/ 340 | cd alignment 341 | 342 | Before we can align reads, we must index the genome sequence: 343 | 344 | bwa index chr20.fa 345 | 346 | With the index in place we can align the reads using BWA's _Maximal Exact Match_ algorithm (bwa-mem, see the [manual](http://bio-bwa.sourceforge.net/bwa.shtml) for bwa options). 347 | 348 | bwa mem -M chr20.fa final1.fastq final2.fastq 2> bwa.err > na12878.sam 349 | ls -alih 350 | 351 | > Take a look at the output file. Note it’s size relative to FASTQ. How long did it take to run? Now extrapolate to how long you would expect this tool to run when mapping to the entire genome (approximately). What about using whole genome data instead of whole exome? 352 | 353 | 354 | #### SAM/BAM 355 | 356 | Many aligners produce output in ["SAM" format](http://samtools.sourceforge.net/) (Sequence Alignment/Map format, see [publication from Heng Li](http://bioinformatics.oxfordjournals.org/content/early/2009/06/08/bioinformatics.btp352) for more details). 357 | 358 | less na12878.sam 359 | 360 | > Explore the SAM format. What key information is contained? What is in the header? 361 | 362 | Do read up on some of the available SAM information (e.g., from the original publication). One topic that will come up frequently is the need for _read groups_ to be associated with a file. From the Broad’s GATK documentation page: 363 | 364 | > "Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information. A read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample." 365 | 366 | 367 | #### Mark duplicates 368 | 369 | We now have paired end reads aligned to chromosome 20 and are almost ready for variant calling, but we need to remove duplicate reads first. These originate mostly from library preparation methods (unless sequenced very deeply) and bias subsequent variant calling. We will be using [samblaster](https://github.com/GregoryFaust/samblaster) for this step. From samblaster's description: 370 | 371 | > "samblaster is a fast and flexible program for marking duplicates in read-id grouped paired-end SAM files. It can also optionally output discordant read pairs and/or split read mappings to separate SAM files [..]". 372 | 373 | This latter feature comes in handy when looking for structural variation, but here we are mostly interested in its ability to flag duplicate reads. It expected paired end data with a sequence header and grouped by read-id, that is, all reads with the same read-id are in adjacent lines -- the default output for most aligners sich as bwa-mem. samblaster can either discard duplicates or (the preferred option) simply mark them with a specific flag in the SAM file. 374 | 375 | In order to be called a 'duplicate' reads need to match on the sequence name, strand, and where the 5' end of the read would end up on the genomic sequence if the read is fully aligned, i.e., when ignoring any clipped reads: 376 | 377 | samblaster --version 378 | samblaster -i na12878.sam -o na12878_marked.sam 379 | 380 | Lastly, to speed up processing we will need to use SAMtools to convert the SAM file to a BAM file, allowing subsequent methods and viewers to navigate the data more easily. 381 | 382 | samtools --version 383 | samtools view -Sb -o na12878.bam na12878_marked.sam 384 | 385 | As a sidenote, samblaster and many other tools can read from standard input and write to standard out and can thus be easily inserted into a very simple 'pipeline'. The code below is a pipeline example of the last few commands that we just ran. Since we have already run this and have our bam file generated, you **do not need to run this command**: 386 | 387 | # Do not run, example only 388 | bwa mem -M chr20.fa final1.fastq final2.fastq | samblaster | samtools view -Sb - > na12878.bam 389 | 390 | This runs the bwa-mem alignment, pipes the resulting file directly into samblaster which passes the results on to samtools for conversion into a BAM file. It avoids writing multiple large files to disk and speeds up the conversion quite a bit. 391 | 392 | As final preparation steps we sort the BAM file by genomic position, something that SAMtools' `sort` subcommand handles. It asks for a _prefix_ for the final name and appends '.bam' to it: 393 | 394 | samtools sort na12878.bam na12878_sorted 395 | 396 | To speed up access to the BAM file (and as a requirement of downstream tools) we index the BAM file, once again using samtools: 397 | 398 | samtools index na12878_sorted.bam 399 | 400 | 401 | #### Assess the alignment 402 | 403 | Let's take a look at how bwa ended up aligning our reads to the reference chromosome using the the Broad’s Integrated Genome Viewer, IGV. You can access IGV via the [Broad's website](http://www.broadinstitute.org/software/igv/log-in) using the 'Java Web Start' button, but you will have to register first. Start IGV which will take a while to load and may ask for permissions; as it is a Java application you may also be asked to install Java first. 404 | 405 | In the meantime, prepare the data for viewing. You will need the alignment in BAM format, the corresponding index file (*.bai) and the chromosome 20 reference sequence which you will also have to index: 406 | 407 | samtools faidx chr20.fa 408 | 409 | This assumes you are still running all commands in the `/mnt/ngs/` directory, i.e., the folder that is shared with your host operating system. All the files for the IGV exercise should be in the `/mnt/ngs/alignment/` directory within that shared directory. 410 | 411 | Import the two files, chr20.fa and na12878_sorted.bam, into IGV, starting with the reference chromosome ('Genomes', 'Load Genome from file') and followed by the alignment ('File', 'Load from file'). 412 | 413 | Some of the concepts you should explore include: 414 | 415 | * Expanded / collapsed views 416 | * Color alignments (reads) by different attributes 417 | * View reads as pairs 418 | * Finding regions with no coverage or very high coverage 419 | * Finding variants (red lines) 420 | * Exploring the tool tips when you hover over an alignment 421 | 422 | Note how we do not have any annotation for chr20 -- IGV does not know about it and would require manual annotation. Switch genomes: `Genome - Load from Server - Human - human hg19`. Re-load the BAM file if needed, then jump to any gene directly by entering it into the search bar (e.g., BMP2). 423 | 424 | 425 | ### Calling Variants 426 | 427 | We have our sequence data, we cleaned up the reads, aligned them to the genome, sorted the whole thing and flagged duplicates. Which is to say we are now finally ready to find sequence variants, i.e., regions where the sequenced sample differs from the human reference genome. 428 | 429 | Some of the more popular tools for calling variants include samtools, the [GATK suite](https://www.broadinstitute.org/gatk/guide/best-practices?bpm=DNAseq) and FreeBayes. While it can be useful to work through the GATK Best Practices we will be using [FreeBayes](https://github.com/ekg/freebayes) in this module as it is just as sensitive and precise, but has no license restrictions. 430 | 431 | FreeBayes uses a Bayesian approach to identify SNPs, InDels and more complex events as long as they are shorter than individual reads. It is haplotype based, that is, it calls variants based on the reads aligned to a given genomic region, not on a genomic position. In brief, it looks at read alignments from an individual (or a group of individuals) to find the most likely combination of genotypes at each reference position and produces a variant call file (VCF) which we will study in more detail later. It can also make use of priors (i.e., known variants in the populations) and adjust for known copy numbers. Like GATK it includes a re-alignment step that left-aligns InDels and minimizes alignment inconsistencies between individual reads. 432 | 433 | In principle FreeBayes only needs a reference in FASTA format and the BAM-formatted alignment file with reads sorted by positions and with read-groups attached. Let's start with a clean directory: 434 | 435 | cd .. 436 | mkdir variants 437 | mv alignment/chr20.fa variants/ 438 | mv alignment/na12878_sorted.* variants/ 439 | 440 | If you are getting an error message with the `mv` command it most likley means that you are still running IGV which is keeping a lock on these files. Just copy (`cp`) the files instead of moving them in this case. 441 | 442 | To see _all_ options that FreeBayes offers take a look at the manual or run: 443 | 444 | freebayes --help 445 | 446 | For now, we will simply call variants with the default parameters which should only take a couple of minutes for our small data set: 447 | 448 | freebayes -f chr20.fa na12878_sorted.bam > na12878.vcf 449 | 450 | > Remember the 400x size reduction. Assess how long this might have taken for all of hg19? 451 | 452 | Like other methods we looked at FreeBayes follows standard Unix conventions and can be linked to other tools with pipes. FreeBayes allows you to only look at specific parts of the genome (with the `--region` parameter), to look at multiple samples jointly (by passing in multuple BAM files which are processed in parallel), consider different ploidies or recall at specific sites. The GitHub page has many examples worth exploring. 453 | 454 | 455 | #### Understanding VCFs 456 | 457 | The output from FreeBayes (and other variant callers) is a VCF file (in standard 4.1 format), a matrix with variants as rows and particpants as columns. It can handle allelic variants for a whole population of samples or simply describe a single sample as in our case. 458 | 459 | Now take a look at the results FreeBayes generated for the NA12878 data set: 460 | 461 | less -S na12878.vcf 462 | 463 | You will see the header which describes the format, when the file was created, the FreeBayes version along with the command line parameters used and some additional column information: 464 | 465 | ``` 466 | ##fileformat=VCFv4.1 467 | ##fileDate=20150228 468 | ##source=freeBayes v0.9.14-15-gc6f49c0-dirty 469 | ##reference=chr20.fa 470 | ##phasing=none 471 | ##commandline="freebayes -f chr20.fa na12878_sorted.bam" 472 | ##INFO= 473 | ##INFO= 474 | ``` 475 | 476 | Followed by the variant information: 477 | 478 | ``` 479 | #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT unknown 480 | chr20 61275 . T C 0.000359121 . AB=0;ABP=0;AC=0;AF=0;AN=2;AO=2;CIGAR=1X;DP=3;DPB=3;DPRA=0;EPP=7.35324;EPPR=5.18177;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=40;NS=1;NUMALT=1;ODDS=9.85073;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=4;QR=40;RO=1;RPP=7.35324;RPPR=5.18177;RUN=1;SAF=0;SAP=7.35324;SAR=2;SRF=1;SRP=5.18177;SRR=0;TYPE=snp GT:DP:RO:QR:AO:QA:GL 0/0:3:1:40:2:4:0,-0.0459687,-3.62 481 | chr20 61289 . A C 0.00273171 . AB=0;ABP=0;AC=0;AF=0;AN=2;AO=3;CIGAR=1X;DP=4;DPB=4;DPRA=0;EPP=3.73412;EPPR=5.18177;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=7.94838;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=15;QR=40;RO=1;RPP=9.52472;RPPR=5.18177;RUN=1;SAF=1;SAP=3.73412;SAR=2;SRF=0;SRP=5.18177;SRR=1;TYPE=snp GT:DP:RO:QR:AO:QA:GL 0/0:4:1:40:3:15:-0.79794,0,-3.39794 482 | ``` 483 | 484 | The first columns are relatively straightforward and represent the information we have about a predicted variation. CHROM and POS provide the config information and position where the variation occurs. ID is the dbSNP rs identifier (after we annotated the file) or a `.` if the variant does not have a record in dbSNP based on its position. REF and ALT represent the genotype at the reference and in the sample, always on the foward strand. 485 | 486 | QUAL then is the Phred scaled probablity that the observed variant exists at this site. It's again at a scale of -10 * log(1-p), so a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance. Ideally you would need nothing else to filter out bad variant calls, but in reality we still need to filter on multiple other metrics. which we will describe in the next module. If the FILTER field is a `.` then no filter has been applied, otherwise it will be set to either PASS or show the (quality) filters this variant failed. 487 | 488 | The last columns contains the genotypes and can be a bit more tricky to decode. In brief, we have: 489 | 490 | * GT: The genotype of this sample which for a diploid genome is encoded with a 0 for the REF allele, 1 for the first ALT allele, 2 for the second and so on. So 0/0 means homozygous reference, 0/1 is heterozygous, and 1/1 is homozygous for the alternate allele. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either: 491 | * GQ: the Phred-scaled confidence for the genotype 492 | * AD, DP: Reflect the depth per allele by sample and coverage 493 | * PL: the likelihoods of the given genotypes 494 | 495 | Let’s look at an example from the FreeBayes tutorial: 496 | 497 | ``` 498 | chr1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26 499 | ``` 500 | 501 | To use Eric’s description of this particular entry: 502 | 503 | > At this site, the called genotype is GT = 0/1, which is C/T. The confidence indicated by GQ = 25.92 isn't so good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). 504 | > The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0). There's a chance that the subject is "hom-var" (=homozygous with the variant allele) since PL(1/1) = 26, which corresponds to 10^(-2.6), or 0.0025, but either way, it's clear that the subject is definitely not "hom-ref" (=homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number. 505 | 506 | The Broad's [VCF guide](https://www.broadinstitute.org/gatk/guide/article?id=1268) has more information and lists a number of metrics presented in the INFO field to get you started. 507 | 508 | 509 | #### Filtering VCFs 510 | 511 | By default FreeBayes does almost no filtering, only removing low-confidence alignments or alleles supported by low-quality base quality positions in the reads. It expects the user to subsequently flag or remove variants that have a low probability of being true. A method from the [vcflib package](https://github.com/ekg/vcflib), `vcffilter`, enables us to quickly subset our VCF based on the various quality attributes: 512 | 513 | vcffilter -f "QUAL > 20" na12878.vcf > na12878_q20.vcf 514 | 515 | This example removes any sites with an estimated probability of not being polymorphic less than 20 -- this is again the PHRED score described in our FASTQ survey, and matches a probablily of 0.01, or a probability of a polymorphism at this site >= 0.99. 516 | 517 | Another nifty tool from the [bcftools](http://samtools.github.io/bcftools/bcftools.html) package allows us to aggregate our VCF data for chromosome 20 and to compare the impact of different filtering steps: 518 | 519 | bcftools stats na12878_q20.vcf > vcf.log 520 | 521 | bcftools will complain about the `chr20` contig not being defined. You can fix this by (block) compressing the file with `bgzip` which then allows tools such as `tabix` to index it: 522 | 523 | bgzip na12878_q20.vcf 524 | tabix na12878_q20.vcf.gz 525 | bcftools stats na12878_q20.vcf.gz > vcf.log 526 | less vcf.log 527 | 528 | Take a look at the output. The log contains: 529 | 530 | * Summary stats at the top (sample, number of events) 531 | * Summaries for SNPs, Indels, including a breakdown of singletons (i.e., variant calls with the minimum allele frequency) 532 | * Quality breakdown of the calls 533 | * Indel size distributions 534 | 535 | For example, note the ts/tv ratio (the transition/transversion rate of which tends to be around 2-2.1 for the human genome, although it changes between different genomic regions): 536 | 537 | vcffilter -f "QUAL > 20" na12878.vcf | bcftools stats | grep "TSTV" 538 | 539 | Again, we use pipes to first create a VCF subset that we then aggregate, finally using grep to pick out the statistic of interest. Compare this to the variants called with lower confidence: 540 | 541 | vcffilter -f "QUAL < 20" na12878.vcf | bcftools stats | grep "TSTV" 542 | 543 | Filters in the Q20-Q30 range have shown to work reasonably well for most samples, but might need to be combined with filters on user statistics such as regions with very low coverage, or perhaps surprisingly at first glance in regions of exceptionally high coverage -- these are frequenly repeat regions attracting many mismapped reads. Other good filters include those that test for strans biases, that is, excluding variants found almost exlusively only on one strand which indicates a read duplication problem, or those that test for allelic frequency which, at least for germline mutations, should be close to 50% or 100% given sufficient coverage. 544 | 545 | For most purposes 'hard' filters for variant calls work well enough (as opposed to filters learned from the variant calls in combination with a truth set). You can find different filtering criteria in the literature and translate them into command line arguments that vcffilter can use. For example, following recommendations based on [Meynert et al](http://www.ncbi.nlm.nih.gov/pubmed/23773188) you could use: 546 | 547 | (AF[0] <= 0.5 && (DP < 4 || (DP < 13 && QUAL < 10))) || 548 | (AF[0] > 0.5 && (DP < 4 && QUAL < 50)) 549 | 550 | You will recognize the `QUAL` filter. This construct distinguishes between heterozygous calls -- AF (allele frequence) at 0.5 or below -- and homozyguous calls, checks for the combined depth across samples (DP) and uses different quality filters for different depth levels. 551 | 552 | 553 | #### Benchmarks 554 | 555 | When trying to figure out which possible disease-causing variants to focus on it becomes crucial to identify likely false positive calls (but also to get a general idea of the false negative rate). One way to get to these metrics is to run data for which 'gold standards' exist. As mentioned before, NA12878 is one of the best studied genomes and has become such a gold standard, in part through efforts by NIST's Genome in a Bottle consortium which has created and validated variant calls from a large number of different sequencing technologies and variant calling workflows. These variant calls are [available for download](https://sites.stanford.edu/abms/content/giab-reference-materials-and-data) along with supplementary data such as a list of regions for which we think we can or cannot make highly confident variant calls -- e.g., most repetitive regions are excluded from benchmarking. 556 | 557 | These benchmarks are invaluable for testing. It is good practice to re-run these whenever you change your workflow or try to decide what filtering criteria to use. Brad Chapman has a few real world examples on his blog on how to use the GiaB set to [optimize workflows](http://bcb.io/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/), or to use the somatic variant benchmark set from the ICGC-TCGA DREAM challenge to [assess cancer variant callers](https://github.com/chapmanb/bcbb/blob/master/posts/cancer_validation.org). 558 | 559 | If you are curious, the [FreeBayes tutorial](http://clavius.bc.edu/~erik/CSHL-advanced-sequencing/freebayes-tutorial.html) has more details on how to generate the statistics for our data set. If you run the numbers for our data you'll notice that we only have a sensitivity of around 70% which is to be expected given the very shallow coverage, but it seems we can keep the false discovery rate below 5% at least. 560 | 561 | For now we will just retrieve the GiaB 'truth' set for chromosome 20 of the hg19 genome build, and contrast it with our own calls in IGV. 562 | 563 | 564 | #### Visually exploring the VCF 565 | 566 | You already have your variant calls in bgzip'ed format (`na12878_q20.vcf.gz`) along with an index created by tabix. Let's get the GiaB gold standard which we already pre-filtered to only include data for chromosome 20 and create an index for it: 567 | 568 | cp ~/giab/na12878.chr20.giab.vcf.gz . 569 | tabix na12878.chr20.giab.vcf.gz 570 | 571 | Go back to IGV which should still be running (if not, revisit the Alignment Assessment module to start IGV and import both the indexed reference chromosome and your aligned reads in BAM format). Load both your own VCF file as well as the GiaB set and check for good matches between variant calls and the reads. Try to find homozygous and heterozygous variants, and check for cases where the reads indicate a difference to the reference genome, but no variant was called. Try to get an impression of how well your variant calls track the GiaB standard. By and large you'll find a few missing calls, but very few false positives. 572 | 573 | You can also upload your filtered VCF to the [GeT-RM browser](http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/browse/), a CDC project to establish reference materials for resequencing studies which collaborates with the GiaB project. 574 | 575 | 576 | #### Annotating a VCF file 577 | 578 | During the next session we will annotate the VCF file and use this annotation to select a small number of 'novel' variants that might be of interest based on their functional annotation. 579 | 580 | Let's start by testing whether the variants we called are present in [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/), the database of short genetic variations. Normally you would get the full database from a download site at NCBI or use the Broad's [resource bundles](https://www.broadinstitute.org/gatk/guide/article.php?id=1213), but the full data set is > 2GB. Instead, we prepared a smaller subset containing just 'high confidence' entries (dbsnp v138, excluding sites after v129) limited to chr20. If you are curious why we are using the 'before 129' version, a number of articles from [MassGenomics](http://massgenomics.org/2012/01/the-current-state-of-dbsnp.html) and FinchTalk [here](http://finchtalk.geospiza.com/2011/01/dbsnp-or-is-it.html) and [here](http://finchtalk.geospiza.com/2011/03/flavors-of-snps.html) provide lots of context; in brief, this excludes the majority of variants detected by next-gen sequencing, the majority of which have not have been properly annotated just yet. 581 | 582 | cd .. 583 | mkdir gemini 584 | cp variants/na12878_q20* gemini/ 585 | cd gemini 586 | cp ~/gemini/dbsnp.138.chr20.vcf.gz . 587 | tabix dbsnp.138.chr20.vcf.gz 588 | 589 | We will once again use bcftools, this time to associate the variants we called with the entries in dbSNP via the [annotate command](http://samtools.github.io/bcftools/bcftools.html#annotate): 590 | 591 | bcftools annotate -c ID -a dbsnp.138.chr20.vcf.gz na12878_q20.vcf.gz > na12878_annot.vcf 592 | 593 | Explore the file -- for a majority of the variants the previous 'unknown IDs' (the '.' annotation) has been replaced by an `rs` identifier that you can look up in the dbSNP database. This is not terribly surprising: NA12878 is one of the best-sequenced genomes, and short of true sequencing errors all variants are bound to be in public databases. An easy way to confirm that is to look at the newly generated file (scroll using the space bar): 594 | 595 | less na12878_annot.vcf 596 | 597 | You will observe some rows which still contain a '.' where other rows have been replaced with a dbSNP 'rs' identfier. We can also check this using bcftools once again (this step is optional): 598 | 599 | bcftools view -i '%ID = "."' na12878_annot.vcf | bcftools stats 600 | 601 | Another typical annotation involves assessing the _impact_ of the identified variants to distiginush between potentially harmless substituions and more severe changes that cause loss of function (truncations, change of splice events, etc.). A number of frameworks such as [ANNOVAR](http://www.openbioinformatics.org/annovar/) and [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) tackle this; here we will be using another popular framework, [snpEFF](http://snpeff.sourceforge.net/SnpEff_manual.html). The manual is more or less required reading to get the most out of snpEff, but in brief, snpEff takes predicted variants as input and annotates these with their likely effects based on external databases. 602 | 603 | Normally you would have to download the databases matching your genome prior to annotation (or have snpEff do it automatically for you), but we included a pre-installed database for hg19 with the VM: 604 | 605 | snpEff -Xmx2G -i vcf -o vcf -dataDir ~/reference/snpeff hg19 na12878_annot.vcf > na12878_annot_snpEff.vcf 606 | 607 | We use a `Java` parameter (`-Xmx2G`) to define the available memory. You might have to tweak this depending on your setup to use just `1G`. Note that for full-sized data set snpEff will require a minimum of 4GB of free RAM. 608 | 609 | Take a look at the output: 610 | 611 | less na12878_annot_snpEff.vcf 612 | 613 | As you can see snpEff added a fair amount of additional information into the 'ANN' info field. These are described in the [snpEff annotation section](http://snpeff.sourceforge.net/SnpEff_manual.html#ann) of the manual. As a first pass, let's see how many 'HIGH' impact variants we have found in chr20: 614 | 615 | cat na12878_annot_snpEff.vcf | grep HIGH | wc -l 616 | 617 | That's a total of six high-impact variants in just one chromosome of a healthy individual. snpEff creates HTML summaries as part of it's output, so navigate to the mounted directory on your host OS and open the `snpEff_summary` file with a web browser. 618 | 619 | > Take a quick look through the results. Note the number of variants, what kind of base changes you see. Note how there are no variant calls in centromer regions. 620 | 621 | 622 | ### Prioritizing variants with GEMINI 623 | 624 | As you'll quickly notice handling variant annotation in this format is quiet cumbersome. Frameworks such as [GEMINI](http://gemini.readthedocs.org/en/latest/) have been developed to support an interactive exploration of variant information in the context of genomic annotations. 625 | 626 | GEMINI (GEnome MINIng) can import VCF files (and an optional [PED file](http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped), automatically annotating them with genome annotations from sources such as ENCODE tracks, OMIM, dbSNP etc, and storing them in a relational database that can then be queried. 627 | 628 | 629 | #### Loading a GEMINI database 630 | 631 | The annotated VCF file (in this case annotated by VEP) is loaded into a database (.db) with external annotations and computing additional population genetics statistics that support downstream analyses. As the annotation files required for this take multiple GBs of download we start with a pre-loaded database. The code for loading the database is provided below, but you can **skip this step** as we already created the database for you: 632 | 633 | gemini load -v dominant.vcf -t VEP -p dominant.ped --cores 4 dominant.db 634 | 635 | The database that was generated for this exercise `dominant.db` is located in your home directory. Move it over to your current directory: 636 | 637 | mv ~/gemini/dominant.db . 638 | 639 | The database contains variant call information for a single family of three individuals (trio). In this family, both mother and son are affected by a condition called hypobetalipoproteinemia, an autosomal dominant disorder. We will be using this database to look for variants that are likely to be associated with the disorder towards the end of this section. First, we will start with some simple queries and filters. 640 | 641 | 642 | #### Querying the database 643 | 644 | When we 'query', we are asking the database to report data that matches the requirements we provide. The query is formulated using a structured query language (SQL). As we go through a few examples you will start to become famililar with the language, but if you are interested in more details [sqlzoo](http://sqlzoo.net/wiki/SQL_Tutorial) has online tutorials describing the basics. 645 | 646 | To query GEMINI, we use the `gemini query` command. Following the command we provide the query. Data is stored in tables within the database and so our query needs to specify the table and the fields/columns within that table. Read more about the database schema on the GEMINI [website](http://gemini.readthedocs.org/en/latest/content/database_schema.html). 647 | 648 | Let's start by `select`ing the columnns chrom, start, end, gene `from` the variants table. Pipe `head` to the end of the command so only the first few lines are printed to screen. What information is displayed? 649 | 650 | gemini query -q "select chrom, start, end, gene from variants" dominant.db | head 651 | 652 | Adding in the `where` clause allows us to select specific rows in the table. Which variants in the table are indels? This time we will re-direct the results to file with `>` . 653 | 654 | gemini query -q "select chrom, start, end, gene from variants where type='indel' " dominant.db > all_snps.txt 655 | 656 | All indels from the variants table (and the columns specified) are written to file. Try a `wc -l` on the filename; this is Unix command that returns to you the number of lines in the file. How many indels are there? Rather than printing rows from the table, you can also ask GEMINI to report the number of lines that match your query using `count()`. Since the count operation cannot take more than one field, we will put the wildcard character `*` to indicate any field. 657 | 658 | gemini query -q "select count(*) from variants where type='indel' " dominant.db 659 | 660 | The number returned should match the value returned from `wc -l`. Let's try a few more queries using the `count()` operation. How many variants are SNPs? 661 | 662 | gemini query -q "select count(*) from variants where type='snp'" dominant.db 663 | 664 | For some fields the value is not numeric or character, but is boolean. TRUE is equivalent to the value of 1, and FALSE is equivalent to the value of 0. Let's query a field that has boolean values. How many variants are exonic? Not exonic? 665 | 666 | gemini query -q "select count(*) from variants where is_exonic = 1" dominant.db 667 | gemini query -q "select count(*) from variants where is_exonic = 0" dominant.db 668 | 669 | The `count()` operation can also be combined with `group by` so rather than counting all instances, GEMINI will give us a breakdown of numbers per category. The impact field has multiple categories (e.g., nonsynonymous, stop-gain, etc.). How many variants are there for each type of variant impact ? 670 | 671 | gemini query -q "select impact, count(*) from variants group by impact" dominant.db 672 | 673 | Queries can also be combined by using `and` to separate _multiple_ `where` clauses. For example, how many of the coding variants are SNPs? 674 | 675 | gemini query -q "select count(*) from variants where is_coding = 1 and type='snp' " dominant.db 676 | 677 | How many variants are rare _and_ in a disease-associated gene? 678 | 679 | gemini query -q "select count(*) from variants where clinvar_disease_name is not NULL and aaf_esp_ea <= 0.01" dominant.db 680 | 681 | List those genes by changing the `count()` operation to the appropriate filed name: 682 | 683 | gemini query -q "select gene from variants where clinvar_disease_name is not NULL and aaf_esp_ea <= 0.01" dominant.db 684 | 685 | The above examples illustrate _ad hoc_ queries that do not account or filter upon the genotypes of individual samples. Time to make use of that information. 686 | 687 | 688 | #### Querying genotype information 689 | 690 | Genotype information (genotype, depth, and genotype quality) for each variant is tored in GEMINI using a slightly different format and so the syntax for accessing it also altered. To retrieve the alleles for a given sample one would add `gts.subjectID` to the `select` statement. For all other information the prefixes (followed by subject ID) are as follows: `gt_types`, `gt_depths`, `gt_quals`. Try the following query and add in `--header` to keep track of what each column refers to. We'll also pipe to `head` so only the first few lines get written to screen. 691 | 692 | For the rare variants, let's get the genotype for subject 4805 and the depth and quality of aligned sequence so that we can assess the confidence in the genotype: 693 | 694 | gemini query -q "select gene, ref, alt, gts.4805, gt_depths.4805, gt_quals.4805 from variants where aaf_esp_ea <= 0.01" --header dominant.db | head 695 | 696 | If we wanted to display information for _all samples_, rather than typing out each subjectID we could just use the wildcard character (`*`). There are many flavours of the wildcard operator that can be applied to make more complex queries (i.e. any, all, none), but is beyond the scope of this course. We encourage you to read the [documentation](http://gemini.readthedocs.org/en/latest/content/querying.html#selecting-sample-genotypes-based-on-wildcards) for more detail. 697 | 698 | Often we want to focus only on variants where a given sample has a specific genotype (e.g., looking for homozygous variants in family trios). In GEMINI we cannot directly do this in the query, but the `gemini query tool` has an option called `--gt-filter` that allows one to specify filters to apply to the returned rows. [The filter](http://gemini.readthedocs.org/en/latest/content/querying.html#gt-filter-filtering-on-genotypes) can be applied to any of the genotype information stored. The wildcard can be combined with the filter using the syntax: 699 | 700 | --gt-filters (COLUMN).(SAMPLE_WILDCARD).(SAMPLE_WILDCARD_RULE).(RULE_ENFORCEMENT) 701 | 702 | See an example below where we report genotypes of variants in subject 4805 that have high quality (aligned depth >=50) genotypes in all samples: 703 | 704 | gemini query -q "select gene, ref, alt, gts.4805 from variants" --gt-filter "(gt_depths).(*).(>=50).(all)" --header dominant.db | head 705 | 706 | #### Disease gene hunting 707 | 708 | GEMINI also has a number of [built-in tools](http://gemini.readthedocs.org/en/latest/content/tools.html#) which incorporates the pedigree structure provided in the form of a PED file. Using this information we can go a step further and query within a single family to identify disease-causing or disease-associated genetic variants reliably from the broader background of variants. In our example we will make use of the `autosomal_dominant` tool to query the trio of samples that we have been working with thus far. This tool is useful for identifying variants that meet an autosomal dominant inheritance pattern. The reported variants will be restricted to those variants having the potential to impact the function of affecting protein coding transcripts. 709 | 710 | Our PED file indicates that within our trio, both mother and son are affected. Since we have only one of the parents to be affected, the tool will report variants where both the affected child and the affected parent are heterozygous. We will query and limit the attributes returned by using the `--columns` option. 711 | 712 | gemini autosomal_dominant dominant.db --columns "chrom, ref, alt, gene, impact"| head 713 | 714 | The first few columns that are returned include family information, genotype and phenotype for each individual. All other columns are the same fields we have been using above in our examples. We can filter results using the `--filter` option which serves as the `where` clause that we had been using previously. For example, we could search for only variants of high impact: 715 | 716 | gemini autosomal_dominant dominant.db --columns "chrom, ref, alt, gene, impact" 717 | --filter "impact_severity = 'HIGH'" | head 718 | 719 | In order to eliminate less confident genotypes, we can add the `-d [0]` option to enforce a minimum sequence depth for each sample (default is zero). Additionally, if we had mutliple families in our dataset, we could specify to GEMINI the minimum number of families for the variant to be present in using `--min-kindreds` or we can select the specific families we want to query using `--families`. 720 | 721 | 722 | #### Is this useful? 723 | 724 | Functional annotation of human genome variation is [a hard problem](http://massgenomics.org/2012/06/interpretation-of-human-genomes.html), but results have been quite impressive. Dan Koboldt at WashU went through the 2011 literature and collected a large list of [disease-causing gene discoveries in disorders](http://massgenomics.org/2011/12/disease-causing-mutations-discovered-by-ngs-in-2011.html) and [cancer](http://massgenomics.org/2012/01/cancer-genome-and-exome-sequencing-in-2011.html), frequently with [relevance to the clinic](http://massgenomics.org/2010/04/why-we-sequence-cancer-genomes.html). In some cases such as for studies of Cystic Fibrosis even [moderate samples sizes](http://www.ncbi.nlm.nih.gov/pubmed/22772370?dopt=Abstract) can be useful. With the advent of several nationwide genomic programs we will see more and more patients being sequenced, and particularly for rare diseases and [developmental disorders](https://www.sanger.ac.uk/about/press/2014/141224.html) there are plenty of success stories. Results are all but guaranteed, though -- for example, here are just the [most likely reasons why you cannot find a causative mutation in your data](http://massgenomics.org/2012/07/6-causes-of-elusive-mendelian-disease-genes.html). 725 | 726 | In addition, sequencing errors and systematic changes to sample materials as part of the DNA extractation and library preparation process are a serious problem. Nick Lohman has a good summary of just [some of the known error sources](http://pathogenomics.bham.ac.uk/blog/2013/01/sequencing-data-i-want-the-truth-you-cant-handle-the-truth/) in a blog post triggered by a publication from the Broad Institute on [artifactual mutations due to oxidative damage](http://nar.oxfordjournals.org/content/early/2013/01/08/nar.gks1443.full.pdf?keytype=ref&ijkey=suYBLqdsrc7kH7G) -- in this case even analysis of the sample data with different sequencing technologies and bioinformatic workflows would not have made a difference. 727 | 728 | Still, it is hard to argue with the success of sequencing approaches at least for rare diseases. The ongoing [‘Idiopathic Diseases of Man’](http://www.nature.com/gim/journal/vaop/ncurrent/full/gim201521a.html) program has studied more than 100 cases so far, leading to a likely diagnosis in 60% of cases, and for 20% of cases the diagnosis has already been confirmed. And even just having a diagnosis is a [huge event for affected families](http://www.newyorker.com/magazine/2014/07/21/one-of-a-kind-2). 729 | 730 | 731 | ## Recap and next steps 732 | 733 | That’s it! If you made it this far you have taken raw sequencing data all the way from the sequencing facility to annotated and prioritized variants and even identified a disease-related gene in an independent case study. While there is tons more to learn this should put you into an excellent starting position to identify the relevant methods and literature. Maybe a few more pointers to wrap of the module. 734 | 735 | 736 | ### Calling variants in a population 737 | 738 | FreeBayes can run on individual samples or a collection of samples from many different individuals from the same family or general population. It leverages information found across the whole data set to improve confidence in genotype calls in individual samples. In short, if your study has data from multiple individuals it is almost always a good idea to run FreeBayes on all of them at the same time. 739 | 740 | ### A word on data management & reproducibility 741 | 742 | Data management is a very important aspect of any large-scale analysis like variant detection. There are various tools and best-practices descriptions available to help with this, e.g. [git](http://git-scm.com/) for version control and [this PLOS paper](http://www.ploscompbiol.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pcbi.1000424&representation=PDF) for ideas on best practices in computational biology. 743 | 744 | Data management ties in with the concept of Reproducibility in that the better your bookkeeping, the better you will be able to describe your methods and therefore make them more reproducible. 745 | 746 | ### Workflow systems 747 | 748 | As your data analysis needs grow more complex you will need to move away from typing commands in a shell environment. The standard bioinformatics behaviour is to write lots of shell scripts that pipe the output of one workflow step into the output of the next script, but that tends to get messy fast. Logfiles get messed up, it becomes difficult to follow naming conventions, and it is unclear how or where to restart a run that failed halfway through the process. 749 | 750 | At the very least consider frameworks such as [bpipe](https://github.com/ssadedin/bpipe) which come with all kinds of goodies: automatic renaming of files, log file generation, the ability to resume failed runs and interfaces to most cluster resource managers. 751 | 752 | Beyond that frameworks such as [SpeedSeq](https://github.com/cc2qe/speedseq) and [bcbio](https://bcbio-nextgen.readthedocs.org/) provide additional flexibility: they install tools and references for you and simply require high level configuration files to drive the best practice analysis of your DNA- and RNA-Seq data. This helps tremendously when it comes to making your analysis reproducible and keeping tools and workflows current. Frameworks such as bcbio also come with support for just about every cluster scheduler as well as the ability to deploy the whole workflow on Amazon's AWS environment. 753 | __________________ 754 | This module was partially sponsored by the Harvard Medical School Tools and Technology (TnT) Committee and Harvard NeuroDiscovery Center (HNDC). 755 | __________________ 756 | --------------------------------------------------------------------------------