├── LICENSE.md ├── README.md ├── VMbuild.README ├── _config.yml ├── questionbox.png ├── task1 ├── README.md ├── e-coli-genome.fasta.gz ├── e-coli-h20-genome.fasta.gz ├── e-coli-k12-genome.fasta.gz └── gcloud-download.png ├── task1b └── README.md ├── task2 ├── README.md ├── mt_barcodes.txt ├── mt_reads.fastq.gz └── tablet-coverage-plot.png ├── task3 ├── 16Ssearch.png ├── README.md ├── installingArtemis.md ├── mysteryGenome.fna.gz ├── ntsearch.png └── uniprot2go.py ├── task4 ├── README.md ├── act.png ├── installingArtemis.md └── l-terrestris.genome.fa ├── task5 └── README.md ├── task6 ├── README.md ├── ecoli-rel606.fa.gz └── example-pileup.png ├── task7 ├── README.md ├── ftp-list.txt └── runSalmon.bash ├── task8 ├── README.md ├── barplot1.png ├── barplot2.png ├── bubbleplot.png ├── bubbleplot1.png └── pheatmap.png └── task9 ├── README.md ├── SNP-heatmap.png ├── alleleFreq.png └── igsr_samples.tsv /LICENSE.md: -------------------------------------------------------------------------------- 1 |

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # learn-genomics-in-linux 2 | 3 | This repository contains a tutorial on how to use command-line genomics/bioinformatics tools in Linux. 4 | It has been designed for BIOL 469 (Genomics) and BIOL 614 (Bioinformatics Tools and Techniques) @ the University of Waterloo. 5 | 6 | It is divided into a series of tasks 7 | 8 | * [Task 1](task1/) - Learning the linux command line 9 | * [Task 1b](task1b/) - Command-line BLAST 10 | * [Task 2](task2/) - Genome Assembly 11 | * [Task 3](task3/) - Genome Annotation 12 | * [Task 4](task4/) - Comparative genomics: synteny comparison genomes 13 | * [Task 5](task5/) - Comparative genomics: gene set comparison between two genomes 14 | * [Task 6](task6/) - Resequencing: variant calling from NGS data 15 | * [Task 7](task7/) - Transcriptomics and detection of differentially expressed genes 16 | * [Task 8](task8/) - 16S rRNA amplicon sequencing analysis using Kraken2+Bracken 17 | * TBD - ChIP-seq data analysis 18 | * TBD - GWAS 19 | * TBD - Metagenomics: taxonomic and functional profiling 20 | 21 | 22 | # Requirements & Software Installation 23 | 24 | The main requirement is that you have access to a Linux-based OS such as Ubuntu. If you have access through a remote server that has been supplied to you, then you are set. If not, then one option is to install a Linux VM as a Virtual Machine on your local system, or use the Google Compute Engine (e.g., [free credits](https://cloud.google.com/free/)) 25 | ). 26 | 27 | Most of the programs we will use are text-based and can be run directly in the shell; however, some graphical programs (e.g., Tablet, Artemis, fastqc) will be used as well. You can install these on your own local machine. 28 | 29 | Detailed software requirements will be listed at the beginning of each Task. Alternatively, if you wish to install all the programs for the course beforehand, please install the software listed below: 30 | 31 | * [tablet](https://ics.hutton.ac.uk/tablet/) 32 | * [artemis and act](http://sanger-pathogens.github.io/Artemis/Artemis/) 33 | * [bandage](http://rrwick.github.io/Bandage/) 34 | * [R](https://www.r-project.org/) 35 | 36 | We have also documented installation instructions for building a bioinformatics system (Ubuntu 18.04) [here](https://github.com/doxeylab/learn-genomics-in-unix/blob/master/VMbuild.README). 37 | 38 | 39 | 40 | # Contact 41 | 42 | If you have any questions, please contact acdoxey at uwaterloo dot ca. 43 | 44 | Enjoy 45 | -------------------------------------------------------------------------------- /VMbuild.README: -------------------------------------------------------------------------------- 1 | ### BUILD INSTRUCTIONS FOR GOOGLE COMPUTE ENGINE GENOMICS SERVER 2 | 3 | # Set up a 16-core Ubuntu 18.04 VM with 500 Gb of SSD 4 | 5 | # Logged in via terminal (supplied my ssh public key to the gcloud meta ssh keys page in the console) 6 | 7 | # update package list 8 | sudo apt update 9 | 10 | # install bioinformatics software 11 | sudo apt install fastqc velvet abyss fasttree prodigal barrnap bcftools 12 | 13 | # comment out following line (/etc/java-X-openjdk/accessibility.properties) to prevent Java issues 14 | # assistive_technologies=org.GNOME.Accessibility.AtkWrapper 15 | 16 | # install fastx toolkit 17 | wget http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2 18 | tar xvjf fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2 19 | cd bin 20 | sudo mv * /usr/bin/. 21 | 22 | # install artemis (optional) 23 | wget ftp://ftp.sanger.ac.uk/pub/resources/software/artemis/artemis.tar.gz 24 | tar zxf artemis.tar.gz 25 | 26 | # install prokka 27 | sudo apt-get install libdatetime-perl libxml-simple-perl libdigest-md5-perl git default-jre bioperl 28 | sudo apt install ncbi-tools-bin 29 | sudo cpan Bio::Perl 30 | sudo cpan Bio::SearchIO::hmmer3 31 | git clone https://github.com/tseemann/prokka.git $HOME/prokka 32 | $HOME/prokka/bin/prokka --setupdb 33 | 34 | # then cp prokka folder to usr/bin 35 | 36 | # install tablet (optional) 37 | wget https://ics.hutton.ac.uk/resources/tablet/installers/tablet_linux_x64_1_17_08_17.sh 38 | sh tablet_linux_x64_1_17_08_17.sh 39 | 40 | # install pip (for python packages) 41 | # this required python2.7 42 | # sudo apt install python-pip # OLD 43 | # pip install getopt pysqlw # OLD 44 | # had to do something like this.. 45 | # sudo apt install python2.7 46 | # wget https://bootstrap.pypa.io/pip/2.7/get-pip.py 47 | # sudo python2.7 get-pip.py 48 | # python2.7 -m pip install pysqlw #I think? 49 | 50 | 51 | # installing uniprot2go.py (Doxey Lab script) 52 | wget https://github.com/doxeylab/learn-genomics-in-unix/raw/master/task3/uniprot2go.py 53 | chmod +x uniprot2go.py 54 | sudo mv uniprot2go.py /usr/bin 55 | 56 | # building the uniprot-to-go SQL database 57 | sudo apt install sqlite3 58 | #ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/ - all GO data is here 59 | wget ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gpa.gz 60 | zcat goa_uniprot_all.gpa.gz | awk -F'\t' '{print $2","$4}' >uniprot-vs-go.csv 61 | sqlite3 uniprot-vs-go-db.sl3 62 | sqlite> create table unitogo (uniprotID text, goTerm text); 63 | sqlite> .mode csv 64 | sqlite> .import uniprot-vs-go.csv unitogo 65 | sqlite> CREATE INDEX unigoindex ON unitogo(uniprotID); 66 | sqlite> VACUUM; 67 | sqlite> .quit 68 | # now move uniprot-vs-go-db.sl3 to /data/uniprot2go/uniprot-vs-go-db.sl3 69 | 70 | # add paths to system-wide bashrc files 71 | # add lines to /etc/bash.bashrc 72 | export PATH="$PATH:/usr/bin/prokka/bin/" 73 | # export PATH="$PATH:/usr/bin/Tablet/" 74 | # export PATH="$PATH:/usr/bin/artemis/" 75 | 76 | 77 | # hiding users and home directories 78 | 79 | sudo chmod 751 /home 80 | sudo rm /usr/bin/who 81 | sudo rm /usr/bin/w 82 | sudo rm /usr/bin/users 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-hacker -------------------------------------------------------------------------------- /questionbox.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/questionbox.png -------------------------------------------------------------------------------- /task1/README.md: -------------------------------------------------------------------------------- 1 | # Task1 - Learning the Linux shell 2 | 3 | This task will introduce students to the Linux command-line (shell environment). 4 | 5 | ### Requirements 6 | 7 | * Access to a linux-based OS running BASH 8 | 9 | --- 10 | 11 | 12 | ## What is the Linux Shell? 13 | 14 | The shell is a command-line programming language for interacting with the [UNIX](https://en.wikipedia.org/wiki/Unix_shell) operating system. 15 | 16 | There are several different shell languages. What we will be using in this course is a popular shell flavor called [BASH](https://en.wikipedia.org/wiki/Bash_(Unix_shell)) 17 | 18 | We will be learning basic commands, but BASH is actually a language that can perform complex programming tasks. 19 | 20 | 21 | ## Getting Started 22 | 23 | Before you can begin with the coding exercises, you must have access to a linux machine. 24 | You can either use your own local system or a remote server that has been set up for you. 25 | 26 | ### Accessing the remote server 27 | 28 | 29 | You can access the remote server through a terminal (full instructions on LEARN) like this: 30 | 31 | ``` 32 | ssh yourUserName@genomics1.private.uwaterloo.ca 33 | ``` 34 | 35 | 36 | When you are done, you can leave your session by typing... 37 | 38 | ``` 39 | exit 40 | ``` 41 | 42 | ## Once you are logged on | Learning the command-line 43 | 44 | If you have logged in correctly, you should see a welcome screen. 45 | 46 | You are now in a unix command-line environment. For more information and instructions: 47 | 48 | * [CodeAcademy](https://www.codecademy.com/learn/learn-the-command-line) - Learn the command line 49 | * [TeachingUnix](https://info-ee.surrey.ac.uk/Teaching/Unix/) - Another Unix Tutorial 50 | * [BasicCommands](http://mally.stanford.edu/~sr/computing/basic-unix.html) - List of common commands 51 | 52 | ## A Linux Shell primer 53 | 54 | ### Navigating files/folders 55 | 56 | When you log in, by default you start in your home directory. 57 | 58 | Type... 59 | 60 | ``` 61 | pwd 62 | ``` 63 | 64 | And this will print to the screen your current location (e.g., /home/username) 65 | 66 | You can always get back to your home folder by typing 67 | 68 | ``` 69 | cd 70 | ``` 71 | 72 | To view the contents of your current folder type 73 | 74 | ``` 75 | ls 76 | ``` 77 | 78 | To make the folder "task1", type 79 | 80 | ``` 81 | mkdir task1 82 | ``` 83 | 84 | Change directory into 'task1' folder 85 | 86 | ``` 87 | cd task1 88 | ``` 89 | 90 | And now create a file called file.txt 91 | 92 | ``` 93 | >file.txt 94 | ``` 95 | 96 | Open the file with `nano`. This is one built-in text editor. There are others. 97 | 98 | ``` 99 | nano file.txt 100 | ``` 101 | 102 | Now enter a few lines of text, type 'ctrl-o' and then 'ctrl-x' to save and exit 103 | 104 | To print to the screen the contents of your file 105 | 106 | ``` 107 | cat file.txt 108 | ``` 109 | 110 | Other ways of viewing your file 111 | 112 | ``` 113 | less file.txt #type q to exit 114 | more file.txt 115 | head file.txt 116 | head -n 10 file.txt # first 10 lines of your file 117 | tail file.txt 118 | tail -n 10 file.txt # last 10 lines of your file 119 | ``` 120 | 121 | Size of your file 122 | 123 | ``` 124 | du file.txt 125 | du -h file.txt # in human-readable output (byte, kb, mb, etc.) 126 | ``` 127 | 128 | To count the number of words and lines in your file 129 | 130 | ``` 131 | wc file.txt #words 132 | wc -l file.txt #lines 133 | wc -m file.txt #characters 134 | ``` 135 | 136 | To copy the file to a new file 137 | 138 | ``` 139 | cp file.txt newfile.txt 140 | ``` 141 | 142 | You can also 'move' a file to a new location or rename it using `mv`. 143 | 144 | To combine both files together into a third file 145 | 146 | ``` 147 | cat file.txt newfile.txt > thirdfile.txt 148 | ``` 149 | 150 | The '>' redirects the output of the commands on the left of it to a file specified on the right. 151 | 152 | Delete the file (note: be careful since there is no Trash Bin) 153 | 154 | ``` 155 | rm file.txt 156 | ``` 157 | 158 | Print contents of all .txt files in current folder. * acts as a wildcard 159 | 160 | ``` 161 | cat *.txt 162 | ``` 163 | 164 | Delete all .txt files in the current folder 165 | 166 | ``` 167 | rm *.txt 168 | ``` 169 | 170 | Delete all files in the current folder 171 | 172 | ``` 173 | rm * 174 | ``` 175 | 176 | Move back to the previous folder 177 | 178 | ``` 179 | cd .. 180 | ``` 181 | 182 | And delete the folder 'task1' 183 | 184 | ``` 185 | rmdir task1 186 | ``` 187 | 188 | ### Getting help on linux commands and program usage 189 | 190 | For most commands, you can get more information on their usage by typing `man` 'command'. 191 | 192 | e.g., try 193 | 194 | ``` 195 | man ls 196 | #type 'q' to quit 197 | #Note: this line and the line above are interpreted as comments since they start with the "#" character. They will not be executed as a command. 198 | ``` 199 | 200 | `man` will work with some bioinformatics tools. However, not always. 201 | 202 | e.g., for help on the `blastp` tool, type 203 | 204 | ``` 205 | blastp -h # or 206 | blastp --help 207 | ``` 208 | 209 | ### Additional operations 210 | 211 | #### Pattern finding with grep 212 | 213 | ``` 214 | grep "word" file.txt # prints lines in file.txt containing "word" 215 | 216 | grep -o "word" file.txt #print out all the occurrences of "word" 217 | 218 | grep -c "word" file.txt # counts the number of lines containing "word" in file.txt 219 | 220 | ``` 221 | 222 | Note: Be careful when using `grep` to analyze files containing nucleic acid or protein sequence data. 223 | 224 | e.g., your FASTA file may be separated into multiple lines like this: 225 | 226 | ``` 227 | >myFastaSequence 228 | ATCGACGTTATCGACTAGCTAT 229 | TCGGCGCGGTATTAGCGATTCG 230 | TAATATCGGCGCGATATATCGA 231 | ``` 232 | 233 | instead of this: 234 | 235 | ``` 236 | >myFastaSequence 237 | ATCGACGTTATCGACTAGCTATTCGGCGCGGTATTAGCGATTCGTAATATCGGCGCGATATATCGA 238 | ``` 239 | 240 | Therefore, `grep` may miss some words that span multiple lines. 241 | 242 | Fortunately, there is a useful tool called `compseq` that be used to examine the [k-mer](https://en.wikipedia.org/wiki/K-mer) composition of a FASTA file. 243 | It can be run like this: 244 | 245 | ``` 246 | compseq file.fasta 247 | 248 | #or 249 | 250 | compseq -reverse file.fasta #also counts occurrences on the reverse complement of the sequence 251 | 252 | ``` 253 | 254 | 255 | 256 | #### Piping commands 257 | 258 | We can also chain together multiple commands like this using the `|` (pipe) operator. 259 | 260 | ``` 261 | grep "word" file.txt | wc -l # will count the number of lines containing the word "word" 262 | 263 | # or alternatively 264 | 265 | cat file.txt | grep "word" | wc -l # does the same thing as above 266 | 267 | # if we want to count ALL the occurrences of "word" in the file (allowing multiple per line), we can do 268 | 269 | grep -o "word" file.txt | wc -l 270 | 271 | ``` 272 | 273 | #### Copying a file to and from a remote server 274 | 275 | To 276 | ``` 277 | scp /path/to/file.txt yourUserName@genomics1.private.uwaterloo.ca 278 | ``` 279 | 280 | 281 | From 282 | ``` 283 | scp yourUserName@genomics1.private.uwaterloo.ca:/path/to/file.txt /path/to/location/ 284 | ``` 285 | 286 | 287 | 291 | Remember, your current path can be found using `pwd`. A useful command for printing out the path to your file is: 292 | 293 | ``` 294 | realpath file.txt 295 | ``` 296 | 297 | 298 | #### Downloading a file off the internet 299 | 300 | ``` 301 | wget 302 | ``` 303 | 304 | #### File compression/uncompression 305 | 306 | This is done using programs such as `tar`, and `gzip` and `gunzip`. 307 | e.g., 308 | ``` 309 | gzip file.txt # to compress it 310 | gunzip file.txt.gz # to uncompress it 311 | ``` 312 | 313 | ### More tips 314 | 315 | #### Use tab to autocomplete 316 | 317 | Use tab for autocompletion! This will speed up your command-line work dramatically. 318 | More here: [tab-autocomplete](https://www.howtogeek.com/195207/use-tab-completion-to-type-commands-faster-on-any-operating-system/) 319 | 320 | #### Use Ctrl-C to interrupt or end a process 321 | 322 | If you need to interrupt a command or process that you have started, press Ctrl-C. 323 | 324 | #### Other tips for becoming a linux power user 325 | 326 | [Linux Tips](https://www.howtogeek.com/110150/become-a-linux-terminal-power-user-with-these-8-tricks/) 327 | 328 | 329 | --- 330 | 331 | 332 | # ASSIGNMENT QUESTIONS 333 | 334 | PLEASE COMPLETE ASSIGNMENT 1 ON LEARN. 335 | This can be found under Quizzes. 336 | 337 | You will be asked to answer the following questions. 338 | 339 | Hint: remember to use `man` if you want to explore added functionality of commands. Also, the program `compseq` may be useful to answer some of these questions. 340 | 341 | Download and uncompress this file containing the genome sequence of E. coli H20 https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task1/e-coli-genome.fasta.gz 342 | 343 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q1 - What is the size of the uncompressed file in megabytes (round to one decimal place)? 344 | 345 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q2 - What is the header (first) line of the file? 346 | 347 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q3 - How many characters are in this file (header plus genome)? 348 | 349 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q4 - What are the last five bases in the genome? 350 | 351 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q5 - What is the length of the genome (# bases)? 352 | 353 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q6 - What base (A, C, G, or T) is most common in the file? 354 | 355 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q7 - What is the GC content of the genome? 356 | 357 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q8 - What is the most common trinucleotide in the file? 358 | 359 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q9 - How many times does the word "AATGAGAGG" occur in the genome sequence? Do not use compseq to answer this one. 360 | 361 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q10 - What is the answer to the above question if you also include matches on the reverse complement of the genome sequence? Again, do not use compseq to answer this one. 362 | 363 | 364 | #### Congratulations. You are now finished Task 1. 365 | 366 | -------------------------------------------------------------------------------- /task1/e-coli-genome.fasta.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task1/e-coli-genome.fasta.gz -------------------------------------------------------------------------------- /task1/e-coli-h20-genome.fasta.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task1/e-coli-h20-genome.fasta.gz -------------------------------------------------------------------------------- /task1/e-coli-k12-genome.fasta.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task1/e-coli-k12-genome.fasta.gz -------------------------------------------------------------------------------- /task1/gcloud-download.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task1/gcloud-download.png -------------------------------------------------------------------------------- /task1b/README.md: -------------------------------------------------------------------------------- 1 | # Task1b - Command-line BLAST 2 | 3 | In this task, you will learn how to run BLAST in the command-line. You will download a genome and proteome and search them for a gene and protein of interest, respectively. 4 | 5 | ### Requirements 6 | 7 | * Access to a linux-based OS running BASH 8 | * [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download) 9 | 10 | --- 11 | 12 | ## Getting Started 13 | 14 | Login to your linux environment as you did in task1, either through the browser or via `ssh`. 15 | 16 | Create a new project folder for this task 17 | 18 | ``` 19 | mkdir blastTask #creates folder 20 | cd blastTask #enters into folder 21 | ``` 22 | 23 | ## Retrieving genomic data from the NCBI 24 | 25 | There are several ways to download data. Two common tools are `curl` and `wget`. 26 | You can also simply copy and paste sequence data into a file using `nano` or `pico` or other command-line text-editors. More advanced ones are `vim` and `emacs`. 27 | 28 | The following exercise will download a genome (DNA sequence data) and proteome (translated AA sequence data) from the NCBI. 29 | The NCBI houses its genomic data within an FTP directory - [here](https://tinyurl.com/cvh8n5ce) 30 | 31 | We will be working with the genome of [Prochlorococcus marinus](https://en.wikipedia.org/wiki/Prochlorococcus), which is an abundant marine microbe and possibly the most abundant bacterial genus on earth. First, explore its FTP directory [here](https://tinyurl.com/yc6xv4mh) 32 | 33 | Within this folder, there are a number of files. It is important that you familiarize yourself with these files and their contents. 34 | Files within a genbank genomic ftp directory include: 35 | 36 | * ...genomic.fna.gz file -> this will uncompress into a .fna (fasta nucleic acid) file 37 | * ...protein.faa.gz -> .faa (fasta amino acid) file 38 | 39 | It should be clear what these two files contain. 40 | 41 | There is also another file called: 42 | 43 | * ...genomic.gff.gz -> .gff file (generic feature format) 44 | 45 | 46 | Download these files, uncompress them, and explore them (with `less` for example). 47 | 48 | ``` 49 | wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/007/925/GCA_000007925.1_ASM792v1/GCA_000007925.1_ASM792v1_genomic.fna.gz 50 | 51 | wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/007/925/GCA_000007925.1_ASM792v1/GCA_000007925.1_ASM792v1_genomic.gff.gz 52 | 53 | wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/007/925/GCA_000007925.1_ASM792v1/GCA_000007925.1_ASM792v1_protein.faa.gz 54 | 55 | ``` 56 | 57 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q1 - Examine the contents of the three files you have just downloaded. What information does each contain? 58 | 59 | 60 | ## Command-line BLAST 61 | 62 | ### Setting up your query sequence(s) 63 | 64 | Next, you are going to do a BLAST search against the genome (.fna) and proteome (.faa) that you have downloaded. 65 | 66 | In this case, the genome and proteome will set up as BLAST DATABASES that you are searching against. 67 | The BLAST QUERY sequences (a gene and a protein) can be anything you like (examples below). 68 | 69 | * e.g. a query gene sequence: E. coli 16S ribosomal RNA - copy and paste this into a new text file 70 | 71 | ``` 72 | >J01859.1 Escherichia coli 16S ribosomal RNA, complete sequence 73 | AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGAAGCTTGCTCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAGGGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACGGAAGTTTTCAGAGATGAGAATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTAACAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA 74 | ``` 75 | 76 | * e.g. query protein sequence: 77 | 78 | ``` 79 | >sp|B7LA79|RL7_ECO55 50S ribosomal protein L7/L12 OS=Escherichia coli (strain 55989 / EAEC) OX=585055 GN=rplL PE=3 SV=1 80 | MSITKDQIIEAVAAMSVMDVVELISAMEEKFGVSAAAAVAVAAGPVEAAEEKTEFDVILKAAGANKVAVIKAVRGATGLGLKEAKDLVESAPAALKEGVSKDDAEALKKALEEAGAEVEVK 81 | ``` 82 | 83 | Note: here is a quick way to download the above query protein sequence from Uniprot and rename it in one command 84 | 85 | ``` 86 | curl https://rest.uniprot.org/uniprotkb/B7LA79.fasta > e.coli.l7.faa 87 | ``` 88 | 89 | 90 | ### Formatting the genome and proteome for BLAST 91 | 92 | When doing a BLAST search, the query can be in FASTA format, but the database needs to be formatted for BLAST. This is done with BLAST's `makeblastdb` command. This tool sets up an 'indexed' database for BLAST, which chops sequences into their constitutent k-mer fragments and stores this data to facilitate rapid database searching. 93 | 94 | Let's set up a BLAST database for the proteome 95 | 96 | ``` 97 | makeblastdb -in GCA_000007925.1_ASM792v1_protein.faa -dbtype 'prot' 98 | ``` 99 | 100 | ... and now the genome as well 101 | 102 | ``` 103 | makeblastdb -in GCA_000007925.1_ASM792v1_genomic.fna -dbtype 'nucl' 104 | ``` 105 | 106 | You can see that the `-dbtype` parameter defines whether the input FASTA file is for protein or nucleotide sequences. 107 | 108 | `makeblastdb` and other BLAST tools have lots of additional parameters as well that can be customized based on a users needs. 109 | Let's explore some more. 110 | 111 | ``` 112 | # to look at command usage and parameter options 113 | makeblastdb -help 114 | ``` 115 | 116 | #### Advanced: retrieving specific entries and regions from your BLAST database 117 | 118 | One of the useful parameters here is the `-parse_seq_ids` flag. If this option is set, this makes it very easy to retrieve specific sequences from the database using their name or id. e.g., 119 | 120 | ``` 121 | makeblastdb -in GCA_000007925.1_ASM792v1_genomic.fna -dbtype 'nucl' -parse_seqids 122 | makeblastdb -in GCA_000007925.1_ASM792v1_protein.faa -dbtype 'prot' -parse_seqids 123 | ``` 124 | 125 | And now if you want to print out to the screen the sequence of protein `AAP99047.1`, you can use the `blastdbcmd` program like this: 126 | 127 | ``` 128 | blastdbcmd -entry AAP99047.1 -db GCA_000007925.1_ASM792v1_protein.faa 129 | ``` 130 | This will output: 131 | >\>AAP99047.1 DNA polymerase III beta subunit [Prochlorococcus marinus subsp. marinus str. CCMP1375] 132 | MKLVCSQIELNTALQLVSRAVATRPSHPVLANVLLTADAGTGKLSLTGFDLNLGIQTSLSASIESSGAITVPSKLFGEII 133 | SKLSSESSITLSTDDSSEQVNLKSKSGNYQVRAMSADDFPDLPMVENGAFLKVNANSFAVSLKSTLFASSTDEAKQILTG 134 | VNLCFEGNSLKSAATDGHRLAVLDLQNVIASETNPEINNLSEKLEVTLPSRSLRELERFLSGCKSDSEISCFYDQGQFVF 135 | ISSGQIITTRTLDGNYPNYNQLIPDQFSNQLVLDKKYFIAALERIAVLAEQHNNVVKISTNKELQILNISADAQDLGSGS 136 | ESIPIKYDSEDIQIAFNSRYLLEGLKIIETNTILLKFNAPTTPAIFTPNDETNFVYLVMPVQIRS 137 | 138 | If you want a specific region (e.g., the first 10 amino acids) from this entry, you can use the `-range` parameter 139 | 140 | ``` 141 | blastdbcmd -entry AAP99047.1 -db GCA_000007925.1_ASM792v1_protein.faa -range 1-10 142 | ``` 143 | This will output: 144 | >\>AAP99047.1 DNA polymerase III beta subunit [Prochlorococcus marinus subsp. marinus str. CCMP1375] 145 | MKLVCSQIEL 146 | 147 | 148 | ### Performing a BLAST search 149 | 150 | There are several different flavors of BLAST. Each is run as a separate command: 151 | 152 | * `blastp` - protein query vs protein database 153 | * `blastn` - nucleotide query vs nucleotide database 154 | * `blastx` - nucleotide query (translated) vs protein database 155 | * `tblastn` - protein query vs nucleotide (translated) database 156 | * `tblastx` - nucleotide query (translated) vs nucleotide database (translated) 157 | 158 | To run a `blastp` search using the protein query (defined by `-query` parameter) and protein database (defined by `-db` parameter) you have set up, do the following: 159 | 160 | ``` 161 | blastp -query e.coli.l7.faa -db GCA_000007925.1_ASM792v1_protein.faa 162 | ``` 163 | 164 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q2 - How many significant (E < 0.001) hits did you get? 165 | 166 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q3 - What is the sequence identity (percentage) of your top BLAST hit? 167 | 168 | * Repeat the same BLAST search you did for Q2 but using the genomic sequence as the database. 169 | 170 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q4 - Compare your result to the previous search. Which of the following statements is most correct: 171 | 172 | ``` 173 | * There were no significant BLAST matches. 174 | * BLAST detected the same protein as the top hit. However, the alignment was shorter. 175 | * BLAST detected the same protein as the top hit. However, the alignment was not significant. 176 | * BLAST detected a different protein as the top hit. 177 | ``` 178 | 179 | 180 | * Suppose you have sequenced the following fragment of DNA: 181 | 182 | ``` 183 | ACTGGCATTGATAGAACAACCATTTATTCGAGATAGTTCAATTACTGTAGAGCAAGTTGTAAAACA 184 | ``` 185 | 186 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q5 - Search for this fragment of DNA in the genome of Prochlorococcus marinus subsp. marinus str. CCMP1375. What did you find? 187 | 188 | ``` 189 | * I found an exact match to the sequence. 190 | * I did not find a good match to the sequence. 191 | * I found a good match to the sequence with 1 mutation. 192 | * I found a good match to the sequence with 2 mutations. 193 | 194 | ``` 195 | 196 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q6 - What is the likely function of this fragment of DNA? (use any method that you like to answer this question). 197 | 198 | ``` 199 | * It is impossible to say. 200 | * It is part of a gene encoding Translation elongation factor Ts. 201 | * It is a segment of a protein. 202 | * It is a non-coding sequence. 203 | ``` 204 | 205 | 206 | # ASSIGNMENT QUESTIONS 207 | 208 | * Complete questions 1-6 above and submit your answers on LEARN. 209 | 210 | 211 | #### Congratulations. You are now finished Task 1b. 212 | 213 | -------------------------------------------------------------------------------- /task2/README.md: -------------------------------------------------------------------------------- 1 | # Task 2 - Genome Assembly 2 | 3 | In this lab, you will download raw sequencing data, perform genome assembly, visualize and analyze your assemblies, and compare the assembled genome sequence to the database using BLAST. 4 | 5 | ### Requirements 6 | 7 | * Access to a linux-based OS running BASH 8 | * [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) 9 | * [fastx toolkit](http://hannonlab.cshl.edu/fastx_toolkit/) 10 | * [velvet](https://github.com/dzerbino/velvet/blob/master/Manual.pdf) 11 | * [abyss](https://github.com/bcgsc/abyss) 12 | * [tablet](https://ics.hutton.ac.uk/tablet/) * download this graphical software onto your own machine 13 | * [bandage](http://rrwick.github.io/Bandage/) * (optional) * download this graphical software onto your own machine 14 | 15 | 16 | ## Installation 17 | 18 | Please install the software on your local machine. Once locally installed, you can download results off the linux server and locally visualize them on your own system. 19 | 20 | All software used are available for Mac/Windows/Linux. 21 | 22 | --- 23 | 24 | ## Getting Started 25 | 26 | Login to your Linux environment as you did in task1, and create a new folder for your task2 work. 27 | 28 | ``` 29 | mkdir task2 #creates folder 30 | cd task2 #enters into folder 31 | ``` 32 | 33 | ## Retrieving the raw data 34 | 35 | Download the raw sequencing data from *https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task2/mt_reads.fastq.gz* into your folder and then uncompress it. 36 | 37 | Explore the file using `less`. 38 | 39 | Next, consider how would you return the sequence of the nth read using only head and tail or grep? Note that each read consists of a fixed number of lines and the first line of a read starts with a specific character. 40 | 41 | 42 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q1) What is the sequence of the third read in the file? Make sure to remove all spaces. 43 | 44 | 45 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q2) How many reads are in the file? Hint: the command `grep "^X"` reports all lines starting with the character `X`. 46 | 47 | 48 | 49 | ## Data preprocessing 50 | 51 | Before we can assemble a genome, we need to: 52 | 53 | 1) Assess the quality of the sequencing data 54 | 2) Demultiplex the data 55 | 3) Trim barcodes 56 | 4) Filter out low-quality reads (this is called quality filtering). 57 | 58 | ### Quality assessment 59 | For a quick quality report, you can use the program `fastqc`. 60 | The command below will analyze the mt_reads.fastq file and produce an .html results file. 61 | 62 | ``` 63 | fastqc mt_reads.fastq 64 | ``` 65 | 66 | Transfer the 'fastqc_report.html' file to your local machine and open it in a web browser. Tip: find the path to your file with `realpath yourFile.txt` 67 | 68 | Explore and inspect the FastQC report for mt_reads.fastq. 69 | 70 | 71 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q3) Which of the following statements is correct? 72 | * The reads passed all of the quality control measures 73 | * The reads failed all of the quality control measures 74 | * The reads passed some of the quality control measures but failed others 75 | 76 | 77 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q4) The per-base sequence quality is lowest at the \_\_\_\_\_ of the reads. 78 | 79 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q5) Most of the reads were assigned a quality (Phred) score of \_\_\_\_\_ . 80 | 81 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q6) Examine the per-base sequence content. The base composition is unusual/unexpected for position \_\_\_\_\_ to position \_\_\_\_\_ of the reads. 82 | 83 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q7) This unexpected composition may be due to the inclusion of \_\_\_\_\_ . 84 | 85 | 86 | 87 | 88 | ### Splitting the barcodes (demultiplexing) 89 | 90 | Sequencing data may be barcoded. In this case, the data contains two different samples, each with a unique barcode. 91 | This allows us to split the data by sample. Sometimes, sequencing data can have tens or hundreds of barcodes. See [multiplexing](https://www.illumina.com/science/technology/next-generation-sequencing/multiplex-sequencing.html). 92 | 93 | We will use a standard script from the `fastx toolkit` to split the data by its known barcodes (defined already for you in the file downloaded below). 94 | 95 | ``` 96 | #first download the barcodes file 97 | wget https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task2/mt_barcodes.txt 98 | 99 | #now split 100 | fastx_barcode_splitter.pl k to define the k-mers (sequence fragments of length k) to be used in constructing the graph. 137 | 138 | Read more: 139 | [velvet](https://en.wikipedia.org/wiki/Velvet_assembler), 140 | [de bruijn graphs](https://en.wikipedia.org/wiki/De_Bruijn_graph), 141 | [de novo assemblers](https://en.wikipedia.org/wiki/De_novo_sequence_assemblers). 142 | 143 | 144 | The commands below will compute the graph. The first parameter is the folder name (you choose) and the second parameter is the value of k. So below, we are assembling the genome from the trimmed and quality-filtered reads using a k-mer value of 21. 145 | 146 | ``` 147 | velveth out_21 21 -short -fastq qual_trim_mt1.fastq 148 | ``` 149 | 150 | Next, to compute the actual contig sequences from the graph, run the following: 151 | 152 | ``` 153 | velvetg out_21/ -scaffolding no -read_trkg yes -amos_file yes 154 | ``` 155 | 156 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q10) How many nodes are there in the graph that was produced? 157 | 158 | Inspect the contigs.fa file that has been produced (will be in out_21 folder). 159 | 160 | 161 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q11) How many contigs do you get using k=21? 162 | 163 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q12) How many contigs do you get using k=31? 164 | 165 | 166 | ## Assembly visualization 167 | 168 | ### Assembly visualization with Tablet 169 | 170 | Velvet has the option of keeping track of where the reads map to the assembly using the `-read_trkg` flag. This will produce a `velvet_asm.afg` file. 171 | 172 | Transfer this file (from your k=31 assembly) to your local machine. 173 | Then open it in `tablet`. Tablet is a great program to explore how reads map to assemblies and genomes. 174 | 175 | 176 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q13) What is the average contig length? 177 | 178 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q14) What is the N50 value? 179 | 180 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q15) Examine the read coverage across the longest contig. Does the coverage distribution match that shown [here](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task2/tablet-coverage-plot.png)? 181 | 182 | 183 | Explore `tablet` more on your own. We will be using it later in the course. 184 | 185 | ### Assembly visualization with Bandage (optional) 186 | 187 | Velvet and other de bruijn assemblers produce a graph that can be visualized. `bandage` is an excellent tool for this purpose. 188 | 189 | If you are interested, locate the 'lastgraph' file produced by `velvet` and transfer it to your local machine. 190 | 191 | Open this file in the `Bandage` application, and explore further. 192 | 193 | 194 | ### Generating an improved assembly with ABYSS 195 | 196 | As you can see based on the results from above, `velvet` (with the parameters we chose) did not yield a high quality assembly. It is too fragmented. 197 | 198 | Often in genomics it is useful to try numerous parameters and different assemblers. Let's try to assemble this genome again but with a different assembler. This time, we will be using the popular [abyss](https://github.com/bcgsc/abyss) assembler and we'll keep the value of k = 21. And see [here](https://github.com/bcgsc/abyss/wiki/ABySS-File-Formats#stats) for info on stats reported by an Abyss assembly. 199 | 200 | 201 | ``` 202 | abyss-pe k=21 in='qual_trim_mt1.fastq' name=abyss-assembly 203 | ``` 204 | 205 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q16) How long (# bases) is your assembly? 206 | 207 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q17) What is the N50 value (# bases)? 208 | 209 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q18) How many contigs did abyss generate? 210 | 211 | 212 | 213 | ### What is the taxonomic source of your genome? Explore with BLAST 214 | 215 | You still do not know the source of this genome. Is it eukaryotic? bacterial? Is it a nuclear genome, a plasmid, or something else? 216 | 217 | To investigate this question, do a BLAST search using the online [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi) tool. Use the Abyss assembly as your query. 218 | 219 | 220 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q19) Based on the BLAST result, describe the most likely source of this DNA sequence? 221 | 222 | 223 | 224 | # ASSIGNMENT QUESTIONS 225 | 226 | 227 | Please answer questions 1-19 above on LEARN under Quizzes. 228 | 229 | 230 | # 231 | 232 | Congratulations. You have now completed Task 2. 233 | -------------------------------------------------------------------------------- /task2/mt_barcodes.txt: -------------------------------------------------------------------------------- 1 | #The barcode assignments in the fastq file: mt_reads.fastq 2 | mt1 ATCTACCA 3 | mt2 AACCATAA 4 | -------------------------------------------------------------------------------- /task2/mt_reads.fastq.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task2/mt_reads.fastq.gz -------------------------------------------------------------------------------- /task2/tablet-coverage-plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task2/tablet-coverage-plot.png -------------------------------------------------------------------------------- /task3/16Ssearch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task3/16Ssearch.png -------------------------------------------------------------------------------- /task3/README.md: -------------------------------------------------------------------------------- 1 | # Task3 - Genome Annotation 2 | 3 | This task is a tutorial on genome annotation using `prokka` and other tools. 4 | 5 | You will learn how to perform basic genome annotation, and also how to extract specific regions of interest from your genome sequence. 6 | 7 | ### Requirements 8 | 9 | Graphical software you need to download onto your own machine indicated by (*) 10 | 11 | * Access to a linux-based OS running BASH 12 | * [BLAST](http://blast.ncbi.nlm.nih.gov/) 13 | * [Prokka](https://github.com/tseemann/prokka) 14 | * [Artemis](http://sanger-pathogens.github.io/Artemis/Artemis/) * 15 | Note: for Artemis, you may need to first install the JRE (Java Runtime Environment) and/or JDK (Java Development Kit) on your system. Please see [here](installingArtemis.md) for further instructions. 16 | * uniprot2go.py script located [here](https://github.com/doxeylab/learn-genomics-in-linux/blob/master/task3/uniprot2go.py) 17 | -- already installed in /usr/bin on the remote server. 18 | 19 | ## Installation 20 | 21 | Please install the graphical software on your local machine. 22 | 23 | --- 24 | 25 | ## Getting Started 26 | 27 | * Login to your linux environment as you did in task1. 28 | 29 | * Create a new folder for task3. 30 | 31 | ``` 32 | mkdir task3 #creates folder 33 | cd task3 #enters into folder 34 | ``` 35 | 36 | ## Retrieving the raw data 37 | 38 | * Copy the genome you assembled with `abyss` from task2. 39 | 40 | ``` 41 | cp ../task2/abyss-assembly-contigs.fa . 42 | ``` 43 | 44 | ## Exploring the assembly using `artemis` 45 | 46 | 47 | Now that we have generated a good quality assembly, let's explore the genome sequence itself and do some very basic annotation using `artemis`. 48 | 49 | Visualize the genome you have produced (using `abyss`) with the `artemis` application. Note: you will need to have `artemis` installed on your local machine. 50 | You will also need to download your contigs to your local machine. Open your contigs.fa file in `artemis`. 51 | 52 | - What are the black vertical lines that appear in the sequence window? 53 | - How do you product a gc plot of the genome? 54 | - Why would a researcher create a gc plot? 55 | - How do you mark open reading frames (ORFs)? 56 | 57 | 58 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q1) How many ORFs are there of length >= 100 amino acids? 59 | 60 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q2) How many ORFs are there of length >= 50 amino acids? 61 | 62 | ## Annotating your genome from Task2 using `prokka` 63 | 64 | By marking the ORFs in your genome (given a min size threshold), you have essentially performed a simple gene finding algorithm. However, there are more advanced ways of gene-finding that take additional criteria into account. 65 | 66 | A popular genome annotation tool for prokaryotic genomes is [`prokka`](https://github.com/tseemann/prokka). 67 | `prokka` automates a series of genome annotation tools and is simple to run. It has been installed for you on the server. 68 | 69 | * Run `prokka` using the following command. 70 | 71 | ``` 72 | prokka abyss-assembly-contigs.fa 73 | ``` 74 | Note: This will generate a folder called PROKKA-XXXXXXXX where XXXXXXXX is the current date. It will be different for you than in the examples below. 75 | 76 | * Now, locate and download the .gbk file that was produced and view it in `artemis`. 77 | 78 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q3) You will notice that there are vertical black lines in the middle of predicted ORFs. What do these lines represent? 79 | 80 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q4) Re-start `artemis` and change your `artemis` 'Options' to better reflect the source of this genome. Which source did you choose (e.g., "Standard", "Vertebrate Mitochondrial", etc.)? 81 | 82 | When `prokka` is run without any parameters, it selects 'bacteria' as the default taxonomy. 83 | 84 | Look at the `--kingdom` options in `prokka -h` and re-run `prokka` to use the correct annotation mode. You will also need to use `--outdir` to specify a folder for your new results. 85 | 86 | * Again, open your .gbk file in `artemis`. 87 | 88 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q5) Has anything changed in this genome annotation? Examine the CDSs, tRNAs, and rRNAs, and their annotations. 89 | 90 | 91 | ## Annotation of an E. coli genome using `prokka` 92 | 93 | Next, let's perform genome annotation on a larger scale. 94 | 95 | * Download (or copy from your task1 folder) the E. coli H20 genome below from task1 and annotate it using `prokka` 96 | 97 | ``` 98 | https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task1/e-coli-h20-genome.fasta.gz 99 | ``` 100 | 101 | Next, explore the files produced by `prokka`. Start with the .txt file. 102 | 103 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q6) How many genes, rRNAs, and tRNAs were predicted? What is the size of the genome in Mb? 104 | 105 | `prokka` also annotates genes based on [COGs](https://www.ncbi.nlm.nih.gov/COG/) and also [E.C.](https://enzyme.expasy.org/) (enzyme commission) numbers. This information can be found in the .tbl and .tsv files. 106 | 107 | Column 6 of this .tsv file lists the COGs. To print out only column 6, you can use the `cut` command as follows (replace "yourPROKKAoutput"): 108 | 109 | ``` 110 | cut -f6 yourPROKKAoutput.tsv 111 | ``` 112 | 113 | Using commands such as `cut`, `sort`, `grep`, `uniq`, and `wc` answer the following two questions (Q7 and Q8). 114 | 115 | e.g., this line below will count the number of unique entries in column 3 of file.txt 116 | 117 | ``` 118 | cut -f3 file.txt | sort | uniq | wc -l 119 | ``` 120 | 121 | 122 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q7) How many genes were annotated with COGs? 123 | 124 | 125 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q8) How many unique enzymatic activities (E.C. numbers) were assigned to the E. coli genome? Note: `1.-.-.-` and `1.1.1.17` would count as two separate E.C. numbers. 126 | 127 | 128 | 129 | ## Assigning GO terms 130 | 131 | Next, we will be assigning Gene Ontology ([GO](http://geneontology.org/)) terms to your predicted genes/proteins from the E. coli H20 genome. 132 | 133 | `prokka` identifies homologs of your proteins within the UniProtKB database. Since there are already pre-computed GO terms for all proteins in UniProtKB, we can map these GO terms over using the following commands: 134 | 135 | ``` 136 | #extract the predicted proteins that have been mapped to entries in UniProt 137 | cat yourPROKKAoutput.gff | grep -o "UniProtKB.*;" | awk -F'[:;=]' '{print $4" "$2}' >uniProts.txt 138 | 139 | #assign GO annotations from a uniprot-GO database table 140 | python2.7 /usr/bin/uniprot2go.py -i uniProts.txt -d /fsys1/data/uniprot2go/uniprot-vs-go-db.sl3 >go.annotations 141 | ``` 142 | 143 | This will generate an `go.annotations` file, which contains your predicted functional annotations. 144 | 145 | This one-liner will extract column 3 (GO terms), and list the top 20 according to their frequency in your proteome. 146 | 147 | ``` 148 | cat go.annotations | awk '{print $3}' | tr "," "\n" | sort | uniq -c | sort -n -r | head -20 149 | ``` 150 | 151 | Now, there is a lot you can explore using your predicted GO terms for your genome. 152 | e.g., Suppose you want to find all the predicted DNA binding proteins. Look [here](http://amigo.geneontology.org/amigo) to find the GO accession ID for "DNA binding". 153 | 154 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q9) How many proteins were annotated with the GO term for "DNA binding"? 155 | 156 | ## After annotation: Extracting genes and regions of interest 157 | 158 | Once your genome has been assembled and annotated, you may be interested in identifying and extracting specific genes or regions of interest. 159 | 160 | ### Extracting genes of interest 161 | For example, suppose you are interested in the "trpE" gene from E. coli. You can see whether this gene exists in the predictions like this: 162 | 163 | ``` 164 | grep "trpE" yourPROKKAoutput.tsv 165 | ``` 166 | 167 | This will output 168 | >CGLDHGDC_02664 CDS 1563 trpE 4.1.3.27 COG0147 Anthranilate synthase component 1 169 | 170 | ... which tells you that "trpE" has been assigned to the gene labeled "CGLDHGDC_02664". 171 | 172 | You can then extract this gene sequence from the gene predictions file (.ffn) like this: 173 | 174 | ``` 175 | # index the .ffn file so we can extract from it 176 | makeblastdb -in yourPROKKAoutput.ffn -dbtype 'nucl' -parse_seqids 177 | 178 | blastdbcmd -entry CGLDHGDC_02664 -db yourPROKKAoutput.ffn 179 | ``` 180 | 181 | This will output: 182 | 183 | >\>>>CGLDHGDC_02664 Anthranilate synthase component 1 184 | ATGCAAACACAAAAACCGACTCTCGAACTGCTAACCTGCGAAGGCGCTTATCGCGACAATCCCACCGCGCTTTTTCACCA 185 | GTTGTGTGGGGATCGTCCGGCAACGCTGCTGCTGGAATCCGCAGATATCGACAGCAAAGATGATTTAAAAAGCCTGCTGC 186 | TGGTAGACAGTGCGCTGCGCATTACAGCTTTAGGTGACACTGTCACAATCCAGGCACTTTCCGGCAACGGCGAAGCCCTG 187 | CTGACACTACTGGATAACGCCCTGCCTGCGGGTGTGGAAAATGAACAATTACCAAACTGCCGTGTGCTGCGCTTCCCCCC 188 | ... 189 | 190 | Note: in this example we searched the annotations with a text query "trpE". However, the best way of finding your gene of interest is to do a BLAST search since it may not be labeled correctly in your annotations 191 | 192 | ### Extracting regions of interest 193 | 194 | Next, suppose you are interested in extracting the promoter of the "trp operon". See [here](https://en.wikipedia.org/wiki/Trp_operon) for some background information. The trp operon regulatory sequences (operator and promoter) can be found upstream of the trpE gene. These regions are not in the annotations files so you will need to locate them yourself. 195 | 196 | First, let's see where the trpE gene is located in the genome: 197 | 198 | ``` 199 | grep "trpE" yourPROKKAoutput.gff 200 | ``` 201 | 202 | This will output: 203 | 204 | >CP069692.1 Prodigal:002006 CDS 2777877 2779439 . + 0 ID=CGLDHGDC_02664;eC_number=4.1.3.27;Name=trpE;db_xref=COG:COG0147;gene=trpE;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P00895;locus_tag=CGLDHGDC_02664;product=Anthranilate synthase component 1 205 | 206 | This tells us that trpE is located in entry "CP069692.1" at chromosome position "2777877 to 2779439" and encoded on the plus (+) strand. 207 | 208 | To extract the sequence for these coordinates, we can use `blastdbcmd` against the genome as follows: 209 | 210 | ``` 211 | # index the genome so we can extract regions from it 212 | makeblastdb -in yourPROKKAoutput.fna -dbtype 'nucl' -parse_seqids 213 | 214 | blastdbcmd -entry CP069692.1 -db yourPROKKAoutput.fna -range 2777877-2779439 -strand plus 215 | ``` 216 | 217 | This should produce a FASTA sequence output of the gene identical to that in the above example. 218 | 219 | But you are not interested in the gene sequence; you actually want the upstream regulatory region. Suppose you want to identify the 30-nucleotide long region upstream (before but not including the start codon) of the trpE coding sequence. By modifying the code above, answer the following question. 220 | 221 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q10) What is the 30-nucleotide long sequence immediately upstream of the TrpE coding sequence? 222 | 223 | 224 | ### Extracting the rRNAs predicted by barrnap 225 | 226 | Sometimes you may be interested in extracting multiple genes or regions at once. E.g., suppose you want to extract all of the regions corresponding to predicted 16S rRNA sequences. In `prokka`, rRNA genes are predicted for you using the `barrnap` tool. 227 | 228 | Here is a two-liner to extract the 16S rRNAs predicted by `barrnap`. 229 | 230 | ``` 231 | cat yourPROKKAoutput.gff | grep "barrnap" | awk '{ if ($7 == "-") {print $1" "$4"-"$5" minus"} else {print $1" "$4"-"$5" plus"} }' > rRNAs.txt 232 | blastdbcmd -db yourPROKKAoutput.fna -entry_batch rRNAs.txt > rRNAs.fa 233 | ``` 234 | 235 | Now, to predict taxonomy, we can BLAST these rRNA sequences against the NCBI nucleotide database, for example using [web-BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn). Note, that there may be multiple rRNAs and some of them may be partial sequences. 236 | 237 | ![](https://github.com/doxeylab/learn-genomics-in-linux/blob/master/task3/ntsearch.png) 238 | 239 | 240 | ## Analyzing a mystery genome of unknown source 241 | 242 | And now for something a little more difficult. 243 | 244 | * Download this "mystery" genome of unknown source. 245 | 246 | https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task3/mysteryGenome.fna.gz 247 | 248 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q11) Based on 16S rRNA sequences, what is the taxonomic origin of this genome (genus and species)? 249 | e.g., "Escherichia coli" 250 | 251 | 252 | --- 253 | 254 | # ASSIGNMENT QUESTIONS 255 | 256 | 257 | Please answer questions 1-11 above on LEARN under Quizzes. 258 | 259 | 260 | # 261 | 262 | Congratulations. You have now completed Task 3. 263 | 264 | 265 | -------------------------------------------------------------------------------- /task3/installingArtemis.md: -------------------------------------------------------------------------------- 1 | # Installing Artemis/Java 2 | 3 | Artemis Installation Instructions 4 | 5 | ## Part 1 - Install the Java Development Kit (JDK): 6 | 1. Go to the [Temurin by Adoptium website](https://adoptium.net/en-GB/temurin/releases/?version=11&os=windows&arch=x64&package=jdk) and select your operating system version (most systems run x64 bit). 7 | 2. Select the "**JDK 11-LTS**" version from the dropdown menu. 8 | - For Windows: Download the **.msi** file and run it once downloaded. 9 | - For MacOS: Download the **.pkg** file and run it once downloaded. 10 | 11 | ## Part 2 - Install Artemis Tools: 12 | 1. Go to the [Sanger Pathogens website](https://sanger-pathogens.github.io/Artemis/) and scroll down the page until you reach the **"Software Availability"** section. 13 | 2. Under the **"Download"** heading, select your operating system version from the list and download the file. 14 | - For Windows: Download the **.zip** file. 15 | - For MacOS: Download the **.dmg** file. 16 | 3. Unzip the file to an appropriate local directory and an Artemis folder will be created containing the tools. 17 | - For Windows: Recommended to unzip the file to the **"C:\"** drive. 18 | - For MacOS: Recommended to unzip the file to the "**Applications**" folder. 19 | The first time you run one of the tools, it will ask to "Set Working Directory". Ask your prof what to set this to as they probably have some files to get data from – the working directory should be the one that contains those. 20 | 21 | #### Helpful resource: 22 | [The Artemis Manual](https://sanger-pathogens.github.io/Artemis/Artemis/artemis-manual.html) 23 | 24 | ## Additional Steps For MacOS: 25 | You may run into the error message "This application requires that Java 9 or later be installed on your computer. Please download and install the latest version of Java from www.java.com and try again." when opening any of the tools. If so, please follow the instructions below to open Artemis. 26 | 1. Open the "**Artemis**" folder. 27 | 2. Right click on the "**Artemis**" icon and select "Show Package Content". 28 | 3. Double-click on the "**Contents**" folder. 29 | 4. Double-click on the "art" executable file (it has the black terminal icon). 30 | 5. A new window will open and you will be prompted to set the working directory. 31 | 6. Once confirmed it is working, you can follow the additional steps below to create a shortcut for each of the Artemis tools. 32 | - Go through steps 1-3 again until the "**Contents**" folder is open. 33 | - Right-click on the "art" executable file and from the dropdown menu, click "**Make Alias**". 34 | - A new shortcut called "**art alias**" will be made in the folder. You can drag this file to your Desktop for example for easy access. 35 | - From the Desktop (or wherever you put the file), double-click on "**art alias**" and the program should open without errors. 36 | - You can repeat these steps for ACT, BamView, and Circular-Plot. 37 | 38 | -------------------------------------------------------------------------------- /task3/mysteryGenome.fna.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task3/mysteryGenome.fna.gz -------------------------------------------------------------------------------- /task3/ntsearch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task3/ntsearch.png -------------------------------------------------------------------------------- /task3/uniprot2go.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import pysqlw 4 | import sys, getopt 5 | 6 | sqlfile = '' 7 | uniprotlist = '' 8 | 9 | try: 10 | opts, args = getopt.getopt(sys.argv[1:],"hi:d:",["uniprotlist=","sqlfile="]) 11 | except getopt.GetoptError: 12 | print 'uniprot2go.py -i -d ' 13 | sys.exit(2) 14 | if len(sys.argv) <= 1: 15 | print('uniprot2go.py -i -d ') 16 | sys.exit() 17 | for opt, arg in opts: 18 | if opt == '-h': 19 | print 'uniprot2go.py -i -d ' 20 | sys.exit() 21 | elif opt in ("-i", "--uniprotlist"): 22 | uniprotlist = arg 23 | elif opt in ("-d", "--sqlfile"): 24 | sqlfile = arg 25 | 26 | 27 | #print 'DB = ', sqlfile 28 | #print 'UNIPROTLIST = ', uniprotlist 29 | 30 | 31 | p = pysqlw.pysqlw(db_type="sqlite", db_path=sqlfile) 32 | 33 | filepath = uniprotlist 34 | 35 | with open(filepath) as fp: 36 | line = fp.readline() #first line must be a header line 37 | while line: 38 | line = fp.readline() 39 | line = line.strip() 40 | splitline = line.split() 41 | if splitline: 42 | uni = splitline[1] 43 | # print(line) 44 | # print(uni) 45 | rows = p.where('uniprotid',uni).get('unitogo') 46 | goTerms = [] 47 | for b in rows: 48 | #print(b['goTerm']); 49 | goTerms.append(b['goTerm']) 50 | goTerms = set(goTerms) 51 | #goTerms.reverse() 52 | print line , ','.join(goTerms) 53 | 54 | p.close(); 55 | -------------------------------------------------------------------------------- /task4/README.md: -------------------------------------------------------------------------------- 1 | # Task4 - Synteny comparison of genomes 2 | 3 | This task is a tutorial on structural comparison of genomes using synteny mapping. 4 | 5 | ### Requirements 6 | 7 | * Access to a linux-based OS running BASH 8 | * [BLAST](http://blast.ncbi.nlm.nih.gov/) 9 | * [Artemis](http://sanger-pathogens.github.io/Artemis/Artemis/) * download this graphical software onto your own machine. Again, see [here](installingArtemis.md) for further instructions 10 | * [Mauve](http://darlinglab.org/mauve/download.html) (optional) * download this graphical software onto your own machine 11 | 12 | ## Installation 13 | 14 | Please install the graphical software on your local machine. 15 | 16 | All software used are available for Mac/Windows/Linux. 17 | 18 | --- 19 | 20 | ## Getting Started 21 | 22 | * Login to your linux environment and create a new folder for task4. 23 | 24 | ``` 25 | mkdir task4 #creates folder 26 | cd task4 #enters into folder 27 | ``` 28 | 29 | ## Retrieving the raw data 30 | 31 | * Copy the genome from task2 you assembled with `abyss` 32 | 33 | ``` 34 | cp ../task2/abyss-assembly-contigs.fa . 35 | ``` 36 | 37 | * You will be comparing this genome to another related genome from L. terrestris. Download this genome. 38 | 39 | ``` 40 | wget https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task4/l-terrestris.genome.fa 41 | ``` 42 | 43 | * Make BLAST databases for both. 44 | 45 | ``` 46 | makeblastdb -in abyss-assembly-contigs.fa -dbtype nucl 47 | makeblastdb -in l-terrestris.genome.fa -dbtype nucl 48 | ``` 49 | 50 | Now BLAST one genome against the other with the following command. Note that you are using BLAST's `-outfmt 6` parameter which outputs the BLAST result as a table (which you are writing to `blastresults.tab`). You will be using this table to visualize the synteny between these two genomes. 51 | 52 | ``` 53 | blastn -outfmt 6 -db abyss-assembly-contigs.fa -query l-terrestris.genome.fa >blastresults.tab 54 | ``` 55 | 56 | Now, download to your local machine the following files: 57 | 58 | * abyss-assembly-contigs.fa 59 | * l-terrestris.genome.fa 60 | * blastresults.tab 61 | 62 | Open the `act` program that is packaged with `artemis` and input these three files. 63 | 64 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q1) Paste a screenshot of your result. (3 marks) 65 | 66 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q2) Describe the synteny pattern that you are observing. Do you think genomic rearrangements have taken place or is there a strong pattern of shared synteny between both genomes? (2 marks) See [shared synteny](https://en.wikipedia.org/wiki/Synteny#Shared_synteny). 67 | 68 | To help you with this question, consider two genome sequences composed of four genes A-D. One genome has gene order A,B,C,D and the second genome has gene order A,C,B,D. There has clearly been a genomic rearrangement here because C and B have switched places. 69 | 70 | But now suppose the genomes are (A,B,C,D) and (C,D,A,B). If these are linear chromosomes, then a rearrangement has taken place, but what if they are circular? 71 | 72 | And lastly, now suppose we compare (A,B,C,D) to its reverse complement which will appear to be in the order (D,C,B,A). This may look like an inversion in artemis, but one of the two strands just needs to be flipped so that we are comparing the genomes in the same orientation. 73 | 74 | --- 75 | 76 | ## Working with your own dataset 77 | 78 | Next, find two related genomes (e.g., different strains of same species) from the [NCBI Genome Database](https://www.ncbi.nlm.nih.gov/datasets/genome/). 79 | 80 | * Repeat the analyses above to perform a structural genome comparison. 81 | 82 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q3) Paste a screenshot of your result. (3 marks) 83 | 84 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q4) Describe the synteny patterns that you are observing. (2 marks) 85 | 86 | 87 | ## Multiple genome alignment with Mauve -- Bonus (+1) 88 | 89 | This is for bonus marks. 90 | 91 | Want to try aligning/comparing more than two genomes? 92 | 93 | * Download/install [Mauve](http://darlinglab.org/mauve/download.html) to your local machine. 94 | 95 | * Select three or more genomes of interest. 96 | 97 | * Open the sequences in `Mauve` and align them. 98 | 99 | * Visualize the multiple alignment. 100 | 101 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Bonus) Paste a screenshot of your result. 102 | 103 | 104 | --- 105 | 106 | 107 | # ASSIGNMENT QUESTIONS 108 | 109 | The questions for this task are indicated by the lines starting with ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) above. 110 | Please submit the code you used (when required) as well as the answers to the questions. Submit your assignment to a dropbox on LEARN as a .docx, .txt, or .pdf file. 111 | 112 | -------------------------------------------------------------------------------- /task4/act.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task4/act.png -------------------------------------------------------------------------------- /task4/installingArtemis.md: -------------------------------------------------------------------------------- 1 | # Installing Artemis/Java 2 | 3 | Artemis Installation Instructions 4 | 5 | ## Part 1 - Install the Java Development Kit (JDK): 6 | 1. Go to the [Temurin by Adoptium website](https://adoptium.net/en-GB/temurin/releases/?version=11&os=windows&arch=x64&package=jdk) and select your operating system version (most systems run x64 bit). 7 | 2. Select the "**JDK 11-LTS**" version from the dropdown menu. 8 | - For Windows: Download the **.msi** file and run it once downloaded. 9 | - For MacOS: Download the **.pkg** file and run it once downloaded. 10 | 11 | ## Part 2 - Install Artemis Tools: 12 | 1. Go to the [Sanger Pathogens website](https://sanger-pathogens.github.io/Artemis/) and scroll down the page until you reach the **"Software Availability"** section. 13 | 2. Under the **"Download"** heading, select your operating system version from the list and download the file. 14 | - For Windows: Download the **.zip** file. 15 | - For MacOS: Download the **.dmg** file. 16 | 3. Unzip the file to an appropriate local directory and an Artemis folder will be created containing the tools. 17 | - For Windows: Recommended to unzip the file to the **"C:\"** drive. 18 | - For MacOS: Recommended to unzip the file to the "**Applications**" folder. 19 | The first time you run one of the tools, it will ask to "Set Working Directory". Ask your prof what to set this to as they probably have some files to get data from – the working directory should be the one that contains those. 20 | 21 | #### Helpful resource: 22 | [The Artemis Manual](https://sanger-pathogens.github.io/Artemis/Artemis/artemis-manual.html) 23 | 24 | ## Additional Steps For MacOS: 25 | You may run into the error message "This application requires that Java 9 or later be installed on your computer. Please download and install the latest version of Java from www.java.com and try again." when opening any of the tools. If so, please follow the instructions below to open Artemis. 26 | 1. Open the "**Artemis**" folder. 27 | 2. Right click on the "**Artemis**" icon and select "Show Package Content". 28 | 3. Double-click on the "**Contents**" folder. 29 | 4. Double-click on the "art" executable file (it has the black terminal icon). 30 | 5. A new window will open and you will be prompted to set the working directory. 31 | 6. Once confirmed it is working, you can follow the additional steps below to create a shortcut for each of the Artemis tools. 32 | - Go through steps 1-3 again until the "**Contents**" folder is open. 33 | - Right-click on the "art" executable file and from the dropdown menu, click "**Make Alias**". 34 | - A new shortcut called "**art alias**" will be made in the folder. You can drag this file to your Desktop for example for easy access. 35 | - From the Desktop (or wherever you put the file), double-click on "**art alias**" and the program should open without errors. 36 | - You can repeat these steps for ACT, BamView, and Circular-Plot. 37 | 38 | -------------------------------------------------------------------------------- /task4/l-terrestris.genome.fa: -------------------------------------------------------------------------------- 1 | >U24570.1 Lumbricus terrestris mitochondrion, complete genome 2 | ATGCGATGATTCTACTCAACTAATCACAAAGATATTGGAACTTTATACTTCATTCTTGGGGTATGGGCTG 3 | GCATGGTGGGAGCCGGAATAAGACTTCTTATCCGTATTGAGCTAAGACAACCTGGTGCATTCCTAGGAAG 4 | TGACCAATTATACAATACAATCGTTACTGCGCACNNNTTTGTTATAATTTTCTTCCTAGTGATACCAGTC 5 | TTCATTGGCGGGTTTGGGAACTGACTTCTTCCCCTAATACTGGGCGCTCCTGATATAGCATTCCCACGCC 6 | TTAATAACATAAGATTTTGACTTCTACCCCCCTCTCTTATTCTCCTAGTTTCCTCAGCTGCCGTAGAGAA 7 | GGGAGCCGGAACAGGCTGAACAGTGTACCCCCCTCTTGCCAGAAATCTCGCCCATGCTGGGCCATCTGTA 8 | GATTTAGCTATTTTTTCCCTTCATTTAGCAGGTGCGTCATCTATTCTAGGGGCTATTAATTTTATTACCA 9 | CTGTAATCAACATACGCTGAAGTGGGTTACGACTAGAACGAATCCCTCTGTTTGTCTGAGCTGTATTAAT 10 | TACAGTAGTTCTCCTCCTCCTATCCCTTCCTGTACTTGCCGGAGCAATCACAATACTCCTAACAGATCGA 11 | AATCTTAATACCTCATTTTTCGACCCCGCTGGTGGAGGGGATCCAATTTTATATCAACACCTTTTCTGAT 12 | TCTTTGGTCACCCAGAAGTATATATTCTTATTCTTCCTGGGTTTGGGGCCATTTCCCACATTGTTAGACA 13 | CTATACAGCTAAACTTGAGCCATTTGGAGCCTTAGGGATAATTTATGCAATACTAGGAATCGCAGTTTTA 14 | GGATTTATTGTCTGAGCACACCACATATTTACCGTTGGCTTAGATGTGGACACCCGGGCATATTTCACAG 15 | CAGTAACCATAATTATCGCAGTACCGACAGGTATTAAAGTATTTAGTTGATTAGCTACCATTCACGGGTC 16 | AAAGATCAAATATGAAACACCAGTGTTATGGGCCTTAGGATTTATCTTTTTATTTACAACGGGAGGTTTA 17 | ACTGGAATTATTTTATCTAACTCCTCCCTAGATATTATTCTTCATGACACATACTATGTAGTAGCACACT 18 | TCCACTACGTGTTGAGAATGGGCGCCGTATTTGCAATCTTTGCTGCCTTTACTCATTGATTCCCCCTACT 19 | AACAGGGCTTACCCTACACCACCGATGAGCCAATGCACAATTCTTCCTCATATTCTTAGGGGTAAACACT 20 | ACATTCTTTCCCCAACACTTCCTAGGATTGAGCGGTATACCTCGGCGATATTCTGACTACCCTGATGCTT 21 | TTATAAAATGAAACGTCGTGTCATCATTTGGGTCTCTCCTGTCCTTTGTAGCATTAATACTGTTTATTTT 22 | TATTTTATGGGAAGCCTTCGCCTCACAGCGAAGAGTAATCTCAAGGCCCCACATATCATCAGCTCTTGAA 23 | TGGTCTGACCCTATTCTACCTCTAGATTTTCATAATTTAAGGGAGACCGGAATCATTACATACCCTAAAT 24 | TAAGTGAATGCCAAAGGCACCTATTTGTTAATTAGGCCCGTGCTATTTATAGCTCACTTAGAATGCCAAA 25 | CTGAGGTCAAGTAATATTTCAAGACGCCGCATCATCTGTCATGCTCCAACTAGTATCCTTTCACGACCAT 26 | GCTTTATTAGTCCTAACTCTAGTCCTAACAGTGGTCGGCTATGCTCTCCTAGCACTCATATTAAACAAAC 27 | AAGTAAACCGTTACATTATAGAAGCTCAAACAGTAGAAACAATCTGAACTATTTTACCAGCTCTTATTCT 28 | CCTAGTTCTAGCCCTACCATCTTTACGCATTCTTTATATTACAGACGAGGTGAGACAACCATCTATTACT 29 | GTGAAGACTATTGGGCATCAATGATATTGAAGATACGAATATACTGATTTCTTAAATGTAGAAATAGATT 30 | CATATATGCTACCAACCTCAGACCTACTACCGGGGGACTATCGACTCCTAGAAGTAGATAATCGTATGGT 31 | AGTACCTATACAATTAGAAATTCGAATACTAATCACTGCTGCAGATGTGATTCACTCATGAACAGTTCCA 32 | GCTCTCGGGGTAAAAGTGGATGCCGTGCCTGGACGTCTAAACCAAATTGGATTTACAACTACACAACCAG 33 | GGGTATTTTATGGTCAGTGCTCAGAAATCTGCGGTGCTAATCACTCATTTATGCCAATTGCAGTGGAAGC 34 | TATTAACACTAAATCCTTCATAAGATGAGTCTCCAATTTTAAACCTTAGAAATACTAGTTAATCTATAAC 35 | AATGCCTTGTCAAGACATAATTACCTCTGGTGTATTTCTATGCCTCACCTATCTCCCATAAGATGAATTA 36 | CTTCTATATTAATATTCTGAATTTCCGTATCAATTCTTTTTTCCACCCTATGATGATCCAACAATTATTT 37 | ATTCAGTTCAAAAATAACTAATTGCGCCCCCAAATCCCTTACACCTTGAAATTGATTATTACGAGATGGT 38 | CGAGATTAAACATTAAACTGTAAATTTAATAACGGGCTACCACCCTCTTGTATGTTTTTAGTATATTTTG 39 | TACATTTACCTTCCAAGTAAAAAGATTGTTAGTAAAAAACATAAATGATTCGTCAACCCTTCCACCTCGT 40 | TGAGTACAGCCCATGGCCTCTAACCTCATCTATTGGAGCTTTTACCCTAGCTATCGGATTAGCTAGGTGG 41 | TTTCATAACCATGGATTCTTATGCCTAACCCTAGCCGCTTTTCTTATCATCGTTTCCATAATTCAGTGAT 42 | GGCGAGATGTCGTGCGAGAGGGCACATACATGGGTCATCATACCAGCTTAGTAACTACCGGCTTACGTTG 43 | GGGTATAATTTTATTTATTACTTCAGAAGTGATATTTTTCCTCGCCTTCTTTTGAGCCTTTTTCCACAGA 44 | AGCTTATCCCCCACACCAGAAATTGGCTGTTCCTGACCTCCAACAGGAATCCACCCCTTAAACCCATTCA 45 | GAGTCCCCCTGCTAAACACTGCTGTTCTTCTAGCCTCAGGGGTTACAGTAACCTGAGCTCATCACAGACT 46 | AATAAGAGGTAAGCGTATTGATGCCACTCAAGCACTAATTCTAACTGTCTGCTTAGGTGCCTACTTCACT 47 | TTCCTCCAAGCTGGCGAATACATAGCCGCCCCATTTTCTATTGCTGATAGGGTGTATGGCACTACATTCT 48 | TTGTGGCAACTGGATTTCATGGGCTTCACGTTCTAATCGGGTCATCTTTCCTAGCTATTTGCTTAGCGCG 49 | TACATGGTCCCACCACTTCTCTGCTGGGCATCACTTTGGATTCGAAGCCGCTGCCTGATACTGACACTTT 50 | GTAGATGTGGTGTGAATCTGCCTATATCTATGTATTTACTGATGAGGCTCCTATATAACTTAGTGTATGG 51 | TGCACGAAGGTTTTTGAAACCTAAAGCCTAAGTTCAAATCTTACAGTTATAAATGATTTTAACCTCTTTC 52 | ATATTAATGATAATCGCCACAACTTTTACCCTATATCTAGCTTCCACCCCTATTGTCTTAGGTGTAAATA 53 | TTCTCATAATAGCCCTACTCTTAGCCTCCACATTTGCGTCCTTTATAAGCTCCTGATTTGCATTTTTAAT 54 | TTTTCTAATCTACATCGGCGGCATGTTAGTCATATTTGCCTACTTTCTAGCTTTAACCCCCAACCAACAA 55 | ATTTCAAACTTCAATATTATACCATATGCTCTAATCACATTATTAACATTTTCGGCACTAACATACACCA 56 | CCAACATTAAAATCCCCACTTTTTCTGATATTAGTCAAGGAAACTCAATTTTGTATATATCCAGAACTGC 57 | ACCATTCCTCATCCTTCTCGCCCTAATCCTCCTCCTTACGATAGTTATTGTAGTAAAATTAACCAGACGG 58 | TCAAGGGGCCCTCTCCGCCCATTCTCCCCATATGTTCAAACCTATCCGAACCACACACCCGGCAATTAAA 59 | ATTATTAATAGTACCCTAATTGATCTTCCCGCGCCTAATAATATCTCCATTTGATGAAACTACGGATCAC 60 | TTCTGGGCCTATGCCTTGTAATCCAAGTTTTAACAGGCCTATTTCTAAGAATACACTACGTACCCAACAT 61 | TGAAATAGCTTTCTCATCAGTAGCCCTCATTTCTCGAGACGTGAACTACGGCTGACTTCTTCGGTCTATT 62 | CACGCTAATGGAGCATCTATATTTTTTCTATTTATCTATCTCCATGCGGGCCGAGGTCTATATTATGGCT 63 | CGTATAACCTCAGTGAAACTTGAAATATTGGGGTAATTTTATTTCTTCTCACCATAGCCACTGCATTCAT 64 | AGGTTATGTTCTGCCCTGGGGACAAATGAGATTCTGGGGAGCGACAGTAATTACTAACTTATTCTCAGCA 65 | ATTCCCTACATCGGGAAAACTTTGGTAGAGTGAATTTGAGGTGGGTTTGCAGTAGATAACGCTACCCTAA 66 | ACCGATTTTTCGCATTCCATTTCATTCTCCCGTTTGCTATTATAGGGGCGACTATCCTACACATCATATT 67 | TCTTCACGAGTCAGGATCTAACAACCCCATTGGTCTAAATGCAGACTCCGACCGAATCCCGTTTCATCCC 68 | TATTATTCTATTAAAGACACCCTCGGTTATACGTTAGCAATTTCAGCTCTATCTTTAATAGTTTTATTCG 69 | AGCCTAACTTATTTACCGACCCTGAGAACTTTTTAATAGCAAACCCTCTTGTAACACCTATTCATATTAA 70 | ACCTGAATGATATTTTCTATGAATATATGCCATTCTACGTTCAATTCCTAATAAGCTAGGGGGGGTGATA 71 | GCACTATTCGCAGCCATCGTTATTCTATTTATTCCACCGCTAACAAGTGTCATAAATAAGCGGAGCCTCT 72 | CATTTTACCCCCTAAATAAGACAATATTCTGAGGCCTTGTAGCATCCTGAGCTATTCTTACATGAATTGG 73 | GGGTCGACCTGTAGAAGACCCCTTTATCATCATCGGGCAAGTATTTACATCCTTGTACTTTATCTACTTT 74 | ATTTCCAGTCCTACCATCTCTAAACTCTGGGACGATTCTATTATTATCTCCTTAGAAAACACGTACCAAC 75 | TCAAAAAGATATACCTCCCCGAATATAAATTATAAGCCCTTAACAAAGACTTTAAGTTAAACAAACTAAA 76 | AACCTTCAAAGTTTTCATTGGGAGTATCCAAGTCTTGCCATGATACCAGATATTTTTTCCTCCTTTGACC 77 | CCTATATATTTAATACCCTGTTCCCACTCAACTCTCTATTCTTAGTAACAAACACAGCTATCATTCTGAT 78 | AATTCAGTCGTCATTTTGAGTTTTAAACGCTCGAACCTCAGCATTTAAGTCTCCAGTCAATGATACAATT 79 | TTCACTCAACTATCCCGCACATCTACCACACACCTCAAAGGTCTATCAACCCCATTATCCACCATCTTCT 80 | TTATACTAGTAATAATCAATCTCATGGGATTAATTCCATACATGTTTAGAACATCTAGACACCTAGTATT 81 | CACCCTTTCCCTAGGGTTCCCCATCTGACTAAGTCTTATAATCTCTACGTTCGCTCACAGCCCCAAAAAG 82 | AGAACAGCTCACTTTCTCCCTGACGGGGCCCCAGACTGATTAAACCCGTTCTTAGTTCTAATCGAAACAA 83 | CTAGAGTTTTCGTCCGACCTCTAACACTATCTTTCCGTTTAGCCGCTAACATAAGAGCAGGGCACATCGT 84 | CTTAAGACTTATAGGAATCTACTGTGCCGCCGCATGATTTTCAAGTGTTTCAAGAACAGCACTCCTAATC 85 | TTAACTGCCATCGGATATATTCTATTTGAAGTAGCAATTTGTTTAATTCAAGCTTATATTTTCTGCCTAC 86 | TCCTATCCCTATACTCAGATGATCACGCCCATTAAAACAATAAGCATTAAAAAAAATGCGCGCCGATTTC 87 | GACTCGGCGAGAGCACAAAGCATTGTTTTTTTACTTAGTTTATACTATACTCTATATATATATACGCATT 88 | TGTGTACTCTGATTGGGGGGGGGGGGTAATTTCACAAAAAGCTATAATCCGAAAAGGCCCGACCGGGCGA 89 | GAAAAAAAAAAAAAAAAAAAGAAAAAGTGGTGTTTTTAGGTTCTAATCCTTTAGAATGATGCCAATTTCG 90 | GAAAAACTCGACAGGGACTTTTTAAATTTCGGTCCTTGCTAATATGGGCACGACGTATATTTGCGGTATT 91 | TACATAAGAAACGGCCTGTATCGAGCAAAATTTACAGTCTGTCGGGGGAAAAAATTTTAACCTAAAAAAT 92 | TGTTCGGCGTGGGGCCTTTTTTTTTTCAGTTTTTAACATTAAAAATTTTCTCGGAGTTCTAATCATAAAG 93 | GTAGGTTACAAAAACCCCCGAATTGTGGTTCCGGAAACGTCAAAAGACCCTTTTTCATGCTTCGGATATT 94 | TAATACTAGACTTGTGGCCAGTAAACTAATATGGGTTATCTTTACTGGGATGCTGGCGCCCACCCTATAC 95 | ATAGTGCACTGTAATTCCACCTTACTTCTAGAGTGAAATCTTTTCTCTATTTCCTCTACCCCCATAATAA 96 | TAACTATTATCCTAGACCCCCTGGGACTGATATTTTCTTGCACCGTAGTAATAATTTCAGCCAATATTCT 97 | AAAATTCTCAACTATCTATATGAAGGAAGATAAATTTATCAACCGTTTTACAGTCCTAGTGCTGCTCTTT 98 | GTCTTATCTATAAACATACTAATCTTCTTTCCCCACTTAATTATCCTACTACTTGGTTGAGACGGCTTGG 99 | GAATTGTATCCTTTATCCTAGTCATTTACTACCAAAATCCAAAATCTTTGGCAGCTGGTATAATCACAGC 100 | TCTCACTAATCGTATTGGGGATGTTATACTCCTCTTGGCTATCGCGTGAACTCTAAACCAGGGTCACTGA 101 | AATATTTTACATATGTGGGCNGTCGACGAAAACATATATCAGGCATTAGTTATCATTATCGCAGCTATAA 102 | CTAAAAGAGCCCAGATGCCGTTTTCCAGGTGGCTCCCAGCAGCTATAGCTGCACCTACCCCAGTCTCAGC 103 | CTTAGTGCACTCATCAACCTTAGTTACCGCCGGAGTATTCTTATTAATCCGATTTTATAACTTTCTATCT 104 | TCTGTGTGATGATTCACTACCTTTCTACTTTTTGTAGCTGTTAGTACTACTTTAATAGCCGGGTTGAGAG 105 | CCTCTTCTGAATGCGACATAAAAAAAATTATTGCTTTGTCAACCCTTAGACAACTTGGAATAATGATAGC 106 | TGCTATAGGCTTAGGGATGGCCCATATAGCCTTTTTCCATATAGTAACCCACGCTATATTTAAGGCTCTT 107 | CTCTTTGTGTGCGCCGGAAGATTTATTCACAGACATATGCACAGTCAAGATCTTCGTTGAATAGGTAATC 108 | TCACTAAACAAATACCTACTACCACCTCATGCTTAATTATAGCAAATCTAGCTCTTTGTGGGTTCCCCTT 109 | TATGTCAGGTTTTTACTCTAAGGATATAATTGTGGAAGCTTCGCTCTACTACCCCCATAACTCACTTATA 110 | ATTAATCTAATCTTATTTGCAGTCGGTTTAACTGCATTCTACTCAACTCGATTTACCATGTGCGTAGTCC 111 | TTTCTCCCAATAACTGTGGTCCTTATATACATTTGGAGGAGAGCAACTCCCTCACATCTCCTATACTGCT 112 | TCTAGCTTCAATATCAGTTATTTCCGGGTCAGCTCTTACATGAATTCTGCCGTTAAAACAAGAAATAATG 113 | ATAATCCCCCTTGACCAAAAGCTTAAAACCTTAATATTAGTCACTTTGGGTGCACTTATATCCTGGTTCT 114 | TTCTAACAACGACAAATATAACTAAAACATGCCTATACATTCGTCACCCAATTATTAACTACTTCTCATG 115 | CACTATGTGGTTTCTAGTCCCCCTTTCATCTCAATTTATAATAAAACTCCCAATATATGTATCACACAAC 116 | TACTTAAAACTGACCGATCAGTCATGGTTGGAGTTACTCGGGGGGCAAGGTATTAATAACGTATCAAGTA 117 | AAGCCTCCAATATCTATCTGGCATCCTTAAAATCTACACCTATGAACTACCTAATAATGTCGTCGATACT 118 | ACTACTAGTCGCCACCTTAGTCGCAATTTAGTCTAGATAGCTTAAAATAAAGCATGTTATTGAAGATAGC 119 | AATATGGGAGTTCCTCCGGACAGTGTATGTGGTGTAAGTCAACACATTAGCTTTTCATGCTAATAATATA 120 | CATCCGTATTATACACAGATAGTAGTTTAAGTAAAACTCTGACCTTGGGTGTCAAAAATCACTTCGGTGT 121 | TATCTGAGAATTGAAAGCTAATTAACAGCATCGATCTTGTAAATCGAAGATAGAGACTACCTCTCAATTC 122 | TATGTACAGTTCTATTATAAGTTTAGTCTTTCTCCTTCCAATCGTCGCTGTTGTAAATCTAATCTCAAAT 123 | CAATCACACTTTTTAATAACTCTTCTATCACTTGAAGGTATCACACTGAGACTGGTTCTATTTGTTCCAA 124 | TCTCTCTCTCTATTATAAGTGCCTCTAATGTTAGAATTAGGGTCATTTTATTGACTTTCGGGGCATGTGA 125 | GGCCAGCTTAGGACTAAGCCTCATGGTATTAATATCCCGATCCTACGGAACTGATATATTAAACTCACTT 126 | ACAGCAAATAAATGTTAAAGCTCCAAATAGTATTAATATCTCTGCTCCTACTCCCACTCATTGTAAATCT 127 | GTACCCCTGAATTATCGCTCTGACTCTTAGAGCTTTATTACTACCCACCTGTTTCAATTTGGTAAACAGA 128 | GCATCCTACTCAATATTTACAGAATACATATCCTCTGATATAATGTCATTTACACTCTCCGCTCTAACAA 129 | TCTGAGTCACTGTAATAATAATCCTCGCAAGAACTAAAATTATGCATTTAAATATGTACCCCAAAATATT 130 | TATGACAAACTTAGTTATTTTGCTAATTATTCTAATTAATTGCTTCTTATCCCCCAATCTAATTATATTT 131 | TATATTTGATTTGAAGCATCCTTAATTCCAACTATAGTGCTAATCATGACTTGGGGCTATCAGCCAGAAC 132 | GATCTCAAGCAAGAATATATTTAATAATCTATACAGTCGCTGCCTCCCTCCCAATGCTTATAGTGCTATG 133 | TAAAATTTTTATCGTGTCCAAAACAGCTATGATACCCATATTCATAAACATGGAGTTCCCTATAGACTAC 134 | CCATCTATGGCCCTAGCCTGAGTATTAACACTGGGGGGCTTTCTAGTAAAACTCCCTATATTTACAGTGC 135 | ACCTCTGACTTCCTAAAGCACACGTAGAAGCCCCAATCGCAGGGTCTATAATTTTAGCTGCAATTCTTCT 136 | AAAACTTGGGGGTTACGGCATTCTTCGCATACTAAGATTATTTCACTATATAGCTAAATCAACCTCAAGA 137 | CTTCTTTCTAGGGTAGCTTTAGTCGGGGCAGTCTCAACAAGATTAATCTGTCTCCGCCAATCAGACCTAA 138 | AATCCCTAATTGCTTACTCATCTGTTGGACATATGGGTCTAATAGTCGCGGGCGCTTTAATAAGCTCTAA 139 | TTGGGGGTTCCAAGCAGCTCTAGCTATAATAATTGCCCATGGGCTGAGCTCATCCGCCCTATTTGTAATA 140 | GCAAATATAAACTATGAATTAACCCATACTCGAAGCCTATTCTTAATAAAGGGCTTGTTAGTTTTAGCAC 141 | CGACGCTCACTATATGGTGATTCCTGTTTACAGCTAGAAATATAGCAGCCCCCCCTTCCATTAACCTACT 142 | CAGAGAGATTATGTTAATTACATCTATTTTAAAAATATCTACTTCAGCTTTTATTCTTCTAGGTCTAACA 143 | AGATTCTTTACAGCTGCTTATTGTTTGTACATGTATACTTCTATACACCACGGACCCTTAATACTAACCT 144 | CTAACCCAATCCCTCAATTCAAAGTAAAAGACCTAACTCTTATAACTATACACTTAGTTCCTACAATTCT 145 | TATTATCTTTAAGCCTGAACTAATCACAAGATGGTCCTGATGGCATAGTTAAACAATAACATTAAATTGC 146 | AAATTTGATATTATACTAATAGTATTACCATCTAGTAAGATAAGCTATTCAAGCTAGTGGGTTCATACCC 147 | CGAAAAAGAGATACTCTCTCTTACTATCAGTTTGATTCTGGCTGACAATTTAGCTCTCTTAAACTAATAT 148 | ACATGTAAGTCTAAGCTACCCCGCACCACATATAAACCCTAAATGGAGAATAACTATATTAGACATATCT 149 | CATATGGTAAAGCGTCACAGTAAGAAAATCTACCTTATTGCAAAAGCAAGAGCTGGTTATTAAGATCAGA 150 | GTTGGCAATATTCGTGCCAGCTGCCGCGGTTAGACGATAAACTCAAGCTAATTCATATAAGACTAATTGC 151 | AAGGCGATCTAAAAAATAACTCAAAGTCTAATTTATATAATCCGAGACCCGTAAACGCCTATTTACCGTA 152 | AAACCATAGACTAAAACACGGATTAGATACCCGTCTATTTATGGAGTAACTAAAGTCGAAAAATACGAAC 153 | TACAGTTTAAAACTTAAAGATTTTGGCGGTGTCTTATCAACCCAGGGGAACCTGTCTCATAACTCGATAA 154 | CCCACGACACTCTCACCCTCCCTAGACTAAACAGCTTGTGTACTGCCGTCGTAAGCACACCTCTAAAAGC 155 | CAAGGAAGTGTGCAATAATGATTGTCTCACCCACGTCAGGTCAAAGTGCAGCCCATGGACGGAGATGATG 156 | GGTTACACCTAAACAAAGATACGGAATACAGCATTAAGAGCTGTGTAAAGGAGGACTTGAGTGTAACAGT 157 | ATTACAAAATTAAAGTGAATCTGAATCTAAGACATGCACACATCGCCCGTCACTCTCGCCTAAAGGCGAG 158 | ATAAGTCGTAACAAAGTAGGTGTAACGGAAGTTGCCCCTGTCGAAGTATAGCATATATAATGCCTTTTAC 159 | TTACACTAAAAATAAAACATTTGTTTACTTCGCTGCCTATGTTTATCTTATAAATAAAAACTAATAAAAA 160 | CACTTAAACTGATAATTTCATAATAAATCTTTACAATAGTACTATAGAGGAAGTAGTCAACATAATAAAG 161 | TAGTGGTTTATACACGTACCTTGTGCATCATGGTTTTACAAGCCTCAAATTAATAATATTACCCGAATTC 162 | TAAGCGAGCTGTCCCTTCATAGCTAAGAGCCCACCACTAGTTGTAGCATCAACTTTGGAAAATGGGGGGA 163 | TAGGGGCTACATACCAATCGCGCTAGAAAATCTCTGGTTTTCAGTAAAATTTACCAATAAAACATGTAGC 164 | GTCCACCTACACTGCAAGACTACAGAGGATAAGCCCTGTATTCAAAAACTAGATATGCCCTCCTTCCAGT 165 | ATAGGCCTAAAAACAGCCACTAATAGTACCTCACCGTAAACACCATTAAATTAATTCTATACCCTGTTCA 166 | AATAAACTGAAATTTTTGACAAACCTTAAATCTTAAAATATTATGTAAAAATAAGTATTAATTTTATAAC 167 | CTAAGTTACAGCTACCATGTATGTATTATATATTTACAACTTATAAAGGAACTTGGCAAATTCTTATTTC 168 | GACTGTTTAACAAAAACATTGCTCTCAGTAACCTTAATTAAGAGTAACTCCTGCCCAGTGAGTAATTCAA 169 | CGGCCGCGGTATCCTAACCGTGCAAAGGTAGCATAATCACTTGCCCATTAATTGTGGGCTAGAATGAAGG 170 | ATAAACGAAATAAATACTGTCTCTATAAGCCGCTTAAAAATACCCTCTAACCGAAGAGTGTTAGATAGCG 171 | TCGAAGGACAAGAAGACCCTATAGAGCTTAATTTAAATAAATATGAAAAAATTTACTAAAATTCGGTTGG 172 | GGCGACCAGGGAATTACCCATCATCCCTAAACAAAAGATAAATGTATCAAAACACTGACCCTTCTACAAG 173 | ATCATTAAAACAAGCTACCTTAGGGATAACAGGCTAATCTCACTAGAGAGTCCTTATCAATAGTGAGGAT 174 | TGGCCCCTCGATGTTGGCTTAGGGAATCTCTATGACGCAAAAGTCATATAAAGATGGTTTGTTCAACCAA 175 | TAACACCCTACATGAGCTGAGTTCAGACCGCGTAAGCCAGGTTAGTTTCTATCCTCGATCACTTTATCTA 176 | TTTATAGTACGAAAGGACCTAATTAGAGTAATATTTACACACAGGGAATAAATATAAACCATTCTAAGTA 177 | AAATAAATCATAAACTATGAGTAAGTTGGCAGAATAGTGCGACCGACTTAGGATCGGTTCATGGGTAAGC 178 | CCACCTACTATGCAACTTAGTTCATTTAGAATAACCAACCTGCACTTGGTAGGAGAGATATCTCAGTAGC 179 | GGTTTGATGTTTCGGAAATACTGGACCTTGAACGTCCACTTAAGGTGTTCGACTCACCTTCAAATCACCA 180 | AGATGGCAGAATAGTGCCATAGGTTTAAACCCTATTCATGAGTAGTCATACTCTCTTGTTACATGAATAT 181 | CCCGTTTTTTACATCTGTGTTAATAAGATTGGTACTTGCGCTCCTAGCAATAGCCTTCTACACACTAATA 182 | GAGCGAAAATTCCTTGGGTACTTCCACCTACGAAAAGGGCCTAATAAGGTAGGGCTAATGGGGCTTCCGC 183 | AACCATTTGCTGACGCAATTAAACTTTTTGTAAAGGAGCAAGCTAAACCTAACCCCTCGAATCAAACCCC 184 | GTTTCTATTTGCCCCTACCATAGGATTAGTTTTAGCTCTCTTAATGTGAGTAATCTACCCCCATTCCCAT 185 | CAATCATTTTTCATTCAATTTAGGGTCCTATATTTTTTATGTGTATCAAGAATAAACGTATATACGACCT 186 | TTCTCGCTGGCTGAAGATCCAACTCTAAATATGCTCTACTGGGAGCTTTACGGGGGGTTGCTCAAACTAT 187 | TTCATACGAGGTTAGAATATCCTTAATTCTTCTCAGATCCTTAATTATTTTATCAACTATAGATTTCACT 188 | AAAATATTCTCGTATTCCTGAATCCTATTTATATTTATTCCCTTGGCAGTAGTATGGTTCATTACTAATC 189 | TAGCAGAGACTAATCGAACCCCGTTCGATTTTGCAGAGGGCGAGTCAGAACTGGTATCCGGGTTTAATGT 190 | TGAGTACAGGGCTGGCCTTTTTGCCTTAATCTTCATAGCAGAGTATGCAAATATCTTAATTATGAGCCTA 191 | TTCACAAGTGTTATTTTTATAAGGACCTGCGCTAGGGGTATAGCCAGAGATCTAGTCCTAATTCTTCAAA 192 | CTATTACCTTAGCCATGCTCTTTGTGTGGGTTCGAGCAACATACCCCCGAATACGTTACGACCATCTAAT 193 | AAACCTCACATGAAAAAGATTTCTCCCCCTATCTCTAGCCCTATTAATGATATCTATTCCAATCGCAATA 194 | ATGCTGTGGTACAGCGCCGGATGAACGGATAACTCTGATGACGTTAATTAAGGAACAAGCTTCCCTGTAT 195 | CTAACTAGAGAGCTTGTAAATAGCACTTGACTTTTAATCAAGAGATAGTATAATTATTTCTAGTTAATGA 196 | TCCTTACAGCCTTATCCTCAGCCATTGCACTATTAGTCCCTATTATTATTTTGGGGGCAGCATGAGTTCT 197 | AGCTTCACGATCTACAGAAGATCGAGAAAAGTCATCTCCATTCGAGTGTGGGTTTGACCCAAAAAGAACC 198 | GCACGGATCCCATTCTCAACCCGATTTTTCTTATTGGCCATTATCTTTATCGTATTTGATATCGAGATTG 199 | TTCTTTTGATACCCCTACCCACAATCTTACACACGAGAGATGTATTTACCACTGTAACTACGTCTGTCCT 200 | ATTTCTAATAATTCTTTTAATTGGTTTAATCCATGAGTGAAAGGAAGGATCTCTAGACTGATCTTCCTAG 201 | ATTAACAGGCGAAAATAAGAACTTCTAATTCTTACACGGGGGTTCAACTCCTCCTTAATCTTATGAAATA 202 | CATCAAATCACCTACAATAGCTCTTGCTATATCCACGCTAATTATAAGGACGCTCATAGCCGTATCAAGA 203 | GCCAATTGAATATTCCTTTGAGGGGCTATAGAACTTAATCTCTTAAGATTTATTCCTATTATAATACAAT 204 | CTAACAACAATCAAGAAACAGAGGGAGCTGTGAAATACTTTCTAGCTCAAGCACTGGGATCAGCCTTACT 205 | TCTAATGTCAAGAACATCTATATGAATAACATTCTCCATAATCTCAAACTTTATACCTTTAACTCTTATG 206 | GCCGCGATTATACTAAAATTGGGCAGAGTTCCCTGTCATTTCTGATACCCGTCAGTAATAGCTTCAATTT 207 | CATGAGTATCATGCCTAATCTTATCCTCCTGACAAAAGCTGGCCCCCTTATCTATTTTAGCCTTTTTACT 208 | CCCTCAAAAAAACATAAACTTTATACTATCAATAGCTGCAATAAATGCGCTGTTGGGGGGGGTGATCGGA 209 | ATAAATCAAACTCAACTACGGACCATTATAGCATACTCCTCAATTGGGCATATCGGTTGAATGATAAGAT 210 | TAGCTGCCGTATATAAGCCGAGTTCTTGCATTATATATTTTGTAGTCTACTGCATTTTAATTACCCCTCT 211 | ATTTATAACCATGGGCTATCTAAACATATTCTCTACTAAACACATAAGAAAACTTTCCTCCTATAGAAGA 212 | ACTGTCCACATAGCTTTATTGATAGTTCTTCTATCATTGGGAGGACTACCCCCTTTTACAGGATTCATGC 213 | CAAAACTTATAACCATTATATTGTTAATGCAATCCATAAAAATTATTCTACTTATCCTAATCGCGGGGTC 214 | TATTATAAACCTATTTTTCTATTTAAATATTATTATCTCTTCTATGCCCCTACCCCCACATCTAAAAAAT 215 | GTCGACTCCACTGATATTAAATGTTCATTAAAATTTGTTATTCCAATCTGTACCCTGTCATTAGGGTTGA 216 | GACCTTTTATTATACTAT 217 | 218 | -------------------------------------------------------------------------------- /task5/README.md: -------------------------------------------------------------------------------- 1 | # Task5 - Comparative genomics - gene set comparison 2 | 3 | This task is a tutorial on comparative genomics with a focus on gene set comparison. 4 | 5 | You are going to download, annotate and compare the genomes of a enterohemorrhagic E. coli (strain [O157:H7](https://en.wikipedia.org/wiki/Escherichia_coli_O157:H7)) versus a non-pathogenic E. coli (strain [K12](https://en.wikipedia.org/wiki/Escherichia_coli_in_molecular_biology#K-12)). You are then going to identify genes and gene duplications that are unique to each organism and biologically interpret your results. 6 | 7 | 8 | ### Requirements 9 | 10 | * Access to a linux-based OS running BASH 11 | * [BLAST](http://blast.ncbi.nlm.nih.gov/) 12 | * [prokka](https://github.com/tseemann/prokka) 13 | 14 | 15 | ## Getting Started 16 | 17 | * Login to your linux environment and create a new folder for your task5. 18 | * Work on your assignment in the folder your created. 19 | 20 | 21 | 22 | ## Retrieving the raw data 23 | 24 | You will be comparing two genomes of E. coli - strain K12 (non-pathogenic lab strain) and O157H7 (pathogenic E. coli associated with disease outbreaks). 25 | 26 | * Download both of these genomes using `wget`. Check the man pages for `wget` or use the --help flag to determine how to save the files with the following file names. 27 | * Name the O157H7 genome O157H7.fna 28 | * Name the K12 genome K12.fna 29 | 30 | ``` 31 | ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_O157H7_EDL933_uid259/AE005174.fna 32 | ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_K_12_substr__DH10B_uid20079/CP000948.fna 33 | ``` 34 | 35 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q1 - Look within the ftp directories for these bacterial genome projects. What do the other files contain (i.e., go to: https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_O157H7_EDL933_uid259)? 36 | 37 | ## Annotating both genomes 38 | 39 | * Next, annotate both genomes using `prokka`. 40 | 41 | ``` 42 | #this may take a while so be patient... Remember, these are full bacterial genomes as opposed to small mitochondrial contigs. 43 | prokka O157H7.fna --outdir O157H7 --norrna --notrna 44 | prokka K12.fna --outdir K12 --norrna --notrna 45 | ``` 46 | 47 | ## Generating gene lists 48 | 49 | * Next, make text files of the predicted gene lists for each genome. Have a look back at Task1 for how to redirect program output to a file. 50 | * Redirect the output for O157H7 to a file named genelist_O157H7.txt inside the O157H7 folder. 51 | * Redirect the output for K12 to a file named genelist_K12.txt inside the K12 folder. 52 | 53 | ``` 54 | #generate a gene list text file by grepping the gene names from the .tbl file 55 | cat PROKKA*.tbl | awk '{if ($1 == "gene") {print $2}}' | awk -F '_' '{print $1}' | sort 56 | ``` 57 | 58 | 59 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q2 - How many genes are present in each genome? 60 | 61 | 62 | ## Comparing gene lists 63 | 64 | Now, let's compare these lists to find genes that are common to both, duplicated or unique to either. 65 | 66 | A lot of this work can be done simply using the `comm` and `uniq` functions within the shell. 67 | Explore the command-line usage and options of these commands using `man`. 68 | 69 | ### Comparison including gene duplicates 70 | 71 | We can compare both gene lists like this in your task5 folder: 72 | 73 | ``` 74 | comm O157H7/genelist_O157H7.txt K12/genelist_K12.txt >geneListComparison.txt 75 | ``` 76 | 77 | Examine the output of `geneListComparison.txt` using `less`. 78 | 79 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q3 - What do the genes in column 1, column 2, and column 3 represent? 80 | 81 | Now, suppose we want to output the genes in column 1 (ignoring spaces). We can do so like this: 82 | 83 | ``` 84 | cat geneListComparison.txt | awk -F '\t' '{print $1}' | grep -v -e '^$' 85 | ``` 86 | 87 | ... and we can then count them by piping this command to `wc -l`. 88 | 89 | ``` 90 | cat geneListComparison.txt | awk -F '\t' '{print $1}' | grep -v -e '^$' | wc -l 91 | ``` 92 | 93 | * Analyze the core versus variable gene content for these two strains. 94 | 95 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q4 - How many genes are only in the O157H7 genome? Only in the K12 genome? In both? 96 | 97 | 98 | ### Comparison without gene duplicates (finding unique genes) 99 | Construct a simple Venn diagram illustrating the number of shared vs genome-specific genes. In this Venn diagram, if a gene G is duplicated (2 copies) in genome A and not in genome B (1 copy), then gene G will contribute to the unique genes in genome A. This is not really true, since genome B has a copy of gene G. 100 | 101 | Let's account for this by "removing gene redundancy". This can be done by first filtering each initial gene list using the tool `uniq`. 102 | 103 | ``` 104 | cd /O157H7 105 | uniq genelist_O157H7.txt > unique_genelist_O157H7.txt 106 | 107 | cd ../K12 108 | uniq genelist_K12.txt > unique_genelist_K12.txt 109 | 110 | cd .. 111 | ``` 112 | 113 | Now, when we compare these lists using `comm`, we will only be comparing single copies of each gene. Therefore, the result of `comm` should show us only those genes that are unique to genome 1, unique to genome 2, or shared between both 114 | 115 | ``` 116 | comm O157H7/unique_genelist_O157H7.txt K12/unique_genelist_K12.txt > uniqueGeneListComparison.txt 117 | ``` 118 | 119 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q5 - How many unique genes are only in the O157H7 genome? Only in the K12 genome? In both? 120 | 121 | 122 | ## Going further: inspecting your duplicated and unique genes within each organism 123 | 124 | Now, let's examine the output of the lists above. 125 | 126 | Examine `geneListComparison.txt` to find the gene expansions specific to enterohemorrhagic E. coli O157:H7. The following code will sort the O157:H7-specific genes by their copy number. This will identify those that have undergone the most pathogen-specific duplication. 127 | 128 | ``` 129 | cat geneListComparison.txt | awk -F '\t' '{print $1}' | sort | uniq -c | sort -n -r | head -20 130 | ``` 131 | 132 | Examine your result carefully. Column 1 states the copy number and column 2 states the gene name. 133 | 134 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q6 - Which gene in O157H7 occurs most frequently? Which genes in K12 occur most frequently? 135 | 136 | 137 | 138 | --- 139 | 140 | # ASSIGNMENT QUESTIONS 141 | 142 | The questions for this task are indicated by the lines starting with ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) above. Please submit your answers using the quiz on LEARN. 143 | -------------------------------------------------------------------------------- /task6/README.md: -------------------------------------------------------------------------------- 1 | # Task6 - Resequencing: variant calling from NGS data 2 | 3 | In this lab (based on this [tutorial](https://angus.readthedocs.io/en/2014/variant.html)) you will be exploring some resequencing data from Richard Lenski's famous E. coli long term evolution experiment. 4 | More on Lenski's long term evolution experiment can be found in this [article](http://www.nature.com/nature/journal/v489/n7417/full/nature11514.html), with a summary here: https://en.wikipedia.org/wiki/Richard_Lenski. 5 | 6 | You will be mapping the reads from a single population of E. coli at 38,000 generations that has evolved citrate-utilization capacity (Cit+). You will map the reads to the reference genome in order to identify SNPs that have occurred in this lineage (or its ancestors). You will focus on one SNP in particular that has created a "mutator" strain by disrupting the mutS DNA-repair gene. 7 | 8 | 9 | ### Requirements 10 | 11 | #### Command-line tools 12 | * Access to a linux-based OS running BASH 13 | * [bwa](http://bio-bwa.sourceforge.net/) 14 | * [samtools](http://samtools.sourceforge.net/) 15 | * [bcftools](https://samtools.github.io/bcftools/bcftools.html) 16 | 17 | #### Graphical tools 18 | 19 | * You will also need to download and install either Tablet or IGV on your own machine. 20 | * [tablet](https://ics.hutton.ac.uk/tablet/) 21 | * [igv](http://software.broadinstitute.org/software/igv/) 22 | 23 | 24 | ## Getting Started 25 | 26 | * Login to your linux environment and create a new folder for your task6. 27 | 28 | ``` 29 | mkdir task6 #creates folder 30 | cd task6 #enters into folder 31 | ``` 32 | 33 | ## Retrieving the raw data 34 | 35 | First, download the E. coli reference genome: 36 | 37 | ``` 38 | wget https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/master/task6/ecoli-rel606.fa.gz 39 | gunzip ecoli-rel606.fa.gz 40 | ``` 41 | 42 | The resequencing data is located at `/data/SRR098038.fastq.gz`. This is a 229 MB file, so it was already downloaded for you using the following command: 43 | 44 | ``` 45 | #wget http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR098/SRR098038/SRR098038.fastq.gz 46 | ``` 47 | 48 | 49 | ## Mapping your reads to the reference genome 50 | 51 | 52 | Before we can map reads with `bwa` (note: `bowtie` is another popular option), we need to index the reference genome. This can be done with `bwa index`. 53 | 54 | ``` 55 | bwa index ecoli-rel606.fa 56 | ``` 57 | 58 | Now, let's map the reads to the reference genome. This is also a fairly intensive step that may take a few minutes. 59 | 60 | ``` 61 | bwa aln ecoli-rel606.fa /data/SRR098038.fastq.gz > SRR098038.sai 62 | ``` 63 | 64 | Make a .SAM file which contains all information about where each read maps onto the reference genome. 65 | 66 | ``` 67 | bwa samse ecoli-rel606.fa SRR098038.sai /data/SRR098038.fastq.gz > SRR098038.sam 68 | ``` 69 | 70 | Index the reference genome (again) so that `samtools` can work with it. 71 | 72 | ``` 73 | samtools faidx ecoli-rel606.fa 74 | ``` 75 | 76 | Convert the .SAM file to a .BAM file. 77 | 78 | ``` 79 | samtools view -S -b SRR098038.sam > SRR098038.bam 80 | rm SRR098038.sam # remove this large file 81 | ``` 82 | 83 | Sort the BAM file and index it. 84 | 85 | ``` 86 | samtools sort SRR098038.bam > SRR098038.sorted.bam 87 | samtools index SRR098038.sorted.bam 88 | ``` 89 | 90 | ## Viewing your BAM file 91 | 92 | BAM files can be viewed with `igv` or with `tablet`. But let's take a quick look in the terminal. 93 | 94 | ``` 95 | samtools tview SRR098038.sorted.bam 96 | ``` 97 | 98 | Type `q` to exit when finished. 99 | 100 | ## Variant calling 101 | 102 | Instead of identifying SNPs by eye, use `bcftools` to perform automated variant calling. 103 | 104 | ``` 105 | bcftools mpileup -f ecoli-rel606.fa SRR098038.sorted.bam | bcftools call -mv -Ob --ploidy 1 -o calls.bcf 106 | 107 | #convert to vcf (human-readable variant call format). This file should contain all identified SNPs and other variants. 108 | bcftools view calls.bcf > calls.vcf 109 | 110 | ``` 111 | 112 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q1 - How many total variants are present? Hint: `grep` for a pattern found only in your variant lines. 113 | 114 | 115 | ## Locating a key SNP in Lenski's E. coli evolution experiment 116 | 117 | This lineage of E. coli has a mutation in the mutS gene (protein sequence can be found [here](https://www.uniprot.org/uniprot/P23909.fasta)). This mutation creates a premature stop codon. Your task is to find this mutation within your sequencing data! 118 | 119 | Find the region in the reference genome that encodes the mutS gene using `blast`. You may need to refer to earlier tasks to help you with this. 120 | 121 | Now, extract the mapped area for this region from your .bam file. The command will be something like this: 122 | 123 | ``` 124 | samtools view SRR098038.sorted.bam "ecoli:START-END" > region.sam # where START and END are position numbers 125 | ``` 126 | 127 | 128 | Download the following two files and open these files in `tablet` on your home machine. 129 | 130 | - the region.sam file you created above 131 | - the reference genome ([ecoli-rel606.fa](https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/master/task6/ecoli-rel606.fa.gz)) 132 | 133 | Now, locate the region containing the mutS gene within `tablet`, and search for the premature stop codon variant. 134 | 135 | Here is an [example read-pileup](https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/master/task6/example-pileup.png) in tablet that highlights a variant position. 136 | 137 | 138 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q2 - Paste a screenshot highlighting this mutation (you will need to zoom in) and show the amino acid translation. 139 | 140 | 141 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q3 - The premature stop codon mutation is from the codon “_ _ _ ” to “_ _ _ ”. 142 | 143 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q4 - The amino acid encoded by this codon before this mutation is _____? 144 | 145 | 146 | Once you are finished, please delete the files in your task 6 folder with the `rm` command.: 147 | 148 | > :warning: **Caution:** Be careful as `rm` permanently deletes files! 149 | 150 | ``` 151 | cd task6 152 | rm * 153 | ``` 154 | 155 | 156 | 157 | --- 158 | 159 | # ASSIGNMENT QUESTIONS 160 | 161 | The questions for this task are indicated by the lines starting with ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) above. 162 | Submit your answers to questions 1-4 to the QUIZ on LEARN. 163 | 164 | 165 | 166 | -------------------------------------------------------------------------------- /task6/ecoli-rel606.fa.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task6/ecoli-rel606.fa.gz -------------------------------------------------------------------------------- /task6/example-pileup.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task6/example-pileup.png -------------------------------------------------------------------------------- /task7/README.md: -------------------------------------------------------------------------------- 1 | # Transcriptomics and detection of differentially expressed genes (DEGs) 2 | 3 | In this lab, you will be analyzing RNA-seq data from a study by Aguiar et al. (2020) [here](https://pubmed.ncbi.nlm.nih.gov/31646766/). 4 | This study exposed a human lung cell line (Calu-3 cells) to tobacco smoke, cannabis smoke, and a common drug intervention (LABA/GCS). 5 | 6 | You will be measuring and comparing the transcript expression levels between normal untreated cells (controls) and cells exposed to tobacco smoke extract (TSE). 7 | There are 4 TSE samples vs 4 control samples as labeled below. 8 | 9 | | Sample ID | Status | 10 | | --------------- | --------------- | 11 | | SRR8451881 | Control | 12 | | SRR8451882 | Control | 13 | | SRR8451883 | Control | 14 | | SRR8451884 | Control | 15 | | SRR8451885 | TSE | 16 | | SRR8451886 | TSE | 17 | | SRR8451887 | TSE | 18 | | SRR8451888 | TSE | 19 | 20 | The goal is to identify which genes are up-regulated and down-regulated following tobacco smoke exposure. 21 | 22 | ### Requirements 23 | 24 | #### Command-line tools 25 | * Access to a linux-based OS running BASH 26 | * [Salmon](https://combine-lab.github.io/salmon/) 27 | 28 | #### Graphical tools 29 | 30 | You will also need to download and install R on your own machine with the following packages 31 | 32 | * tximport 33 | * DESeq2 or edgeR 34 | 35 | 36 | ## Getting Started 37 | 38 | * Login to your linux environment and create a new folder for your task7 39 | 40 | ``` 41 | mkdir transcriptomics-task #creates folder 42 | cd transcriptomics-task #enters into folder 43 | ``` 44 | 45 | ## Retrieving the raw data and reference transcriptome 46 | 47 | **NOTE: This has been done for you already and the files are located at : `/fsys1/data/task4`** 48 | If you are curious and would like to know how this was done, see below, but again this is not needed, so you can skip ahead to the next section called "Transcript quantification with Salmon". 49 | 50 | ### Download human reference transcriptome and create a Salmon index 51 | 52 | ``` 53 | #download a pre-made reference transcriptome from Gencode 54 | wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.transcripts.fa.gz 55 | gunzip gencode.v29.transcripts.fa.gz 56 | 57 | #index your reference transcriptome so that it can be analyzed with `Salmon` 58 | salmon index -t gencode.v29.transcripts.fa -i gencode_v29_idx 59 | 60 | ``` 61 | 62 | ### Download the RNA-seq dataset 63 | 64 | Next, we need the the RNA-seq data (8 samples -- fw and rv reads, so 16 total files) from the public EBI FTP site. This has also been downloaded for you. These files were downloaded using the following code: 65 | 66 | ``` 67 | #download the list of urls first - NOTE: THIS STEP CAN TAKE A LONG TIME (~1 hr) 68 | wget https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/master/task7/ftp-list.txt 69 | 70 | #download all of the .fastq files into your data folder 71 | wget -i ../ftp-list.txt 72 | 73 | ``` 74 | 75 | 76 | ## Transcript quantification with Salmon 77 | 78 | Now, you are going to measure transcript abundance using `Salmon`. For a single sample with paired-end reads (e.g., `forward_reads.fastq.gz` and `reverse_reads.fastq.gz`, this could be done using the following line: 79 | 80 | ``` 81 | #result will be output to "quants" folder 82 | # -p 6 means that six CPU threads will be used 83 | salmon quant -i gencode_v29_idx -l A -1 forward_reads.fastq.gz -2 reverse_reads.fastq.gz -p 6 -o quants 84 | ``` 85 | 86 | But the above line is just an example for a single sample. Here is a .bash script that will run `Salmon` on all of the 8 samples we have just downloaded. Run this bash script in your `/transcriptomics-task` folder 87 | ``` 88 | #download bash script 89 | wget https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/master/task7/runSalmon.bash 90 | 91 | #run bash script. This may take a few hours... 92 | bash runSalmon.bash 93 | 94 | ``` 95 | 96 | ## Exploring the transcript counts 97 | 98 | Now, the transcript expression levels have been quantified for each of your 8 samples. Look within the `quants` folder and examine the `quant.sf` files that you have produced for each sample. 99 | 100 | * Take note of which column contains the transcript id. You can do this by using `head -1` to look at the header of the `quant.sf` file. 101 | * Also take note of which column contains the TPM (transcripts per million) expression level. 102 | 103 | Suppose you are interested in the transcript "ENST00000379727.7". 104 | 105 | ``` 106 | #go to your quants/data folder 107 | cd quants 108 | 109 | #inspect the expression levels for this transcript 110 | grep "ENST00000379727.7" */quant.sf 111 | ``` 112 | 113 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q1 - Has this transcript's abundance increased, decreased, or stayed the same following smoke exposure? 114 | 115 | Support your answer using statistics. Perform a t-test comparing the expression level of this transcript between the 4 smoke-treated samples versus 4 control samples. Use any program of your choice to do so (R, excel, Google Sheets, etc.). 116 | 117 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q2 - Is the difference statistically significant (p < 0.05)? 118 | 119 | 120 | ## Detecting differentially expressed genes (DEGs) in R 121 | 122 | Now that you have measured transcript abundance for all samples using `Salmon`, you can perform a differential expression analysis using a tool such as DeSeq2 or edgeR. 123 | 124 | On your local machine, install the following R packages: 125 | 126 | ``` 127 | tximport 128 | DEseq2 129 | EnhancedVolcano 130 | ``` 131 | 132 | Now, on your local machine, open your terminal and download the following quant files produced by Salmon to your local machine 133 | 134 | ``` 135 | scp -r userid@genomics1.private.uwaterloo.ca:~/task4/quants/ . 136 | ``` 137 | 138 | Now, open R and load packages and set working directory: 139 | 140 | ``` 141 | #load required packages 142 | library(tximport) 143 | library(DESeq2) 144 | library(EnhancedVolcano) 145 | 146 | #go to your folder containing your quant files you just downloaded 147 | setwd("/path/to/quants") 148 | 149 | ``` 150 | 151 | Download the gencode reference transcriptome 152 | 153 | ``` 154 | system("wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.metadata.HGNC.gz") 155 | system("gunzip gencode.v29.metadata.HGNC.gz") 156 | genesymbols = read.delim("gencode.v29.metadata.HGNC") 157 | ``` 158 | 159 | Read the quant files into R 160 | 161 | ``` 162 | files = paste(list.dirs('.', recursive=FALSE),"/","quant.sf",sep='') 163 | #make sure to check your list of files to ensure that this step worked 164 | 165 | txi.salmon <- tximport(files, type = "salmon", tx2gene = genesymbols, ignoreAfterBar =T) 166 | ``` 167 | 168 | Run DESeq2 to detect differentially expressed genes between the two categories 169 | 170 | ``` 171 | meta = data.matrix(cbind(files,as.numeric(c(0,0,0,0,1,1,1,1)))) 172 | 173 | #check that the first four samples are control samples (0) and the last four are TSE (1) 174 | meta 175 | 176 | colnames(meta) = c("filenames","category") 177 | 178 | dds <- DESeqDataSetFromTximport(txi.salmon, meta, ~as.factor(category)) # this is detecting DEGs between the "0" and "1" samples 179 | dds <- DESeq(dds) 180 | 181 | res <- results(dds, lfcThreshold=0.5,alpha=0.01) 182 | ``` 183 | 184 | Examine the results 185 | 186 | ``` 187 | summary(res) 188 | 189 | ``` 190 | 191 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q3 - How many significant up- and down-expressed genes were detected? 192 | 193 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q4 - Produce a table of the top 10 differentially expressed genes along with their fold-changes and adjusted p-values. Also include the code you used to do so. 194 | 195 | 196 | Next, generate a volcano plot using the following code. 197 | 198 | ``` 199 | EnhancedVolcano(res, 200 | lab = rownames(res), 201 | x = 'log2FoldChange', 202 | y = 'pvalue') 203 | ``` 204 | 205 | ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) Q5 - Paste an image of a volcano plot. 206 | 207 | 208 | 209 | --- 210 | 211 | # ASSIGNMENT QUESTIONS 212 | 213 | The questions for this task are indicated by the lines starting with ![question](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/questionbox.png) above. 214 | Please submit your answers in the dropbox on LEARN. 215 | 216 | 217 | 218 | -------------------------------------------------------------------------------- /task7/ftp-list.txt: -------------------------------------------------------------------------------- 1 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/008/SRR8451888/SRR8451888_1.fastq.gz 2 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/008/SRR8451888/SRR8451888_2.fastq.gz 3 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/007/SRR8451887/SRR8451887_1.fastq.gz 4 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/007/SRR8451887/SRR8451887_2.fastq.gz 5 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/006/SRR8451886/SRR8451886_1.fastq.gz 6 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/006/SRR8451886/SRR8451886_2.fastq.gz 7 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/005/SRR8451885/SRR8451885_1.fastq.gz 8 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/005/SRR8451885/SRR8451885_2.fastq.gz 9 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/004/SRR8451884/SRR8451884_1.fastq.gz 10 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/004/SRR8451884/SRR8451884_2.fastq.gz 11 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/003/SRR8451883/SRR8451883_1.fastq.gz 12 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/003/SRR8451883/SRR8451883_2.fastq.gz 13 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/002/SRR8451882/SRR8451882_1.fastq.gz 14 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/002/SRR8451882/SRR8451882_2.fastq.gz 15 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/001/SRR8451881/SRR8451881_1.fastq.gz 16 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/001/SRR8451881/SRR8451881_2.fastq.gz 17 | -------------------------------------------------------------------------------- /task7/runSalmon.bash: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | for samp in SRR8451881 SRR8451882 SRR8451883 SRR8451884 SRR8451885 SRR8451886 SRR8451887 SRR8451888; 4 | do 5 | echo "Processing sample $samp" 6 | salmon quant -i /fsys1/data/task4/gencode_v29_idx -l A \ 7 | -1 /fsys1/data/task4/${samp}_1.fastq.gz \ 8 | -2 /fsys1/data/task4/${samp}_2.fastq.gz \ 9 | -p 6 -o quants/${samp}_quant 10 | done 11 | -------------------------------------------------------------------------------- /task8/README.md: -------------------------------------------------------------------------------- 1 | # Task8 - Analysis of 16S amplicon sequencing data using Kraken2/Bracken 2 | 3 | In this lab, you will be analyzing 16S and metagenomic data from a study by Lobb et al. (2020) [here](https://pubmed.ncbi.nlm.nih.gov/32345738/). 4 | This study examined the microbial communities of decomposing fish in local rivers near Waterloo, ON, Canada. 5 | 6 | There are 48 16S rRNA samples with the following metadata. 7 | 8 | | Sample ID | Name | 9 | | --------------- | --------------- | 10 | | SRS6112281 | WW1.1 | 11 | | SRS6112282 | WW1.2 | 12 | | SRS6112283 | WW1.3 | 13 | | SRS6112267 | EE1.1 | 14 | | SRS6112268 | EE1.2 | 15 | | SRS6112279 | EE1.3 | 16 | | SRS6112285 | WW4.1 | 17 | | SRS6112284 | WW4.2 | 18 | | SRS6112286 | WW4.3 | 19 | | SRS6112293 | WE4.1 | 20 | | SRS6112295 | WE4.2 | 21 | | SRS6112296 | WE4.3 | 22 | | SRS6112290 | EE4.1 | 23 | | SRS6112302 | EE4.2 | 24 | | SRS6112304 | EE4.3 | 25 | | SRS6112271 | EW4.1 | 26 | | SRS6112272 | EW4.2 | 27 | | SRS6112273 | EW4.3 | 28 | | SRS6112287 | WW8.1 | 29 | | SRS6112288 | WW8.2 | 30 | | SRS6112289 | WW8.3 | 31 | | SRS6112297 | WE8.1 | 32 | | SRS6112298 | WE8.2 | 33 | | SRS6112299 | WE8.3 | 34 | | SRS6112305 | EE8.1 | 35 | | SRS6112306 | EE8.2 | 36 | | SRS6112307 | EE8.3 | 37 | | SRS6112274 | EW8.1 | 38 | | SRS6112275 | EW8.2 | 39 | | SRS6112276 | EW8.3 | 40 | | SRS6112291 | WW10.1 | 41 | | SRS6112292 | WW10.2 | 42 | | SRS6112294 | WW10.3 | 43 | | SRS6112300 | WE10.1 | 44 | | SRS6112301 | WE10.2 | 45 | | SRS6112303 | WE10.3 | 46 | | SRS6112308 | EE10.1 | 47 | | SRS6112269 | EE10.2 | 48 | | SRS6112270 | EE10.3 | 49 | | SRS6112277 | EW10.1 | 50 | | SRS6112278 | EW10.2 | 51 | | SRS6112280 | EW10.3 | 52 | 53 | Our goal will be to perform taxonomic profiling of these 16S rRNA datasets using Kraken2/Bracken. 54 | 55 | ### Requirements 56 | 57 | #### Command-line tools 58 | * Access to a linux-based OS running BASH 59 | * kraken2 and Bracken 60 | * metaspades 61 | 62 | #### Graphical tools 63 | 64 | You will also need to download and install [R](https://www.r-project.org) on your own machine with the following packages 65 | 66 | * ggplot2 67 | * pheatmap 68 | * reshape2 69 | * viridisLite 70 | 71 | 72 | ## Getting Started 73 | 74 | * Login to your linux environment and create a new folder for your task7 75 | 76 | ``` 77 | mkdir task8 #creates folder 78 | cd task8 #enters into folder 79 | ``` 80 | 81 | ## Retrieving the raw data 82 | 83 | The data has already been downloaded for you, and is located in the `/fsys/data/lobb-et-al/` folder 84 | 85 | If you're curious, the original data was downloaded from the NCBI SRA using this command: 86 | ``` 87 | fastq-dump --split-files SRS6112303 SRS6112301 SRS6112300 SRS6112299 SRS6112298 SRS6112297 SRS6112296 SRS6112295 SRS6112293 SRS6112294 SRS6112292 SRS6112291 SRS6112289 SRS6112288 SRS6112287 SRS6112286 SRS6112284 SRS6112285 SRS6112283 SRS6112282 SRS6112281 SRS6112280 SRS6112278 SRS6112277 SRS6112276 SRS6112275 SRS6112274 SRS6112273 SRS6112272 SRS6112271 SRS6112270 SRS6112269 SRS6112308 SRS6112307 SRS6112306 SRS6112305 SRS6112304 SRS6112302 SRS6112290 SRS6112279 SRS6112268 SRS6112267 SRS6098991 SRS6098990 SRS6098989 SRS6098988 SRS6098999 SRS6098998 SRS6098997 SRS6098996 SRS6098995 SRS6098994 SRS6098993 SRS6098992 SRS6098987 SRS6098986 88 | ``` 89 | 90 | 91 | ## Quality filtering 92 | 93 | We have previously covered the use of tools such as `fastqc` and `trimmomatic` to quality filter our dataset. QC is a required step for any high-throughput sequencing pipeline, but for simplicity we will skip it for the purposes of this tutorial. 94 | 95 | ## Taxonomic classification of 16S reads using Kraken2 96 | 97 | ### Analyzing single samples 98 | 99 | Tools such as `QIIME2` and `Mothur` are common for analyzing 16S rRNA sequences. For this tutorial, we will be using a different tool called `Kraken2`. 100 | 101 | Suppose we wanted to analyze a single sample (e.g., SRS6112303). We can do so with the following Kraken2 command: 102 | 103 | ``` 104 | CLASSIFICATION_LVL=G # this will set an environmental variable for the taxonomic level of classification desired (G = "Genus", S = "Species", etc.) 105 | krakenDB=/data/krakendb/16S_Greengenes_k2db/ #this is the location of the kraken2 database you want to use for classification 106 | fastq1=/fsys1/data/lobb-et-al/SRS6112303_1.fastq 107 | fastq2=/fsys1/data/lobb-et-al/SRS6112303_2.fastq 108 | 109 | kraken2 --db $krakenDB --paired --report report.txt --output kraken.out $fastq1 $fastq2 110 | 111 | bracken -d $krakenDB -l $CLASSIFICATION_LVL -i report.txt -o bracken.out 112 | ``` 113 | 114 | `bracken.out` will look like this: 115 | 116 | ``` 117 | name taxonomy_id taxonomy_lvl kraken_assigned_reads added_reads new_est_reads fraction_total_reads 118 | Parabacteroides distasonis 2601 S 8 53 61 0.00554 119 | Parabacteroides gordonii 2602 S 1 37 38 0.00347 120 | Bacteroides fragilis 2596 S 11 2044 2055 0.18386 121 | Bacteroides uniformis 2599 S 1 9 10 0.00094 122 | Clostridium pasteurianum 2769 S 278 346 624 0.05586 123 | Clostridium perfringens 2770 S 146 986 1132 0.10128 124 | Clostridium subterminale 2772 S 78 2011 2089 0.18691 125 | Clostridium bowmanii 2764 S 13 64 77 0.00692 126 | Clostridium butyricum 2765 S 10 102 112 0.01009 127 | Clostridium neonatale 2768 S 3 33 36 0.00328 128 | Alkaliphilus transvaalensis 2762 S 3 1 4 0.00037 129 | Sporomusa polytropa 2800 S 4 98 102 0.00913 130 | Veillonella dispar 2801 S 22 255 277 0.02478 131 | ... 132 | ... 133 | 134 | ``` 135 | 136 | ### Analyzing many samples 137 | 138 | But these commands run Kraken2/Bracken on only a single sample. What do we do if we want to run them on all samples? 139 | 140 | First, you need to have a list of the samples you want to analyze. This has been done for you with the file at `/fsys1/data/lobb-et-al/files.txt`. 141 | 142 | Then, we will create a bash script called `runAll.bash` with the following contents. 143 | 144 | ``` 145 | #!/bin/bash 146 | 147 | # $1 is the file containing the list of samples 148 | # $2 is the classification level 149 | 150 | CLASSIFICATION_LVL=$2 151 | 152 | while IFS=$'\t' read sample 153 | do 154 | echo "processing sample $sample" 155 | 156 | kraken2 --db /data/krakendb/16S_Greengenes_k2db/ --paired --report $sample.$CLASSIFICATION_LVL.kraken --output $sample.$CLASSIFICATION_LVL.kraken.out /fsys1/data/lobb-et-al/${sample}_1.fastq /fsys1/data/lobb-et-al/${sample}_2.fastq 157 | 158 | bracken -d /data/krakendb/16S_Greengenes_k2db -l $CLASSIFICATION_LVL -i $sample.$CLASSIFICATION_LVL.kraken -o $sample.$CLASSIFICATION_LVL.bracken.out 159 | 160 | done < $1 161 | ``` 162 | 163 | Let's now run `runAll.bash` to apply Kraken2 and Bracken to all of our 16S samples. 164 | 165 | ``` 166 | #first let's create a new folder 167 | mkdir order_classification 168 | 169 | cd order_classification 170 | 171 | # this will perform taxonomic classification at the Order level 172 | bash runAll.bash /fsys1/data/lobb-et-al/files.txt O # now wait a while.... 173 | 174 | ``` 175 | 176 | Once completed, you will see that your folder is full of .bracken output files. 177 | To merge these together into a single file containing bracken output for all your samples, do the following: 178 | 179 | 180 | ``` 181 | python2.7 /usr/local/bin/Bracken-2.5/analysis_scripts/combine_bracken_outputs.py --files $(ls *.bracken.out) -o combined.order.out 182 | ``` 183 | 184 | 185 | ## Plotting results in R 186 | 187 | Now, download your `combined.order.out` file to your local computer and load [R](https://www.r-project.org/) for further analysis and plotting. 188 | 189 | First load these libraries. Install these first if they are not already installed. 190 | 191 | ``` 192 | library(ggplot2) 193 | library(reshape2) 194 | library(viridisLite) 195 | ``` 196 | 197 | Next, load your data 198 | 199 | ``` 200 | tb = read.delim("combined.order.out",header=T,row.names=1) 201 | 202 | #Bracken output has _frac and _num columns. We will just analyze the _num columns. 203 | tbp = tb[,grep("_num",colnames(tb))] 204 | 205 | #Transpose the table 206 | tbp <- t(tbp) 207 | 208 | #Convert to proportions 209 | tb_prop<-as.data.frame(round(prop.table(as.matrix(tbp), 1) * 100,1)) 210 | 211 | #Choose a selection of taxa with a % > 3 (Note: might have to play around with this until you get a reasonable number of taxa to display) 212 | tb_sub <- tb_prop[,apply(tb_prop, 2, function(x) max(x, na.rm = TRUE))>3] 213 | 214 | ``` 215 | 216 | For plotting, we have to do a few more modifications to the data matrix 217 | ``` 218 | #Melt the dataframe for plotting 219 | tbm <- as.data.frame(melt(as.matrix(tb_sub))) 220 | 221 | #fix labels 222 | tbm[,1] = within(tbm, Var1<-data.frame(do.call('rbind', strsplit(as.character(Var1), '.', fixed=TRUE))))[,1][,1] 223 | 224 | #Turn 0s into NAs 225 | tbm[tbm == 0] <- NA 226 | 227 | #Set the order of the taxa on the plot (Note: optional) 228 | tbm$Var2 <- factor(tbm$Var2, levels = row.names(as.table(sort(colMeans(tb_sub))))) 229 | 230 | ``` 231 | 232 | Now, we can plot using `ggplot2`. Note: the following ggplot command is very parameter-rich, and it can be a lot simpler than this. 233 | 234 | ``` 235 | ggplot(tbm, aes(Var1,Var2,size = value,fill=value), colsep=c(1:100), rowsep=(1:100), sepwidth=c(5,1)) + geom_point(shape = 21, alpha=0.4) + ggtitle("") + xlab("") + ylab("") + theme(axis.text = element_text(colour= "black", size = 12), text = element_text(size=15), axis.text.x=element_text(angle=90, vjust = 0.5, hjust = 1))+ scale_size_area(max_size = 15,guide="none") + labs(fill="Relative\nfrequency (%)") + scale_fill_viridis_c() 236 | ``` 237 | This should produce the following plot: 238 | 239 | ![](https://github.com/doxeylab/learn-genomics-in-linux/blob/master/task8/bubbleplot.png) 240 | 241 | We can also create a barplot by doing the following: 242 | 243 | ``` 244 | ggplot(tbm, aes(fill=Var2, y=value, x=Var1)) + 245 | geom_bar(position="fill", stat="identity", col="grey50") + 246 | scale_y_continuous(labels=scales::percent) + 247 | xlab("") + ylab("Relative frequency") + labs(fill="Order") + 248 | theme(axis.text.x= element_text(angle = 90, hjust = 1)) 249 | 250 | ``` 251 | 252 | ... which should produce: 253 | 254 | ![](https://github.com/doxeylab/learn-genomics-in-linux/blob/master/task8/barplot1.png) 255 | 256 | 257 | ## Adding in metadata annotations 258 | 259 | The last plot contains sample names, but let's replace these names with annotations from the metadata that are more informative. 260 | 261 | First, make sure you have a `metadata.txt` text file that contains the 48 samples (column 1) and names (column 2) listed at the top of the page. 262 | It should look like this: 263 | 264 | ``` 265 | SRS6112281 WW1.1 266 | SRS6112282 WW1.2 267 | SRS6112283 WW1.3 268 | SRS6112267 EE1.1 269 | SRS6112268 EE1.2 270 | SRS6112279 EE1.3 271 | ... 272 | ``` 273 | 274 | Now, load in your metadata 275 | ``` 276 | metadata = read.table("metadata.txt") 277 | ``` 278 | 279 | Now, let's subset our data matrix to include only the metadata samples, and let's also re-order the variables so that they plot in the desired order 280 | 281 | ``` 282 | #subset tbm to only include those samples in metadata 283 | tbm = tbm[which(tbm[,1] %in% metadata[,1]),] 284 | tbm[,1] = metadata[match(tbm[,1],metadata[,1]),2] 285 | tbm$Var1 = factor(tbm$Var1,levels= metadata[,2]) 286 | 287 | ggplot(tbm, aes(fill=Var2, y=value, x=Var1)) + 288 | geom_bar(position="fill", stat="identity", col="grey50") + 289 | scale_y_continuous(labels=scales::percent) + 290 | xlab("") + ylab("Relative frequency") + labs(fill="Order") + 291 | theme(axis.text.x= element_text(angle = 90, hjust = 1)) 292 | ``` 293 | 294 | This should produce: 295 | 296 | ![](https://github.com/doxeylab/learn-genomics-in-linux/blob/master/task8/barplot2.png) 297 | 298 | 299 | How does this result compare to the result from Lobb et al. (2020) [here](https://pubmed.ncbi.nlm.nih.gov/32345738/) ? 300 | 301 | 302 | Lastly, let's create a heatmap and add in an annotation category 303 | 304 | ``` 305 | library(pheatmap) 306 | 307 | # convert the tbm table back to a 2D matrix 308 | tb = acast(tbm, Var1 ~ Var2,value.var='value',fill=0) 309 | tb = t(tb) #transpose 310 | 311 | # let's split the names (EE, WW, etc.) into a matrix that we can use as annotations 312 | annot = data.frame(do.call("rbind", strsplit(as.character(metadata[,2]), "", fixed = TRUE)))[,c(1,2)] 313 | rownames(annot) = metadata[,2] 314 | colnames(annot) = c("Fish_Origin","Water_Origin") 315 | 316 | # specify the colors 317 | ann_colors = list( 318 | Fish_Origin = c(W = "#EBEBEB", E = "#424242"), 319 | Water_Origin = c(W = "#EBEBEB", E = "#424242") 320 | ) 321 | 322 | # plot 323 | pheatmap(tb,annotation_col=annot,cluster_cols=F,annotation_colors=ann_colors,color = viridis(1000)) 324 | ``` 325 | 326 | This should produce: 327 | 328 | ![](https://github.com/doxeylab/learn-genomics-in-linux/blob/master/task8/pheatmap.png) 329 | 330 | 331 | 332 | 333 | 334 | -------------------------------------------------------------------------------- /task8/barplot1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task8/barplot1.png -------------------------------------------------------------------------------- /task8/barplot2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task8/barplot2.png -------------------------------------------------------------------------------- /task8/bubbleplot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task8/bubbleplot.png -------------------------------------------------------------------------------- /task8/bubbleplot1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task8/bubbleplot1.png -------------------------------------------------------------------------------- /task8/pheatmap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task8/pheatmap.png -------------------------------------------------------------------------------- /task9/README.md: -------------------------------------------------------------------------------- 1 | # Task9 - Analysis of human SNP variation from 1000 genomes data 2 | 3 | In this lab, you will be analyzing available data from the 1000 genomes project - https://en.wikipedia.org/wiki/1000_Genomes_Project 4 | 5 | You will 6 | 7 | ### Requirements 8 | 9 | #### Command-line tools 10 | * Access to a linux-based OS running BASH 11 | * Tabix 12 | * vcftools 13 | * R 14 | 15 | 16 | # Linux command-line 17 | 18 | ## Getting Started 19 | 20 | * Login to your linux environment and create a new folder for your task7 21 | 22 | ``` 23 | mkdir task9 #creates folder 24 | cd task9 #enters into folder 25 | ``` 26 | 27 | ## Download the VCF file for your chromosome of interest 28 | 29 | e.g., below we will download chromosome 12 30 | 31 | ``` 32 | wget ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/release/20130502/ALL.chr12.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz 33 | 34 | wget ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/release/20130502/ALL.chr12.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz.tbi 35 | ``` 36 | 37 | ## Download the reference genome (optional, if needed) 38 | ``` 39 | wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz 40 | 41 | gunzip human_g1k_v37.fasta.gz 42 | 43 | wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai 44 | ``` 45 | 46 | 47 | ## Use `tabix` to extract the region of interest from the chromosome 48 | 49 | e.g., suppose we are interested in the variants found across the 1000-bp region 49687909-49688909 of chromosome 12 50 | 51 | ``` 52 | tabix -fh ALL.chr12.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz 12:49687909-49688909 > region.vcf 53 | ``` 54 | 55 | ## convert VCF file to tab-separated file 56 | ``` 57 | cat region.vcf | vcf-to-tab > region.tab 58 | ``` 59 | 60 | * How many SNPs were detected? 61 | 62 | ## More links 63 | 64 | * https://vcftools.github.io/perl_examples.html 65 | 66 | 67 | # Data analysis in R 68 | 69 | ## Loading the region.tab data and visualization as a heatmap 70 | 71 | ``` 72 | # Load required library 73 | if (!require("pheatmap")) { 74 | install.packages("pheatmap") 75 | library(pheatmap) 76 | } 77 | 78 | # Read the header line separately to get the sample names without modification 79 | header_line <- readLines("region.tab", n = 1) 80 | header_parts <- unlist(strsplit(header_line, "\t")) 81 | 82 | # Extract the sample names (from the 4th column onward) 83 | sample_names <- header_parts[4:length(header_parts)] 84 | 85 | # Load the rest of the data 86 | data <- read.table("region.tab", header = TRUE, sep = "\t", stringsAsFactors = FALSE, check.names = FALSE, comment.char = "#") 87 | 88 | 89 | #Load the metadata 90 | 91 | metadata <- read.delim("igsr_samples.tsv",sep='\t',header=T) 92 | 93 | # Define the most common allele for each position and use it as the reference 94 | presence_absence_matrix <- apply(data[, 4:ncol(data)], 1, function(genotypes) { 95 | # Calculate the most common allele 96 | most_common_allele <- names(sort(table(genotypes), decreasing = TRUE))[1] 97 | # Mark as 1 if different from the most common allele, else 0 98 | ifelse(genotypes == most_common_allele, 0, 1) 99 | }) 100 | 101 | # Transpose the matrix to have samples as columns 102 | presence_absence_matrix <- t(presence_absence_matrix) 103 | 104 | # Set row and column names for clarity 105 | rownames(presence_absence_matrix) <- paste(data$CHROM, data$POS, sep = ":") 106 | colnames(presence_absence_matrix) <- sample_names 107 | 108 | # Filter out rows with no variation (all 0s) and rows with all 1s 109 | presence_absence_matrix_filtered <- presence_absence_matrix[rowSums(presence_absence_matrix) > 0 & rowSums(presence_absence_matrix) < ncol(presence_absence_matrix), ] 110 | 111 | pop <- metadata[match(sample_names,metadata[,1]),6] 112 | 113 | 114 | # Create a data frame for annotation 115 | annotation_df <- data.frame(Population = pop) 116 | rownames(annotation_df) <- sample_names 117 | 118 | 119 | # Plot heatmap with annotation 120 | pheatmap( 121 | t(presence_absence_matrix_filtered), 122 | color = colorRampPalette(c("white", "blue"))(100), # Use a gradient 123 | main = "Presence-Absence Heatmap for Variant Sites", 124 | cluster_rows = TRUE, 125 | cluster_cols = FALSE, 126 | display_numbers = FALSE, 127 | fontsize_row = 1, 128 | fontsize_col = 6, 129 | annotation_row = annotation_df 130 | ) 131 | 132 | 133 | ``` 134 | ![image](SNP-heatmap.png) 135 | 136 | 137 | 138 | ## Plotting the frequencies of specific SNPs per population 139 | 140 | ``` 141 | # Define the site of interest 142 | site <- 6 143 | 144 | # Get indices for samples with reference and variant alleles 145 | withRefBase <- which(presence_absence_matrix_filtered[site, ] == 0) 146 | withVariant <- which(presence_absence_matrix_filtered[site, ] == 1) 147 | 148 | # Get population labels for each sample 149 | populations <- pop 150 | 151 | # Count the number of reference and variant alleles in each population 152 | ref_counts <- table(populations[withRefBase]) 153 | variant_counts <- table(populations[withVariant]) 154 | 155 | # Combine the counts into a data frame for plotting 156 | allele_counts <- data.frame( 157 | Population = unique(populations), 158 | Reference = sapply(unique(populations), function(p) ref_counts[p]), 159 | Variant = sapply(unique(populations), function(p) variant_counts[p]) 160 | ) 161 | 162 | # Replace NA with 0 where counts are missing 163 | allele_counts[is.na(allele_counts)] <- 0 164 | 165 | # Reshape the data for plotting with ggplot2 166 | library(reshape2) 167 | allele_counts_long <- melt(allele_counts, id.vars = "Population", variable.name = "Allele", value.name = "Count") 168 | 169 | # Plot the frequencies 170 | library(ggplot2) 171 | ggplot(allele_counts_long, aes(x = Population, y = Count, fill = Allele)) + 172 | geom_bar(stat = "identity", position = "dodge") + 173 | labs(title = paste("Allele Frequency at Site", site), 174 | x = "Population", 175 | y = "Frequency", 176 | fill = "Allele Type") + 177 | theme_minimal() + 178 | theme(axis.text.x = element_text(angle = 45, hjust = 1)) 179 | ``` 180 | 181 | 182 | ![image](alleleFreq.png) 183 | -------------------------------------------------------------------------------- /task9/SNP-heatmap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task9/SNP-heatmap.png -------------------------------------------------------------------------------- /task9/alleleFreq.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task9/alleleFreq.png --------------------------------------------------------------------------------