├── LICENSE.md
├── README.md
├── VMbuild.README
├── _config.yml
├── questionbox.png
├── task1
├── README.md
├── e-coli-genome.fasta.gz
├── e-coli-h20-genome.fasta.gz
├── e-coli-k12-genome.fasta.gz
└── gcloud-download.png
├── task1b
└── README.md
├── task2
├── README.md
├── mt_barcodes.txt
├── mt_reads.fastq.gz
└── tablet-coverage-plot.png
├── task3
├── 16Ssearch.png
├── README.md
├── installingArtemis.md
├── mysteryGenome.fna.gz
├── ntsearch.png
└── uniprot2go.py
├── task4
├── README.md
├── act.png
├── installingArtemis.md
└── l-terrestris.genome.fa
├── task5
└── README.md
├── task6
├── README.md
├── ecoli-rel606.fa.gz
└── example-pileup.png
├── task7
├── README.md
├── ftp-list.txt
└── runSalmon.bash
├── task8
├── README.md
├── barplot1.png
├── barplot2.png
├── bubbleplot.png
├── bubbleplot1.png
└── pheatmap.png
└── task9
├── README.md
├── SNP-heatmap.png
├── alleleFreq.png
└── igsr_samples.tsv
/LICENSE.md:
--------------------------------------------------------------------------------
1 | 
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
2 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # learn-genomics-in-linux
2 |
3 | This repository contains a tutorial on how to use command-line genomics/bioinformatics tools in Linux.
4 | It has been designed for BIOL 469 (Genomics) and BIOL 614 (Bioinformatics Tools and Techniques) @ the University of Waterloo.
5 |
6 | It is divided into a series of tasks
7 |
8 | * [Task 1](task1/) - Learning the linux command line
9 | * [Task 1b](task1b/) - Command-line BLAST
10 | * [Task 2](task2/) - Genome Assembly
11 | * [Task 3](task3/) - Genome Annotation
12 | * [Task 4](task4/) - Comparative genomics: synteny comparison genomes
13 | * [Task 5](task5/) - Comparative genomics: gene set comparison between two genomes
14 | * [Task 6](task6/) - Resequencing: variant calling from NGS data
15 | * [Task 7](task7/) - Transcriptomics and detection of differentially expressed genes
16 | * [Task 8](task8/) - 16S rRNA amplicon sequencing analysis using Kraken2+Bracken
17 | * TBD - ChIP-seq data analysis
18 | * TBD - GWAS
19 | * TBD - Metagenomics: taxonomic and functional profiling
20 |
21 |
22 | # Requirements & Software Installation
23 |
24 | The main requirement is that you have access to a Linux-based OS such as Ubuntu. If you have access through a remote server that has been supplied to you, then you are set. If not, then one option is to install a Linux VM as a Virtual Machine on your local system, or use the Google Compute Engine (e.g., [free credits](https://cloud.google.com/free/))
25 | ).
26 |
27 | Most of the programs we will use are text-based and can be run directly in the shell; however, some graphical programs (e.g., Tablet, Artemis, fastqc) will be used as well. You can install these on your own local machine.
28 |
29 | Detailed software requirements will be listed at the beginning of each Task. Alternatively, if you wish to install all the programs for the course beforehand, please install the software listed below:
30 |
31 | * [tablet](https://ics.hutton.ac.uk/tablet/)
32 | * [artemis and act](http://sanger-pathogens.github.io/Artemis/Artemis/)
33 | * [bandage](http://rrwick.github.io/Bandage/)
34 | * [R](https://www.r-project.org/)
35 |
36 | We have also documented installation instructions for building a bioinformatics system (Ubuntu 18.04) [here](https://github.com/doxeylab/learn-genomics-in-unix/blob/master/VMbuild.README).
37 |
38 |
39 |
40 | # Contact
41 |
42 | If you have any questions, please contact acdoxey at uwaterloo dot ca.
43 |
44 | Enjoy
45 |
--------------------------------------------------------------------------------
/VMbuild.README:
--------------------------------------------------------------------------------
1 | ### BUILD INSTRUCTIONS FOR GOOGLE COMPUTE ENGINE GENOMICS SERVER
2 |
3 | # Set up a 16-core Ubuntu 18.04 VM with 500 Gb of SSD
4 |
5 | # Logged in via terminal (supplied my ssh public key to the gcloud meta ssh keys page in the console)
6 |
7 | # update package list
8 | sudo apt update
9 |
10 | # install bioinformatics software
11 | sudo apt install fastqc velvet abyss fasttree prodigal barrnap bcftools
12 |
13 | # comment out following line (/etc/java-X-openjdk/accessibility.properties) to prevent Java issues
14 | # assistive_technologies=org.GNOME.Accessibility.AtkWrapper
15 |
16 | # install fastx toolkit
17 | wget http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2
18 | tar xvjf fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2
19 | cd bin
20 | sudo mv * /usr/bin/.
21 |
22 | # install artemis (optional)
23 | wget ftp://ftp.sanger.ac.uk/pub/resources/software/artemis/artemis.tar.gz
24 | tar zxf artemis.tar.gz
25 |
26 | # install prokka
27 | sudo apt-get install libdatetime-perl libxml-simple-perl libdigest-md5-perl git default-jre bioperl
28 | sudo apt install ncbi-tools-bin
29 | sudo cpan Bio::Perl
30 | sudo cpan Bio::SearchIO::hmmer3
31 | git clone https://github.com/tseemann/prokka.git $HOME/prokka
32 | $HOME/prokka/bin/prokka --setupdb
33 |
34 | # then cp prokka folder to usr/bin
35 |
36 | # install tablet (optional)
37 | wget https://ics.hutton.ac.uk/resources/tablet/installers/tablet_linux_x64_1_17_08_17.sh
38 | sh tablet_linux_x64_1_17_08_17.sh
39 |
40 | # install pip (for python packages)
41 | # this required python2.7
42 | # sudo apt install python-pip # OLD
43 | # pip install getopt pysqlw # OLD
44 | # had to do something like this..
45 | # sudo apt install python2.7
46 | # wget https://bootstrap.pypa.io/pip/2.7/get-pip.py
47 | # sudo python2.7 get-pip.py
48 | # python2.7 -m pip install pysqlw #I think?
49 |
50 |
51 | # installing uniprot2go.py (Doxey Lab script)
52 | wget https://github.com/doxeylab/learn-genomics-in-unix/raw/master/task3/uniprot2go.py
53 | chmod +x uniprot2go.py
54 | sudo mv uniprot2go.py /usr/bin
55 |
56 | # building the uniprot-to-go SQL database
57 | sudo apt install sqlite3
58 | #ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/ - all GO data is here
59 | wget ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gpa.gz
60 | zcat goa_uniprot_all.gpa.gz | awk -F'\t' '{print $2","$4}' >uniprot-vs-go.csv
61 | sqlite3 uniprot-vs-go-db.sl3
62 | sqlite> create table unitogo (uniprotID text, goTerm text);
63 | sqlite> .mode csv
64 | sqlite> .import uniprot-vs-go.csv unitogo
65 | sqlite> CREATE INDEX unigoindex ON unitogo(uniprotID);
66 | sqlite> VACUUM;
67 | sqlite> .quit
68 | # now move uniprot-vs-go-db.sl3 to /data/uniprot2go/uniprot-vs-go-db.sl3
69 |
70 | # add paths to system-wide bashrc files
71 | # add lines to /etc/bash.bashrc
72 | export PATH="$PATH:/usr/bin/prokka/bin/"
73 | # export PATH="$PATH:/usr/bin/Tablet/"
74 | # export PATH="$PATH:/usr/bin/artemis/"
75 |
76 |
77 | # hiding users and home directories
78 |
79 | sudo chmod 751 /home
80 | sudo rm /usr/bin/who
81 | sudo rm /usr/bin/w
82 | sudo rm /usr/bin/users
83 |
84 |
85 |
86 |
87 |
88 |
89 |
90 |
91 |
92 |
93 |
94 |
--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-hacker
--------------------------------------------------------------------------------
/questionbox.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/questionbox.png
--------------------------------------------------------------------------------
/task1/README.md:
--------------------------------------------------------------------------------
1 | # Task1 - Learning the Linux shell
2 |
3 | This task will introduce students to the Linux command-line (shell environment).
4 |
5 | ### Requirements
6 |
7 | * Access to a linux-based OS running BASH
8 |
9 | ---
10 |
11 |
12 | ## What is the Linux Shell?
13 |
14 | The shell is a command-line programming language for interacting with the [UNIX](https://en.wikipedia.org/wiki/Unix_shell) operating system.
15 |
16 | There are several different shell languages. What we will be using in this course is a popular shell flavor called [BASH](https://en.wikipedia.org/wiki/Bash_(Unix_shell))
17 |
18 | We will be learning basic commands, but BASH is actually a language that can perform complex programming tasks.
19 |
20 |
21 | ## Getting Started
22 |
23 | Before you can begin with the coding exercises, you must have access to a linux machine.
24 | You can either use your own local system or a remote server that has been set up for you.
25 |
26 | ### Accessing the remote server
27 |
28 |
29 | You can access the remote server through a terminal (full instructions on LEARN) like this:
30 |
31 | ```
32 | ssh yourUserName@genomics1.private.uwaterloo.ca
33 | ```
34 |
35 |
36 | When you are done, you can leave your session by typing...
37 |
38 | ```
39 | exit
40 | ```
41 |
42 | ## Once you are logged on | Learning the command-line
43 |
44 | If you have logged in correctly, you should see a welcome screen.
45 |
46 | You are now in a unix command-line environment. For more information and instructions:
47 |
48 | * [CodeAcademy](https://www.codecademy.com/learn/learn-the-command-line) - Learn the command line
49 | * [TeachingUnix](https://info-ee.surrey.ac.uk/Teaching/Unix/) - Another Unix Tutorial
50 | * [BasicCommands](http://mally.stanford.edu/~sr/computing/basic-unix.html) - List of common commands
51 |
52 | ## A Linux Shell primer
53 |
54 | ### Navigating files/folders
55 |
56 | When you log in, by default you start in your home directory.
57 |
58 | Type...
59 |
60 | ```
61 | pwd
62 | ```
63 |
64 | And this will print to the screen your current location (e.g., /home/username)
65 |
66 | You can always get back to your home folder by typing
67 |
68 | ```
69 | cd
70 | ```
71 |
72 | To view the contents of your current folder type
73 |
74 | ```
75 | ls
76 | ```
77 |
78 | To make the folder "task1", type
79 |
80 | ```
81 | mkdir task1
82 | ```
83 |
84 | Change directory into 'task1' folder
85 |
86 | ```
87 | cd task1
88 | ```
89 |
90 | And now create a file called file.txt
91 |
92 | ```
93 | >file.txt
94 | ```
95 |
96 | Open the file with `nano`. This is one built-in text editor. There are others.
97 |
98 | ```
99 | nano file.txt
100 | ```
101 |
102 | Now enter a few lines of text, type 'ctrl-o' and then 'ctrl-x' to save and exit
103 |
104 | To print to the screen the contents of your file
105 |
106 | ```
107 | cat file.txt
108 | ```
109 |
110 | Other ways of viewing your file
111 |
112 | ```
113 | less file.txt #type q to exit
114 | more file.txt
115 | head file.txt
116 | head -n 10 file.txt # first 10 lines of your file
117 | tail file.txt
118 | tail -n 10 file.txt # last 10 lines of your file
119 | ```
120 |
121 | Size of your file
122 |
123 | ```
124 | du file.txt
125 | du -h file.txt # in human-readable output (byte, kb, mb, etc.)
126 | ```
127 |
128 | To count the number of words and lines in your file
129 |
130 | ```
131 | wc file.txt #words
132 | wc -l file.txt #lines
133 | wc -m file.txt #characters
134 | ```
135 |
136 | To copy the file to a new file
137 |
138 | ```
139 | cp file.txt newfile.txt
140 | ```
141 |
142 | You can also 'move' a file to a new location or rename it using `mv`.
143 |
144 | To combine both files together into a third file
145 |
146 | ```
147 | cat file.txt newfile.txt > thirdfile.txt
148 | ```
149 |
150 | The '>' redirects the output of the commands on the left of it to a file specified on the right.
151 |
152 | Delete the file (note: be careful since there is no Trash Bin)
153 |
154 | ```
155 | rm file.txt
156 | ```
157 |
158 | Print contents of all .txt files in current folder. * acts as a wildcard
159 |
160 | ```
161 | cat *.txt
162 | ```
163 |
164 | Delete all .txt files in the current folder
165 |
166 | ```
167 | rm *.txt
168 | ```
169 |
170 | Delete all files in the current folder
171 |
172 | ```
173 | rm *
174 | ```
175 |
176 | Move back to the previous folder
177 |
178 | ```
179 | cd ..
180 | ```
181 |
182 | And delete the folder 'task1'
183 |
184 | ```
185 | rmdir task1
186 | ```
187 |
188 | ### Getting help on linux commands and program usage
189 |
190 | For most commands, you can get more information on their usage by typing `man` 'command'.
191 |
192 | e.g., try
193 |
194 | ```
195 | man ls
196 | #type 'q' to quit
197 | #Note: this line and the line above are interpreted as comments since they start with the "#" character. They will not be executed as a command.
198 | ```
199 |
200 | `man` will work with some bioinformatics tools. However, not always.
201 |
202 | e.g., for help on the `blastp` tool, type
203 |
204 | ```
205 | blastp -h # or
206 | blastp --help
207 | ```
208 |
209 | ### Additional operations
210 |
211 | #### Pattern finding with grep
212 |
213 | ```
214 | grep "word" file.txt # prints lines in file.txt containing "word"
215 |
216 | grep -o "word" file.txt #print out all the occurrences of "word"
217 |
218 | grep -c "word" file.txt # counts the number of lines containing "word" in file.txt
219 |
220 | ```
221 |
222 | Note: Be careful when using `grep` to analyze files containing nucleic acid or protein sequence data.
223 |
224 | e.g., your FASTA file may be separated into multiple lines like this:
225 |
226 | ```
227 | >myFastaSequence
228 | ATCGACGTTATCGACTAGCTAT
229 | TCGGCGCGGTATTAGCGATTCG
230 | TAATATCGGCGCGATATATCGA
231 | ```
232 |
233 | instead of this:
234 |
235 | ```
236 | >myFastaSequence
237 | ATCGACGTTATCGACTAGCTATTCGGCGCGGTATTAGCGATTCGTAATATCGGCGCGATATATCGA
238 | ```
239 |
240 | Therefore, `grep` may miss some words that span multiple lines.
241 |
242 | Fortunately, there is a useful tool called `compseq` that be used to examine the [k-mer](https://en.wikipedia.org/wiki/K-mer) composition of a FASTA file.
243 | It can be run like this:
244 |
245 | ```
246 | compseq file.fasta
247 |
248 | #or
249 |
250 | compseq -reverse file.fasta #also counts occurrences on the reverse complement of the sequence
251 |
252 | ```
253 |
254 |
255 |
256 | #### Piping commands
257 |
258 | We can also chain together multiple commands like this using the `|` (pipe) operator.
259 |
260 | ```
261 | grep "word" file.txt | wc -l # will count the number of lines containing the word "word"
262 |
263 | # or alternatively
264 |
265 | cat file.txt | grep "word" | wc -l # does the same thing as above
266 |
267 | # if we want to count ALL the occurrences of "word" in the file (allowing multiple per line), we can do
268 |
269 | grep -o "word" file.txt | wc -l
270 |
271 | ```
272 |
273 | #### Copying a file to and from a remote server
274 |
275 | To
276 | ```
277 | scp /path/to/file.txt yourUserName@genomics1.private.uwaterloo.ca
278 | ```
279 |
280 |
281 | From
282 | ```
283 | scp yourUserName@genomics1.private.uwaterloo.ca:/path/to/file.txt /path/to/location/
284 | ```
285 |
286 |
287 |
291 | Remember, your current path can be found using `pwd`. A useful command for printing out the path to your file is:
292 |
293 | ```
294 | realpath file.txt
295 | ```
296 |
297 |
298 | #### Downloading a file off the internet
299 |
300 | ```
301 | wget
302 | ```
303 |
304 | #### File compression/uncompression
305 |
306 | This is done using programs such as `tar`, and `gzip` and `gunzip`.
307 | e.g.,
308 | ```
309 | gzip file.txt # to compress it
310 | gunzip file.txt.gz # to uncompress it
311 | ```
312 |
313 | ### More tips
314 |
315 | #### Use tab to autocomplete
316 |
317 | Use tab for autocompletion! This will speed up your command-line work dramatically.
318 | More here: [tab-autocomplete](https://www.howtogeek.com/195207/use-tab-completion-to-type-commands-faster-on-any-operating-system/)
319 |
320 | #### Use Ctrl-C to interrupt or end a process
321 |
322 | If you need to interrupt a command or process that you have started, press Ctrl-C.
323 |
324 | #### Other tips for becoming a linux power user
325 |
326 | [Linux Tips](https://www.howtogeek.com/110150/become-a-linux-terminal-power-user-with-these-8-tricks/)
327 |
328 |
329 | ---
330 |
331 |
332 | # ASSIGNMENT QUESTIONS
333 |
334 | PLEASE COMPLETE ASSIGNMENT 1 ON LEARN.
335 | This can be found under Quizzes.
336 |
337 | You will be asked to answer the following questions.
338 |
339 | Hint: remember to use `man` if you want to explore added functionality of commands. Also, the program `compseq` may be useful to answer some of these questions.
340 |
341 | Download and uncompress this file containing the genome sequence of E. coli H20 https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task1/e-coli-genome.fasta.gz
342 |
343 |  Q1 - What is the size of the uncompressed file in megabytes (round to one decimal place)?
344 |
345 |  Q2 - What is the header (first) line of the file?
346 |
347 |  Q3 - How many characters are in this file (header plus genome)?
348 |
349 |  Q4 - What are the last five bases in the genome?
350 |
351 |  Q5 - What is the length of the genome (# bases)?
352 |
353 |  Q6 - What base (A, C, G, or T) is most common in the file?
354 |
355 |  Q7 - What is the GC content of the genome?
356 |
357 |  Q8 - What is the most common trinucleotide in the file?
358 |
359 |  Q9 - How many times does the word "AATGAGAGG" occur in the genome sequence? Do not use compseq to answer this one.
360 |
361 |  Q10 - What is the answer to the above question if you also include matches on the reverse complement of the genome sequence? Again, do not use compseq to answer this one.
362 |
363 |
364 | #### Congratulations. You are now finished Task 1.
365 |
366 |
--------------------------------------------------------------------------------
/task1/e-coli-genome.fasta.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task1/e-coli-genome.fasta.gz
--------------------------------------------------------------------------------
/task1/e-coli-h20-genome.fasta.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task1/e-coli-h20-genome.fasta.gz
--------------------------------------------------------------------------------
/task1/e-coli-k12-genome.fasta.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task1/e-coli-k12-genome.fasta.gz
--------------------------------------------------------------------------------
/task1/gcloud-download.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task1/gcloud-download.png
--------------------------------------------------------------------------------
/task1b/README.md:
--------------------------------------------------------------------------------
1 | # Task1b - Command-line BLAST
2 |
3 | In this task, you will learn how to run BLAST in the command-line. You will download a genome and proteome and search them for a gene and protein of interest, respectively.
4 |
5 | ### Requirements
6 |
7 | * Access to a linux-based OS running BASH
8 | * [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download)
9 |
10 | ---
11 |
12 | ## Getting Started
13 |
14 | Login to your linux environment as you did in task1, either through the browser or via `ssh`.
15 |
16 | Create a new project folder for this task
17 |
18 | ```
19 | mkdir blastTask #creates folder
20 | cd blastTask #enters into folder
21 | ```
22 |
23 | ## Retrieving genomic data from the NCBI
24 |
25 | There are several ways to download data. Two common tools are `curl` and `wget`.
26 | You can also simply copy and paste sequence data into a file using `nano` or `pico` or other command-line text-editors. More advanced ones are `vim` and `emacs`.
27 |
28 | The following exercise will download a genome (DNA sequence data) and proteome (translated AA sequence data) from the NCBI.
29 | The NCBI houses its genomic data within an FTP directory - [here](https://tinyurl.com/cvh8n5ce)
30 |
31 | We will be working with the genome of [Prochlorococcus marinus](https://en.wikipedia.org/wiki/Prochlorococcus), which is an abundant marine microbe and possibly the most abundant bacterial genus on earth. First, explore its FTP directory [here](https://tinyurl.com/yc6xv4mh)
32 |
33 | Within this folder, there are a number of files. It is important that you familiarize yourself with these files and their contents.
34 | Files within a genbank genomic ftp directory include:
35 |
36 | * ...genomic.fna.gz file -> this will uncompress into a .fna (fasta nucleic acid) file
37 | * ...protein.faa.gz -> .faa (fasta amino acid) file
38 |
39 | It should be clear what these two files contain.
40 |
41 | There is also another file called:
42 |
43 | * ...genomic.gff.gz -> .gff file (generic feature format)
44 |
45 |
46 | Download these files, uncompress them, and explore them (with `less` for example).
47 |
48 | ```
49 | wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/007/925/GCA_000007925.1_ASM792v1/GCA_000007925.1_ASM792v1_genomic.fna.gz
50 |
51 | wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/007/925/GCA_000007925.1_ASM792v1/GCA_000007925.1_ASM792v1_genomic.gff.gz
52 |
53 | wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/007/925/GCA_000007925.1_ASM792v1/GCA_000007925.1_ASM792v1_protein.faa.gz
54 |
55 | ```
56 |
57 |  Q1 - Examine the contents of the three files you have just downloaded. What information does each contain?
58 |
59 |
60 | ## Command-line BLAST
61 |
62 | ### Setting up your query sequence(s)
63 |
64 | Next, you are going to do a BLAST search against the genome (.fna) and proteome (.faa) that you have downloaded.
65 |
66 | In this case, the genome and proteome will set up as BLAST DATABASES that you are searching against.
67 | The BLAST QUERY sequences (a gene and a protein) can be anything you like (examples below).
68 |
69 | * e.g. a query gene sequence: E. coli 16S ribosomal RNA - copy and paste this into a new text file
70 |
71 | ```
72 | >J01859.1 Escherichia coli 16S ribosomal RNA, complete sequence
73 | AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGAAGCTTGCTCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAGGGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACGGAAGTTTTCAGAGATGAGAATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTAACAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA
74 | ```
75 |
76 | * e.g. query protein sequence:
77 |
78 | ```
79 | >sp|B7LA79|RL7_ECO55 50S ribosomal protein L7/L12 OS=Escherichia coli (strain 55989 / EAEC) OX=585055 GN=rplL PE=3 SV=1
80 | MSITKDQIIEAVAAMSVMDVVELISAMEEKFGVSAAAAVAVAAGPVEAAEEKTEFDVILKAAGANKVAVIKAVRGATGLGLKEAKDLVESAPAALKEGVSKDDAEALKKALEEAGAEVEVK
81 | ```
82 |
83 | Note: here is a quick way to download the above query protein sequence from Uniprot and rename it in one command
84 |
85 | ```
86 | curl https://rest.uniprot.org/uniprotkb/B7LA79.fasta > e.coli.l7.faa
87 | ```
88 |
89 |
90 | ### Formatting the genome and proteome for BLAST
91 |
92 | When doing a BLAST search, the query can be in FASTA format, but the database needs to be formatted for BLAST. This is done with BLAST's `makeblastdb` command. This tool sets up an 'indexed' database for BLAST, which chops sequences into their constitutent k-mer fragments and stores this data to facilitate rapid database searching.
93 |
94 | Let's set up a BLAST database for the proteome
95 |
96 | ```
97 | makeblastdb -in GCA_000007925.1_ASM792v1_protein.faa -dbtype 'prot'
98 | ```
99 |
100 | ... and now the genome as well
101 |
102 | ```
103 | makeblastdb -in GCA_000007925.1_ASM792v1_genomic.fna -dbtype 'nucl'
104 | ```
105 |
106 | You can see that the `-dbtype` parameter defines whether the input FASTA file is for protein or nucleotide sequences.
107 |
108 | `makeblastdb` and other BLAST tools have lots of additional parameters as well that can be customized based on a users needs.
109 | Let's explore some more.
110 |
111 | ```
112 | # to look at command usage and parameter options
113 | makeblastdb -help
114 | ```
115 |
116 | #### Advanced: retrieving specific entries and regions from your BLAST database
117 |
118 | One of the useful parameters here is the `-parse_seq_ids` flag. If this option is set, this makes it very easy to retrieve specific sequences from the database using their name or id. e.g.,
119 |
120 | ```
121 | makeblastdb -in GCA_000007925.1_ASM792v1_genomic.fna -dbtype 'nucl' -parse_seqids
122 | makeblastdb -in GCA_000007925.1_ASM792v1_protein.faa -dbtype 'prot' -parse_seqids
123 | ```
124 |
125 | And now if you want to print out to the screen the sequence of protein `AAP99047.1`, you can use the `blastdbcmd` program like this:
126 |
127 | ```
128 | blastdbcmd -entry AAP99047.1 -db GCA_000007925.1_ASM792v1_protein.faa
129 | ```
130 | This will output:
131 | >\>AAP99047.1 DNA polymerase III beta subunit [Prochlorococcus marinus subsp. marinus str. CCMP1375]
132 | MKLVCSQIELNTALQLVSRAVATRPSHPVLANVLLTADAGTGKLSLTGFDLNLGIQTSLSASIESSGAITVPSKLFGEII
133 | SKLSSESSITLSTDDSSEQVNLKSKSGNYQVRAMSADDFPDLPMVENGAFLKVNANSFAVSLKSTLFASSTDEAKQILTG
134 | VNLCFEGNSLKSAATDGHRLAVLDLQNVIASETNPEINNLSEKLEVTLPSRSLRELERFLSGCKSDSEISCFYDQGQFVF
135 | ISSGQIITTRTLDGNYPNYNQLIPDQFSNQLVLDKKYFIAALERIAVLAEQHNNVVKISTNKELQILNISADAQDLGSGS
136 | ESIPIKYDSEDIQIAFNSRYLLEGLKIIETNTILLKFNAPTTPAIFTPNDETNFVYLVMPVQIRS
137 |
138 | If you want a specific region (e.g., the first 10 amino acids) from this entry, you can use the `-range` parameter
139 |
140 | ```
141 | blastdbcmd -entry AAP99047.1 -db GCA_000007925.1_ASM792v1_protein.faa -range 1-10
142 | ```
143 | This will output:
144 | >\>AAP99047.1 DNA polymerase III beta subunit [Prochlorococcus marinus subsp. marinus str. CCMP1375]
145 | MKLVCSQIEL
146 |
147 |
148 | ### Performing a BLAST search
149 |
150 | There are several different flavors of BLAST. Each is run as a separate command:
151 |
152 | * `blastp` - protein query vs protein database
153 | * `blastn` - nucleotide query vs nucleotide database
154 | * `blastx` - nucleotide query (translated) vs protein database
155 | * `tblastn` - protein query vs nucleotide (translated) database
156 | * `tblastx` - nucleotide query (translated) vs nucleotide database (translated)
157 |
158 | To run a `blastp` search using the protein query (defined by `-query` parameter) and protein database (defined by `-db` parameter) you have set up, do the following:
159 |
160 | ```
161 | blastp -query e.coli.l7.faa -db GCA_000007925.1_ASM792v1_protein.faa
162 | ```
163 |
164 |  Q2 - How many significant (E < 0.001) hits did you get?
165 |
166 |  Q3 - What is the sequence identity (percentage) of your top BLAST hit?
167 |
168 | * Repeat the same BLAST search you did for Q2 but using the genomic sequence as the database.
169 |
170 |  Q4 - Compare your result to the previous search. Which of the following statements is most correct:
171 |
172 | ```
173 | * There were no significant BLAST matches.
174 | * BLAST detected the same protein as the top hit. However, the alignment was shorter.
175 | * BLAST detected the same protein as the top hit. However, the alignment was not significant.
176 | * BLAST detected a different protein as the top hit.
177 | ```
178 |
179 |
180 | * Suppose you have sequenced the following fragment of DNA:
181 |
182 | ```
183 | ACTGGCATTGATAGAACAACCATTTATTCGAGATAGTTCAATTACTGTAGAGCAAGTTGTAAAACA
184 | ```
185 |
186 |  Q5 - Search for this fragment of DNA in the genome of Prochlorococcus marinus subsp. marinus str. CCMP1375. What did you find?
187 |
188 | ```
189 | * I found an exact match to the sequence.
190 | * I did not find a good match to the sequence.
191 | * I found a good match to the sequence with 1 mutation.
192 | * I found a good match to the sequence with 2 mutations.
193 |
194 | ```
195 |
196 |  Q6 - What is the likely function of this fragment of DNA? (use any method that you like to answer this question).
197 |
198 | ```
199 | * It is impossible to say.
200 | * It is part of a gene encoding Translation elongation factor Ts.
201 | * It is a segment of a protein.
202 | * It is a non-coding sequence.
203 | ```
204 |
205 |
206 | # ASSIGNMENT QUESTIONS
207 |
208 | * Complete questions 1-6 above and submit your answers on LEARN.
209 |
210 |
211 | #### Congratulations. You are now finished Task 1b.
212 |
213 |
--------------------------------------------------------------------------------
/task2/README.md:
--------------------------------------------------------------------------------
1 | # Task 2 - Genome Assembly
2 |
3 | In this lab, you will download raw sequencing data, perform genome assembly, visualize and analyze your assemblies, and compare the assembled genome sequence to the database using BLAST.
4 |
5 | ### Requirements
6 |
7 | * Access to a linux-based OS running BASH
8 | * [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
9 | * [fastx toolkit](http://hannonlab.cshl.edu/fastx_toolkit/)
10 | * [velvet](https://github.com/dzerbino/velvet/blob/master/Manual.pdf)
11 | * [abyss](https://github.com/bcgsc/abyss)
12 | * [tablet](https://ics.hutton.ac.uk/tablet/) * download this graphical software onto your own machine
13 | * [bandage](http://rrwick.github.io/Bandage/) * (optional) * download this graphical software onto your own machine
14 |
15 |
16 | ## Installation
17 |
18 | Please install the software on your local machine. Once locally installed, you can download results off the linux server and locally visualize them on your own system.
19 |
20 | All software used are available for Mac/Windows/Linux.
21 |
22 | ---
23 |
24 | ## Getting Started
25 |
26 | Login to your Linux environment as you did in task1, and create a new folder for your task2 work.
27 |
28 | ```
29 | mkdir task2 #creates folder
30 | cd task2 #enters into folder
31 | ```
32 |
33 | ## Retrieving the raw data
34 |
35 | Download the raw sequencing data from *https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task2/mt_reads.fastq.gz* into your folder and then uncompress it.
36 |
37 | Explore the file using `less`.
38 |
39 | Next, consider how would you return the sequence of the nth read using only head and tail or grep? Note that each read consists of a fixed number of lines and the first line of a read starts with a specific character.
40 |
41 |
42 |  Q1) What is the sequence of the third read in the file? Make sure to remove all spaces.
43 |
44 |
45 |  Q2) How many reads are in the file? Hint: the command `grep "^X"` reports all lines starting with the character `X`.
46 |
47 |
48 |
49 | ## Data preprocessing
50 |
51 | Before we can assemble a genome, we need to:
52 |
53 | 1) Assess the quality of the sequencing data
54 | 2) Demultiplex the data
55 | 3) Trim barcodes
56 | 4) Filter out low-quality reads (this is called quality filtering).
57 |
58 | ### Quality assessment
59 | For a quick quality report, you can use the program `fastqc`.
60 | The command below will analyze the mt_reads.fastq file and produce an .html results file.
61 |
62 | ```
63 | fastqc mt_reads.fastq
64 | ```
65 |
66 | Transfer the 'fastqc_report.html' file to your local machine and open it in a web browser. Tip: find the path to your file with `realpath yourFile.txt`
67 |
68 | Explore and inspect the FastQC report for mt_reads.fastq.
69 |
70 |
71 |  Q3) Which of the following statements is correct?
72 | * The reads passed all of the quality control measures
73 | * The reads failed all of the quality control measures
74 | * The reads passed some of the quality control measures but failed others
75 |
76 |
77 |  Q4) The per-base sequence quality is lowest at the \_\_\_\_\_ of the reads.
78 |
79 |  Q5) Most of the reads were assigned a quality (Phred) score of \_\_\_\_\_ .
80 |
81 |  Q6) Examine the per-base sequence content. The base composition is unusual/unexpected for position \_\_\_\_\_ to position \_\_\_\_\_ of the reads.
82 |
83 |  Q7) This unexpected composition may be due to the inclusion of \_\_\_\_\_ .
84 |
85 |
86 |
87 |
88 | ### Splitting the barcodes (demultiplexing)
89 |
90 | Sequencing data may be barcoded. In this case, the data contains two different samples, each with a unique barcode.
91 | This allows us to split the data by sample. Sometimes, sequencing data can have tens or hundreds of barcodes. See [multiplexing](https://www.illumina.com/science/technology/next-generation-sequencing/multiplex-sequencing.html).
92 |
93 | We will use a standard script from the `fastx toolkit` to split the data by its known barcodes (defined already for you in the file downloaded below).
94 |
95 | ```
96 | #first download the barcodes file
97 | wget https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task2/mt_barcodes.txt
98 |
99 | #now split
100 | fastx_barcode_splitter.pl k to define the k-mers (sequence fragments of length k) to be used in constructing the graph.
137 |
138 | Read more:
139 | [velvet](https://en.wikipedia.org/wiki/Velvet_assembler),
140 | [de bruijn graphs](https://en.wikipedia.org/wiki/De_Bruijn_graph),
141 | [de novo assemblers](https://en.wikipedia.org/wiki/De_novo_sequence_assemblers).
142 |
143 |
144 | The commands below will compute the graph. The first parameter is the folder name (you choose) and the second parameter is the value of k. So below, we are assembling the genome from the trimmed and quality-filtered reads using a k-mer value of 21.
145 |
146 | ```
147 | velveth out_21 21 -short -fastq qual_trim_mt1.fastq
148 | ```
149 |
150 | Next, to compute the actual contig sequences from the graph, run the following:
151 |
152 | ```
153 | velvetg out_21/ -scaffolding no -read_trkg yes -amos_file yes
154 | ```
155 |
156 |  Q10) How many nodes are there in the graph that was produced?
157 |
158 | Inspect the contigs.fa file that has been produced (will be in out_21 folder).
159 |
160 |
161 |  Q11) How many contigs do you get using k=21?
162 |
163 |  Q12) How many contigs do you get using k=31?
164 |
165 |
166 | ## Assembly visualization
167 |
168 | ### Assembly visualization with Tablet
169 |
170 | Velvet has the option of keeping track of where the reads map to the assembly using the `-read_trkg` flag. This will produce a `velvet_asm.afg` file.
171 |
172 | Transfer this file (from your k=31 assembly) to your local machine.
173 | Then open it in `tablet`. Tablet is a great program to explore how reads map to assemblies and genomes.
174 |
175 |
176 |  Q13) What is the average contig length?
177 |
178 |  Q14) What is the N50 value?
179 |
180 |  Q15) Examine the read coverage across the longest contig. Does the coverage distribution match that shown [here](https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task2/tablet-coverage-plot.png)?
181 |
182 |
183 | Explore `tablet` more on your own. We will be using it later in the course.
184 |
185 | ### Assembly visualization with Bandage (optional)
186 |
187 | Velvet and other de bruijn assemblers produce a graph that can be visualized. `bandage` is an excellent tool for this purpose.
188 |
189 | If you are interested, locate the 'lastgraph' file produced by `velvet` and transfer it to your local machine.
190 |
191 | Open this file in the `Bandage` application, and explore further.
192 |
193 |
194 | ### Generating an improved assembly with ABYSS
195 |
196 | As you can see based on the results from above, `velvet` (with the parameters we chose) did not yield a high quality assembly. It is too fragmented.
197 |
198 | Often in genomics it is useful to try numerous parameters and different assemblers. Let's try to assemble this genome again but with a different assembler. This time, we will be using the popular [abyss](https://github.com/bcgsc/abyss) assembler and we'll keep the value of k = 21. And see [here](https://github.com/bcgsc/abyss/wiki/ABySS-File-Formats#stats) for info on stats reported by an Abyss assembly.
199 |
200 |
201 | ```
202 | abyss-pe k=21 in='qual_trim_mt1.fastq' name=abyss-assembly
203 | ```
204 |
205 |  Q16) How long (# bases) is your assembly?
206 |
207 |  Q17) What is the N50 value (# bases)?
208 |
209 |  Q18) How many contigs did abyss generate?
210 |
211 |
212 |
213 | ### What is the taxonomic source of your genome? Explore with BLAST
214 |
215 | You still do not know the source of this genome. Is it eukaryotic? bacterial? Is it a nuclear genome, a plasmid, or something else?
216 |
217 | To investigate this question, do a BLAST search using the online [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi) tool. Use the Abyss assembly as your query.
218 |
219 |
220 |  Q19) Based on the BLAST result, describe the most likely source of this DNA sequence?
221 |
222 |
223 |
224 | # ASSIGNMENT QUESTIONS
225 |
226 |
227 | Please answer questions 1-19 above on LEARN under Quizzes.
228 |
229 |
230 | #
231 |
232 | Congratulations. You have now completed Task 2.
233 |
--------------------------------------------------------------------------------
/task2/mt_barcodes.txt:
--------------------------------------------------------------------------------
1 | #The barcode assignments in the fastq file: mt_reads.fastq
2 | mt1 ATCTACCA
3 | mt2 AACCATAA
4 |
--------------------------------------------------------------------------------
/task2/mt_reads.fastq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task2/mt_reads.fastq.gz
--------------------------------------------------------------------------------
/task2/tablet-coverage-plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task2/tablet-coverage-plot.png
--------------------------------------------------------------------------------
/task3/16Ssearch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task3/16Ssearch.png
--------------------------------------------------------------------------------
/task3/README.md:
--------------------------------------------------------------------------------
1 | # Task3 - Genome Annotation
2 |
3 | This task is a tutorial on genome annotation using `prokka` and other tools.
4 |
5 | You will learn how to perform basic genome annotation, and also how to extract specific regions of interest from your genome sequence.
6 |
7 | ### Requirements
8 |
9 | Graphical software you need to download onto your own machine indicated by (*)
10 |
11 | * Access to a linux-based OS running BASH
12 | * [BLAST](http://blast.ncbi.nlm.nih.gov/)
13 | * [Prokka](https://github.com/tseemann/prokka)
14 | * [Artemis](http://sanger-pathogens.github.io/Artemis/Artemis/) *
15 | Note: for Artemis, you may need to first install the JRE (Java Runtime Environment) and/or JDK (Java Development Kit) on your system. Please see [here](installingArtemis.md) for further instructions.
16 | * uniprot2go.py script located [here](https://github.com/doxeylab/learn-genomics-in-linux/blob/master/task3/uniprot2go.py)
17 | -- already installed in /usr/bin on the remote server.
18 |
19 | ## Installation
20 |
21 | Please install the graphical software on your local machine.
22 |
23 | ---
24 |
25 | ## Getting Started
26 |
27 | * Login to your linux environment as you did in task1.
28 |
29 | * Create a new folder for task3.
30 |
31 | ```
32 | mkdir task3 #creates folder
33 | cd task3 #enters into folder
34 | ```
35 |
36 | ## Retrieving the raw data
37 |
38 | * Copy the genome you assembled with `abyss` from task2.
39 |
40 | ```
41 | cp ../task2/abyss-assembly-contigs.fa .
42 | ```
43 |
44 | ## Exploring the assembly using `artemis`
45 |
46 |
47 | Now that we have generated a good quality assembly, let's explore the genome sequence itself and do some very basic annotation using `artemis`.
48 |
49 | Visualize the genome you have produced (using `abyss`) with the `artemis` application. Note: you will need to have `artemis` installed on your local machine.
50 | You will also need to download your contigs to your local machine. Open your contigs.fa file in `artemis`.
51 |
52 | - What are the black vertical lines that appear in the sequence window?
53 | - How do you product a gc plot of the genome?
54 | - Why would a researcher create a gc plot?
55 | - How do you mark open reading frames (ORFs)?
56 |
57 |
58 |  Q1) How many ORFs are there of length >= 100 amino acids?
59 |
60 |  Q2) How many ORFs are there of length >= 50 amino acids?
61 |
62 | ## Annotating your genome from Task2 using `prokka`
63 |
64 | By marking the ORFs in your genome (given a min size threshold), you have essentially performed a simple gene finding algorithm. However, there are more advanced ways of gene-finding that take additional criteria into account.
65 |
66 | A popular genome annotation tool for prokaryotic genomes is [`prokka`](https://github.com/tseemann/prokka).
67 | `prokka` automates a series of genome annotation tools and is simple to run. It has been installed for you on the server.
68 |
69 | * Run `prokka` using the following command.
70 |
71 | ```
72 | prokka abyss-assembly-contigs.fa
73 | ```
74 | Note: This will generate a folder called PROKKA-XXXXXXXX where XXXXXXXX is the current date. It will be different for you than in the examples below.
75 |
76 | * Now, locate and download the .gbk file that was produced and view it in `artemis`.
77 |
78 |  Q3) You will notice that there are vertical black lines in the middle of predicted ORFs. What do these lines represent?
79 |
80 |  Q4) Re-start `artemis` and change your `artemis` 'Options' to better reflect the source of this genome. Which source did you choose (e.g., "Standard", "Vertebrate Mitochondrial", etc.)?
81 |
82 | When `prokka` is run without any parameters, it selects 'bacteria' as the default taxonomy.
83 |
84 | Look at the `--kingdom` options in `prokka -h` and re-run `prokka` to use the correct annotation mode. You will also need to use `--outdir` to specify a folder for your new results.
85 |
86 | * Again, open your .gbk file in `artemis`.
87 |
88 |  Q5) Has anything changed in this genome annotation? Examine the CDSs, tRNAs, and rRNAs, and their annotations.
89 |
90 |
91 | ## Annotation of an E. coli genome using `prokka`
92 |
93 | Next, let's perform genome annotation on a larger scale.
94 |
95 | * Download (or copy from your task1 folder) the E. coli H20 genome below from task1 and annotate it using `prokka`
96 |
97 | ```
98 | https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task1/e-coli-h20-genome.fasta.gz
99 | ```
100 |
101 | Next, explore the files produced by `prokka`. Start with the .txt file.
102 |
103 |  Q6) How many genes, rRNAs, and tRNAs were predicted? What is the size of the genome in Mb?
104 |
105 | `prokka` also annotates genes based on [COGs](https://www.ncbi.nlm.nih.gov/COG/) and also [E.C.](https://enzyme.expasy.org/) (enzyme commission) numbers. This information can be found in the .tbl and .tsv files.
106 |
107 | Column 6 of this .tsv file lists the COGs. To print out only column 6, you can use the `cut` command as follows (replace "yourPROKKAoutput"):
108 |
109 | ```
110 | cut -f6 yourPROKKAoutput.tsv
111 | ```
112 |
113 | Using commands such as `cut`, `sort`, `grep`, `uniq`, and `wc` answer the following two questions (Q7 and Q8).
114 |
115 | e.g., this line below will count the number of unique entries in column 3 of file.txt
116 |
117 | ```
118 | cut -f3 file.txt | sort | uniq | wc -l
119 | ```
120 |
121 |
122 |  Q7) How many genes were annotated with COGs?
123 |
124 |
125 |  Q8) How many unique enzymatic activities (E.C. numbers) were assigned to the E. coli genome? Note: `1.-.-.-` and `1.1.1.17` would count as two separate E.C. numbers.
126 |
127 |
128 |
129 | ## Assigning GO terms
130 |
131 | Next, we will be assigning Gene Ontology ([GO](http://geneontology.org/)) terms to your predicted genes/proteins from the E. coli H20 genome.
132 |
133 | `prokka` identifies homologs of your proteins within the UniProtKB database. Since there are already pre-computed GO terms for all proteins in UniProtKB, we can map these GO terms over using the following commands:
134 |
135 | ```
136 | #extract the predicted proteins that have been mapped to entries in UniProt
137 | cat yourPROKKAoutput.gff | grep -o "UniProtKB.*;" | awk -F'[:;=]' '{print $4" "$2}' >uniProts.txt
138 |
139 | #assign GO annotations from a uniprot-GO database table
140 | python2.7 /usr/bin/uniprot2go.py -i uniProts.txt -d /fsys1/data/uniprot2go/uniprot-vs-go-db.sl3 >go.annotations
141 | ```
142 |
143 | This will generate an `go.annotations` file, which contains your predicted functional annotations.
144 |
145 | This one-liner will extract column 3 (GO terms), and list the top 20 according to their frequency in your proteome.
146 |
147 | ```
148 | cat go.annotations | awk '{print $3}' | tr "," "\n" | sort | uniq -c | sort -n -r | head -20
149 | ```
150 |
151 | Now, there is a lot you can explore using your predicted GO terms for your genome.
152 | e.g., Suppose you want to find all the predicted DNA binding proteins. Look [here](http://amigo.geneontology.org/amigo) to find the GO accession ID for "DNA binding".
153 |
154 |  Q9) How many proteins were annotated with the GO term for "DNA binding"?
155 |
156 | ## After annotation: Extracting genes and regions of interest
157 |
158 | Once your genome has been assembled and annotated, you may be interested in identifying and extracting specific genes or regions of interest.
159 |
160 | ### Extracting genes of interest
161 | For example, suppose you are interested in the "trpE" gene from E. coli. You can see whether this gene exists in the predictions like this:
162 |
163 | ```
164 | grep "trpE" yourPROKKAoutput.tsv
165 | ```
166 |
167 | This will output
168 | >CGLDHGDC_02664 CDS 1563 trpE 4.1.3.27 COG0147 Anthranilate synthase component 1
169 |
170 | ... which tells you that "trpE" has been assigned to the gene labeled "CGLDHGDC_02664".
171 |
172 | You can then extract this gene sequence from the gene predictions file (.ffn) like this:
173 |
174 | ```
175 | # index the .ffn file so we can extract from it
176 | makeblastdb -in yourPROKKAoutput.ffn -dbtype 'nucl' -parse_seqids
177 |
178 | blastdbcmd -entry CGLDHGDC_02664 -db yourPROKKAoutput.ffn
179 | ```
180 |
181 | This will output:
182 |
183 | >\>>>CGLDHGDC_02664 Anthranilate synthase component 1
184 | ATGCAAACACAAAAACCGACTCTCGAACTGCTAACCTGCGAAGGCGCTTATCGCGACAATCCCACCGCGCTTTTTCACCA
185 | GTTGTGTGGGGATCGTCCGGCAACGCTGCTGCTGGAATCCGCAGATATCGACAGCAAAGATGATTTAAAAAGCCTGCTGC
186 | TGGTAGACAGTGCGCTGCGCATTACAGCTTTAGGTGACACTGTCACAATCCAGGCACTTTCCGGCAACGGCGAAGCCCTG
187 | CTGACACTACTGGATAACGCCCTGCCTGCGGGTGTGGAAAATGAACAATTACCAAACTGCCGTGTGCTGCGCTTCCCCCC
188 | ...
189 |
190 | Note: in this example we searched the annotations with a text query "trpE". However, the best way of finding your gene of interest is to do a BLAST search since it may not be labeled correctly in your annotations
191 |
192 | ### Extracting regions of interest
193 |
194 | Next, suppose you are interested in extracting the promoter of the "trp operon". See [here](https://en.wikipedia.org/wiki/Trp_operon) for some background information. The trp operon regulatory sequences (operator and promoter) can be found upstream of the trpE gene. These regions are not in the annotations files so you will need to locate them yourself.
195 |
196 | First, let's see where the trpE gene is located in the genome:
197 |
198 | ```
199 | grep "trpE" yourPROKKAoutput.gff
200 | ```
201 |
202 | This will output:
203 |
204 | >CP069692.1 Prodigal:002006 CDS 2777877 2779439 . + 0 ID=CGLDHGDC_02664;eC_number=4.1.3.27;Name=trpE;db_xref=COG:COG0147;gene=trpE;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P00895;locus_tag=CGLDHGDC_02664;product=Anthranilate synthase component 1
205 |
206 | This tells us that trpE is located in entry "CP069692.1" at chromosome position "2777877 to 2779439" and encoded on the plus (+) strand.
207 |
208 | To extract the sequence for these coordinates, we can use `blastdbcmd` against the genome as follows:
209 |
210 | ```
211 | # index the genome so we can extract regions from it
212 | makeblastdb -in yourPROKKAoutput.fna -dbtype 'nucl' -parse_seqids
213 |
214 | blastdbcmd -entry CP069692.1 -db yourPROKKAoutput.fna -range 2777877-2779439 -strand plus
215 | ```
216 |
217 | This should produce a FASTA sequence output of the gene identical to that in the above example.
218 |
219 | But you are not interested in the gene sequence; you actually want the upstream regulatory region. Suppose you want to identify the 30-nucleotide long region upstream (before but not including the start codon) of the trpE coding sequence. By modifying the code above, answer the following question.
220 |
221 |  Q10) What is the 30-nucleotide long sequence immediately upstream of the TrpE coding sequence?
222 |
223 |
224 | ### Extracting the rRNAs predicted by barrnap
225 |
226 | Sometimes you may be interested in extracting multiple genes or regions at once. E.g., suppose you want to extract all of the regions corresponding to predicted 16S rRNA sequences. In `prokka`, rRNA genes are predicted for you using the `barrnap` tool.
227 |
228 | Here is a two-liner to extract the 16S rRNAs predicted by `barrnap`.
229 |
230 | ```
231 | cat yourPROKKAoutput.gff | grep "barrnap" | awk '{ if ($7 == "-") {print $1" "$4"-"$5" minus"} else {print $1" "$4"-"$5" plus"} }' > rRNAs.txt
232 | blastdbcmd -db yourPROKKAoutput.fna -entry_batch rRNAs.txt > rRNAs.fa
233 | ```
234 |
235 | Now, to predict taxonomy, we can BLAST these rRNA sequences against the NCBI nucleotide database, for example using [web-BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn). Note, that there may be multiple rRNAs and some of them may be partial sequences.
236 |
237 | 
238 |
239 |
240 | ## Analyzing a mystery genome of unknown source
241 |
242 | And now for something a little more difficult.
243 |
244 | * Download this "mystery" genome of unknown source.
245 |
246 | https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task3/mysteryGenome.fna.gz
247 |
248 |  Q11) Based on 16S rRNA sequences, what is the taxonomic origin of this genome (genus and species)?
249 | e.g., "Escherichia coli"
250 |
251 |
252 | ---
253 |
254 | # ASSIGNMENT QUESTIONS
255 |
256 |
257 | Please answer questions 1-11 above on LEARN under Quizzes.
258 |
259 |
260 | #
261 |
262 | Congratulations. You have now completed Task 3.
263 |
264 |
265 |
--------------------------------------------------------------------------------
/task3/installingArtemis.md:
--------------------------------------------------------------------------------
1 | # Installing Artemis/Java
2 |
3 | Artemis Installation Instructions
4 |
5 | ## Part 1 - Install the Java Development Kit (JDK):
6 | 1. Go to the [Temurin by Adoptium website](https://adoptium.net/en-GB/temurin/releases/?version=11&os=windows&arch=x64&package=jdk) and select your operating system version (most systems run x64 bit).
7 | 2. Select the "**JDK 11-LTS**" version from the dropdown menu.
8 | - For Windows: Download the **.msi** file and run it once downloaded.
9 | - For MacOS: Download the **.pkg** file and run it once downloaded.
10 |
11 | ## Part 2 - Install Artemis Tools:
12 | 1. Go to the [Sanger Pathogens website](https://sanger-pathogens.github.io/Artemis/) and scroll down the page until you reach the **"Software Availability"** section.
13 | 2. Under the **"Download"** heading, select your operating system version from the list and download the file.
14 | - For Windows: Download the **.zip** file.
15 | - For MacOS: Download the **.dmg** file.
16 | 3. Unzip the file to an appropriate local directory and an Artemis folder will be created containing the tools.
17 | - For Windows: Recommended to unzip the file to the **"C:\"** drive.
18 | - For MacOS: Recommended to unzip the file to the "**Applications**" folder.
19 | The first time you run one of the tools, it will ask to "Set Working Directory". Ask your prof what to set this to as they probably have some files to get data from – the working directory should be the one that contains those.
20 |
21 | #### Helpful resource:
22 | [The Artemis Manual](https://sanger-pathogens.github.io/Artemis/Artemis/artemis-manual.html)
23 |
24 | ## Additional Steps For MacOS:
25 | You may run into the error message "This application requires that Java 9 or later be installed on your computer. Please download and install the latest version of Java from www.java.com and try again." when opening any of the tools. If so, please follow the instructions below to open Artemis.
26 | 1. Open the "**Artemis**" folder.
27 | 2. Right click on the "**Artemis**" icon and select "Show Package Content".
28 | 3. Double-click on the "**Contents**" folder.
29 | 4. Double-click on the "art" executable file (it has the black terminal icon).
30 | 5. A new window will open and you will be prompted to set the working directory.
31 | 6. Once confirmed it is working, you can follow the additional steps below to create a shortcut for each of the Artemis tools.
32 | - Go through steps 1-3 again until the "**Contents**" folder is open.
33 | - Right-click on the "art" executable file and from the dropdown menu, click "**Make Alias**".
34 | - A new shortcut called "**art alias**" will be made in the folder. You can drag this file to your Desktop for example for easy access.
35 | - From the Desktop (or wherever you put the file), double-click on "**art alias**" and the program should open without errors.
36 | - You can repeat these steps for ACT, BamView, and Circular-Plot.
37 |
38 |
--------------------------------------------------------------------------------
/task3/mysteryGenome.fna.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task3/mysteryGenome.fna.gz
--------------------------------------------------------------------------------
/task3/ntsearch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task3/ntsearch.png
--------------------------------------------------------------------------------
/task3/uniprot2go.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 |
3 | import pysqlw
4 | import sys, getopt
5 |
6 | sqlfile = ''
7 | uniprotlist = ''
8 |
9 | try:
10 | opts, args = getopt.getopt(sys.argv[1:],"hi:d:",["uniprotlist=","sqlfile="])
11 | except getopt.GetoptError:
12 | print 'uniprot2go.py -i -d '
13 | sys.exit(2)
14 | if len(sys.argv) <= 1:
15 | print('uniprot2go.py -i -d ')
16 | sys.exit()
17 | for opt, arg in opts:
18 | if opt == '-h':
19 | print 'uniprot2go.py -i -d '
20 | sys.exit()
21 | elif opt in ("-i", "--uniprotlist"):
22 | uniprotlist = arg
23 | elif opt in ("-d", "--sqlfile"):
24 | sqlfile = arg
25 |
26 |
27 | #print 'DB = ', sqlfile
28 | #print 'UNIPROTLIST = ', uniprotlist
29 |
30 |
31 | p = pysqlw.pysqlw(db_type="sqlite", db_path=sqlfile)
32 |
33 | filepath = uniprotlist
34 |
35 | with open(filepath) as fp:
36 | line = fp.readline() #first line must be a header line
37 | while line:
38 | line = fp.readline()
39 | line = line.strip()
40 | splitline = line.split()
41 | if splitline:
42 | uni = splitline[1]
43 | # print(line)
44 | # print(uni)
45 | rows = p.where('uniprotid',uni).get('unitogo')
46 | goTerms = []
47 | for b in rows:
48 | #print(b['goTerm']);
49 | goTerms.append(b['goTerm'])
50 | goTerms = set(goTerms)
51 | #goTerms.reverse()
52 | print line , ','.join(goTerms)
53 |
54 | p.close();
55 |
--------------------------------------------------------------------------------
/task4/README.md:
--------------------------------------------------------------------------------
1 | # Task4 - Synteny comparison of genomes
2 |
3 | This task is a tutorial on structural comparison of genomes using synteny mapping.
4 |
5 | ### Requirements
6 |
7 | * Access to a linux-based OS running BASH
8 | * [BLAST](http://blast.ncbi.nlm.nih.gov/)
9 | * [Artemis](http://sanger-pathogens.github.io/Artemis/Artemis/) * download this graphical software onto your own machine. Again, see [here](installingArtemis.md) for further instructions
10 | * [Mauve](http://darlinglab.org/mauve/download.html) (optional) * download this graphical software onto your own machine
11 |
12 | ## Installation
13 |
14 | Please install the graphical software on your local machine.
15 |
16 | All software used are available for Mac/Windows/Linux.
17 |
18 | ---
19 |
20 | ## Getting Started
21 |
22 | * Login to your linux environment and create a new folder for task4.
23 |
24 | ```
25 | mkdir task4 #creates folder
26 | cd task4 #enters into folder
27 | ```
28 |
29 | ## Retrieving the raw data
30 |
31 | * Copy the genome from task2 you assembled with `abyss`
32 |
33 | ```
34 | cp ../task2/abyss-assembly-contigs.fa .
35 | ```
36 |
37 | * You will be comparing this genome to another related genome from L. terrestris. Download this genome.
38 |
39 | ```
40 | wget https://github.com/doxeylab/learn-genomics-in-linux/raw/master/task4/l-terrestris.genome.fa
41 | ```
42 |
43 | * Make BLAST databases for both.
44 |
45 | ```
46 | makeblastdb -in abyss-assembly-contigs.fa -dbtype nucl
47 | makeblastdb -in l-terrestris.genome.fa -dbtype nucl
48 | ```
49 |
50 | Now BLAST one genome against the other with the following command. Note that you are using BLAST's `-outfmt 6` parameter which outputs the BLAST result as a table (which you are writing to `blastresults.tab`). You will be using this table to visualize the synteny between these two genomes.
51 |
52 | ```
53 | blastn -outfmt 6 -db abyss-assembly-contigs.fa -query l-terrestris.genome.fa >blastresults.tab
54 | ```
55 |
56 | Now, download to your local machine the following files:
57 |
58 | * abyss-assembly-contigs.fa
59 | * l-terrestris.genome.fa
60 | * blastresults.tab
61 |
62 | Open the `act` program that is packaged with `artemis` and input these three files.
63 |
64 |  Q1) Paste a screenshot of your result. (3 marks)
65 |
66 |  Q2) Describe the synteny pattern that you are observing. Do you think genomic rearrangements have taken place or is there a strong pattern of shared synteny between both genomes? (2 marks) See [shared synteny](https://en.wikipedia.org/wiki/Synteny#Shared_synteny).
67 |
68 | To help you with this question, consider two genome sequences composed of four genes A-D. One genome has gene order A,B,C,D and the second genome has gene order A,C,B,D. There has clearly been a genomic rearrangement here because C and B have switched places.
69 |
70 | But now suppose the genomes are (A,B,C,D) and (C,D,A,B). If these are linear chromosomes, then a rearrangement has taken place, but what if they are circular?
71 |
72 | And lastly, now suppose we compare (A,B,C,D) to its reverse complement which will appear to be in the order (D,C,B,A). This may look like an inversion in artemis, but one of the two strands just needs to be flipped so that we are comparing the genomes in the same orientation.
73 |
74 | ---
75 |
76 | ## Working with your own dataset
77 |
78 | Next, find two related genomes (e.g., different strains of same species) from the [NCBI Genome Database](https://www.ncbi.nlm.nih.gov/datasets/genome/).
79 |
80 | * Repeat the analyses above to perform a structural genome comparison.
81 |
82 |  Q3) Paste a screenshot of your result. (3 marks)
83 |
84 |  Q4) Describe the synteny patterns that you are observing. (2 marks)
85 |
86 |
87 | ## Multiple genome alignment with Mauve -- Bonus (+1)
88 |
89 | This is for bonus marks.
90 |
91 | Want to try aligning/comparing more than two genomes?
92 |
93 | * Download/install [Mauve](http://darlinglab.org/mauve/download.html) to your local machine.
94 |
95 | * Select three or more genomes of interest.
96 |
97 | * Open the sequences in `Mauve` and align them.
98 |
99 | * Visualize the multiple alignment.
100 |
101 |  Bonus) Paste a screenshot of your result.
102 |
103 |
104 | ---
105 |
106 |
107 | # ASSIGNMENT QUESTIONS
108 |
109 | The questions for this task are indicated by the lines starting with  above.
110 | Please submit the code you used (when required) as well as the answers to the questions. Submit your assignment to a dropbox on LEARN as a .docx, .txt, or .pdf file.
111 |
112 |
--------------------------------------------------------------------------------
/task4/act.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task4/act.png
--------------------------------------------------------------------------------
/task4/installingArtemis.md:
--------------------------------------------------------------------------------
1 | # Installing Artemis/Java
2 |
3 | Artemis Installation Instructions
4 |
5 | ## Part 1 - Install the Java Development Kit (JDK):
6 | 1. Go to the [Temurin by Adoptium website](https://adoptium.net/en-GB/temurin/releases/?version=11&os=windows&arch=x64&package=jdk) and select your operating system version (most systems run x64 bit).
7 | 2. Select the "**JDK 11-LTS**" version from the dropdown menu.
8 | - For Windows: Download the **.msi** file and run it once downloaded.
9 | - For MacOS: Download the **.pkg** file and run it once downloaded.
10 |
11 | ## Part 2 - Install Artemis Tools:
12 | 1. Go to the [Sanger Pathogens website](https://sanger-pathogens.github.io/Artemis/) and scroll down the page until you reach the **"Software Availability"** section.
13 | 2. Under the **"Download"** heading, select your operating system version from the list and download the file.
14 | - For Windows: Download the **.zip** file.
15 | - For MacOS: Download the **.dmg** file.
16 | 3. Unzip the file to an appropriate local directory and an Artemis folder will be created containing the tools.
17 | - For Windows: Recommended to unzip the file to the **"C:\"** drive.
18 | - For MacOS: Recommended to unzip the file to the "**Applications**" folder.
19 | The first time you run one of the tools, it will ask to "Set Working Directory". Ask your prof what to set this to as they probably have some files to get data from – the working directory should be the one that contains those.
20 |
21 | #### Helpful resource:
22 | [The Artemis Manual](https://sanger-pathogens.github.io/Artemis/Artemis/artemis-manual.html)
23 |
24 | ## Additional Steps For MacOS:
25 | You may run into the error message "This application requires that Java 9 or later be installed on your computer. Please download and install the latest version of Java from www.java.com and try again." when opening any of the tools. If so, please follow the instructions below to open Artemis.
26 | 1. Open the "**Artemis**" folder.
27 | 2. Right click on the "**Artemis**" icon and select "Show Package Content".
28 | 3. Double-click on the "**Contents**" folder.
29 | 4. Double-click on the "art" executable file (it has the black terminal icon).
30 | 5. A new window will open and you will be prompted to set the working directory.
31 | 6. Once confirmed it is working, you can follow the additional steps below to create a shortcut for each of the Artemis tools.
32 | - Go through steps 1-3 again until the "**Contents**" folder is open.
33 | - Right-click on the "art" executable file and from the dropdown menu, click "**Make Alias**".
34 | - A new shortcut called "**art alias**" will be made in the folder. You can drag this file to your Desktop for example for easy access.
35 | - From the Desktop (or wherever you put the file), double-click on "**art alias**" and the program should open without errors.
36 | - You can repeat these steps for ACT, BamView, and Circular-Plot.
37 |
38 |
--------------------------------------------------------------------------------
/task4/l-terrestris.genome.fa:
--------------------------------------------------------------------------------
1 | >U24570.1 Lumbricus terrestris mitochondrion, complete genome
2 | ATGCGATGATTCTACTCAACTAATCACAAAGATATTGGAACTTTATACTTCATTCTTGGGGTATGGGCTG
3 | GCATGGTGGGAGCCGGAATAAGACTTCTTATCCGTATTGAGCTAAGACAACCTGGTGCATTCCTAGGAAG
4 | TGACCAATTATACAATACAATCGTTACTGCGCACNNNTTTGTTATAATTTTCTTCCTAGTGATACCAGTC
5 | TTCATTGGCGGGTTTGGGAACTGACTTCTTCCCCTAATACTGGGCGCTCCTGATATAGCATTCCCACGCC
6 | TTAATAACATAAGATTTTGACTTCTACCCCCCTCTCTTATTCTCCTAGTTTCCTCAGCTGCCGTAGAGAA
7 | GGGAGCCGGAACAGGCTGAACAGTGTACCCCCCTCTTGCCAGAAATCTCGCCCATGCTGGGCCATCTGTA
8 | GATTTAGCTATTTTTTCCCTTCATTTAGCAGGTGCGTCATCTATTCTAGGGGCTATTAATTTTATTACCA
9 | CTGTAATCAACATACGCTGAAGTGGGTTACGACTAGAACGAATCCCTCTGTTTGTCTGAGCTGTATTAAT
10 | TACAGTAGTTCTCCTCCTCCTATCCCTTCCTGTACTTGCCGGAGCAATCACAATACTCCTAACAGATCGA
11 | AATCTTAATACCTCATTTTTCGACCCCGCTGGTGGAGGGGATCCAATTTTATATCAACACCTTTTCTGAT
12 | TCTTTGGTCACCCAGAAGTATATATTCTTATTCTTCCTGGGTTTGGGGCCATTTCCCACATTGTTAGACA
13 | CTATACAGCTAAACTTGAGCCATTTGGAGCCTTAGGGATAATTTATGCAATACTAGGAATCGCAGTTTTA
14 | GGATTTATTGTCTGAGCACACCACATATTTACCGTTGGCTTAGATGTGGACACCCGGGCATATTTCACAG
15 | CAGTAACCATAATTATCGCAGTACCGACAGGTATTAAAGTATTTAGTTGATTAGCTACCATTCACGGGTC
16 | AAAGATCAAATATGAAACACCAGTGTTATGGGCCTTAGGATTTATCTTTTTATTTACAACGGGAGGTTTA
17 | ACTGGAATTATTTTATCTAACTCCTCCCTAGATATTATTCTTCATGACACATACTATGTAGTAGCACACT
18 | TCCACTACGTGTTGAGAATGGGCGCCGTATTTGCAATCTTTGCTGCCTTTACTCATTGATTCCCCCTACT
19 | AACAGGGCTTACCCTACACCACCGATGAGCCAATGCACAATTCTTCCTCATATTCTTAGGGGTAAACACT
20 | ACATTCTTTCCCCAACACTTCCTAGGATTGAGCGGTATACCTCGGCGATATTCTGACTACCCTGATGCTT
21 | TTATAAAATGAAACGTCGTGTCATCATTTGGGTCTCTCCTGTCCTTTGTAGCATTAATACTGTTTATTTT
22 | TATTTTATGGGAAGCCTTCGCCTCACAGCGAAGAGTAATCTCAAGGCCCCACATATCATCAGCTCTTGAA
23 | TGGTCTGACCCTATTCTACCTCTAGATTTTCATAATTTAAGGGAGACCGGAATCATTACATACCCTAAAT
24 | TAAGTGAATGCCAAAGGCACCTATTTGTTAATTAGGCCCGTGCTATTTATAGCTCACTTAGAATGCCAAA
25 | CTGAGGTCAAGTAATATTTCAAGACGCCGCATCATCTGTCATGCTCCAACTAGTATCCTTTCACGACCAT
26 | GCTTTATTAGTCCTAACTCTAGTCCTAACAGTGGTCGGCTATGCTCTCCTAGCACTCATATTAAACAAAC
27 | AAGTAAACCGTTACATTATAGAAGCTCAAACAGTAGAAACAATCTGAACTATTTTACCAGCTCTTATTCT
28 | CCTAGTTCTAGCCCTACCATCTTTACGCATTCTTTATATTACAGACGAGGTGAGACAACCATCTATTACT
29 | GTGAAGACTATTGGGCATCAATGATATTGAAGATACGAATATACTGATTTCTTAAATGTAGAAATAGATT
30 | CATATATGCTACCAACCTCAGACCTACTACCGGGGGACTATCGACTCCTAGAAGTAGATAATCGTATGGT
31 | AGTACCTATACAATTAGAAATTCGAATACTAATCACTGCTGCAGATGTGATTCACTCATGAACAGTTCCA
32 | GCTCTCGGGGTAAAAGTGGATGCCGTGCCTGGACGTCTAAACCAAATTGGATTTACAACTACACAACCAG
33 | GGGTATTTTATGGTCAGTGCTCAGAAATCTGCGGTGCTAATCACTCATTTATGCCAATTGCAGTGGAAGC
34 | TATTAACACTAAATCCTTCATAAGATGAGTCTCCAATTTTAAACCTTAGAAATACTAGTTAATCTATAAC
35 | AATGCCTTGTCAAGACATAATTACCTCTGGTGTATTTCTATGCCTCACCTATCTCCCATAAGATGAATTA
36 | CTTCTATATTAATATTCTGAATTTCCGTATCAATTCTTTTTTCCACCCTATGATGATCCAACAATTATTT
37 | ATTCAGTTCAAAAATAACTAATTGCGCCCCCAAATCCCTTACACCTTGAAATTGATTATTACGAGATGGT
38 | CGAGATTAAACATTAAACTGTAAATTTAATAACGGGCTACCACCCTCTTGTATGTTTTTAGTATATTTTG
39 | TACATTTACCTTCCAAGTAAAAAGATTGTTAGTAAAAAACATAAATGATTCGTCAACCCTTCCACCTCGT
40 | TGAGTACAGCCCATGGCCTCTAACCTCATCTATTGGAGCTTTTACCCTAGCTATCGGATTAGCTAGGTGG
41 | TTTCATAACCATGGATTCTTATGCCTAACCCTAGCCGCTTTTCTTATCATCGTTTCCATAATTCAGTGAT
42 | GGCGAGATGTCGTGCGAGAGGGCACATACATGGGTCATCATACCAGCTTAGTAACTACCGGCTTACGTTG
43 | GGGTATAATTTTATTTATTACTTCAGAAGTGATATTTTTCCTCGCCTTCTTTTGAGCCTTTTTCCACAGA
44 | AGCTTATCCCCCACACCAGAAATTGGCTGTTCCTGACCTCCAACAGGAATCCACCCCTTAAACCCATTCA
45 | GAGTCCCCCTGCTAAACACTGCTGTTCTTCTAGCCTCAGGGGTTACAGTAACCTGAGCTCATCACAGACT
46 | AATAAGAGGTAAGCGTATTGATGCCACTCAAGCACTAATTCTAACTGTCTGCTTAGGTGCCTACTTCACT
47 | TTCCTCCAAGCTGGCGAATACATAGCCGCCCCATTTTCTATTGCTGATAGGGTGTATGGCACTACATTCT
48 | TTGTGGCAACTGGATTTCATGGGCTTCACGTTCTAATCGGGTCATCTTTCCTAGCTATTTGCTTAGCGCG
49 | TACATGGTCCCACCACTTCTCTGCTGGGCATCACTTTGGATTCGAAGCCGCTGCCTGATACTGACACTTT
50 | GTAGATGTGGTGTGAATCTGCCTATATCTATGTATTTACTGATGAGGCTCCTATATAACTTAGTGTATGG
51 | TGCACGAAGGTTTTTGAAACCTAAAGCCTAAGTTCAAATCTTACAGTTATAAATGATTTTAACCTCTTTC
52 | ATATTAATGATAATCGCCACAACTTTTACCCTATATCTAGCTTCCACCCCTATTGTCTTAGGTGTAAATA
53 | TTCTCATAATAGCCCTACTCTTAGCCTCCACATTTGCGTCCTTTATAAGCTCCTGATTTGCATTTTTAAT
54 | TTTTCTAATCTACATCGGCGGCATGTTAGTCATATTTGCCTACTTTCTAGCTTTAACCCCCAACCAACAA
55 | ATTTCAAACTTCAATATTATACCATATGCTCTAATCACATTATTAACATTTTCGGCACTAACATACACCA
56 | CCAACATTAAAATCCCCACTTTTTCTGATATTAGTCAAGGAAACTCAATTTTGTATATATCCAGAACTGC
57 | ACCATTCCTCATCCTTCTCGCCCTAATCCTCCTCCTTACGATAGTTATTGTAGTAAAATTAACCAGACGG
58 | TCAAGGGGCCCTCTCCGCCCATTCTCCCCATATGTTCAAACCTATCCGAACCACACACCCGGCAATTAAA
59 | ATTATTAATAGTACCCTAATTGATCTTCCCGCGCCTAATAATATCTCCATTTGATGAAACTACGGATCAC
60 | TTCTGGGCCTATGCCTTGTAATCCAAGTTTTAACAGGCCTATTTCTAAGAATACACTACGTACCCAACAT
61 | TGAAATAGCTTTCTCATCAGTAGCCCTCATTTCTCGAGACGTGAACTACGGCTGACTTCTTCGGTCTATT
62 | CACGCTAATGGAGCATCTATATTTTTTCTATTTATCTATCTCCATGCGGGCCGAGGTCTATATTATGGCT
63 | CGTATAACCTCAGTGAAACTTGAAATATTGGGGTAATTTTATTTCTTCTCACCATAGCCACTGCATTCAT
64 | AGGTTATGTTCTGCCCTGGGGACAAATGAGATTCTGGGGAGCGACAGTAATTACTAACTTATTCTCAGCA
65 | ATTCCCTACATCGGGAAAACTTTGGTAGAGTGAATTTGAGGTGGGTTTGCAGTAGATAACGCTACCCTAA
66 | ACCGATTTTTCGCATTCCATTTCATTCTCCCGTTTGCTATTATAGGGGCGACTATCCTACACATCATATT
67 | TCTTCACGAGTCAGGATCTAACAACCCCATTGGTCTAAATGCAGACTCCGACCGAATCCCGTTTCATCCC
68 | TATTATTCTATTAAAGACACCCTCGGTTATACGTTAGCAATTTCAGCTCTATCTTTAATAGTTTTATTCG
69 | AGCCTAACTTATTTACCGACCCTGAGAACTTTTTAATAGCAAACCCTCTTGTAACACCTATTCATATTAA
70 | ACCTGAATGATATTTTCTATGAATATATGCCATTCTACGTTCAATTCCTAATAAGCTAGGGGGGGTGATA
71 | GCACTATTCGCAGCCATCGTTATTCTATTTATTCCACCGCTAACAAGTGTCATAAATAAGCGGAGCCTCT
72 | CATTTTACCCCCTAAATAAGACAATATTCTGAGGCCTTGTAGCATCCTGAGCTATTCTTACATGAATTGG
73 | GGGTCGACCTGTAGAAGACCCCTTTATCATCATCGGGCAAGTATTTACATCCTTGTACTTTATCTACTTT
74 | ATTTCCAGTCCTACCATCTCTAAACTCTGGGACGATTCTATTATTATCTCCTTAGAAAACACGTACCAAC
75 | TCAAAAAGATATACCTCCCCGAATATAAATTATAAGCCCTTAACAAAGACTTTAAGTTAAACAAACTAAA
76 | AACCTTCAAAGTTTTCATTGGGAGTATCCAAGTCTTGCCATGATACCAGATATTTTTTCCTCCTTTGACC
77 | CCTATATATTTAATACCCTGTTCCCACTCAACTCTCTATTCTTAGTAACAAACACAGCTATCATTCTGAT
78 | AATTCAGTCGTCATTTTGAGTTTTAAACGCTCGAACCTCAGCATTTAAGTCTCCAGTCAATGATACAATT
79 | TTCACTCAACTATCCCGCACATCTACCACACACCTCAAAGGTCTATCAACCCCATTATCCACCATCTTCT
80 | TTATACTAGTAATAATCAATCTCATGGGATTAATTCCATACATGTTTAGAACATCTAGACACCTAGTATT
81 | CACCCTTTCCCTAGGGTTCCCCATCTGACTAAGTCTTATAATCTCTACGTTCGCTCACAGCCCCAAAAAG
82 | AGAACAGCTCACTTTCTCCCTGACGGGGCCCCAGACTGATTAAACCCGTTCTTAGTTCTAATCGAAACAA
83 | CTAGAGTTTTCGTCCGACCTCTAACACTATCTTTCCGTTTAGCCGCTAACATAAGAGCAGGGCACATCGT
84 | CTTAAGACTTATAGGAATCTACTGTGCCGCCGCATGATTTTCAAGTGTTTCAAGAACAGCACTCCTAATC
85 | TTAACTGCCATCGGATATATTCTATTTGAAGTAGCAATTTGTTTAATTCAAGCTTATATTTTCTGCCTAC
86 | TCCTATCCCTATACTCAGATGATCACGCCCATTAAAACAATAAGCATTAAAAAAAATGCGCGCCGATTTC
87 | GACTCGGCGAGAGCACAAAGCATTGTTTTTTTACTTAGTTTATACTATACTCTATATATATATACGCATT
88 | TGTGTACTCTGATTGGGGGGGGGGGGTAATTTCACAAAAAGCTATAATCCGAAAAGGCCCGACCGGGCGA
89 | GAAAAAAAAAAAAAAAAAAAGAAAAAGTGGTGTTTTTAGGTTCTAATCCTTTAGAATGATGCCAATTTCG
90 | GAAAAACTCGACAGGGACTTTTTAAATTTCGGTCCTTGCTAATATGGGCACGACGTATATTTGCGGTATT
91 | TACATAAGAAACGGCCTGTATCGAGCAAAATTTACAGTCTGTCGGGGGAAAAAATTTTAACCTAAAAAAT
92 | TGTTCGGCGTGGGGCCTTTTTTTTTTCAGTTTTTAACATTAAAAATTTTCTCGGAGTTCTAATCATAAAG
93 | GTAGGTTACAAAAACCCCCGAATTGTGGTTCCGGAAACGTCAAAAGACCCTTTTTCATGCTTCGGATATT
94 | TAATACTAGACTTGTGGCCAGTAAACTAATATGGGTTATCTTTACTGGGATGCTGGCGCCCACCCTATAC
95 | ATAGTGCACTGTAATTCCACCTTACTTCTAGAGTGAAATCTTTTCTCTATTTCCTCTACCCCCATAATAA
96 | TAACTATTATCCTAGACCCCCTGGGACTGATATTTTCTTGCACCGTAGTAATAATTTCAGCCAATATTCT
97 | AAAATTCTCAACTATCTATATGAAGGAAGATAAATTTATCAACCGTTTTACAGTCCTAGTGCTGCTCTTT
98 | GTCTTATCTATAAACATACTAATCTTCTTTCCCCACTTAATTATCCTACTACTTGGTTGAGACGGCTTGG
99 | GAATTGTATCCTTTATCCTAGTCATTTACTACCAAAATCCAAAATCTTTGGCAGCTGGTATAATCACAGC
100 | TCTCACTAATCGTATTGGGGATGTTATACTCCTCTTGGCTATCGCGTGAACTCTAAACCAGGGTCACTGA
101 | AATATTTTACATATGTGGGCNGTCGACGAAAACATATATCAGGCATTAGTTATCATTATCGCAGCTATAA
102 | CTAAAAGAGCCCAGATGCCGTTTTCCAGGTGGCTCCCAGCAGCTATAGCTGCACCTACCCCAGTCTCAGC
103 | CTTAGTGCACTCATCAACCTTAGTTACCGCCGGAGTATTCTTATTAATCCGATTTTATAACTTTCTATCT
104 | TCTGTGTGATGATTCACTACCTTTCTACTTTTTGTAGCTGTTAGTACTACTTTAATAGCCGGGTTGAGAG
105 | CCTCTTCTGAATGCGACATAAAAAAAATTATTGCTTTGTCAACCCTTAGACAACTTGGAATAATGATAGC
106 | TGCTATAGGCTTAGGGATGGCCCATATAGCCTTTTTCCATATAGTAACCCACGCTATATTTAAGGCTCTT
107 | CTCTTTGTGTGCGCCGGAAGATTTATTCACAGACATATGCACAGTCAAGATCTTCGTTGAATAGGTAATC
108 | TCACTAAACAAATACCTACTACCACCTCATGCTTAATTATAGCAAATCTAGCTCTTTGTGGGTTCCCCTT
109 | TATGTCAGGTTTTTACTCTAAGGATATAATTGTGGAAGCTTCGCTCTACTACCCCCATAACTCACTTATA
110 | ATTAATCTAATCTTATTTGCAGTCGGTTTAACTGCATTCTACTCAACTCGATTTACCATGTGCGTAGTCC
111 | TTTCTCCCAATAACTGTGGTCCTTATATACATTTGGAGGAGAGCAACTCCCTCACATCTCCTATACTGCT
112 | TCTAGCTTCAATATCAGTTATTTCCGGGTCAGCTCTTACATGAATTCTGCCGTTAAAACAAGAAATAATG
113 | ATAATCCCCCTTGACCAAAAGCTTAAAACCTTAATATTAGTCACTTTGGGTGCACTTATATCCTGGTTCT
114 | TTCTAACAACGACAAATATAACTAAAACATGCCTATACATTCGTCACCCAATTATTAACTACTTCTCATG
115 | CACTATGTGGTTTCTAGTCCCCCTTTCATCTCAATTTATAATAAAACTCCCAATATATGTATCACACAAC
116 | TACTTAAAACTGACCGATCAGTCATGGTTGGAGTTACTCGGGGGGCAAGGTATTAATAACGTATCAAGTA
117 | AAGCCTCCAATATCTATCTGGCATCCTTAAAATCTACACCTATGAACTACCTAATAATGTCGTCGATACT
118 | ACTACTAGTCGCCACCTTAGTCGCAATTTAGTCTAGATAGCTTAAAATAAAGCATGTTATTGAAGATAGC
119 | AATATGGGAGTTCCTCCGGACAGTGTATGTGGTGTAAGTCAACACATTAGCTTTTCATGCTAATAATATA
120 | CATCCGTATTATACACAGATAGTAGTTTAAGTAAAACTCTGACCTTGGGTGTCAAAAATCACTTCGGTGT
121 | TATCTGAGAATTGAAAGCTAATTAACAGCATCGATCTTGTAAATCGAAGATAGAGACTACCTCTCAATTC
122 | TATGTACAGTTCTATTATAAGTTTAGTCTTTCTCCTTCCAATCGTCGCTGTTGTAAATCTAATCTCAAAT
123 | CAATCACACTTTTTAATAACTCTTCTATCACTTGAAGGTATCACACTGAGACTGGTTCTATTTGTTCCAA
124 | TCTCTCTCTCTATTATAAGTGCCTCTAATGTTAGAATTAGGGTCATTTTATTGACTTTCGGGGCATGTGA
125 | GGCCAGCTTAGGACTAAGCCTCATGGTATTAATATCCCGATCCTACGGAACTGATATATTAAACTCACTT
126 | ACAGCAAATAAATGTTAAAGCTCCAAATAGTATTAATATCTCTGCTCCTACTCCCACTCATTGTAAATCT
127 | GTACCCCTGAATTATCGCTCTGACTCTTAGAGCTTTATTACTACCCACCTGTTTCAATTTGGTAAACAGA
128 | GCATCCTACTCAATATTTACAGAATACATATCCTCTGATATAATGTCATTTACACTCTCCGCTCTAACAA
129 | TCTGAGTCACTGTAATAATAATCCTCGCAAGAACTAAAATTATGCATTTAAATATGTACCCCAAAATATT
130 | TATGACAAACTTAGTTATTTTGCTAATTATTCTAATTAATTGCTTCTTATCCCCCAATCTAATTATATTT
131 | TATATTTGATTTGAAGCATCCTTAATTCCAACTATAGTGCTAATCATGACTTGGGGCTATCAGCCAGAAC
132 | GATCTCAAGCAAGAATATATTTAATAATCTATACAGTCGCTGCCTCCCTCCCAATGCTTATAGTGCTATG
133 | TAAAATTTTTATCGTGTCCAAAACAGCTATGATACCCATATTCATAAACATGGAGTTCCCTATAGACTAC
134 | CCATCTATGGCCCTAGCCTGAGTATTAACACTGGGGGGCTTTCTAGTAAAACTCCCTATATTTACAGTGC
135 | ACCTCTGACTTCCTAAAGCACACGTAGAAGCCCCAATCGCAGGGTCTATAATTTTAGCTGCAATTCTTCT
136 | AAAACTTGGGGGTTACGGCATTCTTCGCATACTAAGATTATTTCACTATATAGCTAAATCAACCTCAAGA
137 | CTTCTTTCTAGGGTAGCTTTAGTCGGGGCAGTCTCAACAAGATTAATCTGTCTCCGCCAATCAGACCTAA
138 | AATCCCTAATTGCTTACTCATCTGTTGGACATATGGGTCTAATAGTCGCGGGCGCTTTAATAAGCTCTAA
139 | TTGGGGGTTCCAAGCAGCTCTAGCTATAATAATTGCCCATGGGCTGAGCTCATCCGCCCTATTTGTAATA
140 | GCAAATATAAACTATGAATTAACCCATACTCGAAGCCTATTCTTAATAAAGGGCTTGTTAGTTTTAGCAC
141 | CGACGCTCACTATATGGTGATTCCTGTTTACAGCTAGAAATATAGCAGCCCCCCCTTCCATTAACCTACT
142 | CAGAGAGATTATGTTAATTACATCTATTTTAAAAATATCTACTTCAGCTTTTATTCTTCTAGGTCTAACA
143 | AGATTCTTTACAGCTGCTTATTGTTTGTACATGTATACTTCTATACACCACGGACCCTTAATACTAACCT
144 | CTAACCCAATCCCTCAATTCAAAGTAAAAGACCTAACTCTTATAACTATACACTTAGTTCCTACAATTCT
145 | TATTATCTTTAAGCCTGAACTAATCACAAGATGGTCCTGATGGCATAGTTAAACAATAACATTAAATTGC
146 | AAATTTGATATTATACTAATAGTATTACCATCTAGTAAGATAAGCTATTCAAGCTAGTGGGTTCATACCC
147 | CGAAAAAGAGATACTCTCTCTTACTATCAGTTTGATTCTGGCTGACAATTTAGCTCTCTTAAACTAATAT
148 | ACATGTAAGTCTAAGCTACCCCGCACCACATATAAACCCTAAATGGAGAATAACTATATTAGACATATCT
149 | CATATGGTAAAGCGTCACAGTAAGAAAATCTACCTTATTGCAAAAGCAAGAGCTGGTTATTAAGATCAGA
150 | GTTGGCAATATTCGTGCCAGCTGCCGCGGTTAGACGATAAACTCAAGCTAATTCATATAAGACTAATTGC
151 | AAGGCGATCTAAAAAATAACTCAAAGTCTAATTTATATAATCCGAGACCCGTAAACGCCTATTTACCGTA
152 | AAACCATAGACTAAAACACGGATTAGATACCCGTCTATTTATGGAGTAACTAAAGTCGAAAAATACGAAC
153 | TACAGTTTAAAACTTAAAGATTTTGGCGGTGTCTTATCAACCCAGGGGAACCTGTCTCATAACTCGATAA
154 | CCCACGACACTCTCACCCTCCCTAGACTAAACAGCTTGTGTACTGCCGTCGTAAGCACACCTCTAAAAGC
155 | CAAGGAAGTGTGCAATAATGATTGTCTCACCCACGTCAGGTCAAAGTGCAGCCCATGGACGGAGATGATG
156 | GGTTACACCTAAACAAAGATACGGAATACAGCATTAAGAGCTGTGTAAAGGAGGACTTGAGTGTAACAGT
157 | ATTACAAAATTAAAGTGAATCTGAATCTAAGACATGCACACATCGCCCGTCACTCTCGCCTAAAGGCGAG
158 | ATAAGTCGTAACAAAGTAGGTGTAACGGAAGTTGCCCCTGTCGAAGTATAGCATATATAATGCCTTTTAC
159 | TTACACTAAAAATAAAACATTTGTTTACTTCGCTGCCTATGTTTATCTTATAAATAAAAACTAATAAAAA
160 | CACTTAAACTGATAATTTCATAATAAATCTTTACAATAGTACTATAGAGGAAGTAGTCAACATAATAAAG
161 | TAGTGGTTTATACACGTACCTTGTGCATCATGGTTTTACAAGCCTCAAATTAATAATATTACCCGAATTC
162 | TAAGCGAGCTGTCCCTTCATAGCTAAGAGCCCACCACTAGTTGTAGCATCAACTTTGGAAAATGGGGGGA
163 | TAGGGGCTACATACCAATCGCGCTAGAAAATCTCTGGTTTTCAGTAAAATTTACCAATAAAACATGTAGC
164 | GTCCACCTACACTGCAAGACTACAGAGGATAAGCCCTGTATTCAAAAACTAGATATGCCCTCCTTCCAGT
165 | ATAGGCCTAAAAACAGCCACTAATAGTACCTCACCGTAAACACCATTAAATTAATTCTATACCCTGTTCA
166 | AATAAACTGAAATTTTTGACAAACCTTAAATCTTAAAATATTATGTAAAAATAAGTATTAATTTTATAAC
167 | CTAAGTTACAGCTACCATGTATGTATTATATATTTACAACTTATAAAGGAACTTGGCAAATTCTTATTTC
168 | GACTGTTTAACAAAAACATTGCTCTCAGTAACCTTAATTAAGAGTAACTCCTGCCCAGTGAGTAATTCAA
169 | CGGCCGCGGTATCCTAACCGTGCAAAGGTAGCATAATCACTTGCCCATTAATTGTGGGCTAGAATGAAGG
170 | ATAAACGAAATAAATACTGTCTCTATAAGCCGCTTAAAAATACCCTCTAACCGAAGAGTGTTAGATAGCG
171 | TCGAAGGACAAGAAGACCCTATAGAGCTTAATTTAAATAAATATGAAAAAATTTACTAAAATTCGGTTGG
172 | GGCGACCAGGGAATTACCCATCATCCCTAAACAAAAGATAAATGTATCAAAACACTGACCCTTCTACAAG
173 | ATCATTAAAACAAGCTACCTTAGGGATAACAGGCTAATCTCACTAGAGAGTCCTTATCAATAGTGAGGAT
174 | TGGCCCCTCGATGTTGGCTTAGGGAATCTCTATGACGCAAAAGTCATATAAAGATGGTTTGTTCAACCAA
175 | TAACACCCTACATGAGCTGAGTTCAGACCGCGTAAGCCAGGTTAGTTTCTATCCTCGATCACTTTATCTA
176 | TTTATAGTACGAAAGGACCTAATTAGAGTAATATTTACACACAGGGAATAAATATAAACCATTCTAAGTA
177 | AAATAAATCATAAACTATGAGTAAGTTGGCAGAATAGTGCGACCGACTTAGGATCGGTTCATGGGTAAGC
178 | CCACCTACTATGCAACTTAGTTCATTTAGAATAACCAACCTGCACTTGGTAGGAGAGATATCTCAGTAGC
179 | GGTTTGATGTTTCGGAAATACTGGACCTTGAACGTCCACTTAAGGTGTTCGACTCACCTTCAAATCACCA
180 | AGATGGCAGAATAGTGCCATAGGTTTAAACCCTATTCATGAGTAGTCATACTCTCTTGTTACATGAATAT
181 | CCCGTTTTTTACATCTGTGTTAATAAGATTGGTACTTGCGCTCCTAGCAATAGCCTTCTACACACTAATA
182 | GAGCGAAAATTCCTTGGGTACTTCCACCTACGAAAAGGGCCTAATAAGGTAGGGCTAATGGGGCTTCCGC
183 | AACCATTTGCTGACGCAATTAAACTTTTTGTAAAGGAGCAAGCTAAACCTAACCCCTCGAATCAAACCCC
184 | GTTTCTATTTGCCCCTACCATAGGATTAGTTTTAGCTCTCTTAATGTGAGTAATCTACCCCCATTCCCAT
185 | CAATCATTTTTCATTCAATTTAGGGTCCTATATTTTTTATGTGTATCAAGAATAAACGTATATACGACCT
186 | TTCTCGCTGGCTGAAGATCCAACTCTAAATATGCTCTACTGGGAGCTTTACGGGGGGTTGCTCAAACTAT
187 | TTCATACGAGGTTAGAATATCCTTAATTCTTCTCAGATCCTTAATTATTTTATCAACTATAGATTTCACT
188 | AAAATATTCTCGTATTCCTGAATCCTATTTATATTTATTCCCTTGGCAGTAGTATGGTTCATTACTAATC
189 | TAGCAGAGACTAATCGAACCCCGTTCGATTTTGCAGAGGGCGAGTCAGAACTGGTATCCGGGTTTAATGT
190 | TGAGTACAGGGCTGGCCTTTTTGCCTTAATCTTCATAGCAGAGTATGCAAATATCTTAATTATGAGCCTA
191 | TTCACAAGTGTTATTTTTATAAGGACCTGCGCTAGGGGTATAGCCAGAGATCTAGTCCTAATTCTTCAAA
192 | CTATTACCTTAGCCATGCTCTTTGTGTGGGTTCGAGCAACATACCCCCGAATACGTTACGACCATCTAAT
193 | AAACCTCACATGAAAAAGATTTCTCCCCCTATCTCTAGCCCTATTAATGATATCTATTCCAATCGCAATA
194 | ATGCTGTGGTACAGCGCCGGATGAACGGATAACTCTGATGACGTTAATTAAGGAACAAGCTTCCCTGTAT
195 | CTAACTAGAGAGCTTGTAAATAGCACTTGACTTTTAATCAAGAGATAGTATAATTATTTCTAGTTAATGA
196 | TCCTTACAGCCTTATCCTCAGCCATTGCACTATTAGTCCCTATTATTATTTTGGGGGCAGCATGAGTTCT
197 | AGCTTCACGATCTACAGAAGATCGAGAAAAGTCATCTCCATTCGAGTGTGGGTTTGACCCAAAAAGAACC
198 | GCACGGATCCCATTCTCAACCCGATTTTTCTTATTGGCCATTATCTTTATCGTATTTGATATCGAGATTG
199 | TTCTTTTGATACCCCTACCCACAATCTTACACACGAGAGATGTATTTACCACTGTAACTACGTCTGTCCT
200 | ATTTCTAATAATTCTTTTAATTGGTTTAATCCATGAGTGAAAGGAAGGATCTCTAGACTGATCTTCCTAG
201 | ATTAACAGGCGAAAATAAGAACTTCTAATTCTTACACGGGGGTTCAACTCCTCCTTAATCTTATGAAATA
202 | CATCAAATCACCTACAATAGCTCTTGCTATATCCACGCTAATTATAAGGACGCTCATAGCCGTATCAAGA
203 | GCCAATTGAATATTCCTTTGAGGGGCTATAGAACTTAATCTCTTAAGATTTATTCCTATTATAATACAAT
204 | CTAACAACAATCAAGAAACAGAGGGAGCTGTGAAATACTTTCTAGCTCAAGCACTGGGATCAGCCTTACT
205 | TCTAATGTCAAGAACATCTATATGAATAACATTCTCCATAATCTCAAACTTTATACCTTTAACTCTTATG
206 | GCCGCGATTATACTAAAATTGGGCAGAGTTCCCTGTCATTTCTGATACCCGTCAGTAATAGCTTCAATTT
207 | CATGAGTATCATGCCTAATCTTATCCTCCTGACAAAAGCTGGCCCCCTTATCTATTTTAGCCTTTTTACT
208 | CCCTCAAAAAAACATAAACTTTATACTATCAATAGCTGCAATAAATGCGCTGTTGGGGGGGGTGATCGGA
209 | ATAAATCAAACTCAACTACGGACCATTATAGCATACTCCTCAATTGGGCATATCGGTTGAATGATAAGAT
210 | TAGCTGCCGTATATAAGCCGAGTTCTTGCATTATATATTTTGTAGTCTACTGCATTTTAATTACCCCTCT
211 | ATTTATAACCATGGGCTATCTAAACATATTCTCTACTAAACACATAAGAAAACTTTCCTCCTATAGAAGA
212 | ACTGTCCACATAGCTTTATTGATAGTTCTTCTATCATTGGGAGGACTACCCCCTTTTACAGGATTCATGC
213 | CAAAACTTATAACCATTATATTGTTAATGCAATCCATAAAAATTATTCTACTTATCCTAATCGCGGGGTC
214 | TATTATAAACCTATTTTTCTATTTAAATATTATTATCTCTTCTATGCCCCTACCCCCACATCTAAAAAAT
215 | GTCGACTCCACTGATATTAAATGTTCATTAAAATTTGTTATTCCAATCTGTACCCTGTCATTAGGGTTGA
216 | GACCTTTTATTATACTAT
217 |
218 |
--------------------------------------------------------------------------------
/task5/README.md:
--------------------------------------------------------------------------------
1 | # Task5 - Comparative genomics - gene set comparison
2 |
3 | This task is a tutorial on comparative genomics with a focus on gene set comparison.
4 |
5 | You are going to download, annotate and compare the genomes of a enterohemorrhagic E. coli (strain [O157:H7](https://en.wikipedia.org/wiki/Escherichia_coli_O157:H7)) versus a non-pathogenic E. coli (strain [K12](https://en.wikipedia.org/wiki/Escherichia_coli_in_molecular_biology#K-12)). You are then going to identify genes and gene duplications that are unique to each organism and biologically interpret your results.
6 |
7 |
8 | ### Requirements
9 |
10 | * Access to a linux-based OS running BASH
11 | * [BLAST](http://blast.ncbi.nlm.nih.gov/)
12 | * [prokka](https://github.com/tseemann/prokka)
13 |
14 |
15 | ## Getting Started
16 |
17 | * Login to your linux environment and create a new folder for your task5.
18 | * Work on your assignment in the folder your created.
19 |
20 |
21 |
22 | ## Retrieving the raw data
23 |
24 | You will be comparing two genomes of E. coli - strain K12 (non-pathogenic lab strain) and O157H7 (pathogenic E. coli associated with disease outbreaks).
25 |
26 | * Download both of these genomes using `wget`. Check the man pages for `wget` or use the --help flag to determine how to save the files with the following file names.
27 | * Name the O157H7 genome O157H7.fna
28 | * Name the K12 genome K12.fna
29 |
30 | ```
31 | ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_O157H7_EDL933_uid259/AE005174.fna
32 | ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_K_12_substr__DH10B_uid20079/CP000948.fna
33 | ```
34 |
35 |  Q1 - Look within the ftp directories for these bacterial genome projects. What do the other files contain (i.e., go to: https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_O157H7_EDL933_uid259)?
36 |
37 | ## Annotating both genomes
38 |
39 | * Next, annotate both genomes using `prokka`.
40 |
41 | ```
42 | #this may take a while so be patient... Remember, these are full bacterial genomes as opposed to small mitochondrial contigs.
43 | prokka O157H7.fna --outdir O157H7 --norrna --notrna
44 | prokka K12.fna --outdir K12 --norrna --notrna
45 | ```
46 |
47 | ## Generating gene lists
48 |
49 | * Next, make text files of the predicted gene lists for each genome. Have a look back at Task1 for how to redirect program output to a file.
50 | * Redirect the output for O157H7 to a file named genelist_O157H7.txt inside the O157H7 folder.
51 | * Redirect the output for K12 to a file named genelist_K12.txt inside the K12 folder.
52 |
53 | ```
54 | #generate a gene list text file by grepping the gene names from the .tbl file
55 | cat PROKKA*.tbl | awk '{if ($1 == "gene") {print $2}}' | awk -F '_' '{print $1}' | sort
56 | ```
57 |
58 |
59 |  Q2 - How many genes are present in each genome?
60 |
61 |
62 | ## Comparing gene lists
63 |
64 | Now, let's compare these lists to find genes that are common to both, duplicated or unique to either.
65 |
66 | A lot of this work can be done simply using the `comm` and `uniq` functions within the shell.
67 | Explore the command-line usage and options of these commands using `man`.
68 |
69 | ### Comparison including gene duplicates
70 |
71 | We can compare both gene lists like this in your task5 folder:
72 |
73 | ```
74 | comm O157H7/genelist_O157H7.txt K12/genelist_K12.txt >geneListComparison.txt
75 | ```
76 |
77 | Examine the output of `geneListComparison.txt` using `less`.
78 |
79 |  Q3 - What do the genes in column 1, column 2, and column 3 represent?
80 |
81 | Now, suppose we want to output the genes in column 1 (ignoring spaces). We can do so like this:
82 |
83 | ```
84 | cat geneListComparison.txt | awk -F '\t' '{print $1}' | grep -v -e '^$'
85 | ```
86 |
87 | ... and we can then count them by piping this command to `wc -l`.
88 |
89 | ```
90 | cat geneListComparison.txt | awk -F '\t' '{print $1}' | grep -v -e '^$' | wc -l
91 | ```
92 |
93 | * Analyze the core versus variable gene content for these two strains.
94 |
95 |  Q4 - How many genes are only in the O157H7 genome? Only in the K12 genome? In both?
96 |
97 |
98 | ### Comparison without gene duplicates (finding unique genes)
99 | Construct a simple Venn diagram illustrating the number of shared vs genome-specific genes. In this Venn diagram, if a gene G is duplicated (2 copies) in genome A and not in genome B (1 copy), then gene G will contribute to the unique genes in genome A. This is not really true, since genome B has a copy of gene G.
100 |
101 | Let's account for this by "removing gene redundancy". This can be done by first filtering each initial gene list using the tool `uniq`.
102 |
103 | ```
104 | cd /O157H7
105 | uniq genelist_O157H7.txt > unique_genelist_O157H7.txt
106 |
107 | cd ../K12
108 | uniq genelist_K12.txt > unique_genelist_K12.txt
109 |
110 | cd ..
111 | ```
112 |
113 | Now, when we compare these lists using `comm`, we will only be comparing single copies of each gene. Therefore, the result of `comm` should show us only those genes that are unique to genome 1, unique to genome 2, or shared between both
114 |
115 | ```
116 | comm O157H7/unique_genelist_O157H7.txt K12/unique_genelist_K12.txt > uniqueGeneListComparison.txt
117 | ```
118 |
119 |  Q5 - How many unique genes are only in the O157H7 genome? Only in the K12 genome? In both?
120 |
121 |
122 | ## Going further: inspecting your duplicated and unique genes within each organism
123 |
124 | Now, let's examine the output of the lists above.
125 |
126 | Examine `geneListComparison.txt` to find the gene expansions specific to enterohemorrhagic E. coli O157:H7. The following code will sort the O157:H7-specific genes by their copy number. This will identify those that have undergone the most pathogen-specific duplication.
127 |
128 | ```
129 | cat geneListComparison.txt | awk -F '\t' '{print $1}' | sort | uniq -c | sort -n -r | head -20
130 | ```
131 |
132 | Examine your result carefully. Column 1 states the copy number and column 2 states the gene name.
133 |
134 |  Q6 - Which gene in O157H7 occurs most frequently? Which genes in K12 occur most frequently?
135 |
136 |
137 |
138 | ---
139 |
140 | # ASSIGNMENT QUESTIONS
141 |
142 | The questions for this task are indicated by the lines starting with  above. Please submit your answers using the quiz on LEARN.
143 |
--------------------------------------------------------------------------------
/task6/README.md:
--------------------------------------------------------------------------------
1 | # Task6 - Resequencing: variant calling from NGS data
2 |
3 | In this lab (based on this [tutorial](https://angus.readthedocs.io/en/2014/variant.html)) you will be exploring some resequencing data from Richard Lenski's famous E. coli long term evolution experiment.
4 | More on Lenski's long term evolution experiment can be found in this [article](http://www.nature.com/nature/journal/v489/n7417/full/nature11514.html), with a summary here: https://en.wikipedia.org/wiki/Richard_Lenski.
5 |
6 | You will be mapping the reads from a single population of E. coli at 38,000 generations that has evolved citrate-utilization capacity (Cit+). You will map the reads to the reference genome in order to identify SNPs that have occurred in this lineage (or its ancestors). You will focus on one SNP in particular that has created a "mutator" strain by disrupting the mutS DNA-repair gene.
7 |
8 |
9 | ### Requirements
10 |
11 | #### Command-line tools
12 | * Access to a linux-based OS running BASH
13 | * [bwa](http://bio-bwa.sourceforge.net/)
14 | * [samtools](http://samtools.sourceforge.net/)
15 | * [bcftools](https://samtools.github.io/bcftools/bcftools.html)
16 |
17 | #### Graphical tools
18 |
19 | * You will also need to download and install either Tablet or IGV on your own machine.
20 | * [tablet](https://ics.hutton.ac.uk/tablet/)
21 | * [igv](http://software.broadinstitute.org/software/igv/)
22 |
23 |
24 | ## Getting Started
25 |
26 | * Login to your linux environment and create a new folder for your task6.
27 |
28 | ```
29 | mkdir task6 #creates folder
30 | cd task6 #enters into folder
31 | ```
32 |
33 | ## Retrieving the raw data
34 |
35 | First, download the E. coli reference genome:
36 |
37 | ```
38 | wget https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/master/task6/ecoli-rel606.fa.gz
39 | gunzip ecoli-rel606.fa.gz
40 | ```
41 |
42 | The resequencing data is located at `/data/SRR098038.fastq.gz`. This is a 229 MB file, so it was already downloaded for you using the following command:
43 |
44 | ```
45 | #wget http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR098/SRR098038/SRR098038.fastq.gz
46 | ```
47 |
48 |
49 | ## Mapping your reads to the reference genome
50 |
51 |
52 | Before we can map reads with `bwa` (note: `bowtie` is another popular option), we need to index the reference genome. This can be done with `bwa index`.
53 |
54 | ```
55 | bwa index ecoli-rel606.fa
56 | ```
57 |
58 | Now, let's map the reads to the reference genome. This is also a fairly intensive step that may take a few minutes.
59 |
60 | ```
61 | bwa aln ecoli-rel606.fa /data/SRR098038.fastq.gz > SRR098038.sai
62 | ```
63 |
64 | Make a .SAM file which contains all information about where each read maps onto the reference genome.
65 |
66 | ```
67 | bwa samse ecoli-rel606.fa SRR098038.sai /data/SRR098038.fastq.gz > SRR098038.sam
68 | ```
69 |
70 | Index the reference genome (again) so that `samtools` can work with it.
71 |
72 | ```
73 | samtools faidx ecoli-rel606.fa
74 | ```
75 |
76 | Convert the .SAM file to a .BAM file.
77 |
78 | ```
79 | samtools view -S -b SRR098038.sam > SRR098038.bam
80 | rm SRR098038.sam # remove this large file
81 | ```
82 |
83 | Sort the BAM file and index it.
84 |
85 | ```
86 | samtools sort SRR098038.bam > SRR098038.sorted.bam
87 | samtools index SRR098038.sorted.bam
88 | ```
89 |
90 | ## Viewing your BAM file
91 |
92 | BAM files can be viewed with `igv` or with `tablet`. But let's take a quick look in the terminal.
93 |
94 | ```
95 | samtools tview SRR098038.sorted.bam
96 | ```
97 |
98 | Type `q` to exit when finished.
99 |
100 | ## Variant calling
101 |
102 | Instead of identifying SNPs by eye, use `bcftools` to perform automated variant calling.
103 |
104 | ```
105 | bcftools mpileup -f ecoli-rel606.fa SRR098038.sorted.bam | bcftools call -mv -Ob --ploidy 1 -o calls.bcf
106 |
107 | #convert to vcf (human-readable variant call format). This file should contain all identified SNPs and other variants.
108 | bcftools view calls.bcf > calls.vcf
109 |
110 | ```
111 |
112 |  Q1 - How many total variants are present? Hint: `grep` for a pattern found only in your variant lines.
113 |
114 |
115 | ## Locating a key SNP in Lenski's E. coli evolution experiment
116 |
117 | This lineage of E. coli has a mutation in the mutS gene (protein sequence can be found [here](https://www.uniprot.org/uniprot/P23909.fasta)). This mutation creates a premature stop codon. Your task is to find this mutation within your sequencing data!
118 |
119 | Find the region in the reference genome that encodes the mutS gene using `blast`. You may need to refer to earlier tasks to help you with this.
120 |
121 | Now, extract the mapped area for this region from your .bam file. The command will be something like this:
122 |
123 | ```
124 | samtools view SRR098038.sorted.bam "ecoli:START-END" > region.sam # where START and END are position numbers
125 | ```
126 |
127 |
128 | Download the following two files and open these files in `tablet` on your home machine.
129 |
130 | - the region.sam file you created above
131 | - the reference genome ([ecoli-rel606.fa](https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/master/task6/ecoli-rel606.fa.gz))
132 |
133 | Now, locate the region containing the mutS gene within `tablet`, and search for the premature stop codon variant.
134 |
135 | Here is an [example read-pileup](https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/master/task6/example-pileup.png) in tablet that highlights a variant position.
136 |
137 |
138 |  Q2 - Paste a screenshot highlighting this mutation (you will need to zoom in) and show the amino acid translation.
139 |
140 |
141 |  Q3 - The premature stop codon mutation is from the codon “_ _ _ ” to “_ _ _ ”.
142 |
143 |  Q4 - The amino acid encoded by this codon before this mutation is _____?
144 |
145 |
146 | Once you are finished, please delete the files in your task 6 folder with the `rm` command.:
147 |
148 | > :warning: **Caution:** Be careful as `rm` permanently deletes files!
149 |
150 | ```
151 | cd task6
152 | rm *
153 | ```
154 |
155 |
156 |
157 | ---
158 |
159 | # ASSIGNMENT QUESTIONS
160 |
161 | The questions for this task are indicated by the lines starting with  above.
162 | Submit your answers to questions 1-4 to the QUIZ on LEARN.
163 |
164 |
165 |
166 |
--------------------------------------------------------------------------------
/task6/ecoli-rel606.fa.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task6/ecoli-rel606.fa.gz
--------------------------------------------------------------------------------
/task6/example-pileup.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task6/example-pileup.png
--------------------------------------------------------------------------------
/task7/README.md:
--------------------------------------------------------------------------------
1 | # Transcriptomics and detection of differentially expressed genes (DEGs)
2 |
3 | In this lab, you will be analyzing RNA-seq data from a study by Aguiar et al. (2020) [here](https://pubmed.ncbi.nlm.nih.gov/31646766/).
4 | This study exposed a human lung cell line (Calu-3 cells) to tobacco smoke, cannabis smoke, and a common drug intervention (LABA/GCS).
5 |
6 | You will be measuring and comparing the transcript expression levels between normal untreated cells (controls) and cells exposed to tobacco smoke extract (TSE).
7 | There are 4 TSE samples vs 4 control samples as labeled below.
8 |
9 | | Sample ID | Status |
10 | | --------------- | --------------- |
11 | | SRR8451881 | Control |
12 | | SRR8451882 | Control |
13 | | SRR8451883 | Control |
14 | | SRR8451884 | Control |
15 | | SRR8451885 | TSE |
16 | | SRR8451886 | TSE |
17 | | SRR8451887 | TSE |
18 | | SRR8451888 | TSE |
19 |
20 | The goal is to identify which genes are up-regulated and down-regulated following tobacco smoke exposure.
21 |
22 | ### Requirements
23 |
24 | #### Command-line tools
25 | * Access to a linux-based OS running BASH
26 | * [Salmon](https://combine-lab.github.io/salmon/)
27 |
28 | #### Graphical tools
29 |
30 | You will also need to download and install R on your own machine with the following packages
31 |
32 | * tximport
33 | * DESeq2 or edgeR
34 |
35 |
36 | ## Getting Started
37 |
38 | * Login to your linux environment and create a new folder for your task7
39 |
40 | ```
41 | mkdir transcriptomics-task #creates folder
42 | cd transcriptomics-task #enters into folder
43 | ```
44 |
45 | ## Retrieving the raw data and reference transcriptome
46 |
47 | **NOTE: This has been done for you already and the files are located at : `/fsys1/data/task4`**
48 | If you are curious and would like to know how this was done, see below, but again this is not needed, so you can skip ahead to the next section called "Transcript quantification with Salmon".
49 |
50 | ### Download human reference transcriptome and create a Salmon index
51 |
52 | ```
53 | #download a pre-made reference transcriptome from Gencode
54 | wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.transcripts.fa.gz
55 | gunzip gencode.v29.transcripts.fa.gz
56 |
57 | #index your reference transcriptome so that it can be analyzed with `Salmon`
58 | salmon index -t gencode.v29.transcripts.fa -i gencode_v29_idx
59 |
60 | ```
61 |
62 | ### Download the RNA-seq dataset
63 |
64 | Next, we need the the RNA-seq data (8 samples -- fw and rv reads, so 16 total files) from the public EBI FTP site. This has also been downloaded for you. These files were downloaded using the following code:
65 |
66 | ```
67 | #download the list of urls first - NOTE: THIS STEP CAN TAKE A LONG TIME (~1 hr)
68 | wget https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/master/task7/ftp-list.txt
69 |
70 | #download all of the .fastq files into your data folder
71 | wget -i ../ftp-list.txt
72 |
73 | ```
74 |
75 |
76 | ## Transcript quantification with Salmon
77 |
78 | Now, you are going to measure transcript abundance using `Salmon`. For a single sample with paired-end reads (e.g., `forward_reads.fastq.gz` and `reverse_reads.fastq.gz`, this could be done using the following line:
79 |
80 | ```
81 | #result will be output to "quants" folder
82 | # -p 6 means that six CPU threads will be used
83 | salmon quant -i gencode_v29_idx -l A -1 forward_reads.fastq.gz -2 reverse_reads.fastq.gz -p 6 -o quants
84 | ```
85 |
86 | But the above line is just an example for a single sample. Here is a .bash script that will run `Salmon` on all of the 8 samples we have just downloaded. Run this bash script in your `/transcriptomics-task` folder
87 | ```
88 | #download bash script
89 | wget https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/master/task7/runSalmon.bash
90 |
91 | #run bash script. This may take a few hours...
92 | bash runSalmon.bash
93 |
94 | ```
95 |
96 | ## Exploring the transcript counts
97 |
98 | Now, the transcript expression levels have been quantified for each of your 8 samples. Look within the `quants` folder and examine the `quant.sf` files that you have produced for each sample.
99 |
100 | * Take note of which column contains the transcript id. You can do this by using `head -1` to look at the header of the `quant.sf` file.
101 | * Also take note of which column contains the TPM (transcripts per million) expression level.
102 |
103 | Suppose you are interested in the transcript "ENST00000379727.7".
104 |
105 | ```
106 | #go to your quants/data folder
107 | cd quants
108 |
109 | #inspect the expression levels for this transcript
110 | grep "ENST00000379727.7" */quant.sf
111 | ```
112 |
113 |  Q1 - Has this transcript's abundance increased, decreased, or stayed the same following smoke exposure?
114 |
115 | Support your answer using statistics. Perform a t-test comparing the expression level of this transcript between the 4 smoke-treated samples versus 4 control samples. Use any program of your choice to do so (R, excel, Google Sheets, etc.).
116 |
117 |  Q2 - Is the difference statistically significant (p < 0.05)?
118 |
119 |
120 | ## Detecting differentially expressed genes (DEGs) in R
121 |
122 | Now that you have measured transcript abundance for all samples using `Salmon`, you can perform a differential expression analysis using a tool such as DeSeq2 or edgeR.
123 |
124 | On your local machine, install the following R packages:
125 |
126 | ```
127 | tximport
128 | DEseq2
129 | EnhancedVolcano
130 | ```
131 |
132 | Now, on your local machine, open your terminal and download the following quant files produced by Salmon to your local machine
133 |
134 | ```
135 | scp -r userid@genomics1.private.uwaterloo.ca:~/task4/quants/ .
136 | ```
137 |
138 | Now, open R and load packages and set working directory:
139 |
140 | ```
141 | #load required packages
142 | library(tximport)
143 | library(DESeq2)
144 | library(EnhancedVolcano)
145 |
146 | #go to your folder containing your quant files you just downloaded
147 | setwd("/path/to/quants")
148 |
149 | ```
150 |
151 | Download the gencode reference transcriptome
152 |
153 | ```
154 | system("wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.metadata.HGNC.gz")
155 | system("gunzip gencode.v29.metadata.HGNC.gz")
156 | genesymbols = read.delim("gencode.v29.metadata.HGNC")
157 | ```
158 |
159 | Read the quant files into R
160 |
161 | ```
162 | files = paste(list.dirs('.', recursive=FALSE),"/","quant.sf",sep='')
163 | #make sure to check your list of files to ensure that this step worked
164 |
165 | txi.salmon <- tximport(files, type = "salmon", tx2gene = genesymbols, ignoreAfterBar =T)
166 | ```
167 |
168 | Run DESeq2 to detect differentially expressed genes between the two categories
169 |
170 | ```
171 | meta = data.matrix(cbind(files,as.numeric(c(0,0,0,0,1,1,1,1))))
172 |
173 | #check that the first four samples are control samples (0) and the last four are TSE (1)
174 | meta
175 |
176 | colnames(meta) = c("filenames","category")
177 |
178 | dds <- DESeqDataSetFromTximport(txi.salmon, meta, ~as.factor(category)) # this is detecting DEGs between the "0" and "1" samples
179 | dds <- DESeq(dds)
180 |
181 | res <- results(dds, lfcThreshold=0.5,alpha=0.01)
182 | ```
183 |
184 | Examine the results
185 |
186 | ```
187 | summary(res)
188 |
189 | ```
190 |
191 |  Q3 - How many significant up- and down-expressed genes were detected?
192 |
193 |  Q4 - Produce a table of the top 10 differentially expressed genes along with their fold-changes and adjusted p-values. Also include the code you used to do so.
194 |
195 |
196 | Next, generate a volcano plot using the following code.
197 |
198 | ```
199 | EnhancedVolcano(res,
200 | lab = rownames(res),
201 | x = 'log2FoldChange',
202 | y = 'pvalue')
203 | ```
204 |
205 |  Q5 - Paste an image of a volcano plot.
206 |
207 |
208 |
209 | ---
210 |
211 | # ASSIGNMENT QUESTIONS
212 |
213 | The questions for this task are indicated by the lines starting with  above.
214 | Please submit your answers in the dropbox on LEARN.
215 |
216 |
217 |
218 |
--------------------------------------------------------------------------------
/task7/ftp-list.txt:
--------------------------------------------------------------------------------
1 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/008/SRR8451888/SRR8451888_1.fastq.gz
2 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/008/SRR8451888/SRR8451888_2.fastq.gz
3 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/007/SRR8451887/SRR8451887_1.fastq.gz
4 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/007/SRR8451887/SRR8451887_2.fastq.gz
5 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/006/SRR8451886/SRR8451886_1.fastq.gz
6 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/006/SRR8451886/SRR8451886_2.fastq.gz
7 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/005/SRR8451885/SRR8451885_1.fastq.gz
8 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/005/SRR8451885/SRR8451885_2.fastq.gz
9 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/004/SRR8451884/SRR8451884_1.fastq.gz
10 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/004/SRR8451884/SRR8451884_2.fastq.gz
11 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/003/SRR8451883/SRR8451883_1.fastq.gz
12 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/003/SRR8451883/SRR8451883_2.fastq.gz
13 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/002/SRR8451882/SRR8451882_1.fastq.gz
14 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/002/SRR8451882/SRR8451882_2.fastq.gz
15 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/001/SRR8451881/SRR8451881_1.fastq.gz
16 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR845/001/SRR8451881/SRR8451881_2.fastq.gz
17 |
--------------------------------------------------------------------------------
/task7/runSalmon.bash:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | for samp in SRR8451881 SRR8451882 SRR8451883 SRR8451884 SRR8451885 SRR8451886 SRR8451887 SRR8451888;
4 | do
5 | echo "Processing sample $samp"
6 | salmon quant -i /fsys1/data/task4/gencode_v29_idx -l A \
7 | -1 /fsys1/data/task4/${samp}_1.fastq.gz \
8 | -2 /fsys1/data/task4/${samp}_2.fastq.gz \
9 | -p 6 -o quants/${samp}_quant
10 | done
11 |
--------------------------------------------------------------------------------
/task8/README.md:
--------------------------------------------------------------------------------
1 | # Task8 - Analysis of 16S amplicon sequencing data using Kraken2/Bracken
2 |
3 | In this lab, you will be analyzing 16S and metagenomic data from a study by Lobb et al. (2020) [here](https://pubmed.ncbi.nlm.nih.gov/32345738/).
4 | This study examined the microbial communities of decomposing fish in local rivers near Waterloo, ON, Canada.
5 |
6 | There are 48 16S rRNA samples with the following metadata.
7 |
8 | | Sample ID | Name |
9 | | --------------- | --------------- |
10 | | SRS6112281 | WW1.1 |
11 | | SRS6112282 | WW1.2 |
12 | | SRS6112283 | WW1.3 |
13 | | SRS6112267 | EE1.1 |
14 | | SRS6112268 | EE1.2 |
15 | | SRS6112279 | EE1.3 |
16 | | SRS6112285 | WW4.1 |
17 | | SRS6112284 | WW4.2 |
18 | | SRS6112286 | WW4.3 |
19 | | SRS6112293 | WE4.1 |
20 | | SRS6112295 | WE4.2 |
21 | | SRS6112296 | WE4.3 |
22 | | SRS6112290 | EE4.1 |
23 | | SRS6112302 | EE4.2 |
24 | | SRS6112304 | EE4.3 |
25 | | SRS6112271 | EW4.1 |
26 | | SRS6112272 | EW4.2 |
27 | | SRS6112273 | EW4.3 |
28 | | SRS6112287 | WW8.1 |
29 | | SRS6112288 | WW8.2 |
30 | | SRS6112289 | WW8.3 |
31 | | SRS6112297 | WE8.1 |
32 | | SRS6112298 | WE8.2 |
33 | | SRS6112299 | WE8.3 |
34 | | SRS6112305 | EE8.1 |
35 | | SRS6112306 | EE8.2 |
36 | | SRS6112307 | EE8.3 |
37 | | SRS6112274 | EW8.1 |
38 | | SRS6112275 | EW8.2 |
39 | | SRS6112276 | EW8.3 |
40 | | SRS6112291 | WW10.1 |
41 | | SRS6112292 | WW10.2 |
42 | | SRS6112294 | WW10.3 |
43 | | SRS6112300 | WE10.1 |
44 | | SRS6112301 | WE10.2 |
45 | | SRS6112303 | WE10.3 |
46 | | SRS6112308 | EE10.1 |
47 | | SRS6112269 | EE10.2 |
48 | | SRS6112270 | EE10.3 |
49 | | SRS6112277 | EW10.1 |
50 | | SRS6112278 | EW10.2 |
51 | | SRS6112280 | EW10.3 |
52 |
53 | Our goal will be to perform taxonomic profiling of these 16S rRNA datasets using Kraken2/Bracken.
54 |
55 | ### Requirements
56 |
57 | #### Command-line tools
58 | * Access to a linux-based OS running BASH
59 | * kraken2 and Bracken
60 | * metaspades
61 |
62 | #### Graphical tools
63 |
64 | You will also need to download and install [R](https://www.r-project.org) on your own machine with the following packages
65 |
66 | * ggplot2
67 | * pheatmap
68 | * reshape2
69 | * viridisLite
70 |
71 |
72 | ## Getting Started
73 |
74 | * Login to your linux environment and create a new folder for your task7
75 |
76 | ```
77 | mkdir task8 #creates folder
78 | cd task8 #enters into folder
79 | ```
80 |
81 | ## Retrieving the raw data
82 |
83 | The data has already been downloaded for you, and is located in the `/fsys/data/lobb-et-al/` folder
84 |
85 | If you're curious, the original data was downloaded from the NCBI SRA using this command:
86 | ```
87 | fastq-dump --split-files SRS6112303 SRS6112301 SRS6112300 SRS6112299 SRS6112298 SRS6112297 SRS6112296 SRS6112295 SRS6112293 SRS6112294 SRS6112292 SRS6112291 SRS6112289 SRS6112288 SRS6112287 SRS6112286 SRS6112284 SRS6112285 SRS6112283 SRS6112282 SRS6112281 SRS6112280 SRS6112278 SRS6112277 SRS6112276 SRS6112275 SRS6112274 SRS6112273 SRS6112272 SRS6112271 SRS6112270 SRS6112269 SRS6112308 SRS6112307 SRS6112306 SRS6112305 SRS6112304 SRS6112302 SRS6112290 SRS6112279 SRS6112268 SRS6112267 SRS6098991 SRS6098990 SRS6098989 SRS6098988 SRS6098999 SRS6098998 SRS6098997 SRS6098996 SRS6098995 SRS6098994 SRS6098993 SRS6098992 SRS6098987 SRS6098986
88 | ```
89 |
90 |
91 | ## Quality filtering
92 |
93 | We have previously covered the use of tools such as `fastqc` and `trimmomatic` to quality filter our dataset. QC is a required step for any high-throughput sequencing pipeline, but for simplicity we will skip it for the purposes of this tutorial.
94 |
95 | ## Taxonomic classification of 16S reads using Kraken2
96 |
97 | ### Analyzing single samples
98 |
99 | Tools such as `QIIME2` and `Mothur` are common for analyzing 16S rRNA sequences. For this tutorial, we will be using a different tool called `Kraken2`.
100 |
101 | Suppose we wanted to analyze a single sample (e.g., SRS6112303). We can do so with the following Kraken2 command:
102 |
103 | ```
104 | CLASSIFICATION_LVL=G # this will set an environmental variable for the taxonomic level of classification desired (G = "Genus", S = "Species", etc.)
105 | krakenDB=/data/krakendb/16S_Greengenes_k2db/ #this is the location of the kraken2 database you want to use for classification
106 | fastq1=/fsys1/data/lobb-et-al/SRS6112303_1.fastq
107 | fastq2=/fsys1/data/lobb-et-al/SRS6112303_2.fastq
108 |
109 | kraken2 --db $krakenDB --paired --report report.txt --output kraken.out $fastq1 $fastq2
110 |
111 | bracken -d $krakenDB -l $CLASSIFICATION_LVL -i report.txt -o bracken.out
112 | ```
113 |
114 | `bracken.out` will look like this:
115 |
116 | ```
117 | name taxonomy_id taxonomy_lvl kraken_assigned_reads added_reads new_est_reads fraction_total_reads
118 | Parabacteroides distasonis 2601 S 8 53 61 0.00554
119 | Parabacteroides gordonii 2602 S 1 37 38 0.00347
120 | Bacteroides fragilis 2596 S 11 2044 2055 0.18386
121 | Bacteroides uniformis 2599 S 1 9 10 0.00094
122 | Clostridium pasteurianum 2769 S 278 346 624 0.05586
123 | Clostridium perfringens 2770 S 146 986 1132 0.10128
124 | Clostridium subterminale 2772 S 78 2011 2089 0.18691
125 | Clostridium bowmanii 2764 S 13 64 77 0.00692
126 | Clostridium butyricum 2765 S 10 102 112 0.01009
127 | Clostridium neonatale 2768 S 3 33 36 0.00328
128 | Alkaliphilus transvaalensis 2762 S 3 1 4 0.00037
129 | Sporomusa polytropa 2800 S 4 98 102 0.00913
130 | Veillonella dispar 2801 S 22 255 277 0.02478
131 | ...
132 | ...
133 |
134 | ```
135 |
136 | ### Analyzing many samples
137 |
138 | But these commands run Kraken2/Bracken on only a single sample. What do we do if we want to run them on all samples?
139 |
140 | First, you need to have a list of the samples you want to analyze. This has been done for you with the file at `/fsys1/data/lobb-et-al/files.txt`.
141 |
142 | Then, we will create a bash script called `runAll.bash` with the following contents.
143 |
144 | ```
145 | #!/bin/bash
146 |
147 | # $1 is the file containing the list of samples
148 | # $2 is the classification level
149 |
150 | CLASSIFICATION_LVL=$2
151 |
152 | while IFS=$'\t' read sample
153 | do
154 | echo "processing sample $sample"
155 |
156 | kraken2 --db /data/krakendb/16S_Greengenes_k2db/ --paired --report $sample.$CLASSIFICATION_LVL.kraken --output $sample.$CLASSIFICATION_LVL.kraken.out /fsys1/data/lobb-et-al/${sample}_1.fastq /fsys1/data/lobb-et-al/${sample}_2.fastq
157 |
158 | bracken -d /data/krakendb/16S_Greengenes_k2db -l $CLASSIFICATION_LVL -i $sample.$CLASSIFICATION_LVL.kraken -o $sample.$CLASSIFICATION_LVL.bracken.out
159 |
160 | done < $1
161 | ```
162 |
163 | Let's now run `runAll.bash` to apply Kraken2 and Bracken to all of our 16S samples.
164 |
165 | ```
166 | #first let's create a new folder
167 | mkdir order_classification
168 |
169 | cd order_classification
170 |
171 | # this will perform taxonomic classification at the Order level
172 | bash runAll.bash /fsys1/data/lobb-et-al/files.txt O # now wait a while....
173 |
174 | ```
175 |
176 | Once completed, you will see that your folder is full of .bracken output files.
177 | To merge these together into a single file containing bracken output for all your samples, do the following:
178 |
179 |
180 | ```
181 | python2.7 /usr/local/bin/Bracken-2.5/analysis_scripts/combine_bracken_outputs.py --files $(ls *.bracken.out) -o combined.order.out
182 | ```
183 |
184 |
185 | ## Plotting results in R
186 |
187 | Now, download your `combined.order.out` file to your local computer and load [R](https://www.r-project.org/) for further analysis and plotting.
188 |
189 | First load these libraries. Install these first if they are not already installed.
190 |
191 | ```
192 | library(ggplot2)
193 | library(reshape2)
194 | library(viridisLite)
195 | ```
196 |
197 | Next, load your data
198 |
199 | ```
200 | tb = read.delim("combined.order.out",header=T,row.names=1)
201 |
202 | #Bracken output has _frac and _num columns. We will just analyze the _num columns.
203 | tbp = tb[,grep("_num",colnames(tb))]
204 |
205 | #Transpose the table
206 | tbp <- t(tbp)
207 |
208 | #Convert to proportions
209 | tb_prop<-as.data.frame(round(prop.table(as.matrix(tbp), 1) * 100,1))
210 |
211 | #Choose a selection of taxa with a % > 3 (Note: might have to play around with this until you get a reasonable number of taxa to display)
212 | tb_sub <- tb_prop[,apply(tb_prop, 2, function(x) max(x, na.rm = TRUE))>3]
213 |
214 | ```
215 |
216 | For plotting, we have to do a few more modifications to the data matrix
217 | ```
218 | #Melt the dataframe for plotting
219 | tbm <- as.data.frame(melt(as.matrix(tb_sub)))
220 |
221 | #fix labels
222 | tbm[,1] = within(tbm, Var1<-data.frame(do.call('rbind', strsplit(as.character(Var1), '.', fixed=TRUE))))[,1][,1]
223 |
224 | #Turn 0s into NAs
225 | tbm[tbm == 0] <- NA
226 |
227 | #Set the order of the taxa on the plot (Note: optional)
228 | tbm$Var2 <- factor(tbm$Var2, levels = row.names(as.table(sort(colMeans(tb_sub)))))
229 |
230 | ```
231 |
232 | Now, we can plot using `ggplot2`. Note: the following ggplot command is very parameter-rich, and it can be a lot simpler than this.
233 |
234 | ```
235 | ggplot(tbm, aes(Var1,Var2,size = value,fill=value), colsep=c(1:100), rowsep=(1:100), sepwidth=c(5,1)) + geom_point(shape = 21, alpha=0.4) + ggtitle("") + xlab("") + ylab("") + theme(axis.text = element_text(colour= "black", size = 12), text = element_text(size=15), axis.text.x=element_text(angle=90, vjust = 0.5, hjust = 1))+ scale_size_area(max_size = 15,guide="none") + labs(fill="Relative\nfrequency (%)") + scale_fill_viridis_c()
236 | ```
237 | This should produce the following plot:
238 |
239 | 
240 |
241 | We can also create a barplot by doing the following:
242 |
243 | ```
244 | ggplot(tbm, aes(fill=Var2, y=value, x=Var1)) +
245 | geom_bar(position="fill", stat="identity", col="grey50") +
246 | scale_y_continuous(labels=scales::percent) +
247 | xlab("") + ylab("Relative frequency") + labs(fill="Order") +
248 | theme(axis.text.x= element_text(angle = 90, hjust = 1))
249 |
250 | ```
251 |
252 | ... which should produce:
253 |
254 | 
255 |
256 |
257 | ## Adding in metadata annotations
258 |
259 | The last plot contains sample names, but let's replace these names with annotations from the metadata that are more informative.
260 |
261 | First, make sure you have a `metadata.txt` text file that contains the 48 samples (column 1) and names (column 2) listed at the top of the page.
262 | It should look like this:
263 |
264 | ```
265 | SRS6112281 WW1.1
266 | SRS6112282 WW1.2
267 | SRS6112283 WW1.3
268 | SRS6112267 EE1.1
269 | SRS6112268 EE1.2
270 | SRS6112279 EE1.3
271 | ...
272 | ```
273 |
274 | Now, load in your metadata
275 | ```
276 | metadata = read.table("metadata.txt")
277 | ```
278 |
279 | Now, let's subset our data matrix to include only the metadata samples, and let's also re-order the variables so that they plot in the desired order
280 |
281 | ```
282 | #subset tbm to only include those samples in metadata
283 | tbm = tbm[which(tbm[,1] %in% metadata[,1]),]
284 | tbm[,1] = metadata[match(tbm[,1],metadata[,1]),2]
285 | tbm$Var1 = factor(tbm$Var1,levels= metadata[,2])
286 |
287 | ggplot(tbm, aes(fill=Var2, y=value, x=Var1)) +
288 | geom_bar(position="fill", stat="identity", col="grey50") +
289 | scale_y_continuous(labels=scales::percent) +
290 | xlab("") + ylab("Relative frequency") + labs(fill="Order") +
291 | theme(axis.text.x= element_text(angle = 90, hjust = 1))
292 | ```
293 |
294 | This should produce:
295 |
296 | 
297 |
298 |
299 | How does this result compare to the result from Lobb et al. (2020) [here](https://pubmed.ncbi.nlm.nih.gov/32345738/) ?
300 |
301 |
302 | Lastly, let's create a heatmap and add in an annotation category
303 |
304 | ```
305 | library(pheatmap)
306 |
307 | # convert the tbm table back to a 2D matrix
308 | tb = acast(tbm, Var1 ~ Var2,value.var='value',fill=0)
309 | tb = t(tb) #transpose
310 |
311 | # let's split the names (EE, WW, etc.) into a matrix that we can use as annotations
312 | annot = data.frame(do.call("rbind", strsplit(as.character(metadata[,2]), "", fixed = TRUE)))[,c(1,2)]
313 | rownames(annot) = metadata[,2]
314 | colnames(annot) = c("Fish_Origin","Water_Origin")
315 |
316 | # specify the colors
317 | ann_colors = list(
318 | Fish_Origin = c(W = "#EBEBEB", E = "#424242"),
319 | Water_Origin = c(W = "#EBEBEB", E = "#424242")
320 | )
321 |
322 | # plot
323 | pheatmap(tb,annotation_col=annot,cluster_cols=F,annotation_colors=ann_colors,color = viridis(1000))
324 | ```
325 |
326 | This should produce:
327 |
328 | 
329 |
330 |
331 |
332 |
333 |
334 |
--------------------------------------------------------------------------------
/task8/barplot1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task8/barplot1.png
--------------------------------------------------------------------------------
/task8/barplot2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task8/barplot2.png
--------------------------------------------------------------------------------
/task8/bubbleplot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task8/bubbleplot.png
--------------------------------------------------------------------------------
/task8/bubbleplot1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task8/bubbleplot1.png
--------------------------------------------------------------------------------
/task8/pheatmap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task8/pheatmap.png
--------------------------------------------------------------------------------
/task9/README.md:
--------------------------------------------------------------------------------
1 | # Task9 - Analysis of human SNP variation from 1000 genomes data
2 |
3 | In this lab, you will be analyzing available data from the 1000 genomes project - https://en.wikipedia.org/wiki/1000_Genomes_Project
4 |
5 | You will
6 |
7 | ### Requirements
8 |
9 | #### Command-line tools
10 | * Access to a linux-based OS running BASH
11 | * Tabix
12 | * vcftools
13 | * R
14 |
15 |
16 | # Linux command-line
17 |
18 | ## Getting Started
19 |
20 | * Login to your linux environment and create a new folder for your task7
21 |
22 | ```
23 | mkdir task9 #creates folder
24 | cd task9 #enters into folder
25 | ```
26 |
27 | ## Download the VCF file for your chromosome of interest
28 |
29 | e.g., below we will download chromosome 12
30 |
31 | ```
32 | wget ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/release/20130502/ALL.chr12.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz
33 |
34 | wget ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/release/20130502/ALL.chr12.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz.tbi
35 | ```
36 |
37 | ## Download the reference genome (optional, if needed)
38 | ```
39 | wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
40 |
41 | gunzip human_g1k_v37.fasta.gz
42 |
43 | wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai
44 | ```
45 |
46 |
47 | ## Use `tabix` to extract the region of interest from the chromosome
48 |
49 | e.g., suppose we are interested in the variants found across the 1000-bp region 49687909-49688909 of chromosome 12
50 |
51 | ```
52 | tabix -fh ALL.chr12.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz 12:49687909-49688909 > region.vcf
53 | ```
54 |
55 | ## convert VCF file to tab-separated file
56 | ```
57 | cat region.vcf | vcf-to-tab > region.tab
58 | ```
59 |
60 | * How many SNPs were detected?
61 |
62 | ## More links
63 |
64 | * https://vcftools.github.io/perl_examples.html
65 |
66 |
67 | # Data analysis in R
68 |
69 | ## Loading the region.tab data and visualization as a heatmap
70 |
71 | ```
72 | # Load required library
73 | if (!require("pheatmap")) {
74 | install.packages("pheatmap")
75 | library(pheatmap)
76 | }
77 |
78 | # Read the header line separately to get the sample names without modification
79 | header_line <- readLines("region.tab", n = 1)
80 | header_parts <- unlist(strsplit(header_line, "\t"))
81 |
82 | # Extract the sample names (from the 4th column onward)
83 | sample_names <- header_parts[4:length(header_parts)]
84 |
85 | # Load the rest of the data
86 | data <- read.table("region.tab", header = TRUE, sep = "\t", stringsAsFactors = FALSE, check.names = FALSE, comment.char = "#")
87 |
88 |
89 | #Load the metadata
90 |
91 | metadata <- read.delim("igsr_samples.tsv",sep='\t',header=T)
92 |
93 | # Define the most common allele for each position and use it as the reference
94 | presence_absence_matrix <- apply(data[, 4:ncol(data)], 1, function(genotypes) {
95 | # Calculate the most common allele
96 | most_common_allele <- names(sort(table(genotypes), decreasing = TRUE))[1]
97 | # Mark as 1 if different from the most common allele, else 0
98 | ifelse(genotypes == most_common_allele, 0, 1)
99 | })
100 |
101 | # Transpose the matrix to have samples as columns
102 | presence_absence_matrix <- t(presence_absence_matrix)
103 |
104 | # Set row and column names for clarity
105 | rownames(presence_absence_matrix) <- paste(data$CHROM, data$POS, sep = ":")
106 | colnames(presence_absence_matrix) <- sample_names
107 |
108 | # Filter out rows with no variation (all 0s) and rows with all 1s
109 | presence_absence_matrix_filtered <- presence_absence_matrix[rowSums(presence_absence_matrix) > 0 & rowSums(presence_absence_matrix) < ncol(presence_absence_matrix), ]
110 |
111 | pop <- metadata[match(sample_names,metadata[,1]),6]
112 |
113 |
114 | # Create a data frame for annotation
115 | annotation_df <- data.frame(Population = pop)
116 | rownames(annotation_df) <- sample_names
117 |
118 |
119 | # Plot heatmap with annotation
120 | pheatmap(
121 | t(presence_absence_matrix_filtered),
122 | color = colorRampPalette(c("white", "blue"))(100), # Use a gradient
123 | main = "Presence-Absence Heatmap for Variant Sites",
124 | cluster_rows = TRUE,
125 | cluster_cols = FALSE,
126 | display_numbers = FALSE,
127 | fontsize_row = 1,
128 | fontsize_col = 6,
129 | annotation_row = annotation_df
130 | )
131 |
132 |
133 | ```
134 | 
135 |
136 |
137 |
138 | ## Plotting the frequencies of specific SNPs per population
139 |
140 | ```
141 | # Define the site of interest
142 | site <- 6
143 |
144 | # Get indices for samples with reference and variant alleles
145 | withRefBase <- which(presence_absence_matrix_filtered[site, ] == 0)
146 | withVariant <- which(presence_absence_matrix_filtered[site, ] == 1)
147 |
148 | # Get population labels for each sample
149 | populations <- pop
150 |
151 | # Count the number of reference and variant alleles in each population
152 | ref_counts <- table(populations[withRefBase])
153 | variant_counts <- table(populations[withVariant])
154 |
155 | # Combine the counts into a data frame for plotting
156 | allele_counts <- data.frame(
157 | Population = unique(populations),
158 | Reference = sapply(unique(populations), function(p) ref_counts[p]),
159 | Variant = sapply(unique(populations), function(p) variant_counts[p])
160 | )
161 |
162 | # Replace NA with 0 where counts are missing
163 | allele_counts[is.na(allele_counts)] <- 0
164 |
165 | # Reshape the data for plotting with ggplot2
166 | library(reshape2)
167 | allele_counts_long <- melt(allele_counts, id.vars = "Population", variable.name = "Allele", value.name = "Count")
168 |
169 | # Plot the frequencies
170 | library(ggplot2)
171 | ggplot(allele_counts_long, aes(x = Population, y = Count, fill = Allele)) +
172 | geom_bar(stat = "identity", position = "dodge") +
173 | labs(title = paste("Allele Frequency at Site", site),
174 | x = "Population",
175 | y = "Frequency",
176 | fill = "Allele Type") +
177 | theme_minimal() +
178 | theme(axis.text.x = element_text(angle = 45, hjust = 1))
179 | ```
180 |
181 |
182 | 
183 |
--------------------------------------------------------------------------------
/task9/SNP-heatmap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task9/SNP-heatmap.png
--------------------------------------------------------------------------------
/task9/alleleFreq.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/doxeylab/learn-genomics-in-linux/1f76ebd585b50a9b0846a345ea5f3a6c03f77e40/task9/alleleFreq.png
--------------------------------------------------------------------------------