├── .gitignore ├── mako ├── templates │ ├── hello.txt │ └── binning │ │ ├── index.rst │ │ ├── setup.rst │ │ ├── phylosift.rst │ │ └── concoct.rst └── settings.yaml ├── source ├── assembly │ ├── index.rst │ ├── qtrim.rst │ ├── map.rst │ ├── assembly.rst │ └── reqs.rst ├── binning │ ├── index.rst │ ├── setup.rst │ ├── phylosift.rst │ └── concoct.rst ├── comparative-taxonomic-analysis │ ├── index.rst │ ├── krona.rst │ ├── rrna.rst │ └── compare.rst ├── annotation │ ├── index.rst │ ├── annotation.rst │ ├── translation.rst │ ├── software.rst │ ├── normalization.rst │ ├── metaxa2.rst │ └── differential.rst ├── comparative-functional-analysis │ ├── index.rst │ ├── annotation.rst │ ├── genefinding.rst │ ├── genecoverage.rst │ └── compare.rst ├── index.rst └── conf.py ├── run_mako.py ├── README.rst └── Makefile /.gitignore: -------------------------------------------------------------------------------- 1 | build 2 | -------------------------------------------------------------------------------- /mako/templates/hello.txt: -------------------------------------------------------------------------------- 1 | hello world! 2 | -------------------------------------------------------------------------------- /source/assembly/index.rst: -------------------------------------------------------------------------------- 1 | Metagenomic Assembly Workshop 2 | ================================================= 3 | In this metagenomics workshop we will learn how to: 4 | 5 | - Quality trim reads with sickle 6 | - Perform assemblies with velvet 7 | - Map back reads to assemblies with bowtie2 8 | 9 | The workshop has the following exercises: 10 | 11 | .. toctree:: 12 | :maxdepth: 2 13 | 14 | reqs 15 | qtrim 16 | assembly 17 | map 18 | 19 | At least a basic knowledge of how to work with the command line is required 20 | otherwise it will be very difficult to follow some of the examples. Have 21 | fun! 22 | -------------------------------------------------------------------------------- /mako/templates/binning/index.rst: -------------------------------------------------------------------------------- 1 | Metagenomic Binning Workshop 2 | ================================================= 3 | In this metagenomics workshop we will learn how to: 4 | 5 | - Perform unsupervised binning with concoct 6 | - Evaluate binning performance 7 | 8 | The workshop has the following exercises: 9 | 10 | 1. Setup Environment 11 | 2. Running Concoct 12 | 3. Evaluate Clustering Using Single Copy Genes 13 | 4. Phylogenetic Classification using Phylosift 14 | 15 | At least a basic knowledge of how to work with the command line is required 16 | otherwise it will be very difficult to follow some of the examples. Have 17 | fun! 18 | 19 | Contents: 20 | 21 | .. toctree:: 22 | :maxdepth: 2 23 | 24 | setup 25 | concoct 26 | phylosift 27 | -------------------------------------------------------------------------------- /source/binning/index.rst: -------------------------------------------------------------------------------- 1 | Metagenomic Binning Workshop 2 | ================================================= 3 | In this metagenomics workshop we will learn how to: 4 | 5 | - Perform unsupervised binning with concoct 6 | - Evaluate binning performance 7 | 8 | The workshop has the following exercises: 9 | 10 | 1. Setup Environment 11 | 2. Running Concoct 12 | 3. Evaluate Clustering Using Single Copy Genes 13 | 4. Phylogenetic Classification using Phylosift 14 | 15 | At least a basic knowledge of how to work with the command line is required 16 | otherwise it will be very difficult to follow some of the examples. Have 17 | fun! 18 | 19 | Contents: 20 | 21 | .. toctree:: 22 | :maxdepth: 2 23 | 24 | setup 25 | concoct 26 | phylosift 27 | 28 | -------------------------------------------------------------------------------- /source/comparative-taxonomic-analysis/index.rst: -------------------------------------------------------------------------------- 1 | ======================================================= 2 | Comparative Taxonomic Analysis Workshop 3 | ======================================================= 4 | In this workshop we will do a comparative analysis of several Baltic Sea 5 | samples on taxonomy. The following topics will be discussed: 6 | 7 | - Taxonomic annotation of rRNA reads using sortmeRNA 8 | - Visualizing taxonomy with KRONA 9 | - Comparative taxonomic analysis in R 10 | 11 | The workshop has the following exercises: 12 | 13 | 1. rRNA Annotation Exercise 14 | 2. KRONA Exercise 15 | 3. Comprative Taxonomic Analysis Exercise 16 | 17 | Contents: 18 | 19 | .. toctree:: 20 | :maxdepth: 2 21 | 22 | rrna 23 | krona 24 | compare 25 | -------------------------------------------------------------------------------- /source/annotation/index.rst: -------------------------------------------------------------------------------- 1 | Metagenomic Annotation Workshop 2 | ================================================= 3 | In this part of the metagenomics workshop we will: 4 | 5 | - Translate nucleotides into amino acid seuquences using EMBOSS 6 | - Annotate metagenomic reads with Pfam domains 7 | - Discuss and perform normalization of metagenomic counts 8 | - Take a look at different gene abundance analysis 9 | 10 | The workshop has the following exercises: 11 | 12 | 1. Translation Exercise 13 | 2. HMMER Exercise 14 | 3. Normalization Exercise 15 | 4. Differential Exercise 16 | 5. BonusExercise: Metaxa2 17 | 18 | At least a basic knowledge of how to work with the command line is required 19 | otherwise it will be very difficult to follow some of the examples. Have 20 | fun! 21 | 22 | Contents: 23 | 24 | .. toctree:: 25 | :maxdepth: 2 26 | 27 | software 28 | translation 29 | annotation 30 | normalization 31 | differential 32 | metaxa2 33 | -------------------------------------------------------------------------------- /source/comparative-functional-analysis/index.rst: -------------------------------------------------------------------------------- 1 | ======================================================= 2 | Comparative Functional Analysis Workshop 3 | ======================================================= 4 | In this workshop we will do a comparative analysis of several Baltic Sea 5 | samples on function. The following topics will be discussed: 6 | 7 | - Find genes on a coassembly of all samples using Prodigal 8 | - Classify the found genes with similar function in Clusters of Orthologous 9 | Groups (COG) using WebMGA 10 | - Compare the expression of the different COG families and classes by looking 11 | at their coverage in different samples using R 12 | 13 | The workshop has the following exercises: 14 | 15 | 1. Gene Finding Exercise 16 | 2. COG Exercise 17 | 3. Comprative Functional Analysis Exercise 18 | 19 | Contents: 20 | 21 | .. toctree:: 22 | :maxdepth: 2 23 | 24 | genefinding 25 | annotation 26 | genecoverage 27 | compare 28 | -------------------------------------------------------------------------------- /source/index.rst: -------------------------------------------------------------------------------- 1 | .. Metagenomics Workshop SciLifeLab documentation master file, created by 2 | sphinx-quickstart on Tue May 6 17:38:39 2014. 3 | You can adapt this file completely to your liking, but it should at least 4 | contain the root `toctree` directive. 5 | 6 | Welcome to the Metagenomics Workshop at SciLifeLab, Stockholm 7 | ============================================================= 8 | 9 | This is a three day metagenomics workshop. We will discuss assembly, binning 10 | and annotation of metagenomic samples. 11 | 12 | Program: 13 | 14 | * Day 1 Time 14-17 15 | * :doc:`assembly/index` 16 | * :doc:`comparative-functional-analysis/index` 17 | * :doc:`comparative-taxonomic-analysis/index` 18 | * Day 2 Time 14-17 19 | * :doc:`binning/index` 20 | * Day 3 Time 13.30-17 21 | * :doc:`annotation/index` 22 | 23 | 24 | Contents: 25 | 26 | .. toctree:: 27 | :maxdepth: 2 28 | 29 | assembly/index 30 | comparative-functional-analysis/index 31 | comparative-taxonomic-analysis/index 32 | binning/index 33 | annotation/index 34 | 35 | Enjoy! 36 | 37 | For questions: 38 | 39 | * Johannes Alneberg at sclifelab dot se 40 | * Johan dot Bengtsson-Palme at gu dot se 41 | * Ino de Bruijn at bils dot se 42 | * Luisa Hugerth at scilifelab dot se 43 | 44 | The code that generates this workshop is available at: 45 | 46 | https://github.com/inodb/2014-5-metagenomics-workshop 47 | -------------------------------------------------------------------------------- /source/binning/setup.rst: -------------------------------------------------------------------------------- 1 | ========================================== 2 | Setup Environment 3 | ========================================== 4 | This workshop will be using the same environment used for the assembly workshop. If you did not participate in the assembly workshop, please have a look at the introductory setup description for that. 5 | 6 | Programs used in this workshop 7 | ============================== 8 | The following programs are used in this workshop: 9 | 10 | - CONCOCT_ 11 | - Phylosift_ 12 | - Blast_ 13 | 14 | .. _CONCOCT: http://github.com/BinPro/CONCOCT 15 | .. _Phylosift: http://phylosift.wordpress.com/ 16 | .. _BLAST: http://blast.ncbi.nlm.nih.gov/ 17 | 18 | All programs and scripts that you need for this workshop are already installed, all you have to do is load the virtual 19 | environment. Once you are logged in to the server run:: 20 | 21 | source /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/activate 22 | 23 | If you'd wish to inactivate this virtual environment you could run:: 24 | 25 | deactivate # Don't run this now 26 | 27 | NOTE: This is a python virtual environment. The binary folder of the virtual 28 | environment has symbolic links to all programs used in this workshop so you 29 | should be able to run those without problems. 30 | 31 | Check that the programs are available 32 | ===================================== 33 | After you have activated the virtual environment the following commands should execute properly and you should be able to see some brief instructions on how to run the different programs respectively. 34 | 35 | CONCOCT:: 36 | 37 | concoct -h 38 | 39 | 40 | BLAST:: 41 | 42 | rpsblast --help 43 | 44 | 45 | -------------------------------------------------------------------------------- /mako/templates/binning/setup.rst: -------------------------------------------------------------------------------- 1 | ========================================== 2 | Setup Environment 3 | ========================================== 4 | This workshop will be using the same environment used for the assembly workshop. If you did not participate in the assembly workshop, please have a look at the introductory setup description for that. 5 | 6 | Programs used in this workshop 7 | ============================== 8 | The following programs are used in this workshop: 9 | 10 | - CONCOCT_ 11 | - Phylosift_ 12 | - Blast_ 13 | 14 | .. _CONCOCT: http://github.com/BinPro/CONCOCT 15 | .. _Phylosift: http://phylosift.wordpress.com/ 16 | .. _BLAST: http://blast.ncbi.nlm.nih.gov/ 17 | 18 | All programs and scripts that you need for this workshop are already installed, all you have to do is load the virtual 19 | environment. Once you are logged in to the server run:: 20 | 21 | ${commands['activate']} 22 | 23 | If you'd wish to inactivate this virtual environment you could run:: 24 | 25 | deactivate # Don't run this now 26 | 27 | NOTE: This is a python virtual environment. The binary folder of the virtual 28 | environment has symbolic links to all programs used in this workshop so you 29 | should be able to run those without problems. 30 | 31 | Check that the programs are available 32 | ===================================== 33 | After you have activated the virtual environment the following commands should execute properly and you should be able to see some brief instructions on how to run the different programs respectively. 34 | 35 | CONCOCT:: 36 | 37 | ${commands['check_activate']['concoct']} 38 | 39 | 40 | BLAST:: 41 | 42 | ${commands['check_activate']['rpsblast']} 43 | 44 | -------------------------------------------------------------------------------- /run_mako.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from __future__ import print_function 3 | from mako.template import Template 4 | from mako.lookup import TemplateLookup 5 | import yaml 6 | import argparse 7 | import os 8 | 9 | FILE_PATH = os.path.dirname(__file__) 10 | 11 | TEMPLATES = ['binning/index.rst', 12 | 'binning/setup.rst', 13 | 'binning/concoct.rst', 14 | 'binning/phylosift.rst'] 15 | 16 | def main(args): 17 | with open(os.path.join(args.input_path, "settings.yaml")) as settings_file: 18 | settings = yaml.load(settings_file) 19 | 20 | lookup = TemplateLookup(directories=[args.template_path]) 21 | for t in TEMPLATES: 22 | template = lookup.get_template(t) 23 | print(template.render(**settings), 24 | file=open(os.path.join(args.output_path, t), 'w')) 25 | 26 | def sanitize_input(args): 27 | assert os.path.isdir(args.output_path) 28 | assert os.path.isdir(args.input_path) 29 | assert os.path.isdir(args.template_path) 30 | 31 | if __name__ == '__main__': 32 | parser = argparse.ArgumentParser() 33 | parser.add_argument('-o', '--output_path', default='source/', 34 | help=('Path to where rendered templates will be written.' 35 | ' Any existing files with the same file names will ' 36 | 'be over written. Default = sources')) 37 | parser.add_argument('-i', '--input_path', default='mako/', 38 | help=('Path where the raw mako settings ' 39 | 'files are stored. Default = mako')) 40 | parser.add_argument('-t', '--template_path', default=None, 41 | help=('Path where the raw mako templates are stored. ' 42 | 'Default = /templates')) 43 | args = parser.parse_args() 44 | if args.template_path is None: 45 | args.template_path = os.path.join(args.input_path, 'templates') 46 | sanitize_input(args) 47 | main(args) 48 | -------------------------------------------------------------------------------- /source/annotation/annotation.rst: -------------------------------------------------------------------------------- 1 | ================================================================ 2 | Search amino acid sequences with HMMER against the Pfam database 3 | ================================================================ 4 | It is time to do the actual Pfam annotation of our metagenomes! 5 | 6 | 7 | Running ``hmmsearch`` on the translated sequence data sets 8 | ========================================================== 9 | Before we run ``hmmsearch``, we will look at its available options:: 10 | 11 | hmmsearch -h 12 | 13 | As you will see, the program takes a substantial amount of arguments. 14 | In this workshop we will work with the table output from HMMER, which 15 | you get by specifying the ``--tblout`` option together with a file 16 | name. We also want to make sure that we only got statistically 17 | relevant matches, which we can do using the E-value option. The 18 | E-value (Expect-value) is an estimation of how often we would expect 19 | to find a similar hit by chance, given the size of the database. To 20 | avoid getting a lot of noise matches, we will specify and E-value of 21 | 10^-5, that is that we would by chance get a match with a similarly good 22 | alignment in 1 out of 100000 cases. This can be set with the ``-E 1e-5`` 23 | option. Finally, to speed up the process a little, we will use the 24 | ``--cpu`` option to get multi-core support. On the Uppmax machines you can 25 | use up to 16 cores for the HMMER runs. 26 | 27 | To specify the HMM-file database and the input data set, we just type in 28 | the names of those two files at the end of the command. Finally we add in 29 | the ``> /dev/null`` string, to avoid getting the screen cluttered with 30 | sequence alignments that HMMER outputs. That should give us the following 31 | command:: 32 | 33 | hmmsearch --tblout -E 1e-5 --cpu 8 ~/Pfam/Pfam-mobility.hmm > /dev/null 34 | 35 | Now run this command on all four input files that we just have downloaded. When the 36 | command has finished for all files, we can move on to the normalization exercise. 37 | -------------------------------------------------------------------------------- /source/comparative-taxonomic-analysis/krona.rst: -------------------------------------------------------------------------------- 1 | =============================== 2 | Visualising taxonomy with KRONA 3 | =============================== 4 | To get a graphical representation of the taxonomic classifications you can use 5 | KRONA, which is an excellent program for exploring data with hierarchical 6 | structures in general. The output file is an html file that can be viewed in a 7 | browser. Again make a directory for KRONA:: 8 | 9 | mkdir -p ~/metagenomics/cta/krona 10 | cd ~/metagenomics/cta/krona 11 | 12 | And run KRONA, concatenating the archaea and bacteria class files together at the same time like this and providing the name of the sample:: 13 | 14 | ktImportRDP \ 15 | <(cat ../rdp/0328_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/0328_rrna.silva-bac-16s-database-id85.fasta.class.tsv),0328 \ 16 | <(cat ../rdp/0403_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/0403_rrna.silva-bac-16s-database-id85.fasta.class.tsv),0403 \ 17 | <(cat ../rdp/0423_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/0423_rrna.silva-bac-16s-database-id85.fasta.class.tsv),0423 \ 18 | <(cat ../rdp/0531_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/0531_rrna.silva-bac-16s-database-id85.fasta.class.tsv),0531 \ 19 | <(cat ../rdp/0619_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/0619_rrna.silva-bac-16s-database-id85.fasta.class.tsv),0619 \ 20 | <(cat ../rdp/0705_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/0705_rrna.silva-bac-16s-database-id85.fasta.class.tsv),0705 \ 21 | <(cat ../rdp/0709_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/0709_rrna.silva-bac-16s-database-id85.fasta.class.tsv),0709 \ 22 | <(cat ../rdp/1001_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/1001_rrna.silva-bac-16s-database-id85.fasta.class.tsv),1001 \ 23 | <(cat ../rdp/1004_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/1004_rrna.silva-bac-16s-database-id85.fasta.class.tsv),1004 \ 24 | <(cat ../rdp/1028_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/1028_rrna.silva-bac-16s-database-id85.fasta.class.tsv),1028 \ 25 | <(cat ../rdp/1123_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/1123_rrna.silva-bac-16s-database-id85.fasta.class.tsv),1123 26 | 27 | The ``<()`` in bash can be used for process substitution 28 | (http://tldp.org/LDP/abs/html/process-sub.html ). Just for your information, 29 | the above command was actually generated with the following commands:: 30 | 31 | cmd=`echo ktImportRDP; for s in ${samplenames[*]}; do echo '<('cat ../rdp/${s}_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/${s}_rrna.silva-bac-16s-database-id85.fasta.class.tsv')',$s; done` 32 | echo $cmd 33 | 34 | Copy the resulting file rdp.krona.html to your local computer with scp and open it in firefox. 35 | -------------------------------------------------------------------------------- /source/comparative-functional-analysis/annotation.rst: -------------------------------------------------------------------------------- 1 | =============================== 2 | Functional annotation 3 | =============================== 4 | Now that we have extracted the genes/proteins we want to functionally annotate 5 | those. There are a bunch of ways of doing this. We will use webMGA to do do 6 | rpsBLAST searches against the COG database. COGs are clusters of orthologs 7 | genes, i.e. evolutionary counterparts in different species, usually with the 8 | same function (http://www.ncbi.nlm.nih.gov/COG/). Many COGs have known 9 | functions and the COGs are also grouped at a higher level with functional 10 | classes. 11 | 12 | To download the protein sequences that Prodigal generated, open a local 13 | terminal and type:: 14 | 15 | mkdir -p ~/metagenomics/cfa/prodigal 16 | cd ~/metagenomics/cfa/prodigal 17 | scp username@milou.uppmax.uu.se:~/metagenomics/cfa/prodigal/baltic-sea-ray-noscaf-41.1000.aa.fa . 18 | 19 | To get COG classifications of your proteins, go to webMGA 20 | http://weizhong-lab.ucsd.edu/metagenomic-analysis/ and select Server / 21 | Function annotation / COG. Upload the protein file 22 | (``baltic-sea-ray-noscaf-41.1000.aa.fa``) and use the default -e value cutoff. 23 | rpsBLAST is used, which is a BLAST based on position specific scoring matrices 24 | (pssm). For each COG, one such pssm has been constructed. These are compiled 25 | into a database of profiles that is searched against. 26 | http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/wwwblast/node20.html. rpsBLAST is 27 | more sensitive than a normal BLAST, which is important if genomes in your 28 | metagenome are distant from existing sequences in databases. It is also faster 29 | than searching against all proteins out there. 30 | 31 | When the search is done you get a zipped folder. On milou, create the 32 | directory:: 33 | 34 | mkdir -p ~/metagenomics/cfa/wmga-cog 35 | 36 | Use wget or curl to download the zip file on uppmax or use scp to upload it to 37 | that folder i.e.:: 38 | 39 | scp output.zip username@milou.uppmax.uu.se:~/metagenomics/cfa/wmga-cog 40 | 41 | Then unzip the file on kalkyl:: 42 | 43 | cd ~/metagenomics/cfa/wmga-cog 44 | unzip output.zip 45 | 46 | Have a look at the README.txt to see what all the files represent. The file 47 | output.2 includes detailed information on the classifications for every protein 48 | with a hit below the -e value cutoff. View them with:: 49 | 50 | less README.txt 51 | less -S output.2 52 | 53 | NOTE: If the queueing takes too much time you can also just copy the results 54 | from the project dir:: 55 | 56 | cp -r /proj/g2014113/metagenomics/cfa/wmga-cog/ ~/metagenomics/cfa/ 57 | 58 | **Question: What seem to be the 3 most abundant COG classes in our combined 59 | sample (not taking coverage into account)?** 60 | 61 | .. less output.2.class | tail -n +2 | sort -nk2,2 | tail -3 62 | J 1895 Translation, ribosomal structure and biogenesis 63 | R 2031 General function prediction only 64 | E 2308 Amino acid transport and metabolism 65 | -------------------------------------------------------------------------------- /source/comparative-taxonomic-analysis/rrna.rst: -------------------------------------------------------------------------------- 1 | ============================================================== 2 | Extracting rRNA encoding reads and annotating them 3 | ============================================================== 4 | Taxonomic composition of a sample can be based on e.g. BLASTing the contigs 5 | against a database of reference genomes, or by utilising rRNA sequences. 6 | Usually assembly doesn’t work well for rRNA genes due to their highly conserved 7 | regions, therefore extracting rRNA from contigs will miss a lot of the 8 | taxonomic information that can be obtained by analysing the reads directly. 9 | Analysing the reads also has the advantage of being quantitative, i.e. we don’t 10 | need to calculate coverages by the mapping procedure we applied for the 11 | functional genes above. We will extract rRNA encoding reads with the program 12 | sortmeRNA which is one of the fastest software solutions for this. The program 13 | sortmeRNA has built-in multithreading support so this time we use that for 14 | parallelization. These are the commands to run:: 15 | 16 | mkdir -p ~/metagenomics/cta/sortmerna 17 | cd ~/metagenomics/cta/sortmerna 18 | samplenames=(0328 0403 0423 0531 0619 0705 0709 1001 1004 1028 1123) 19 | for s in ${samplenames[*]} 20 | do sortmerna -n 2 --db \ 21 | /proj/g2014113/src/sortmerna-1.9/rRNA_databases/silva-arc-16s-database-id95.fasta \ 22 | /proj/g2014113/src/sortmerna-1.9/rRNA_databases/silva-bac-16s-database-id85.fasta \ 23 | --I /proj/g2014113/metagenomics/cta/reads/${s}_pe.fasta \ 24 | --accept ${s}_rrna \ 25 | --other ${s}_nonrrna \ 26 | --bydbs \ 27 | -a 8 \ 28 | --log ${s}_bilan \ 29 | -m 5242880 30 | done 31 | 32 | Again, this command takes rather long to run (~5m per sample) so just copy the results if you don’t feel like waiting:: 33 | 34 | cp /proj/g2014113/metagenomics/cta/sortmerna/* ~/metagenomics/cta/sortmerna 35 | 36 | It outputs the reads or part of reads that encode rRNA in a fasta file. These 37 | rRNA sequences can be classified in many ways, and again blasting them against 38 | a suitable database is one option. Here we use a simple and fast method (unless 39 | you have too many samples), the classifier tool at RDP (ribosomal database 40 | project). This uses a naive bayesian classifier trained on many sequences of 41 | defined taxonomies. It gives bootstrap support values for each taxonomic level; 42 | usually the support gets lower the further down the hierarchy you go. Genus 43 | level is the lowest level provided. You can use the web service if you prefer, 44 | and upload each file individually, or you can use the uppmax installation of 45 | RDP classifier like this (~4m):: 46 | 47 | mkdir -p ~/metagenomics/cta/rdp 48 | cd ~/metagenomics/cta/rdp 49 | for s in ../sortmerna/*_rrna*.fasta 50 | do java -Xmx1g -jar /glob/inod/src/rdp_classifier_2.6/dist/classifier.jar \ 51 | classify \ 52 | -g 16srrna \ 53 | -b `basename ${s}`.bootstrap \ 54 | -h `basename ${s}`.hier.tsv \ 55 | -o `basename ${s}`.class.tsv \ 56 | ${s} 57 | done 58 | -------------------------------------------------------------------------------- /source/annotation/translation.rst: -------------------------------------------------------------------------------- 1 | ========================================================== 2 | Translating nucleotide sequences into amino acid sequences 3 | ========================================================== 4 | The first step before we can annotate the metagenomes with Pfam domains 5 | using HMMER will be to translate the reads into amino acid sequences. This 6 | is necessary because HMMER (still) does not translate nucleotide sequnces 7 | into protein space on the fly (like,for example, BLAST). For completing 8 | this task we will use ``transeq``, part of the `EMBOSS `_ 9 | package. 10 | 11 | Running ``transeq`` on the sequence data sets 12 | ============================================= 13 | To run ``transeq``, take a look at its available options:: 14 | 15 | transeq -h 16 | 17 | If you have trouble getting ``transeq`` to run, try to run:: 18 | 19 | module load emboss 20 | 21 | A few options are important in this context. First of all, we need to 22 | supply an input file, using the (somewhat bulky) option ``-sequence``. 23 | Second, we also need to specify an output file, otherwise transeq will 24 | simply write its output to the screen. This is specified using the 25 | ``-outseq`` option. 26 | 27 | However, if we just run ``transeq`` like this we will 28 | run into two additional problems. First, ``transeq`` by default just 29 | translate the reading frame beginning at the first base in the input sequnece, 30 | and will ignore any bases in the reading frames beginning with base two 31 | and three, as well as those on the reverse-complementary strand. Second, 32 | the software will add stop characters in the form of asterixes ``*`` whenever 33 | it encounters a stop codon. This will occasionally cause HMMER to choke, so we 34 | want stop codons to instead be translated into X characters that HMMER can handle. 35 | The following excerpt form the `HMMER creator's blog `_ 36 | on this subject is one of my personal all-time favorites in terms of computer 37 | software documentation: 38 | 39 | There’s two ways people do six-frame translation. You can translate each 40 | frame into separate ORFs, or you can translate the read into exactly six 41 | “ORFs”, one per frame, with * chararacters marking stop codons. HMMER 42 | prefers that you do the former. Technically, * chararacters aren’t legal 43 | amino acid residue codes, and the author of HMMER3 is a pedantic nitpicker, 44 | passive-aggressive, yet also a pragmatist: so while HMMER3 pragmatically 45 | accepts * chararacters in input “protein” sequences just fine, it pedantically 46 | relegates them to somewhat suboptimal status, and it passively-aggressively 47 | figures that any suboptimal performance on \*-containing ORFs is your own 48 | fault for using \*’s in the first place. 49 | 50 | To avoid making Sean Eddy angry and causing other problems for our HMMER runs, 51 | we will use the ``-frame 6`` option to ``transeq`` in order to get translations 52 | of all six reading frames, and the ``-clean`` option to convert stop codons to X 53 | instead of \*. 54 | 55 | That should give us the command:: 56 | 57 | transeq -sequence -outseq -frame 6 -clean 58 | 59 | Now run this command on all four input files that we have created links to. 60 | When the command has finished for all files, we can move on to the actual 61 | annotation. 62 | -------------------------------------------------------------------------------- /source/assembly/qtrim.rst: -------------------------------------------------------------------------------- 1 | ========================================== 2 | Quality trimming Illumina paired-end reads 3 | ========================================== 4 | In this excercise you will learn how to quality trim Illumina paired-end reads. 5 | The most common Next Generation Sequencing (NGS) technology for metagenomics. 6 | 7 | Sickle 8 | ====== 9 | For quality trimming Illumina paired end reads we use the library sickle which 10 | trims reads from 3' end to 5' end using a sliding window. If the mean quality 11 | drops below a specified number the remaining part of the read will be trimmed. 12 | 13 | 14 | Downloading a test set 15 | ====================== 16 | Today we'll be working on a small metagenomic data set from the anterior nares 17 | (http://en.wikipedia.org/wiki/Anterior_nares). 18 | 19 | .. image:: https://raw.github.com/inodb/2013-metagenomics-workshop-gbg/master/images/nostril.jpg 20 | 21 | 22 | So get ready for your first smell of metagenomic assembly - pun intended. Run 23 | all these commands in your shell:: 24 | 25 | # Download the reads and extract them 26 | mkdir -p ~/asm-workshop 27 | mkdir -p ~/asm-workshop/data 28 | cd ~/asm-workshop/data 29 | wget http://downloads.hmpdacc.org/data/Illumina/anterior_nares/SRS018585.tar.bz2 30 | tar -xjf SRS018585.tar.bz2 31 | 32 | If successfull you should have the files:: 33 | 34 | $ ls -lh ~/asm-workshop/data/SRS018585/ 35 | -rw-rw-r-- 1 inod inod 36M Apr 18 2011 SRS018585.denovo_duplicates_marked.trimmed.1.fastq 36 | -rw-rw-r-- 1 inod inod 36M Apr 18 2011 SRS018585.denovo_duplicates_marked.trimmed.2.fastq 37 | -rw-rw-r-- 1 inod inod 6.4M Apr 18 2011 SRS018585.denovo_duplicates_marked.trimmed.singleton.fastq 38 | 39 | If not, try to find out if one of the previous commands gave an error. 40 | 41 | Look at the top of the one of the pairs:: 42 | 43 | cat ~/asm-workshop/data/SRS018585/SRS018585.denovo_duplicates_marked.trimmed.1.fastq | head 44 | 45 | **Question: Can you explain what the different parts of this header mean @HWI-EAS324_102408434:5:100:10055:13493/1?** 46 | 47 | 48 | Running sickle on a paired end library 49 | ====================================== 50 | I like to create directories for specific parts I'm working on and creating 51 | symbolic links (shortcuts in windows) to the input files. One can use the 52 | command ``ln`` for creating links. The difference between a symbolic link and a 53 | hard link can be found here: 54 | http://stackoverflow.com/questions/185899/what-is-the-difference-between-a-symbolic-link-and-a-hard-link. 55 | In this case I use symbolic links so I know what path the original reads have, 56 | which can help one remember what those reads were:: 57 | 58 | mkdir -p ~/asm-workshop/sickle 59 | cd ~/asm-workshop/sickle 60 | ln -s ../data/SRS018585/SRS018585.denovo_duplicates_marked.trimmed.1.fastq pair1.fastq 61 | ln -s ../data/SRS018585/SRS018585.denovo_duplicates_marked.trimmed.2.fastq pair2.fastq 62 | 63 | Now run sickle:: 64 | 65 | # check if sickle is in your PATH 66 | which sickle 67 | # Run sickle 68 | sickle pe \ 69 | -f pair1.fastq \ 70 | -r pair2.fastq \ 71 | -t sanger \ 72 | -o qtrim1.fastq \ 73 | -p qtrim2.fastq \ 74 | -s qtrim.unpaired.fastq 75 | # Check what files have been generated 76 | ls 77 | 78 | Sickle states how many reads it trimmed, but it is always good to be 79 | suspicious! Check if the numbers correspond with the amount of reads you count. 80 | Hint: use ``wc -l``. 81 | 82 | **Question: How many paired reads are left after trimming? How many singletons?** 83 | 84 | **Question: What are the different quality scores that sickle can handle? Why do we specify -t sanger here?** 85 | -------------------------------------------------------------------------------- /mako/templates/binning/phylosift.rst: -------------------------------------------------------------------------------- 1 | =========================================== 2 | Phylogenetic Classification using Phylosift 3 | =========================================== 4 | In this workshop we'll extract interesting bins from the concoct runs and investigate which species they consists of. We'll start by using a plain'ol BLASTN search and later we'll try a more sophisticated strategy with the program Phylosift. 5 | 6 | Extract bins from CONCOCT output 7 | ================================ 8 | The output from concoct is only a list of cluster id and contig ids respectively, so if we'd like to have fasta files for all our bins, we need to run the following script:: 9 | 10 | ${commands['extract_fasta_help']} 11 | 12 | Running it will create a separate fasta file for each bin, so we'd first like to create a output directory where we can store these files:: 13 | 14 | ${'\n '.join(commands['extract_fasta'])} 15 | 16 | Now you can see a number of bins in your output folder:: 17 | 18 | ${commands['list_bins']} 19 | 20 | Using the graph downloaded in the previous part, decide one cluster you'd like to investigate further. We're going to use the web based BLASTN tool at ncbi, so lets first download the fasta file for the cluster you choose. Execute on a terminal not logged in to UPPMAX:: 21 | 22 | ${commands['download_fasta']} 23 | 24 | Before starting to blasting this cluster, lets begin with the next assignment, since the next assignment will include a long waiting time that suits for running the BLASTN search. 25 | 26 | Phylosift 27 | ========= 28 | Phylosift is a software created for the purpose of determining the phylogenetic composition of your metagenomic data. It uses a defined set of genes to predict the taxonomy of each sequence in your dataset. You can read more about how this works here: http://phylosift.wordpress.com 29 | I've yet to discover how to install phylosift into a common bin, so in order to execute phylosift, you'd have to cd into the phylosift directory:: 30 | 31 | ${commands['move_to_phylosift']} 32 | 33 | Running phylosift will take some time (roughly 45 min) and UPPMAX do not want you to run this kind of heavy jobs on the regular login session, so what we'll do is to allocate an interactive node. For this course we have 16 nodes booked and available for our use so you will not need to wait in line. Start your interactive session with 4 cores available:: 34 | 35 | ${commands['allocate_interactive']} 36 | 37 | Now we have more computational resources available so lets start running phylosift on the cluster you choose (excange x in x.fa for your cluster number). You could also choose to use the clusters from the binning results using a single sample, but then you need to redo the fasta extraction above.:: 38 | 39 | ${'\n '.join(commands['run_phylosift'])} 40 | 41 | While this command is running, go to ncbi web blast service: 42 | 43 | http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome 44 | 45 | Upload your fasta file that you downloaded in the previous step and submit a blast search against the nr/nt database. 46 | Browse through the result and try and see if you can do a taxonomic classification from these. 47 | 48 | When the phylosift run is completed, browse the output directory:: 49 | 50 | ${commands['browse_phylosift']} 51 | 52 | All of these files are interesting, but the most fun one is the html file, so lets download this to your own computer and have a look. Again, switch to a terminal where you're not logged in to UPPMAX:: 53 | 54 | ${commands['download_phylosift']} 55 | 56 | Did the phylosift result correspond to any results in the BLAST output? 57 | 58 | As you hopefully see, this phylosift result file is quite neat, but it doesn't show its full potential using a pure cluster, so to display the results for a more diverse input file we have prepared a run for the complete dataset:: 59 | 60 | ${commands['browse_all_phylosift']} 61 | 62 | And download this (running it on your own terminal again):: 63 | 64 | ${commands['download_phylosift_all']} 65 | 66 | Can you "find your bin" within this result file? 67 | -------------------------------------------------------------------------------- /source/comparative-functional-analysis/genefinding.rst: -------------------------------------------------------------------------------- 1 | ================== 2 | Gene finding 3 | ================== 4 | Now that you have assembled the data into contigs next natural step to do is 5 | annotation of the data, i.e. finding the genes and doing functional annotation 6 | of those. For gene finding a range of programs are available (Metagene 7 | Annotator, MetaGeneMark, Orphelia, FragGeneScan), here we will use Prodigal 8 | which is very fast and has recently been enhanced for metagenomics. We will use 9 | the -p flag which instructs Prodigal to use the algorithm suitable for 10 | metagenomic data. We will use a dataset consisting of 11 samples from a time 11 | series sampling of surface water in the Baltic Sea. Sequencing was done with 12 | Illumina MiSeq here generating on average 835,048 2 x 250 bp reads per sample. 13 | The reads can be found here:: 14 | 15 | /proj/g2014113/metagenomics/comparative-functional-analysis/reads 16 | 17 | The first four numbers in the filename represent a date. All samples are from 18 | 2012. R1 and R2 both contain one read of a pair. They are ordered, so the first 19 | four lines in R1 are paired with the read in the first four lines of R2. They 20 | are in CASAVA v1.8 format (http://en.wikipedia.org/wiki/FASTQ_format). 21 | 22 | A coassembly has already been made with Ray using all reads to save you some 23 | time. You can find the contigs from a combined assembly on reads from all 24 | samples here:: 25 | 26 | /proj/g2014113/metagenomics/cfa/assembly/baltic-sea-ray-noscaf-41.1000.fa 27 | 28 | They have been constructed with Ray using a kmer of 41 and no scaffolding. Only 29 | contigs >= 1000 are in this file. The reason a coassembly is used is that we 30 | can get an idea of the entire metagenome over multiple samples. By mapping the 31 | reads back per sample we can compare coverages of contigs between samples. 32 | 33 | **Question: What could be a possible advantage/disadvantage for the assembly 34 | process when assembling multiple samples at one time?** 35 | 36 | .. Advantage: more coverage. Disadvantage: more related strains/species makes 37 | .. graph traversal harder 38 | 39 | **Question: Can you think of other approaches to get a coassembly?** 40 | 41 | .. Maybe map contigs against each other in merge them in that way. Preferably 42 | .. taking coverages into account 43 | 44 | Note that all solutions (i.e. the generated outputs) for the exercises are also in:: 45 | 46 | /proj/g2014113/metagenomics/cfa/ 47 | 48 | In all the following exercises you should again use the virtual environment to 49 | get all the necessary programs (unless you already loaded it ofc):: 50 | 51 | source /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/activate 52 | 53 | It’s time to run Prodigal. First create an output directory with a copy of the 54 | contig file:: 55 | 56 | mkdir -p ~/metagenomics/cfa/prodigal 57 | cd ~/metagenomics/cfa/prodigal 58 | cp /proj/g2014113/metagenomics/cfa/assembly/baltic-sea-ray-noscaf-41.1000.fa . 59 | 60 | Then run Prodigal on the contig file (~2m20):: 61 | 62 | prodigal -a baltic-sea-ray-noscaf-41.1000.aa.fa \ 63 | -d baltic-sea-ray-noscaf-41.1000.nuc.fa \ 64 | -i baltic-sea-ray-noscaf-41.1000.fa \ 65 | -f gff -p meta \ 66 | > baltic-sea-ray-noscaf-41.1000.gff 67 | 68 | This will produce 3 files: 69 | 70 | * ``-d`` a fasta file with the gene sequences (nucleotides) 71 | * ``-a`` a fasta file with the protein sequences (aminoacids) 72 | * ``stdout`` a gff file 73 | 74 | The gff format is a standardised file type for showing annotations.It’s a tab 75 | delimited file that can be viewed by e.g. :: 76 | 77 | less baltic-sea-ray-noscaf-41.1000.gff 78 | 79 | Pass the option -S to less if you don’t want lines to wrap 80 | 81 | An explanation of the gff format can be found at 82 | http://genome.ucsc.edu/FAQ/FAQformat.html 83 | 84 | **Question: How many coding regions were found by Prodigal? Hint: use grep -c** 85 | 86 | .. less *.gff | grep -c 'CDS' 87 | .. 23577 88 | 89 | **Question: How many contigs have coding regions? How many do not?** 90 | 91 | .. less *.gff | grep '^contig' | grep 'CDS' | awk '{print $1}' | sort -u | wc -l 92 | .. 8517 93 | .. grep -c '^>cont' baltic-sea-ray-noscaf-41.1000.fa 94 | .. 8533 95 | .. 8533-8517=16 96 | -------------------------------------------------------------------------------- /source/assembly/map.rst: -------------------------------------------------------------------------------- 1 | ============================================ 2 | Mapping reads back to the assembly 3 | ============================================ 4 | 5 | Overview 6 | ====================== 7 | 8 | There are many different mappers available to map your reads back to the 9 | assemblies. Usually they result in a SAM or BAM file 10 | (http://genome.sph.umich.edu/wiki/SAM). Those are formats that contain the 11 | alignment information, where BAM is the binary version of the plain text SAM 12 | format. In this tutorial we will be using bowtie2 13 | (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml). 14 | 15 | 16 | The SAM/BAM file can afterwards be processed with Picard 17 | (http://picard.sourceforge.net/) to remove duplicate reads. Those are likely to 18 | be reads that come from a PCR duplicate (http://www.biostars.org/p/15818/). 19 | 20 | 21 | BEDTools (http://code.google.com/p/bedtools/) can then be used to retrieve 22 | coverage statistics. 23 | 24 | 25 | There is a script available that does it all at once. Read it and try to 26 | understand what happens in each step:: 27 | 28 | less `which map-bowtie2-markduplicates.sh` 29 | map-bowtie2-markduplicates.sh -h 30 | 31 | Bowtie2 has some nice documentation: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml 32 | 33 | **Question: what does bowtie2-build do?** 34 | 35 | Picard's documentation also exists! Two bioinformatics programs in a row with 36 | decent documentation! Take a moment to celebrate, then have a look here: 37 | http://sourceforge.net/apps/mediawiki/picard/index.php 38 | 39 | **Question: Why not just remove all identitical pairs instead of mapping them 40 | and then removing them?** 41 | 42 | **Question: What is the difference between samtools rmdup and Picard MarkDuplicates?** 43 | 44 | 45 | 46 | Mapping reads with bowtie2 47 | ========================== 48 | Take an assembly and try to map the reads back using bowtie2. Do this on an 49 | interactive node again, and remember to change the 'out_21' part to the actual output directory that you generated:: 50 | 51 | # Create a new directory and link files 52 | mkdir -p ~/asm-workshop/bowtie2 53 | cd ~/asm-workshop/bowtie2 54 | ln -s ../velvet/out_21/contigs.fa contigs.fa 55 | ln -s ../sickle/pair1.fastq pair1.fastq 56 | ln -s ../sickle/pair2.fastq pair2.fastq 57 | 58 | # Run the everything in one go script. 59 | map-bowtie2-markduplicates.sh -t 1 -c pair1.fastq pair2.fastq pair contigs.fa contigs map > map.log 2> map.err 60 | 61 | Inspect the ``map.log`` output and see if all went well. 62 | 63 | **Question: What is the overall alignment rate of your reads that bowtie2 reports?** 64 | 65 | Add the answer to the doc_. 66 | 67 | 68 | Some general statistics from the SAM/BAM file 69 | ============================================= 70 | You can also determine mapping statistics directly from the bam file. Use for 71 | instance:: 72 | 73 | # Mapped reads only 74 | samtools view -c -F 4 map/contigs_pair-smds.bam 75 | 76 | # Unmapped reads only 77 | samtools view -c -f 4 map/contigs_pair-smds.bam 78 | 79 | From: 80 | http://left.subtree.org/2012/04/13/counting-the-number-of-reads-in-a-bam-file/. 81 | The number is different from the number that bowtie2 reports, because these are 82 | the numbers after removing duplicates. The ``-smds`` part stands for running 83 | ``samtools sort``, ``MarkDuplicates.jar`` and ``samtools sort`` again on the 84 | bam file. If all went well with the mapping there should also be a 85 | ``map/contigs_pair-smd.metrics`` file where you can see the percentage of 86 | duplication. Add that to the doc_ as well. 87 | 88 | 89 | Coverage information from BEDTools 90 | ============================================= 91 | Look at the output from BEDTools:: 92 | 93 | less map/contigs_pair-smds.coverage 94 | 95 | The format is explained here 96 | http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html. The 97 | ``map-bowtie2-markduplicates.sh`` script also outputs the mean coverage per 98 | contig:: 99 | 100 | less map/contigs_pair-smds.coverage.percontig 101 | 102 | **Question: What is the contig with the highest coverage? Hint: use sort -k** 103 | 104 | .. _doc: https://docs.google.com/spreadsheet/ccc?key=0AvduvUOYAB-_dDdDSVhqUi1KQmJkTlZJcHVfMGI3a2c#gid=3 105 | -------------------------------------------------------------------------------- /mako/settings.yaml: -------------------------------------------------------------------------------- 1 | commands: 2 | activate: 'source /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/activate' 3 | check_activate: 4 | concoct: 'concoct -h' 5 | rpsblast: 'rpsblast --help' 6 | copy_dataset: 7 | - 'mkdir -p ~/binning-workshop' 8 | - 'mkdir -p ~/binning-workshop/data' 9 | - 'cd ~/binning-workshop/' 10 | - 'cp /proj/g2014113/nobackup/concoct-workshop/Contigs_gt1000.fa data/' 11 | - 'cp /proj/g2014113/nobackup/concoct-workshop/120322_coverage_nolen.tsv data/' 12 | browse_coverage: 'less -S ~/binning-workshop/data/120322_coverage_nolen.tsv' 13 | cut_coverage: 'cut -f1,3 ~/binning-workshop/data/120322_coverage_nolen.tsv > ~/binning-workshop/data/120322_coverage_one_sample.tsv' 14 | run_concoct: 15 | one_sample: 16 | - 'mkdir -p ~/binning-workshop/concoct_output' 17 | - 'concoct --coverage_file ~/binning-workshop/data/120322_coverage_one_sample.tsv --composition_file ~/binning-workshop/data/Contigs_gt1000.fa -l 3000 -b ~/binning-workshop/concoct_output/3000_one_sample/' 18 | look_clustering: 'less ~/binning-workshop/concoct_output/3000_one_sample/clustering_gt3000.csv' 19 | all_samples: 20 | 'concoct --coverage_file ~/binning-workshop/data/120322_coverage_nolen.tsv --composition_file ~/binning-workshop/data/Contigs_gt1000.fa -l 3000 -b ~/binning-workshop/concoct_output/3000_all_samples/' 21 | copy_blast: 22 | - 'cp /proj/g2014113/nobackup/concoct-workshop/Contigs_gt1000_blast.out ~/binning-workshop/data/' 23 | - 'cp /proj/g2014113/nobackup/concoct-workshop/scg_cogs_min0.97_max1.03_unique_genera.txt ~/binning-workshop/data/' 24 | - 'cp /proj/g2014113/nobackup/concoct-workshop/cdd_to_cog.tsv ~/binning-workshop/data/' 25 | check_cog_scripts: 26 | - 'COG_table.py -h' 27 | - 'COGPlot.R -h' 28 | cogplot_single: 29 | - 'COG_table.py -b ~/binning-workshop/data/Contigs_gt1000_blast.out -c ~/binning-workshop/concoct_output/3000_one_sample/clustering_gt3000.csv -m ~/binning-workshop/data/scg_cogs_min0.97_max1.03_unique_genera.txt --cdd_cog_file ~/binning-workshop/data/cdd_to_cog.tsv > ~/binning-workshop/cog_table_3000_single_sample.tsv' 30 | - 'COGPlot.R -s ~/binning-workshop/cog_table_3000_single_sample.tsv -o ~/binning-workshop/cog_plot_3000_single_sample.pdf' 31 | download_single_cogplot: 'scp username@milou.uppmax.uu.se:~/binning-workshop/cog_plot_3000_single_sample.pdf ~/Desktop/' 32 | cogplot_multiple: 33 | - 'COG_table.py -b ~/binning-workshop/data/Contigs_gt1000_blast.out -c ~/binning-workshop/concoct_output/3000_all_samples/clustering_gt3000.csv -m ~/binning-workshop/data/scg_cogs_min0.97_max1.03_unique_genera.txt --cdd_cog_file ~/binning-workshop/data/cdd_to_cog.tsv > ~/binning-workshop/cog_table_3000_all_samples.tsv' 34 | - 'COGPlot.R -s ~/binning-workshop/cog_table_3000_all_samples.tsv -o ~/binning-workshop/cog_plot_3000_all_samples.pdf' 35 | download_multiple_cogplot: 'scp username@milou.uppmax.uu.se:~/binning-workshop/cog_plot_3000_all_samples.pdf ~/Desktop' 36 | extract_fasta_help: 'extract_fasta_bins.py -h' 37 | extract_fasta: 38 | - 'mkdir -p ~/binning-workshop/concoct_output/3000_all_samples/fasta_bins' 39 | - 'extract_fasta_bins.py ~/binning-workshop/data/Contigs_gt1000.fa ~/binning-workshop/concoct_output/3000_all_samples/clustering_gt3000.csv --output_path ~/binning-workshop/concoct_output/3000_all_samples/fasta_bins/' 40 | download_fasta: 'scp username@milou.uppmax.uu.se:~/binning-workshop/concoct_output/3000_all_samples/fasta_bins/x.fa ~/Desktop/' 41 | list_bins: 'ls ~/binning-workshop/concoct_output/3000_all_samples/fasta_bins' 42 | allocate_interactive: 'interactive -A g2014113 -p core -n 4 -t 4:00:00' 43 | move_to_phylosift: 'cd /proj/g2014113/src/phylosift_v1.0.1' 44 | run_phylosift: 45 | - 'mkdir -p ~/binning-workshop/phylosift_output/' 46 | - '/proj/g2014113/src/phylosift_v1.0.1/phylosift all -f --output ~/binning-workshop/phylosift_output/ ~/binning-workshop/concoct_output/3000_all_samples/fasta_bins/x.fa' 47 | browse_phylosift: 'ls ~/binning-workshop/phylosift_output/' 48 | download_phylosift: 'scp username@milou.uppmax.uu.se:~/binning-workshop/phylosift_output/x.fa.html ~/Desktop/' 49 | browse_all_phylosift: 'ls /proj/g2014113/nobackup/concoct-workshop/phylosift_output/all/' 50 | download_phylosift_all: 'scp username@milou.uppmax.uu.se:/proj/g2014113/nobackup/concoct-workshop/phylosift_output/all/Contigs_gt1000.fa.html ~/Desktop/' 51 | -------------------------------------------------------------------------------- /source/annotation/software.rst: -------------------------------------------------------------------------------- 1 | ========================================== 2 | Checking required software 3 | ========================================== 4 | Before we begin, we will quickly go through the required software and datasets 5 | for this workshop. For those who are already command-line-skilled there will 6 | also be a possibility to install the Metaxa2 tool for rRNA finding, but this 7 | is not required to complete the workshop. 8 | 9 | Programs used in this workshop 10 | ============================== 11 | The following programs are used in this workshop: 12 | 13 | - `EMBOSS (transeq)`__ 14 | - HMMER_ 15 | - R_ 16 | - Optionally: Metaxa2_ 17 | 18 | .. __: http://emboss.sourceforge.net 19 | .. _HMMER: http://hmmer.janelia.org 20 | .. _R: http://www.r-project.org 21 | .. _Metaxa2: http://microbiology.se/software/metaxa2/ 22 | 23 | Since we are going to use the plotting functionality of R, we need to login 24 | to Uppmax with X11 forwarding turned on. In the Unix/Linux terminal this is 25 | easily achieved by adding the ``-X`` (captal X) option. All programs but 26 | Metaxa2 are already installed, all you have to do is load the virtual 27 | environment for this workshop. Once you are logged in to the server run:: 28 | 29 | source /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/activate 30 | 31 | You deactivate the virtual environment with:: 32 | 33 | deactivate 34 | 35 | NOTE: This is a python virtual environment. The binary folder of the virtual 36 | environment has symbolic links to all programs used in this workshop so you 37 | should be able to run those without problems. 38 | 39 | 40 | Check all programs in one go with which 41 | ================================================== 42 | To check whether you have all programs installed in one go, you can use ``which`` 43 | to test for the following programs:: 44 | 45 | hmmsearch 46 | transeq 47 | R 48 | blastall 49 | 50 | Data and databases used in this workshop 51 | ======================================== 52 | In this workshop, we are (due to time constraints) going to use a simplified version 53 | of the `Pfam `__ database, including only protein families 54 | related to plasmid replication and maintenance. This database is pre-compiled and can 55 | be downloaded from http://microbiology.se/teach/scilife2014/pfam.tar.gz 56 | Download it using the following commands:: 57 | 58 | mkdir -p ~/Pfam 59 | cd ~/Pfam 60 | wget http://microbiology.se/teach/scilife2014/pfam.tar.gz 61 | tar -xzvf pfam.tar.gz 62 | cd ~ 63 | 64 | In addition, you will need to obtain the following data sets for the workshop:: 65 | 66 | /proj/g2014113/metagenomics/annotation/baltic1.fna 67 | /proj/g2014113/metagenomics/annotation/baltic2.fna 68 | /proj/g2014113/metagenomics/annotation/indian_lake.fna 69 | /proj/g2014113/metagenomics/annotation/swedish_lake.fna 70 | 71 | We are going to use two data sets from the Baltic Sea, one from a Swedish lake and one 72 | from an Indian lake contaminated with wastewater from pharmaceutical production. For 73 | the same of time, I have reduced the data sets in size dramatically prior to this 74 | workshop. You can create links to the above files using the ``ln -s `` command. 75 | Use it on all the four data sets. 76 | 77 | 78 | (Optional excercise) Install Metaxa2 by yourself 79 | ================================================ 80 | Follow these steps only if you want to install ``Metaxa2`` by yourself. 81 | The code for Metaxa2 is available from http://microbiology.se/sw/Metaxa2_2.0rc3.tar.gz 82 | You can install Metaxa2 as follows:: 83 | 84 | # Create a src and a bin directory 85 | mkdir -p ~/src 86 | mkdir -p ~/bin 87 | 88 | # Go to the source directory and download the Metaxa2 tarball 89 | cd ~/src 90 | wget http://microbiology.se/sw/Metaxa2_2.0rc3.tar.gz 91 | tar -xzvf Metaxa2_2.0rc3.tar.gz 92 | cd Metaxa2_2.0rc3 93 | 94 | # Run the installation script 95 | ./install_metaxa2 96 | 97 | # Try to run Metaxa2 (this should bring up the main options for the software) 98 | metaxa2 -h 99 | 100 | If this did not work, you can try this manual approach:: 101 | 102 | cd ~/src/Metaxa2_2.0rc3 103 | cp -r metaxa2* ~/bin/ 104 | 105 | # Then try to run Metaxa2 again 106 | metaxa2 -h 107 | 108 | If this brings up the help message, you are all set! 109 | -------------------------------------------------------------------------------- /source/binning/phylosift.rst: -------------------------------------------------------------------------------- 1 | =========================================== 2 | Phylogenetic Classification using Phylosift 3 | =========================================== 4 | In this workshop we'll extract interesting bins from the concoct runs and investigate which species they consists of. We'll start by using a plain'ol BLASTN search and later we'll try a more sophisticated strategy with the program Phylosift. 5 | 6 | Extract bins from CONCOCT output 7 | ================================ 8 | The output from concoct is only a list of cluster id and contig ids respectively, so if we'd like to have fasta files for all our bins, we need to run the following script:: 9 | 10 | extract_fasta_bins.py -h 11 | 12 | Running it will create a separate fasta file for each bin, so we'd first like to create a output directory where we can store these files:: 13 | 14 | mkdir -p ~/binning-workshop/concoct_output/3000_all_samples/fasta_bins 15 | extract_fasta_bins.py ~/binning-workshop/data/Contigs_gt1000.fa ~/binning-workshop/concoct_output/3000_all_samples/clustering_gt3000.csv --output_path ~/binning-workshop/concoct_output/3000_all_samples/fasta_bins/ 16 | 17 | Now you can see a number of bins in your output folder:: 18 | 19 | ls ~/binning-workshop/concoct_output/3000_all_samples/fasta_bins 20 | 21 | Using the graph downloaded in the previous part, decide one cluster you'd like to investigate further. We're going to use the web based BLASTN tool at ncbi, so lets first download the fasta file for the cluster you choose. Execute on a terminal not logged in to UPPMAX:: 22 | 23 | scp username@milou.uppmax.uu.se:~/binning-workshop/concoct_output/3000_all_samples/fasta_bins/x.fa ~/Desktop/ 24 | 25 | Before starting to blasting this cluster, lets begin with the next assignment, since the next assignment will include a long waiting time that suits for running the BLASTN search. 26 | 27 | Phylosift 28 | ========= 29 | Phylosift is a software created for the purpose of determining the phylogenetic composition of your metagenomic data. It uses a defined set of genes to predict the taxonomy of each sequence in your dataset. You can read more about how this works here: http://phylosift.wordpress.com 30 | I've yet to discover how to install phylosift into a common bin, so in order to execute phylosift, you'd have to cd into the phylosift directory:: 31 | 32 | cd /proj/g2014113/src/phylosift_v1.0.1 33 | 34 | Running phylosift will take some time (roughly 45 min) and UPPMAX do not want you to run this kind of heavy jobs on the regular login session, so what we'll do is to allocate an interactive node. For this course we have 16 nodes booked and available for our use so you will not need to wait in line. Start your interactive session with 4 cores available:: 35 | 36 | interactive -A g2014113 -p core -n 4 -t 4:00:00 37 | 38 | Now we have more computational resources available so lets start running phylosift on the cluster you choose (excange x in x.fa for your cluster number). You could also choose to use the clusters from the binning results using a single sample, but then you need to redo the fasta extraction above.:: 39 | 40 | mkdir -p ~/binning-workshop/phylosift_output/ 41 | /proj/g2014113/src/phylosift_v1.0.1/phylosift all -f --output ~/binning-workshop/phylosift_output/ ~/binning-workshop/concoct_output/3000_all_samples/fasta_bins/x.fa 42 | 43 | While this command is running, go to ncbi web blast service: 44 | 45 | http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome 46 | 47 | Upload your fasta file that you downloaded in the previous step and submit a blast search against the nr/nt database. 48 | Browse through the result and try and see if you can do a taxonomic classification from these. 49 | 50 | When the phylosift run is completed, browse the output directory:: 51 | 52 | ls ~/binning-workshop/phylosift_output/ 53 | 54 | All of these files are interesting, but the most fun one is the html file, so lets download this to your own computer and have a look. Again, switch to a terminal where you're not logged in to UPPMAX:: 55 | 56 | scp username@milou.uppmax.uu.se:~/binning-workshop/phylosift_output/x.fa.html ~/Desktop/ 57 | 58 | Did the phylosift result correspond to any results in the BLAST output? 59 | 60 | As you hopefully see, this phylosift result file is quite neat, but it doesn't show its full potential using a pure cluster, so to display the results for a more diverse input file we have prepared a run for the complete dataset:: 61 | 62 | ls /proj/g2014113/nobackup/concoct-workshop/phylosift_output/all/ 63 | 64 | And download this (running it on your own terminal again):: 65 | 66 | scp username@milou.uppmax.uu.se:/proj/g2014113/nobackup/concoct-workshop/phylosift_output/all/Contigs_gt1000.fa.html ~/Desktop/ 67 | 68 | Can you "find your bin" within this result file? 69 | 70 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | ================================== 2 | Metagenomics Workshop SciLifeLab 3 | ================================== 4 | 5 | This repository holds the code for the website of the metagenomics workshop 6 | held at SciLifeLab, Stockholm 21-23 May 2014. The website is written using 7 | Sphinx_. The webpage can be found at: 8 | 9 | http://inodb.github.io/2014-5-metagenomics-workshop/ 10 | 11 | and 12 | 13 | http://2014-5-metagenomics-workshop.readthedocs.org/ 14 | 15 | How does it work? 16 | ------------------------- 17 | In short, we use a python package called Sphinx_ to convert a bunch of text 18 | files written in reStructuredText_ (reST) to HTML pages. Instead of editing the 19 | HTML directly you change text files in the reST_ format. Those are the 20 | ``*.rst`` files in the `source directory`_. That's all you need to know to 21 | start `Contributing`_. 22 | 23 | Contributing 24 | ------------- 25 | We follow the Fork_ & pull_ model. It's not necessary to do anything on the 26 | command line. All you have to is click on fork. Then you can edit the 27 | ``*.rst`` files directly through the GitHub interface if you want. Only the 28 | Sphinx specific commands will not work, such as the table of contents command 29 | ``toctree``. You can also `add new files`_ by clicking on the plus symbol next 30 | to a directory. After you are satisfied with you changes you click on the pull 31 | request button. Do note that changing the ``*.rst`` files does not change the 32 | actual webpage, maybe somebody else (.e.g me) can do that for you. If you want 33 | to learn how to compile the ``*.rst`` files to ``*.html``, please read on. 34 | 35 | Compile the reST files to HTML locally 36 | --------------------------------------- 37 | The only thing that is a bit more tricky is actually compiling the ``*.rst`` 38 | files to ``*.html`` files. This is not necessary to contribute since you can 39 | see the results in Github (GitHub shows ``*.rst`` files as they would look like 40 | in HTML by default). If you want to compile the files locally you would do:: 41 | 42 | pip install sphinx # install sphinx 43 | git clone https://github.com/inodb/2014-5-metagenomics-workshop 44 | make html 45 | 46 | The resulting HTML pages are in the folder ``build/``. You can open the files 47 | in your browser by typing e.g. 48 | ``file:///home/inodb/path/to/build/html/index.html`` in the address bar. If you 49 | want to make changes you should: 50 | 51 | 1. fork_ this repo 52 | 2. clone your forked repo 53 | 3. Make the changes to the ``*.rst`` files 54 | 4. run ``make html`` 55 | 5. look at the results 56 | 6. add the changes with ``git add files that you changed`` 57 | 7. commit the changes with ``git commit`` 58 | 8. push the changes to your own repo with ``git push`` 59 | 9. do a pull_ request by clicking on the pull request button on the GitHub page 60 | of your repo 61 | 62 | This only changes the ``*.rst`` files in the ``master`` branch, not the actual 63 | webpage, which is in the ``gh-pages`` branch. How that is set up is explained 64 | in the section. 65 | 66 | Compile the reST files to HTML on milou 67 | --------------------------------------- 68 | The generated docs can be found on bit.ly/metalove. The HTML files are located in 69 | ``/proj/g2014113/webexport/``. To update those files you first clone the repository 70 | somewhere on milou. Then load the virtual environment of the workshop:: 71 | 72 | source /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/activate 73 | 74 | Then from the root dir of the repository run:: 75 | 76 | make milou 77 | 78 | The HTML files will then be updated. Obviously you should be part of the g2014113 project. 79 | 80 | Updating the HTML to GitHub Pages 81 | -------------------------------------- 82 | The website is hosted on `GitHub Pages`. It works by having a branch called 83 | ``gh-pages`` on this repository, which has all the HTML. I used 84 | brantfaircloth's `sphinx_to_github.sh`_ script to set it up. Basically it sets 85 | up a ``gh-pages`` branch in the ``build/html`` folder of the repository, so 86 | everytime you run ``make html`` it changes the files in that branch. You then 87 | ``cd build/html``, commit the new HTML files and push them to the ``gh-pages`` 88 | branch. After that the result can be viewed at: 89 | 90 | http://yourusername.github.io/reponame/ 91 | 92 | I'll update the branch ``gh-pages`` myself after your pull request with the 93 | changed ``*.rst`` files on the ``master`` branch was accepted. 94 | 95 | 96 | .. _sphinx: http://sphinx-doc.org/ 97 | .. _fork: https://help.github.com/articles/fork-a-repo 98 | .. _pull: https://help.github.com/articles/using-pull-requests 99 | .. _reStructuredText: http://sphinx-doc.org/rest.html 100 | .. _reST: http://sphinx-doc.org/rest.html 101 | .. _source directory: https://github.com/inodb/2014-5-metagenomics-workshop/tree/master/source 102 | .. _GitHub Pages: https://pages.github.com/ 103 | .. _add new files: https://github.com/blog/1327-creating-files-on-github 104 | .. _sphinx_to_github.sh: https://gist.github.com/brantfaircloth/791759 105 | -------------------------------------------------------------------------------- /source/assembly/assembly.rst: -------------------------------------------------------------------------------- 1 | ========================================== 2 | Assembling reads with Velvet 3 | ========================================== 4 | In this exercise we will learn how to perform an assembly with Velvet. Velvet 5 | takes your reads as input and turns them into contigs. It consists of two 6 | steps. In the first step, ``velveth``, the de Bruijn graph is created. 7 | Afterwards the graph is traversed and contigs are created with ``velvetg``. 8 | When constructing the de Bruijn graph, a *kmer* has to be specified. Reads are 9 | cut up into pieces of length *k*, each representing a node in the graph, edges 10 | represent an overlap (some de Bruijn graph assemblers do this differently, but 11 | the idea is the same). The advantage of using kmer overlap instead of read 12 | overlap is that the computational requirements grow with the number of unique 13 | kmers instead of unique reads. A more detailled explanation can be found at 14 | http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html. 15 | 16 | 17 | Pick a kmer 18 | =========== 19 | Please work in pairs for this assignment. Every group can select a kmer of 20 | their likings - pick a random one if you haven't developed a preference yet. 21 | Write you and your partner's name down at a kmer on the 22 | Google doc_ for this workshop. 23 | 24 | .. _doc: https://docs.google.com/spreadsheet/ccc?key=0AvduvUOYAB-_dDdDSVhqUi1KQmJkTlZJcHVfMGI3a2c#gid=3 25 | 26 | velveth 27 | ======= 28 | Create the graph data structure with ``velveth``. Again like we did with 29 | ``sickle``, first create a directory with symbolic links to the pairs that you 30 | want to use:: 31 | 32 | mkdir -p ~/asm-workshop/velvet 33 | cd ~/asm-workshop/velvet 34 | ln -s ../sickle/qtrim1.fastq pair1.fastq 35 | ln -s ../sickle/qtrim2.fastq pair2.fastq 36 | 37 | The reads need to be interleaved for ``velveth``:: 38 | 39 | shuffleSequences_fastq.pl pair1.fastq pair2.fastq pair.fastq 40 | 41 | Run velveth over the kmer you picked (21 in this example):: 42 | 43 | velveth out_21 21 -fastq -shortPaired pair.fastq 44 | 45 | Check what directories have been created:: 46 | 47 | ls 48 | 49 | velvetg 50 | ======= 51 | To get the actual contigs you will have to run ``velvetg`` on the created 52 | graph. You can vary options such expected coverage and the coverage cut-off if 53 | you want, but we do not do that in this tutorial. We only choose not to do 54 | scaffolding:: 55 | 56 | velvetg out_21 -scaffolding no 57 | 58 | 59 | assemstats 60 | ========== 61 | After the assembly one wants to look at the length distributions of the 62 | resulting assemblies. You can use the ``assemstats`` script for that:: 63 | 64 | assemstats 100 out_*/contigs.fa 65 | 66 | Try to find-out what each of the stats represent by varying the cut-off. One of 67 | the most often used statistics in assembly length distribution comparisons is 68 | the *N50 length*, a weighted median, where you weight each contig by it's 69 | length. This way you assign more weight to larger contigs. Fifty procent of all 70 | the bases in the assembly are contained in contigs shorter or equal to N50 71 | length. Once you have gotten an idea of what it all the stats mean, it is time 72 | to compare your results with the other attendees of this workshop. Generate the results and copy them to the doc_:: 73 | 74 | assemstats 100 out_*/contigs.fa 75 | 76 | Do the same for the cut-off at 1000 and add it to the doc_. Compare your kmer 77 | against the others. If there are very little available yet, this would be an 78 | ideal time to help out some other attendees or do the same exercise for a kmer 79 | that has not been picked by somebody else yet. Please write down you and your 80 | partners name again at the doc_ in that case. 81 | 82 | 83 | **Question: What are the important length statistics? Do we prefer sum over 84 | length? Should it be a combination?** 85 | 86 | Think of a formula that could indicate the best preferred 87 | length distribution where you express the optimization function in terms of the 88 | column names from the doc_. For instance only ``n50_len`` or ``sum * 89 | n50_len``. 90 | 91 | 92 | (Optional exercise) Ray 93 | ======================= 94 | Try to create an assembly with Ray over the same kmer. Ray is an assembler that 95 | uses MPI to distribute the assembly over multiple cores and nodes. The latest 96 | version of Ray was made to work well with metagenomics data as well:: 97 | 98 | mkdir -p ~/asm-workshop/ray 99 | cd ~/asm-workshop/ray 100 | ln -s ../sickle/qtrim1.fastq pair1.fastq 101 | ln -s ../sickle/qtrim2.fastq pair2.fastq 102 | mpiexec -n 1 Ray -k 21 -p pair1.fastq pair2.fastq -o out_21 103 | 104 | Add the ``assemstats`` results to the doc_ as you did for Velvet. There is a 105 | separate tab for the Ray assemblies, compare the results with Velvet. 106 | 107 | (Optional exercise) VelvetOptimiser 108 | =================================== 109 | VelvetOptimiser_ is a script that runs Velvet multiple times and follows the 110 | optimization function you give it. Use VelvetOptimiser_ to find the assembly 111 | that gets the best score for the optimization function you designed in 112 | `assemstats`_. It requires ``BioPerl``, which you can get on uppmax with 113 | ``module load BioPerl``. 114 | 115 | .. _VelvetOptimiser: https://github.com/Victorian-Bioinformatics-Consortium/VelvetOptimiser 116 | -------------------------------------------------------------------------------- /source/annotation/normalization.rst: -------------------------------------------------------------------------------- 1 | ========================================================== 2 | Normalization of count data from the metagenomic data sets 3 | ========================================================== 4 | An important aspects of working with metagenomics is to apply proper 5 | normalization procedures to the retrieved counts. There are several 6 | ways to do this, and in part the method of choice is dependent on 7 | the research question investigated, but in part also based on more 8 | philosphical considerations. Let's start with a bit of theory. 9 | 10 | Why is normalization important? 11 | =============================== 12 | Generally, sequencing data sets are not of the same size. In addition, 13 | different genes and genomes come in different sizes, which means that 14 | *at equal coverage, the number of mapped reads to a certain gene or 15 | region will be directly dependent on the length of that region*. 16 | Luckily, the latter scenario is not a huge issue for Pfam families 17 | (although it exists), and we will not care about it more today. We 18 | will however care about the size of the sequencing libraries. To make 19 | relatively fair comparisons between sets, we need to normalize the 20 | gene counts to something. Let's begin with checking how unequal the 21 | librairies are. You can do that by counting the number of sequences 22 | in the FASTA files, by checking for the number of ">" characters in 23 | each file, using ``grep``:: 24 | 25 | grep -c ">" 26 | 27 | As you will see, there are quite substantial differences in the 28 | number of reads in each library. How do we account for that? 29 | 30 | What normalization methods are possible? 31 | ======================================== 32 | 33 | The choice of normalization method will depend on what research 34 | question we want to ask. An easy way of removing the technical 35 | bias related to different sequencing effort in different libraries 36 | is to simply divide each gene count with the total library size. 37 | That will yield a relative proportion of counts to that gene. To 38 | make that number easier to interpret, we can multiply it by 39 | 1,000,000 to get *the number of reads corresponding to that gene 40 | or feature per million reads*. 41 | 42 | (counts of gene X / total number of reads) * 1000000 43 | 44 | This is a quick way of normalizing, but it does not consider 45 | the composition of the sample. Say that you are interested in 46 | studying bacterial gene content within e.g. different plant hosts. 47 | Then the interesting changes in bacterial composition might be 48 | drowned by genetic material from the host plant. That will then 49 | have a huge impact on the gene abundances of the bacteria, even if 50 | those abundances are actually the same. The same applies to complex 51 | microbial communities with both bacteria, single-cell eukaryotes 52 | and viruses. In such cases, it might be better to consider a 53 | normalization to the number of bacteria in the sample (or eukaryotes 54 | if that is what you want to study). One way of doing that is to 55 | count the number of reads mapping to the 16S rRNA gene in each 56 | sample. You can then divide each gene count with the number of 57 | 16S rRNA counts, to yield a genes per 16S proportion. 58 | 59 | (counts of gene X / counts of 16S rRNA gene) 60 | 61 | There is a few problems with using the 16S rRNA gene in this way. 62 | The most prominient one is that the gene exists in a single copy in 63 | some bacteria, but in multiple (sometimes >10) copies in other 64 | species. That means that this number will not truly be a per-genome 65 | estimate. Other genetic markers, such as the *rpoB* gene has been 66 | suggested for this, but has not yet taken off. 67 | 68 | Finally, we could imagine a scenario in which you are only 69 | interested in the proportion of different annotated features in 70 | your sample. One can then instead divide to the total number of 71 | reads mapped to *something* in the database used. That will give 72 | relative proportions, and will remove a lot of "noise", but will 73 | have the limitation that only the well-defined part of the 74 | microbial community can be studied, and the rest is ignored. 75 | 76 | (counts of gene X / total number of mapped reads) 77 | 78 | 79 | Trying out some normalization methods 80 | ===================================== 81 | We are now ready to try out these methods on our data. Let's begin 82 | generating the numbers we need for normalization. We begin with the 83 | library sizes. As you remember, those numbers can be generated using 84 | ``grep``:: 85 | 86 | grep -c ">" 87 | 88 | To get the number of 16S rRNA sequences, we will use Metaxa2. If you 89 | did not install it, you can "cheat" by getting the numbers from this 90 | file: ``/proj/g2014113/metagenomics/annotation/metaxa2_16S_rRNA_counts.txt``. 91 | If you installed it previously, you can test it out using the following 92 | command:: 93 | 94 | metaxa2 -i -o --cpu 16 --align none 95 | 96 | Metaxa2 will take a few minutes to run. You will then be able to 97 | get the number of bacterial 16S rRNA sequences from the file ending 98 | with .summary.txt. 99 | 100 | Finally, we would like to get the number of reads mapping to *any* 101 | Pfam family in the database. To get that number, we can again use 102 | ``grep``. This time however, we will use it to *remove* the entries 103 | that we are not interested in, and counting the rest. This can be 104 | done by:: 105 | 106 | grep -c -v "^#" 107 | 108 | That will remove all lines beginning with a ``#`` character, and 109 | count all remaining lines. Write all the numbers down that you have 110 | got during this exercize, we will use them in the next step! 111 | -------------------------------------------------------------------------------- /source/assembly/reqs.rst: -------------------------------------------------------------------------------- 1 | ========================================== 2 | Checking required software 3 | ========================================== 4 | An often occuring theme in bioinformatics is installing software. Here we wil 5 | go over some steps to help you check whether you actually have the right 6 | software installed. There's an optional excerise on how to install ``sickle``. 7 | 8 | Programs used in this workshop 9 | ============================== 10 | The following programs are used in this workshop: 11 | 12 | - Bowtie2_ 13 | - Velvet_ 14 | - samtools_ 15 | - sickle_ 16 | - Picard_ 17 | - Ray_ 18 | 19 | .. _Bowtie2: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml 20 | .. _Velvet: http://www.ebi.ac.uk/~zerbino/velvet/ 21 | .. _xclip: http://sourceforge.net/projects/xclip/ 22 | .. _parallel: https://www.gnu.org/software/parallel/ 23 | .. _samtools: http://samtools.sourceforge.net/ 24 | .. _CD-HIT: https://code.google.com/p/cdhit/ 25 | .. _AMOS: http://sourceforge.net/apps/mediawiki/amos/index.php?title=AMOS 26 | .. _sickle: https://github.com/najoshi/sickle 27 | .. _Picard: http://picard.sourceforge.net/index.shtml 28 | .. _Ray: http://denovoassembler.sourceforge.net/ 29 | 30 | All programs are already installed, all you have to do is load the virtual 31 | environment for this workshop. Once you are logged in to the server run:: 32 | 33 | source /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/activate 34 | 35 | You deactivate the virtual environment with:: 36 | 37 | deactivate 38 | 39 | NOTE: This is a python virtual environment. The binary folder of the virtual 40 | environment has symbolic links to all programs used in this workshop so you 41 | should be able to run those without problems. 42 | 43 | 44 | Using which to locate a program 45 | =============================== 46 | An easy way to determine whether you have have a certain program installed is 47 | by typing:: 48 | 49 | which programname 50 | 51 | where ``programname`` is the name of the program you want to use. The program 52 | ``which`` searches all directories in ``$PATH`` for the executable file 53 | ``programname`` and returns the path of the first found hit. This is exactly 54 | what happens when you would just type ``programname`` on the command line, but 55 | then ``programname`` is also executed. To see what your ``$PATH`` looks like, 56 | simply ``echo`` it:: 57 | 58 | echo $PATH 59 | 60 | For more information on the ``$PATH`` variable see this link: 61 | http://www.linfo.org/path_env_var.html. 62 | 63 | Check all programs in one go with which 64 | ================================================== 65 | To check whether you have all programs installed in one go, you can use ``which``. 66 | 67 | bowtie2 68 | bowtie2-build 69 | velveth 70 | velvetg 71 | shuffleSequences_fastq.pl 72 | parallel 73 | samtools 74 | Ray 75 | 76 | 77 | We will now iterate over all the programs in calling ``which`` on each of them. 78 | First make a variable containing all programs separated by whitespace:: 79 | 80 | $ req_progs="bowtie2 bowtie2-build velveth velvetg parallel samtools shuffleSequences_fastq.pl Ray" 81 | $ echo $req_progs 82 | bowtie2 bowtie2-build velveth velvetg parallel samtools shuffleSequences_fastq.pl 83 | 84 | Now iterate over the variable ``req_progs`` and call which:: 85 | 86 | $ for p in $req_progs; do which $p || echo $p not in PATH; done 87 | /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/bowtie2 88 | /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/bowtie2-build 89 | /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/velveth 90 | /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/velvetg 91 | /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/parallel 92 | /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/samtools 93 | /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/shuffleSequences_fastq.pl 94 | 95 | In Unix-like systems a program that sucessfully completes it tasks should 96 | return a zero exit status. For the program ``which`` that is the case if the 97 | program is found. The ``||`` character does not mean *pipe the output onward* as 98 | you are probably familiar with (otherwise see 99 | http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-4.html), but checks whether the 100 | program before it exists succesfully and executes the part behind it if not. 101 | 102 | If any of the installed programs is missing, try to install them yourself or 103 | ask. If you are having troubles following these examples, try to find some bash 104 | tutorials online next time you have some time to kill. Educating yourself on 105 | how to use the command line effectively increases your productivity immensely. 106 | 107 | Some bash resources: 108 | 109 | - Excellent bash tutorial http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html 110 | - Blog post on pipes for NGS http://www.vincebuffalo.com/2013/08/08/the-mighty-named-pipe.html 111 | - Using bash and GNU parallel for NGS http://bit.ly/gwbash 112 | 113 | (Optional excercise) Install sickle by yourself 114 | =============================================== 115 | Follow these steps only if you want to install ``sickle`` by yourself. 116 | Installation procedures of research software often follow the same pattern. 117 | Download the code, *compile* it and copy the binary to a location in your 118 | ``$PATH``. The code for sickle is on https://github.com/najoshi/sickle. I 119 | prefer *compiling* my programs in ``~/src`` and then copying the resulting 120 | program to my ``~/bin`` directory, which is in my ``$PATH``. This should get 121 | you a long way:: 122 | 123 | mkdir -p ~/src 124 | 125 | # Go to the source directory and clone the sickle repository 126 | cd ~/src 127 | git clone https://github.com/najoshi/sickle 128 | cd sickle 129 | 130 | # Compile the program 131 | make 132 | 133 | # Create a bin directory 134 | mkdir -p ~/bin 135 | cp sickle ~/bin 136 | -------------------------------------------------------------------------------- /source/comparative-functional-analysis/genecoverage.rst: -------------------------------------------------------------------------------- 1 | ============================================== 2 | Determine gene coverage in metagenomic samples 3 | ============================================== 4 | Ok, now that we know what functions are represented in the combined samples (we 5 | could call it the Baltic meta-community, i.e. a community of communities), we 6 | may want to know how much of the different functions (COG families and classes) 7 | are present in the different samples, since this will likely change between 8 | seasons. To do this we first map the reads from the different samples against 9 | the contigs. We will use the mapping script that we used this morning. First 10 | create a directory and cd there:: 11 | 12 | mkdir -p ~/metagenomics/cfa/map 13 | cd ~/metagenomics/cfa/map 14 | 15 | Copy the contig file and build an index on it for bowtie2:: 16 | 17 | cp /proj/g2014113/metagenomics/cfa/assembly/baltic-sea-ray-noscaf-41.1000.fa . 18 | bowtie2-build baltic-sea-ray-noscaf-41.1000.fa baltic-sea-ray-noscaf-41.1000.fa 19 | 20 | You will end up with various baltic-sea-ray-noscaf-41.1000.fa.*.bt2 files that 21 | represent the index for the assembly. It allows for faster alignment of 22 | sequences, see http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform 23 | for more information. 24 | 25 | Now we will use some crazy bash for loop to map all the reads. This actually 26 | prints the mapping command instead of executing, because it takes too much time 27 | to run it, it does however create the directories:: 28 | 29 | for s in /proj/g2014113/metagenomics/cfa/reads/*_R1.fastq.gz; do 30 | echo mkdir -p $(basename $s _R1.fastq.gz) 31 | echo cd $(basename $s _R1.fastq.gz) 32 | echo map-bowtie2-markduplicates.sh -ct 1 \ 33 | $s ${s/R1/R2} pair \ 34 | ~/metagenomics/cfa/map/baltic-sea-ray-noscaf-41.1000.fa asm \ 35 | bowtie2 36 | echo cd .. 37 | done 38 | 39 | The for loop iterates over all the first mates of the pairs. It then creates a 40 | directory using the basename of the pair with the directory part and the 41 | postfix removed, goes to that dir and runs ``map-bowtie2-markduplicates.sh`` on 42 | both mates of the pair. Try to change the for loop such that it only maps one 43 | sample. 44 | 45 | **Question: Can you think of an easy way to parallelize this?** 46 | .. Add an & after bowtie2 47 | 48 | For more examples on how to parallelize this check: http://bit.ly/gwbash 49 | 50 | If you sort of understand what's going on in the for loop above you are welcome 51 | to copy the data that we have already generated:: 52 | 53 | cp -r /proj/g2014113/metagenomics/cfa/map/* ~/metagenomics/cfa/map/ 54 | 55 | Take a look at the files that have been created. Check 56 | ``map-bowtie2-markduplicates.sh -h`` for an explanation of the different files. 57 | 58 | **Question what is the mean coverage for contig-394 in sample 0328?** 59 | 60 | .. 0 61 | 62 | Next we want to figure out the coverage for every gene in every contig per 63 | sample. We will use the bedtools coverage command within the BEDTools suite 64 | (https://code.google.com/p/bedtools/) that can parse a SAM/BAM file and a gff 65 | file to extract coverage information for every gene:: 66 | 67 | mkdir -p ~/metagenomics/cfa/coverage-hist-per-feature-per-sample 68 | cd ~/metagenomics/cfa/coverage-hist-per-feature-per-sample 69 | 70 | Run bedtools coverage on one sample (~4m):: 71 | 72 | for s in 0328; do 73 | bedtools coverage -hist -abam ~/metagenomics/cfa/map/$s/bowtie2/asm_pair-smds.bam \ 74 | -b ../prodigal/baltic-sea-ray-noscaf-41.1000.gff \ 75 | > $s-baltic-sea-ray-noscaf-41.1000.gff.coverage.hist 76 | done 77 | 78 | Copy the other ones:: 79 | 80 | cp /proj/g2014113/metagenomics/cfa/map/coverage-hist-per-feature-per-sample/* . 81 | 82 | Have a look at which files have been created with less again. The final four 83 | columns give you the histogram i.e. coverage, number of bases with that 84 | coverage, length of the contig/feature/gene, bases with that coverage expressed 85 | as a ratio of the length of the contig/feature/gene. 86 | 87 | Now what we want to is do is to extract the mean coverage per COG instead of 88 | per gene. Remember that multiple genes can belong to the same COG so we will 89 | take the sum of the mean coverage from those genes. We will use the script 90 | ``br-sum-mean-cov-per-cog.py`` for that. First make a directory 91 | again and go there:: 92 | 93 | mkdir -p ~/metagenomics/cfa/cog-sum-mean-cov 94 | cd ~/metagenomics/cfa/cog-sum-mean-cov 95 | 96 | The script expects a file with one samplename per line so we will create an 97 | array with those sample names 98 | (http://www.thegeekstuff.com/2010/06/bash-array-tutorial/):: 99 | 100 | samplenames=(0328 0403 0423 0531 0619 0705 0709 1001 1004 1028 1123) 101 | echo ${samplenames[*]} 102 | 103 | Now we can use process substitution to give the script those sample names 104 | without having to store it to a file first. 105 | 106 | **Question: What is the difference between the following statements?**:: 107 | 108 | echo ${samplenames[*]} 109 | cat <(echo ${samplenames[*]}) 110 | cat <(echo ${samplenames[*]} | tr ' ' '\n') 111 | 112 | .. First one just echoes 113 | second one concatenates the contents of the "file" with samplenames to stdout 114 | the last one adds newlines 115 | 116 | Run the the script that computes the sum of mean coverages per COG (~2m47):: 117 | 118 | br-sum-mean-cov-per-cog.py --samplenames <(echo ${samplenames[*]} | tr ' ' '\n') \ 119 | ../prodigal/baltic-sea-ray-noscaf-41.1000.gff ../prodigal/baltic-sea-ray-noscaf-41.1000.aa.fa \ 120 | ../wmga-cog/output.2 ../coverage-hist-per-feature-per-sample/*.gff.coverage.hist \ 121 | > cog-sum-mean-cov.tsv 122 | 123 | Have a look at the table with less -S again. 124 | 125 | **Question: What is the sum of mean coverages for COG0038 in sample 0423?** 126 | 127 | .. 1.8215488 128 | -------------------------------------------------------------------------------- /source/comparative-functional-analysis/compare.rst: -------------------------------------------------------------------------------- 1 | ================================================= 2 | Comparative functional analysis with R 3 | ================================================= 4 | Having this table one can use different statistical and visualisation software 5 | to analyse the results. One option would be to import a simpler version of the 6 | table into the program Fantom, a graphical user interface program developed for 7 | comparative analysis of metagenome data. You can try this in the end of the day 8 | if you have time. 9 | 10 | But here we will use the statistical programming language R to do some simple 11 | analysis. cd to the directory where you have the cog-sum-mean-cov.tsv file. 12 | Then start R:: 13 | 14 | cd ~/metagenomics/cfa 15 | R 16 | 17 | and import the data:: 18 | 19 | tab_cog <- read.delim("cog-sum-mean-cov/cog-sum-mean-cov.tsv") 20 | 21 | Assign the different columns with descriptors to vectors of logical names:: 22 | 23 | cogf <- tab_cog[,1] # cog family 24 | cogfd <- tab_cog[,2] # cog family descriptor 25 | cogc <- tab_cog[,3] # cog class 26 | cogcd <- tab_cog[,4] # cog class descriptor 27 | 28 | Make a matrix with the coverages of the cog families:: 29 | 30 | cogf_cov <- as.matrix(tab_cog[,5:ncol(tab_cog)]) # coverage in the different samples 31 | 32 | And why not put sample names into a vector as well:: 33 | 34 | sample <- colnames(cogf_cov) 35 | sample 36 | 37 | Let’s clean the sample names a bit:: 38 | 39 | for (i in 1:length(sample)) { 40 | sample[i] <- matrix(unlist(strsplit(sample[i],"_")), 1)[1,4] 41 | } 42 | 43 | Since the coverages will differ depending on how many reads per sample we have 44 | we can normalise by dividing the coverages by the total coverage for the sample 45 | (only considering cog-annotated genes though):: 46 | 47 | for (i in 1:ncol(cogf_cov)) { 48 | cogf_cov[,i] <- cogf_cov[,i]/sum(cogf_cov[,i]) 49 | } 50 | 51 | The cogf_cov gives coverage per cog family. Let’s summarise within cog classes 52 | and make a separate matrix for that:: 53 | 54 | unique_cogc <- levels(cogc) 55 | cogc_cov <- matrix(ncol = length(sample), nrow = length(unique_cogc)) 56 | colnames(cogc_cov) <- sample 57 | rownames(cogc_cov) <- unique_cogc 58 | for (i in 1:length(unique_cogc)) { 59 | these <- grep(paste("^", unique_cogc[i],"$", sep = ""), cogc) 60 | for (j in 1:ncol(cogf_cov)) { 61 | cogc_cov[i,j] <- sum(cogf_cov[these,j]) 62 | } 63 | } 64 | 65 | 66 | OK, now let’s start playing with the data. We can for example do a pairwise 67 | plot of coverage of cog classes in sample1 vs. sample2:: 68 | 69 | plot(cogc_cov[,1], cogc_cov[,2]) 70 | 71 | or make a stacked barplot showing the different classes in the different 72 | samples:: 73 | 74 | barplot(cogf_cov, col = rainbow(100), border=NA) 75 | barplot(cogc_cov, col = rainbow(10), border=NA) 76 | 77 | The vegan package contains many nice functions for doing (microbial) ecology 78 | analysis. Load vegan:: 79 | 80 | install.packages("vegan") # not necessary if already installed 81 | library(vegan) 82 | 83 | If installing doesn't work for you have a look here 84 | http://www.stat.osu.edu/computer-support/mathstatistics-packages/installing-r-libraries-locally-your-home-directory 85 | 86 | We can calculate pairwise distances between the samples based on their 87 | functional composition. In ecology pairwise distance between samples is 88 | referred to as beta-diversity, although typically based on taxonomic 89 | composition rather than functional:: 90 | 91 | cogf_dist <- as.matrix(vegdist(t(cogf_cov), method="bray", binary=FALSE, diag=TRUE, upper=TRUE, na.rm = FALSE)) 92 | cogc_dist <- as.matrix(vegdist(t(cogc_cov), method="bray", binary=FALSE, diag=TRUE, upper=TRUE, na.rm = FALSE)) 93 | 94 | You can visualise the distance matrices as a heatmaps:: 95 | 96 | image(cogf_dist) 97 | image(cogc_dist) 98 | 99 | Are the distances calculated on the different functional levels correlated?:: 100 | 101 | plot(cogc_dist, cogf_dist) 102 | 103 | Now let’s cluster the samples based on the distances with hierarchical 104 | clustering. We use the function "agnes" in the "cluster" library and apply 105 | average linkage clustering:: 106 | 107 | install.packages("cluster") # not necessary if already installed 108 | library(cluster) 109 | 110 | cluster <- agnes(cogf_dist, diss = TRUE, method = "average") 111 | plot(cluster, which.plots = 2, hang = -1, label = sample, main = "", axes = FALSE, xlab = "", ylab = "", sub = "") 112 | 113 | Alternatively you can use the function heatmap, that calculates distances both 114 | between samples and between features and clusters in two dimensions:: 115 | 116 | heatmap(cogf_dist, scale = "none") 117 | heatmap(cogc_dist, scale = "none") 118 | 119 | And let’s ordinate the data in two dimensions. This can be done e.g. by PCA 120 | based on the actual coverage values or by e.g. PcOA or NMDS (non-metrical 121 | dimensional scaling). Let's do NMDS:: 122 | 123 | mds <- metaMDS(cogf_dist) 124 | plot(mds$points[,1], mds$points[,2], pch = 20, xlab = "NMDS1", ylab = "NMDS2", cex = 2) 125 | 126 | We can color the samples according to date (provided your samples are ordered 127 | according to date). There are some nice color scales to choose from here 128 | http://colorbrewer2.org/:: 129 | 130 | install.packages("RColorBrewer") # not necessary if already installed 131 | library(RColorBrewer) 132 | color = brewer.pal(length(sample), "Reds") # or select another color scale! 133 | 134 | mds <- metaMDS(cogf_dist) 135 | plot(mds$points[,1], mds$points[,2], pch = 20, xlab = "NMDS1", ylab = "NMDS2", cex = 5, col = color) 136 | 137 | Let’s compare with how it looks if we base the clustering on COG class coverage 138 | instead:: 139 | 140 | mds <- metaMDS(cogc_dist) 141 | plot(mds$points[,1], mds$points[,2], pch = 20, xlab = "NMDS1", ylab = "NMDS2", cex = 5, col = color) 142 | 143 | In addition to these examples there are of course infinite ways to analyse the 144 | results in R. One could for instance find COGs that significantly differ in 145 | abundance between samples, do different types of correlations between metadata 146 | (nutrients, temperature, etc) and functions, etc. Leave your R window open, 147 | since we will compare these results with taxonomic data in a bit. 148 | -------------------------------------------------------------------------------- /mako/templates/binning/concoct.rst: -------------------------------------------------------------------------------- 1 | ========================================================== 2 | CONCOCT - Clustering cONtigs with COverage and ComposiTion 3 | ========================================================== 4 | In this excercise you will learn how to use a new software for automatic and unsupervised binning of metagenomic contigs, called CONCOCT. 5 | CONCOCT uses a statistical model called a Gaussian Mixture Model to cluster sequences based on their tetranucleotide frequencies and their average coverage over multiple samples. 6 | 7 | The theory behind using the coverage pattern is that sequences having similar coverage pattern over multiple samples are likely to belong to the same species. 8 | Species having a similar abundance pattern in the samples can hopefully be separated by the tetranucleotide frequencies. 9 | 10 | We will be working with an assembly made using only the reads from this single sample, but since CONCOCT is constructed to be ran using the coverage profile over multiple samples, we'll be investigating how the performance is affected if we add several other samples. 11 | This is done by mapping the reads from the other samples to the contigs resulting from this single sample assembly. 12 | 13 | 14 | Getting to know the test data set 15 | ================================= 16 | Today we'll be working on a metagenomic data set from the baltic sea. 17 | The sample we'll be using is part of a time series study, where the same location have been sampled twice weekly during 2013. This specific sample was taken march 22. 18 | 19 | Start by copying the contigs to your working directory:: 20 | 21 | ${'\n '.join(commands['copy_dataset'])} 22 | 23 | You should now have one fasta file containing all contigs, in this case only contigs longer than 1000 bases is included to save space, and one comma separated file containing the coverage profiles for each contig. 24 | Let's have a look at the coverage profiles:: 25 | 26 | ${commands['browse_coverage']} 27 | 28 | Try to find the column corresponding to March 22 and compare this column to the other ones. Can you draw any conclusions from this comparison? 29 | 30 | We'd like to first run concoct using only one sample, so we remove all other columns in the coverage table to create this new coverage file:: 31 | 32 | ${commands['cut_coverage']} 33 | 34 | Running CONCOCT 35 | =============== 36 | CONCOCT takes a number of parameters that you got a glimpse of earlier, running:: 37 | 38 | ${commands['check_activate']['concoct']} 39 | 40 | The contigs will be input as the composition file and the coverage file obviously as the coverage file. The output path is given by as the -b (--basename) parameter, where it is important to include a trailing slash if we want to create an output directory containing all result files. 41 | Last but not least we will set the length threshold to 3000 to speed up the clustering (the less contigs we use, the shorter the runtime):: 42 | 43 | ${'\n '.join(commands['run_concoct']['one_sample'])} 44 | 45 | This command will normally take a couple of minutes to finish. When it is done, check the output directory and try to figure out what the different files contain. 46 | Especially, have a look at the main output file:: 47 | 48 | ${commands['run_concoct']['look_clustering']} 49 | 50 | This file gives you the cluster id for each contig that was included in the clustering, in this case all contigs longer than 3000 bases. 51 | 52 | For the comparison we will now run concoct again, using the coverage profile over all samples in the time series:: 53 | 54 | ${commands['run_concoct']['all_samples']} 55 | 56 | Have a look at the output from this clustering as well, do you notice anything different? 57 | 58 | Evaluating Clustering Results 59 | ============================= 60 | One way of evaluating the resulting clusters are to look at the distribution of so called Single Copy Genes (SCG:s), genes that are present in all bacteria and archea in only one copy. 61 | With this background, a complete and correct bin should have exactly one copy of each gene present, while missing genes indicate an inclomplete bin and several copies of the same gene indicate a chimeric cluster. 62 | To predict genes in prokaryotes, we use Prodigal that we then use as the query sequences for an RPS-BLAST search against the Clusters of Orthologous Groups (COG) database. 63 | This RPS-BLAST search takes about an hour and a half for our dataset so we're going to use a precomputed result file. 64 | Copy this result file along with two files necessary for the COG counting scripts:: 65 | 66 | ${'\n '.join(commands['copy_blast'])} 67 | 68 | Before moving on, we need to install some R packages, please run these commands line by line:: 69 | 70 | R 71 | install.packages("ggplot2") 72 | install.packages("reshape") 73 | install.packages("getopt") 74 | q() 75 | 76 | With the CONCOCT distribution comes scripts for parsing this output and creating a plot where each cog present in the data are grouped accordingly to the clustering results, namely COG_table.py and COGPlot.R. These scripts are added to the virtual environment, try check out their usage:: 77 | 78 | ${'\n '.join(commands['check_cog_scripts'])} 79 | 80 | Let's first create a plot for the single sample run:: 81 | 82 | ${'\n '.join(commands['cogplot_single'])} 83 | 84 | This command might not work for some R related reason. If you've tried getting it to work more than you wish to do, just copy the results from the workshop directory:: 85 | 86 | cp /proj/g2014113/nobackup/concoct-workshop/cogplots/* ~/binning-workshop/ 87 | 88 | 89 | This command should have created a pdf file with your plot. In order to look at it, you can download it to your personal computer with scp. OBS! You need to run this in a separate terminal window where you are not logged in to Uppmax:: 90 | 91 | ${commands['download_single_cogplot']} 92 | 93 | Have a look at the plot and try to figure out if the clustering was successful or not. Which clusters are good? Which clusters are bad? Are all clusters present in the plot? 94 | Now, lets do the same thing for the multiple samples run:: 95 | 96 | ${'\n '.join(commands['cogplot_multiple'])} 97 | 98 | And download again from your separate terminal window:: 99 | 100 | ${commands['download_multiple_cogplot']} 101 | 102 | What differences can you observe for these plots? Think about how we were able to use samples not included in the assembly in order to create a different clustering result. Can this be done with any samples? 103 | -------------------------------------------------------------------------------- /source/comparative-taxonomic-analysis/compare.rst: -------------------------------------------------------------------------------- 1 | ======================================== 2 | Comparative taxonomic analysis with R 3 | ======================================== 4 | KRONA is not very good for comparing multiple samples though. Instead we will 5 | use R as for the functional data. First combine the data from the different 6 | samples with the script sum_rdp_annot.pl (made by us) into a table. By default 7 | the script uses a bootstrap support of 0.80 to include a taxonomic level (this 8 | can be changed easily by changing the number on row 16 in the script). You may 9 | need to change the input file names in the beginning of the script ($in_file[0] 10 | = ...). The script will only import the 16S bacterial data:: 11 | 12 | cd ~/metagenomics/cta/rdp 13 | sum_rdp_annot.pl > summary.rrna.silva-bac-16s-database-id85.fasta.class.0.80.tsv 14 | 15 | Let’s import this table into R:: 16 | 17 | tab_tax <- read.delim("summary.rrna.silva-bac-16s-database-id85.fasta.class.0.80.tsv") 18 | 19 | And assign the descriptor column to a vector:: 20 | 21 | tax <- tab_tax[,1] 22 | 23 | And put the counts into a matrix:: 24 | 25 | tax_counts <- tab_tax[,2:ncol(tab_tax)] # counts of taxa in the different samples 26 | 27 | Since you will compare this dataset with the functional dataset you generated 28 | before it's great if the samples come in the same order. Check the previous 29 | order:: 30 | 31 | sample 32 | 33 | And the current order:: 34 | 35 | colnames(tax_counts) 36 | 37 | if they are not in the same order contact an assistant for help. Otherwise:: 38 | 39 | colnames(tax_counts) <- sample 40 | rownames(tax_counts) <- tax 41 | 42 | Make a normalised version of tax_counts:: 43 | 44 | norm_tax_counts <- tax_counts 45 | for (i in 1:ncol(tax_counts)) { 46 | norm_tax_counts[,i] <- tax_counts[,i]/sum(tax_counts[,i]) 47 | } 48 | 49 | What different taxa do we have there?:: 50 | 51 | tax 52 | 53 | From the tax_counts matrix we can create new matrices at defined taxonomic 54 | levels. If you open the text file /proj/g2013206/metagenomics/r_commands.txt 55 | you can copy and paste all of this code into R (or use the source command) and 56 | this will give you the matrices and vectors below (check carefully that you 57 | didn’t get any error messages!):: 58 | 59 | phylum_counts matrix with counts for different phyla 60 | norm_phylum_counts normalised version of phylum_count 61 | phylum vector with phyla (same order as in phyla matrix) 62 | 63 | class_counts matrix with counts for different classes 64 | norm_class_counts normalised version of class_count 65 | class vector with classes 66 | 67 | phylumclass_counts matrix with counts for different phyla and proteobacteria classes 68 | norm_phylumclass_counts normalised version of phylumclass_count 69 | phylumclass vector with phyla and proteobacteria classes 70 | 71 | The “other” clade in each of the above sums reads that were not classified at 72 | the defined level. Having these more well defined matrices we can compare the 73 | taxonomic composition in the samples. We can apply the commands that we did for 74 | the functional analysis:: 75 | 76 | library(vegan) 77 | library(RColorBrewer) 78 | 79 | Barplots:: 80 | 81 | par(mar=c(1,2,1,22)) # Increase the MARgins, to make space for a legend 82 | 83 | barplot(norm_phylum_counts, col = rainbow(11), legend.text=TRUE, args.legend=list(x=ncol(norm_phylum_counts)+25, y=1, adj=c(0,0))) 84 | barplot(norm_phylumclass_counts, col = rainbow(15), legend.text=TRUE, args.legend=list(x=ncol(norm_phylum_counts)+30, y=1, adj=c(0,0))) 85 | barplot(norm_class_counts, col = rainbow(18), legend.text=TRUE, args.legend=list(x=ncol(norm_phylum_counts)+32, y=1, adj=c(0,0))) 86 | 87 | If you can't see the legends, they're just in a bad position. Try altering the x and y parameters in the args.legend. 88 | 89 | Calculate beta-diversity based on class-level taxonomic counts:: 90 | 91 | class_dist <- as.matrix(vegdist(t(norm_class_counts[1:(nrow(norm_class_counts) - 1),]), method="bray", binary=FALSE, diag=TRUE, upper=TRUE, na.rm = FALSE)) 92 | 93 | Note that by "[1:(nrow(norm_class_counts) - 1),]" we exclude the last row in 94 | norm_class_counts when we calculate the distances because this is the "others" 95 | column that contains all kinds of unclassified taxa. 96 | 97 | Hierarchical clustering:: 98 | 99 | library(cluster) 100 | cluster <- agnes(class_dist, diss = TRUE, method = "average") 101 | plot(cluster, which.plots = 2, hang = -1) 102 | 103 | Heatmaps with clusterings:: 104 | 105 | heatmap(norm_class_counts, scale = "none") 106 | heatmap(norm_phylumclass_counts, scale = "none") 107 | 108 | And ordinate the data by NMDS:: 109 | 110 | color = brewer.pal(length(sample), "Reds") 111 | mds <- metaMDS(class_dist) 112 | plot(mds$points[,1], mds$points[,2], pch = 20, xlab = "NMDS1", ylab = "NMDS2", cex = 5, col = color) 113 | 114 | Does the pattern look similar as that obtained by functional data?:: 115 | 116 | mds <- metaMDS(cogc_dist) 117 | plot(mds$points[,1], mds$points[,2], pch = 20, xlab = "NMDS1", ylab = "NMDS2", cex = 5, col = color) 118 | 119 | We can actually check how beta diversity generated by the two approaches is 120 | correlated:: 121 | 122 | plot(cogf_dist, class_dist) 123 | cor.test(cogf_dist, class_dist) 124 | 125 | (For comparing matrices it is common to use a mantel test, but the r-value (but 126 | not the p-value) is in fact the same.) 127 | 128 | Finally, let’s check how alpha-diversity fluctuates over the year and compares 129 | between taxonomic and functional data. Since alpha-diversity is influenced by 130 | sample size it is advisable to subsample the datasets to the same number of 131 | reads. We can make a subsampled table using the vegan function rrarefy:: 132 | 133 | sub_class_counts <- t(rrarefy(t(class_counts), 100)) 134 | 135 | This will be difficult to achieve for the functional data at this point, 136 | however, so let’s skip that for the functional data. 137 | 138 | Let’s use Shannon diversity index since this is pretty insensitive to sample 139 | size. Shannon index combines richness (number of species) and evenness (how 140 | evenly the species are distributed); many, evenly distributed species gives a 141 | high Shannon. There is a vegan function for getting shannon:: 142 | 143 | class_shannon <- diversity(class_counts[1:(nrow(norm_class_counts) - 1),], MARGIN = 2) 144 | sub_class_shannon <- diversity(sub_class_counts[1:(nrow(norm_class_counts) - 1),], MARGIN = 2) 145 | cogf_shannon <- diversity(cogf_cov, MARGIN = 2) 146 | 147 | How does subsampling influence shannon?:: 148 | 149 | plot(class_shannon, sub_class_shannon) 150 | 151 | Is functional and taxonomic shannon correlated?:: 152 | 153 | plot(sub_class_shannon, cogf_shannon) 154 | cor.test(sub_class_shannon, cogf_shannon) 155 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | # Makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line. 5 | SPHINXOPTS = 6 | SPHINXBUILD = sphinx-build 7 | PAPER = 8 | BUILDDIR = build 9 | MILOU_EXPORT = /proj/g2014113/webexport/ 10 | 11 | # User-friendly check for sphinx-build 12 | ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1) 13 | $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/) 14 | endif 15 | 16 | # Internal variables. 17 | PAPEROPT_a4 = -D latex_paper_size=a4 18 | PAPEROPT_letter = -D latex_paper_size=letter 19 | ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source 20 | # the i18n builder cannot share the environment and doctrees with the others 21 | I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source 22 | 23 | .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext 24 | 25 | help: 26 | @echo "Please use \`make ' where is one of" 27 | @echo " html to make standalone HTML files" 28 | @echo " dirhtml to make HTML files named index.html in directories" 29 | @echo " singlehtml to make a single large HTML file" 30 | @echo " pickle to make pickle files" 31 | @echo " json to make JSON files" 32 | @echo " htmlhelp to make HTML files and a HTML help project" 33 | @echo " qthelp to make HTML files and a qthelp project" 34 | @echo " devhelp to make HTML files and a Devhelp project" 35 | @echo " epub to make an epub" 36 | @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" 37 | @echo " latexpdf to make LaTeX files and run them through pdflatex" 38 | @echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx" 39 | @echo " text to make text files" 40 | @echo " man to make manual pages" 41 | @echo " texinfo to make Texinfo files" 42 | @echo " info to make Texinfo files and run them through makeinfo" 43 | @echo " gettext to make PO message catalogs" 44 | @echo " changes to make an overview of all changed/added/deprecated items" 45 | @echo " xml to make Docutils-native XML files" 46 | @echo " pseudoxml to make pseudoxml-XML files for display purposes" 47 | @echo " linkcheck to check all external links for integrity" 48 | @echo " doctest to run all doctests embedded in the documentation (if enabled)" 49 | 50 | clean: 51 | rm -rf $(BUILDDIR)/* 52 | 53 | milou: 54 | $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(MILOU_EXPORT) 55 | @echo 56 | @echo "Build finished. The HTML pages are in $(MILOU_EXPORT)" 57 | 58 | html: 59 | $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html 60 | @echo 61 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." 62 | 63 | dirhtml: 64 | $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml 65 | @echo 66 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." 67 | 68 | singlehtml: 69 | $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml 70 | @echo 71 | @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." 72 | 73 | pickle: 74 | $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle 75 | @echo 76 | @echo "Build finished; now you can process the pickle files." 77 | 78 | json: 79 | $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json 80 | @echo 81 | @echo "Build finished; now you can process the JSON files." 82 | 83 | htmlhelp: 84 | $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp 85 | @echo 86 | @echo "Build finished; now you can run HTML Help Workshop with the" \ 87 | ".hhp project file in $(BUILDDIR)/htmlhelp." 88 | 89 | qthelp: 90 | $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp 91 | @echo 92 | @echo "Build finished; now you can run "qcollectiongenerator" with the" \ 93 | ".qhcp project file in $(BUILDDIR)/qthelp, like this:" 94 | @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/MetagenomicsWorkshopSciLifeLab.qhcp" 95 | @echo "To view the help file:" 96 | @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/MetagenomicsWorkshopSciLifeLab.qhc" 97 | 98 | devhelp: 99 | $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp 100 | @echo 101 | @echo "Build finished." 102 | @echo "To view the help file:" 103 | @echo "# mkdir -p $$HOME/.local/share/devhelp/MetagenomicsWorkshopSciLifeLab" 104 | @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/MetagenomicsWorkshopSciLifeLab" 105 | @echo "# devhelp" 106 | 107 | epub: 108 | $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub 109 | @echo 110 | @echo "Build finished. The epub file is in $(BUILDDIR)/epub." 111 | 112 | latex: 113 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 114 | @echo 115 | @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." 116 | @echo "Run \`make' in that directory to run these through (pdf)latex" \ 117 | "(use \`make latexpdf' here to do that automatically)." 118 | 119 | latexpdf: 120 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 121 | @echo "Running LaTeX files through pdflatex..." 122 | $(MAKE) -C $(BUILDDIR)/latex all-pdf 123 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." 124 | 125 | latexpdfja: 126 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 127 | @echo "Running LaTeX files through platex and dvipdfmx..." 128 | $(MAKE) -C $(BUILDDIR)/latex all-pdf-ja 129 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." 130 | 131 | text: 132 | $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text 133 | @echo 134 | @echo "Build finished. The text files are in $(BUILDDIR)/text." 135 | 136 | man: 137 | $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man 138 | @echo 139 | @echo "Build finished. The manual pages are in $(BUILDDIR)/man." 140 | 141 | texinfo: 142 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 143 | @echo 144 | @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." 145 | @echo "Run \`make' in that directory to run these through makeinfo" \ 146 | "(use \`make info' here to do that automatically)." 147 | 148 | info: 149 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 150 | @echo "Running Texinfo files through makeinfo..." 151 | make -C $(BUILDDIR)/texinfo info 152 | @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." 153 | 154 | gettext: 155 | $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale 156 | @echo 157 | @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." 158 | 159 | changes: 160 | $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes 161 | @echo 162 | @echo "The overview file is in $(BUILDDIR)/changes." 163 | 164 | linkcheck: 165 | $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck 166 | @echo 167 | @echo "Link check complete; look for any errors in the above output " \ 168 | "or in $(BUILDDIR)/linkcheck/output.txt." 169 | 170 | doctest: 171 | $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest 172 | @echo "Testing of doctests in the sources finished, look at the " \ 173 | "results in $(BUILDDIR)/doctest/output.txt." 174 | 175 | xml: 176 | $(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml 177 | @echo 178 | @echo "Build finished. The XML files are in $(BUILDDIR)/xml." 179 | 180 | pseudoxml: 181 | $(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml 182 | @echo 183 | @echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml." 184 | -------------------------------------------------------------------------------- /source/annotation/metaxa2.rst: -------------------------------------------------------------------------------- 1 | ================================================================== 2 | Bonus exercise: Using Metaxa2 to investigate the taxonomic content 3 | ================================================================== 4 | Now when we are familiar to using R, we can just as well use it to go through 5 | another type of output generated by the Metaxa2 software. Metaxa2 does 6 | classification of rRNA at different taxonomic levels, assigning each read to 7 | a taxonomic affiliation only if it is reliably able to (given conservation 8 | between taxa etc.) You can read more about Metaxa2 here: 9 | http://microbiology.se/software/metaxa2 10 | 11 | Install Metaxa2 12 | =============== 13 | For this exercise to work, you need Metaxa2 installed. If you did not do this 14 | earlier, here are the installation instructions again. If you did install 15 | Metaxa2 at the beginning of the workshop, you can skip this step and move 16 | straight to the next heading! 17 | 18 | The code for Metaxa2 is available from http://microbiology.se/sw/Metaxa2_2.0rc3.tar.gz 19 | You can install Metaxa2 as follows:: 20 | 21 | # Create a src and a bin directory 22 | mkdir -p ~/src 23 | mkdir -p ~/bin 24 | 25 | # Go to the source directory and download the Metaxa2 tarball 26 | cd ~/src 27 | wget http://microbiology.se/sw/Metaxa2_2.0rc3.tar.gz 28 | tar -xzvf Metaxa2_2.0rc3.tar.gz 29 | cd Metaxa2_2.0rc3 30 | 31 | # Run the installation script 32 | ./install_metaxa2 33 | 34 | # Try to run Metaxa2 (this should bring up the main options for the software) 35 | metaxa2 -h 36 | 37 | If this did not work, you can try this manual approach:: 38 | 39 | cd ~/src/Metaxa2_2.0rc3 40 | cp -r metaxa2* ~/bin/ 41 | 42 | # Then try to run Metaxa2 again 43 | metaxa2 -h 44 | 45 | If this brings up the help message, you are all set! 46 | 47 | 48 | Generating family level taxonomic counts 49 | ======================================== 50 | 51 | If you have already run Metaxa2 to get the number of 16S rRNA sequences, 52 | you can use the output of those runs. Otherwise you need to run the 53 | following command on all the raw read data from all libraries:: 54 | 55 | metaxa2 -i -o --cpu 16 --align none 56 | 57 | To get counts on the family level from the metaxa2 output, we will use 58 | another tool bundled with the Metaxa2 package; the Metaxa2 Taxonomic 59 | Travesal Tool (``metaxa2_ttt``). Take a look at its options by typing:: 60 | 61 | metaxa2_ttt -h 62 | 63 | In this exercise we are interested in bacterial counts only , so we will 64 | use the ``-t b`` option. Since we are only interested in family abundance 65 | (we have too little data to get any good genus or species counts), we will 66 | only output the data for phyla, classes, orders and families, that is we 67 | will use the ``-m 5`` option. As input files, you should use the files 68 | ending with ".taxonomy.txt" that Metaxa2 produced as output. That should 69 | give you a command looking like this:: 70 | 71 | metaxa2_ttt -i -o -m 5 -t b 72 | 73 | Run this command on the taxonomy.txt files from all input libraries. It 74 | should be really quick. If you type ``ls`` you will notice that ``metaxa2_ttt`` 75 | produced a bunch of .level_X.txt files. Those are the files we are going 76 | to work with next. 77 | 78 | Visualizing family level taxonomic counts 79 | ========================================= 80 | To visualize the family level counts, we will once again use R. Fire it 81 | up again and load in the count tables from Metaxa2:: 82 | 83 | R 84 | b1_fam = read.table("baltic1.level_5.txt", sep = "\t", row.names = 1) 85 | 86 | Repeat this procedure for all four data set. If you saved your workspace, 87 | the merge_four function should still be available. You can try it out on the 88 | taxonomic counts:: 89 | 90 | all_fam = merge_four(b1_fam,b2_fam,swe_fam,ind_fam,c("Baltic 1","Baltic 2","Sweden", "India")) 91 | 92 | Let's load in the ``gplots`` library again, and make a heatmap of the raw 93 | data:: 94 | 95 | library(gplots) 96 | heatmap.2(all_fam, trace = "none", col = c("grey",colorpanel(255,"black","red","yellow")), margin = c(5,10), cexCol = 1, cexRow = 0.7) 97 | 98 | As you will notice, we will need to do some tweaking to fit in the taxonomic data:: 99 | 100 | heatmap.2(all_fam, trace = "none", col = c("grey",colorpanel(255,"black","red","yellow")), margin = c(5,30), cexCol = 1, cexRow = 0.7) 101 | 102 | 103 | Apply normalizations 104 | ==================== 105 | As you might already have guessed, taxonomic count data suffers from the same 106 | biases from, for example, sequencing library size as other gene data. To 107 | account for that, we will apply a normalization procedure. Please note that 108 | the normalization methods 2 and 3 (number of 16S and number of total matches 109 | to database) would in this case be the same. In other words, the both yield 110 | the relative abundances of the taxa. We will therefore only look at two 111 | normalization procedures in this part of the lab. 112 | 113 | First, we will normalize to the number of reads in each sequencing library. 114 | Find the note you have taken on the data set sizes. Then apply a command like 115 | this on the data:: 116 | 117 | b1_fam_norm1 = b1_fam / 118025 * 1000000 118 | 119 | That will give you the 16S rRNA counts for the different families per million 120 | reads. Do the same thing for the other data sets. 121 | 122 | Next, we will do the same for the other type of normalization, the division 123 | by the mapped number of reads/total number of 16S rRNA. This can, once more, 124 | be done by dividing the vector by its sum:: 125 | 126 | b1_fam_norm2 = b1_fam / sum(b1_fam) 127 | 128 | Follow the above procedure for all the data sets, and store the final 129 | result from ``merge_four`` into a variable, for example called ``fam_norm2``. 130 | 131 | Comparing taxonomic distributions 132 | ================================= 133 | 134 | Next we will compare the taxonomic composition of the four environments. 135 | Let's start out by just using a barplot. To get the different taxa on 136 | the x-axis, we will transform the matrix with normalized counts using the 137 | ``t()`` command. But first we need to set the margins to fit the taxonomic 138 | names:: 139 | 140 | par(mar = c(25, 4, 4, 2)) 141 | barplot(t(fam_norm1), main = "Counts per million reads", las = 2, cex.names = 0.6, beside = TRUE) 142 | 143 | We can then do the same for the relative abundances:: 144 | 145 | barplot(t(fam_norm2), main = "Relative abundance", las = 2, cex.names = 0.6, beside = TRUE) 146 | 147 | To only look at families present in at least two samples, we can use the 148 | following command for filtering:: 149 | 150 | fam_norm1_filter = fam_norm1[rowSums(fam_norm1 > 0) >= 2,] 151 | barplot(t(fam_norm1_filter), main = "Counts per million reads", las = 2, cex.names = 0.6, beside = TRUE) 152 | 153 | 154 | **Question: Which normalization method would be most suitable to use in this case? Why?** 155 | 156 | 157 | We can also look at the differences in taxonomic content using a heatmap. As before, 158 | we will use the squareroot as a variance stabilizing transform:: 159 | 160 | heatmap.2(sqrt(fam_norm1), trace = "none", col = c("grey",colorpanel(255,"black","red","yellow")), margin = c(30,10), cexCol = 1, cexRow = 0.7) 161 | 162 | Finally, we can of course also use PCA on taxonomic abundances. We will turn back to the 163 | ``prcomp`` PCA command:: 164 | 165 | fam_norm1_pca = prcomp(sqrt(fam_norm1)) 166 | 167 | We can visualize the PCA using the ``biplot`` command:: 168 | 169 | biplot(fam_norm1_pca, cex = 0.5) 170 | 171 | To see the proportion of variance explained by the different components, we can use the 172 | normal plot command:: 173 | 174 | plot(fam_norm1_pca) 175 | 176 | **Question: Can you think about any other type of problem with the data we are using now? 177 | This problem applies to both kinds of data, but should be particularly problematic with 178 | the taxonomic counts...** 179 | 180 | -------------------------------------------------------------------------------- /source/binning/concoct.rst: -------------------------------------------------------------------------------- 1 | ========================================================== 2 | CONCOCT - Clustering cONtigs with COverage and ComposiTion 3 | ========================================================== 4 | In this excercise you will learn how to use a new software for automatic and unsupervised binning of metagenomic contigs, called CONCOCT. 5 | CONCOCT uses a statistical model called a Gaussian Mixture Model to cluster sequences based on their tetranucleotide frequencies and their average coverage over multiple samples. 6 | 7 | The theory behind using the coverage pattern is that sequences having similar coverage pattern over multiple samples are likely to belong to the same species. 8 | Species having a similar abundance pattern in the samples can hopefully be separated by the tetranucleotide frequencies. 9 | 10 | We will be working with an assembly made using only the reads from this single sample, but since CONCOCT is constructed to be ran using the coverage profile over multiple samples, we'll be investigating how the performance is affected if we add several other samples. 11 | This is done by mapping the reads from the other samples to the contigs resulting from this single sample assembly. 12 | 13 | 14 | Getting to know the test data set 15 | ================================= 16 | Today we'll be working on a metagenomic data set from the baltic sea. 17 | The sample we'll be using is part of a time series study, where the same location have been sampled twice weekly during 2013. This specific sample was taken march 22. 18 | 19 | Start by copying the contigs to your working directory:: 20 | 21 | mkdir -p ~/binning-workshop 22 | mkdir -p ~/binning-workshop/data 23 | cd ~/binning-workshop/ 24 | cp /proj/g2014113/nobackup/concoct-workshop/Contigs_gt1000.fa data/ 25 | cp /proj/g2014113/nobackup/concoct-workshop/120322_coverage_nolen.tsv data/ 26 | 27 | You should now have one fasta file containing all contigs, in this case only contigs longer than 1000 bases is included to save space, and one comma separated file containing the coverage profiles for each contig. 28 | Let's have a look at the coverage profiles:: 29 | 30 | less -S ~/binning-workshop/data/120322_coverage_nolen.tsv 31 | 32 | Try to find the column corresponding to March 22 and compare this column to the other ones. Can you draw any conclusions from this comparison? 33 | 34 | We'd like to first run concoct using only one sample, so we remove all other columns in the coverage table to create this new coverage file:: 35 | 36 | cut -f1,3 ~/binning-workshop/data/120322_coverage_nolen.tsv > ~/binning-workshop/data/120322_coverage_one_sample.tsv 37 | 38 | Running CONCOCT 39 | =============== 40 | CONCOCT takes a number of parameters that you got a glimpse of earlier, running:: 41 | 42 | concoct -h 43 | 44 | The contigs will be input as the composition file and the coverage file obviously as the coverage file. The output path is given by as the -b (--basename) parameter, where it is important to include a trailing slash if we want to create an output directory containing all result files. 45 | Last but not least we will set the length threshold to 3000 to speed up the clustering (the less contigs we use, the shorter the runtime):: 46 | 47 | mkdir -p ~/binning-workshop/concoct_output 48 | concoct --coverage_file ~/binning-workshop/data/120322_coverage_one_sample.tsv --composition_file ~/binning-workshop/data/Contigs_gt1000.fa -l 3000 -b ~/binning-workshop/concoct_output/3000_one_sample/ 49 | 50 | This command will normally take a couple of minutes to finish. When it is done, check the output directory and try to figure out what the different files contain. 51 | Especially, have a look at the main output file:: 52 | 53 | less ~/binning-workshop/concoct_output/3000_one_sample/clustering_gt3000.csv 54 | 55 | This file gives you the cluster id for each contig that was included in the clustering, in this case all contigs longer than 3000 bases. 56 | 57 | For the comparison we will now run concoct again, using the coverage profile over all samples in the time series:: 58 | 59 | concoct --coverage_file ~/binning-workshop/data/120322_coverage_nolen.tsv --composition_file ~/binning-workshop/data/Contigs_gt1000.fa -l 3000 -b ~/binning-workshop/concoct_output/3000_all_samples/ 60 | 61 | Have a look at the output from this clustering as well, do you notice anything different? 62 | 63 | Evaluating Clustering Results 64 | ============================= 65 | One way of evaluating the resulting clusters are to look at the distribution of so called Single Copy Genes (SCG:s), genes that are present in all bacteria and archea in only one copy. 66 | With this background, a complete and correct bin should have exactly one copy of each gene present, while missing genes indicate an inclomplete bin and several copies of the same gene indicate a chimeric cluster. 67 | To predict genes in prokaryotes, we use Prodigal that we then use as the query sequences for an RPS-BLAST search against the Clusters of Orthologous Groups (COG) database. 68 | This RPS-BLAST search takes about an hour and a half for our dataset so we're going to use a precomputed result file. 69 | Copy this result file along with two files necessary for the COG counting scripts:: 70 | 71 | cp /proj/g2014113/nobackup/concoct-workshop/Contigs_gt1000_blast.out ~/binning-workshop/data/ 72 | cp /proj/g2014113/nobackup/concoct-workshop/scg_cogs_min0.97_max1.03_unique_genera.txt ~/binning-workshop/data/ 73 | cp /proj/g2014113/nobackup/concoct-workshop/cdd_to_cog.tsv ~/binning-workshop/data/ 74 | 75 | Before moving on, we need to install some R packages, please run these commands line by line:: 76 | 77 | R 78 | install.packages("ggplot2") 79 | install.packages("reshape") 80 | install.packages("getopt") 81 | q() 82 | 83 | With the CONCOCT distribution comes scripts for parsing this output and creating a plot where each cog present in the data are grouped accordingly to the clustering results, namely COG_table.py and COGPlot.R. These scripts are added to the virtual environment, try check out their usage:: 84 | 85 | COG_table.py -h 86 | COGPlot.R -h 87 | 88 | Let's first create a plot for the single sample run:: 89 | 90 | COG_table.py -b ~/binning-workshop/data/Contigs_gt1000_blast.out -c ~/binning-workshop/concoct_output/3000_one_sample/clustering_gt3000.csv -m ~/binning-workshop/data/scg_cogs_min0.97_max1.03_unique_genera.txt --cdd_cog_file ~/binning-workshop/data/cdd_to_cog.tsv > ~/binning-workshop/cog_table_3000_single_sample.tsv 91 | COGPlot.R -s ~/binning-workshop/cog_table_3000_single_sample.tsv -o ~/binning-workshop/cog_plot_3000_single_sample.pdf 92 | 93 | This command might not work for some R related reason. If you've tried getting it to work more than you wish to do, just copy the results from the workshop directory:: 94 | 95 | cp /proj/g2014113/nobackup/concoct-workshop/cogplots/* ~/binning-workshop/ 96 | 97 | 98 | This command should have created a pdf file with your plot. In order to look at it, you can download it to your personal computer with scp. OBS! You need to run this in a separate terminal window where you are not logged in to Uppmax:: 99 | 100 | scp username@milou.uppmax.uu.se:~/binning-workshop/cog_plot_3000_single_sample.pdf ~/Desktop/ 101 | 102 | Have a look at the plot and try to figure out if the clustering was successful or not. Which clusters are good? Which clusters are bad? Are all clusters present in the plot? 103 | Now, lets do the same thing for the multiple samples run:: 104 | 105 | COG_table.py -b ~/binning-workshop/data/Contigs_gt1000_blast.out -c ~/binning-workshop/concoct_output/3000_all_samples/clustering_gt3000.csv -m ~/binning-workshop/data/scg_cogs_min0.97_max1.03_unique_genera.txt --cdd_cog_file ~/binning-workshop/data/cdd_to_cog.tsv > ~/binning-workshop/cog_table_3000_all_samples.tsv 106 | COGPlot.R -s ~/binning-workshop/cog_table_3000_all_samples.tsv -o ~/binning-workshop/cog_plot_3000_all_samples.pdf 107 | 108 | And download again from your separate terminal window:: 109 | 110 | scp username@milou.uppmax.uu.se:~/binning-workshop/cog_plot_3000_all_samples.pdf ~/Desktop 111 | 112 | What differences can you observe for these plots? Think about how we were able to use samples not included in the assembly in order to create a different clustering result. Can this be done with any samples? 113 | 114 | -------------------------------------------------------------------------------- /source/conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # Metagenomics Workshop SciLifeLab documentation build configuration file, created by 4 | # sphinx-quickstart on Tue May 6 17:38:39 2014. 5 | # 6 | # This file is execfile()d with the current directory set to its 7 | # containing dir. 8 | # 9 | # Note that not all possible configuration values are present in this 10 | # autogenerated file. 11 | # 12 | # All configuration values have a default; values that are commented out 13 | # serve to show the default. 14 | 15 | import sys 16 | import os 17 | 18 | # If extensions (or modules to document with autodoc) are in another directory, 19 | # add these directories to sys.path here. If the directory is relative to the 20 | # documentation root, use os.path.abspath to make it absolute, like shown here. 21 | #sys.path.insert(0, os.path.abspath('.')) 22 | 23 | # -- General configuration ------------------------------------------------ 24 | 25 | # If your documentation needs a minimal Sphinx version, state it here. 26 | #needs_sphinx = '1.0' 27 | 28 | # Add any Sphinx extension module names here, as strings. They can be 29 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 30 | # ones. 31 | extensions = [] 32 | 33 | # Add any paths that contain templates here, relative to this directory. 34 | templates_path = ['_templates'] 35 | 36 | # The suffix of source filenames. 37 | source_suffix = '.rst' 38 | 39 | # The encoding of source files. 40 | #source_encoding = 'utf-8-sig' 41 | 42 | # The master toctree document. 43 | master_doc = 'index' 44 | 45 | # General information about the project. 46 | project = u'Metagenomics Workshop SciLifeLab' 47 | copyright = u'2014, Johannes Alneberg, Johan Bengtsson-Palme, Ino de Bruijn, Luisa Hugerth, Mikael Huss, Thomas Svensson' 48 | 49 | # The version info for the project you're documenting, acts as replacement for 50 | # |version| and |release|, also used in various other places throughout the 51 | # built documents. 52 | # 53 | # The short X.Y version. 54 | version = '1.0' 55 | # The full version, including alpha/beta/rc tags. 56 | release = '1.0' 57 | 58 | # The language for content autogenerated by Sphinx. Refer to documentation 59 | # for a list of supported languages. 60 | #language = None 61 | 62 | # There are two options for replacing |today|: either, you set today to some 63 | # non-false value, then it is used: 64 | #today = '' 65 | # Else, today_fmt is used as the format for a strftime call. 66 | #today_fmt = '%B %d, %Y' 67 | 68 | # List of patterns, relative to source directory, that match files and 69 | # directories to ignore when looking for source files. 70 | exclude_patterns = [] 71 | 72 | # The reST default role (used for this markup: `text`) to use for all 73 | # documents. 74 | #default_role = None 75 | 76 | # If true, '()' will be appended to :func: etc. cross-reference text. 77 | #add_function_parentheses = True 78 | 79 | # If true, the current module name will be prepended to all description 80 | # unit titles (such as .. function::). 81 | #add_module_names = True 82 | 83 | # If true, sectionauthor and moduleauthor directives will be shown in the 84 | # output. They are ignored by default. 85 | #show_authors = False 86 | 87 | # The name of the Pygments (syntax highlighting) style to use. 88 | pygments_style = 'sphinx' 89 | 90 | # A list of ignored prefixes for module index sorting. 91 | #modindex_common_prefix = [] 92 | 93 | # If true, keep warnings as "system message" paragraphs in the built documents. 94 | #keep_warnings = False 95 | 96 | 97 | # -- Options for HTML output ---------------------------------------------- 98 | 99 | # The theme to use for HTML and HTML Help pages. See the documentation for 100 | # a list of builtin themes. 101 | html_theme = 'default' 102 | 103 | # Theme options are theme-specific and customize the look and feel of a theme 104 | # further. For a list of options available for each theme, see the 105 | # documentation. 106 | #html_theme_options = {} 107 | 108 | # Add any paths that contain custom themes here, relative to this directory. 109 | #html_theme_path = [] 110 | 111 | # The name for this set of Sphinx documents. If None, it defaults to 112 | # " v documentation". 113 | #html_title = None 114 | 115 | # A shorter title for the navigation bar. Default is the same as html_title. 116 | #html_short_title = None 117 | 118 | # The name of an image file (relative to this directory) to place at the top 119 | # of the sidebar. 120 | #html_logo = None 121 | 122 | # The name of an image file (within the static path) to use as favicon of the 123 | # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 124 | # pixels large. 125 | #html_favicon = None 126 | 127 | # Add any paths that contain custom static files (such as style sheets) here, 128 | # relative to this directory. They are copied after the builtin static files, 129 | # so a file named "default.css" will overwrite the builtin "default.css". 130 | html_static_path = ['_static'] 131 | 132 | # Add any extra paths that contain custom files (such as robots.txt or 133 | # .htaccess) here, relative to this directory. These files are copied 134 | # directly to the root of the documentation. 135 | #html_extra_path = [] 136 | 137 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, 138 | # using the given strftime format. 139 | #html_last_updated_fmt = '%b %d, %Y' 140 | 141 | # If true, SmartyPants will be used to convert quotes and dashes to 142 | # typographically correct entities. 143 | #html_use_smartypants = True 144 | 145 | # Custom sidebar templates, maps document names to template names. 146 | #html_sidebars = {} 147 | 148 | # Additional templates that should be rendered to pages, maps page names to 149 | # template names. 150 | #html_additional_pages = {} 151 | 152 | # If false, no module index is generated. 153 | #html_domain_indices = True 154 | 155 | # If false, no index is generated. 156 | #html_use_index = True 157 | 158 | # If true, the index is split into individual pages for each letter. 159 | #html_split_index = False 160 | 161 | # If true, links to the reST sources are added to the pages. 162 | #html_show_sourcelink = True 163 | 164 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. 165 | #html_show_sphinx = True 166 | 167 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. 168 | #html_show_copyright = True 169 | 170 | # If true, an OpenSearch description file will be output, and all pages will 171 | # contain a tag referring to it. The value of this option must be the 172 | # base URL from which the finished HTML is served. 173 | #html_use_opensearch = '' 174 | 175 | # This is the file name suffix for HTML files (e.g. ".xhtml"). 176 | #html_file_suffix = None 177 | 178 | # Output file base name for HTML help builder. 179 | htmlhelp_basename = 'MetagenomicsWorkshopSciLifeLabdoc' 180 | 181 | 182 | # -- Options for LaTeX output --------------------------------------------- 183 | 184 | latex_elements = { 185 | # The paper size ('letterpaper' or 'a4paper'). 186 | #'papersize': 'letterpaper', 187 | 188 | # The font size ('10pt', '11pt' or '12pt'). 189 | #'pointsize': '10pt', 190 | 191 | # Additional stuff for the LaTeX preamble. 192 | #'preamble': '', 193 | } 194 | 195 | # Grouping the document tree into LaTeX files. List of tuples 196 | # (source start file, target name, title, 197 | # author, documentclass [howto, manual, or own class]). 198 | latex_documents = [ 199 | ('index', 'MetagenomicsWorkshopSciLifeLab.tex', u'Metagenomics Workshop SciLifeLab Documentation', 200 | u'Johannes Alneberg, Johan Bengtsson-Palme, Ino de Bruijn, Luisa Hugerth, Mikael Huss, Thomas Svensson', 'manual'), 201 | ] 202 | 203 | # The name of an image file (relative to this directory) to place at the top of 204 | # the title page. 205 | #latex_logo = None 206 | 207 | # For "manual" documents, if this is true, then toplevel headings are parts, 208 | # not chapters. 209 | #latex_use_parts = False 210 | 211 | # If true, show page references after internal links. 212 | #latex_show_pagerefs = False 213 | 214 | # If true, show URL addresses after external links. 215 | #latex_show_urls = False 216 | 217 | # Documents to append as an appendix to all manuals. 218 | #latex_appendices = [] 219 | 220 | # If false, no module index is generated. 221 | #latex_domain_indices = True 222 | 223 | 224 | # -- Options for manual page output --------------------------------------- 225 | 226 | # One entry per manual page. List of tuples 227 | # (source start file, name, description, authors, manual section). 228 | man_pages = [ 229 | ('index', 'metagenomicsworkshopscilifelab', u'Metagenomics Workshop SciLifeLab Documentation', 230 | [u'Johannes Alneberg, Johan Bengtsson-Palme, Ino de Bruijn, Luisa Hugerth, Mikael Huss, Thomas Svensson'], 1) 231 | ] 232 | 233 | # If true, show URL addresses after external links. 234 | #man_show_urls = False 235 | 236 | 237 | # -- Options for Texinfo output ------------------------------------------- 238 | 239 | # Grouping the document tree into Texinfo files. List of tuples 240 | # (source start file, target name, title, author, 241 | # dir menu entry, description, category) 242 | texinfo_documents = [ 243 | ('index', 'MetagenomicsWorkshopSciLifeLab', u'Metagenomics Workshop SciLifeLab Documentation', 244 | u'Johannes Alneberg, Johan Bengtsson-Palme, Ino de Bruijn, Luisa Hugerth, Mikael Huss, Thomas Svensson', 'MetagenomicsWorkshopSciLifeLab', 'One line description of project.', 245 | 'Miscellaneous'), 246 | ] 247 | 248 | # Documents to append as an appendix to all manuals. 249 | #texinfo_appendices = [] 250 | 251 | # If false, no module index is generated. 252 | #texinfo_domain_indices = True 253 | 254 | # How to display URL addresses: 'footnote', 'no', or 'inline'. 255 | #texinfo_show_urls = 'footnote' 256 | 257 | # If true, do not generate a @detailmenu in the "Top" node's menu. 258 | #texinfo_no_detailmenu = False 259 | -------------------------------------------------------------------------------- /source/annotation/differential.rst: -------------------------------------------------------------------------------- 1 | ====================================================================== 2 | Estimating differentially abundant protein families in the metagenomes 3 | ====================================================================== 4 | Finally, we are about to do some real analysis of the data, and look 5 | at the results! To do this, we will use the R statistical program. 6 | You start the program by typing:: 7 | 8 | R 9 | 10 | To get out of R, you type ``q()``. You will then be asked if you want 11 | to save your workspace. Typing "y" (yes) might be smart, since that 12 | will remember all your variables until the next time you use R in the 13 | same directory! 14 | 15 | Loading the the count tables 16 | ============================ 17 | 18 | We will begin by loading the count tables from HMMER into R:: 19 | 20 | b1 = read.table("baltic1.hmmsearch", sep = "") 21 | 22 | To get the number of entries of each kind, we will use the R command ``rle``. 23 | We want to get the domain list, which is the third column. For ``rle`` to be 24 | able to work with the data, we must also convert it into a proper vector.:: 25 | 26 | raw_counts = rle(as.vector(b1[,3])) 27 | b1_counts = as.matrix(raw_counts$lengths) 28 | row.names(b1_counts) = raw_counts$values 29 | 30 | Repeat this procedure for all four data sets. 31 | 32 | Apply normalizations 33 | ==================== 34 | 35 | We will now try out the three different normalization methods to see their 36 | effect on the data. First, we will try by normalizing to the number of reads 37 | in each sequencing library. Find the note you have taken on the data set sizes. 38 | Then apply a command like this on the data:: 39 | 40 | b1_norm1 = b1_counts / 118025 41 | 42 | You will now see counts in the range of 10^-5 and 10^6. To make these numbers 43 | more interpretable, let's also multiply them by 1,000,000 to yield the counts 44 | per million reads:: 45 | 46 | b1_norm1 = b1_counts / 118025 * 1000000 47 | 48 | Do the same thing for the other data sets. 49 | 50 | We would then like to compare all the four data sets to each other. Since R's 51 | merge function really suck for multiple data sets, I have provided this 52 | function for merging four data sets. Copy and paste it into the R console:: 53 | 54 | merge_four = function(a,b,c,d,names) { 55 | m1 = merge(a,b,by = "row.names", all = TRUE) 56 | row.names(m1) = m1[,1] 57 | m1 = m1[,2:3] 58 | m2 = merge(c, m1, by = "row.names", all = TRUE) 59 | row.names(m2) = m2[,1] 60 | m2 = m2[,2:4] 61 | m3 = merge(d, m2, by = "row.names", all = TRUE) 62 | row.names(m3) = m3[,1] 63 | m3 = m3[,2:5] 64 | m3[is.na(m3)] = 0 65 | colnames(m3) = c(names[4], names[3], names[1], names[2]) 66 | return(as.matrix(m3)) 67 | } 68 | 69 | You can then try it by running this command on the raw counts:: 70 | 71 | norm0 = merge_four(b1_counts,b2_counts,swe_counts,ind_counts,c("Baltic 1","Baltic 2","Sweden", "India")) 72 | 73 | You should then see a matrix containing all counts from the four data 74 | sets, with each row corresponding to a Pfam family. Next, run the same 75 | command on the normalized data and store the output into a variable, called 76 | for example ``norm1``. The total abundance of mobility domains can then be 77 | visualzied using the following command:: 78 | 79 | barplot(colSums(norm1)) 80 | 81 | We can then repeat the normalization procedure, by instead normalizing to 82 | the number of 16S rRNA counts in each library. This can be done similarly 83 | to the division by total number of reads above:: 84 | 85 | b1_norm2 = b1_counts / 21 86 | 87 | This time, we won't multiply by a million, as that would make numbers 88 | much larger (and harder to interpret). 89 | 90 | Follow the above procedure for all the data sets, and finally store the 91 | end result from ``merge_four`` into a variable, for example called ``norm2``. 92 | 93 | Finally, we will do the same for the third type of normalization, the 94 | division by the mapped number of reads. This can, once more, be done as 95 | above:: 96 | 97 | b1_norm3 = b1_counts / 22 98 | 99 | Follow the above procedure for all the data sets, and store the final 100 | result from ``merge_four`` into a variable, for example called ``norm3``. 101 | 102 | A note on saving plots 103 | ====================== 104 | Note that if you would like to save your plots to a PDF file you can run 105 | the command:: 106 | 107 | pdf("output_file_name.pdf", width = 10, height = 10) 108 | 109 | and then you can just run all the R commands as normal. Instead of getting 110 | plots printed on the screen, all the plots will be output to the specified 111 | PDF file, and can later be viewed in e.g. Acrobat Reader. When you are 112 | finished plotting you can finalize the PDF file using the command:: 113 | 114 | dev.off() 115 | 116 | This closes the PDF and enables other software to read it. Please note that 117 | it will be considered a "broken" PDF until the ``dev.off()`` command is run! 118 | 119 | Comparing normalizations 120 | ======================== 121 | 122 | Let us now quickly compare the three normalization methods. As a quick 123 | overview, we can just make three colorful barplots next to each other, 124 | each representing one normalization method:: 125 | 126 | layout(matrix(c(1,3,2,4),2,2)) 127 | barplot(norm0, col = 1:nrow(norm1), main = "Raw gene counts") 128 | barplot(norm1, col = 1:nrow(norm1), main = "Counts per million reads") 129 | barplot(norm2, col = 1:nrow(norm2), main = "Counts per 16S rRNA") 130 | barplot(norm3, col = 1:nrow(norm3), main = "Relative abundance") 131 | 132 | As you can see, each of these plots will tell a slightly different story. 133 | Let's take a closer look at how normalization affect the behavior of some 134 | genes. First, we can see if there are any genes that are present in all 135 | samples. This is easily investigated by the following command, which takes 136 | counts if a value is larger than zero, counts the number of occurences per 137 | per row (rowSums), and finally outputs all the rows from ``norm1`` where 138 | this sum is exactly four:: 139 | 140 | norm1[rowSums(norm1 > 0) == 4,] 141 | 142 | If that didn't give you much luck, you can try if you can find any genes 143 | that occur in at least three samples:: 144 | 145 | norm1[rowSums(norm1 > 0) >= 3,] 146 | 147 | Select one of those and find out its row number in the count table. 148 | Hint: ``row.names(norm1)`` will help you here! Now lets make boxplots for 149 | that row only:: 150 | 151 | x = 152 | layout(matrix(c(1,3,2,4),2,2)) 153 | barplot(norm0[x,], main = paste(row.names(norm1)[x], "- Raw gene counts")) 154 | barplot(norm1[x,], main = paste(row.names(norm1)[x], "- Counts per million reads")) 155 | barplot(norm2[x,], main = paste(row.names(norm2)[x], "- Counts per 16S rRNA")) 156 | barplot(norm3[x,], main = paste(row.names(norm3)[x], "- Relative abundance")) 157 | 158 | You can now try this for a number of other genes (by changing the value of 159 | ``x``) and see how normalization affects your story. 160 | 161 | **Question: Which normalization method would be most suitable to use in this case? Why?** 162 | 163 | 164 | Visualizing differences in gene abundance 165 | ========================================= 166 | 167 | One neat way of visualizing metagenomic count data is through heatmaps. R has a built-in 168 | heatmap function, that can be called using the (surprise...) ``heatmap`` command. 169 | However, you will quickly notice that this function is rather limited, and we will 170 | therefore install a package containing a better one - the ``gplots`` package. You can do 171 | this by typing the following command:: 172 | 173 | install.packages("gplots") 174 | 175 | Just answer "yes" to the questions, and the package will be installed locally for your 176 | user. After installation you load the package by typing:: 177 | 178 | library(gplots) 179 | 180 | After this, you will be able to use the more powerful ``heatmap.2`` command. Try, 181 | for example, this command on the data:: 182 | 183 | heatmap.2(norm1, trace = "none", col = colorpanel(255,"black","red","yellow"), margin = c(5,10), cexCol = 1, cexRow = 0.7) 184 | 185 | The trace, margin, cexCol and cexRow options are just there to make the plot look better 186 | (play around with them if you wish). The ``col = colorpanel(255,"black","red","yellow")`` 187 | option creates a scale from black to yellow where yellow means highly abundant and black 188 | lowly abundant. To make more clear which genes that are not even detected, let's add a 189 | grey color to that for genes with zero count:: 190 | 191 | heatmap.2(norm1, trace = "none", col = c("grey",colorpanel(255,"black","red","yellow")), margin = c(5,10), cexCol = 1, cexRow = 0.7) 192 | 193 | You will now notice that it is hard to see the differences for the lowly abundant genes. 194 | To aid in this, we can add a variance-stabilizing transform (fancy name for squareroot) 195 | to the data:: 196 | 197 | norm1_sqrt = sqrt(norm1) 198 | 199 | You can then re-run the ``heatmap.2`` command on the newly created ``norm1_sqrt`` 200 | variable. 201 | 202 | Sometimes, it makes more sense to apply a logarithmic transform to the data instead of 203 | the squareroot. This, however, is a bit more tricky since we have zeros in the data. 204 | For fun's sake, we can try:: 205 | 206 | norm1_log10 = log10(norm1) 207 | heatmap.2(norm1_log10, trace = "none", col = c("grey",colorpanel(255,"black","red","yellow")), margin = c(5,10), cexCol = 1, cexRow = 0.7) 208 | 209 | This should give you an error message. The easiest way to solve this problem is to add 210 | some small number to the matrix before the ``log10`` command. Since we will display this 211 | number with grey color anyway, it will in this case, and for this application, matter 212 | much exactly what number you add. You can, for example, choose 1:: 213 | 214 | norm1_log10 = log10(norm1 + 1) 215 | heatmap.2(norm1_log10, trace = "none", col = c("grey",colorpanel(255,"black","red","yellow")), margin = c(5,10), cexCol = 1, cexRow = 0.7) 216 | 217 | Before we end, let's also try another kind of commonly used visualization, the PCA plot. 218 | Principal Component Analysis (PCA) essentially builds upon projecting complex data onto a 219 | 2D (or 3D) surface, while trying to separate the data points as much as possible. This 220 | can be useful for finding groups of observations that fit together. We will use the built-in 221 | PCA command called ``prcomp``:: 222 | 223 | norm1_pca = prcomp(norm1_sqrt) 224 | 225 | Note that we used the data created using the variance stabilizing transform. There are more 226 | sophisticated ways of reducing the influence of very large values, but many times the 227 | squareroot is sufficient. We can visualize the PCA using a plotting command called ``biplot``:: 228 | 229 | layout(1) 230 | biplot(norm1_pca, cex = 0.5) 231 | 232 | To see the proportion of variance explained by the different components, we can use the 233 | normal plot command:: 234 | 235 | plot(norm1_pca) 236 | 237 | We want the first two bars to be as large as possible, since that means that the dataset 238 | can be easily simplified to two dimensions. If all bars are of roughly equal height, the 239 | projection to a 2D surface has caused a loss of much of the information of the data, and 240 | we can not trust the patterns in the PCA plot as much. 241 | 242 | If we do the PCA on the relative abundance data (normalization three), we can get a view 243 | of which Pfam domains that dominate in these samples:: 244 | 245 | norm3_pca = prcomp(norm3) 246 | biplot(norm3_pca, cex = 0.5) 247 | 248 | And that's the end of the lab. If you have lots of time to spare, you can move on to the 249 | bonus excersize, in which we will analyze the 16S rRNA data generated by Metaxa2 further, 250 | to understand which bacterial species that are present in the samples. 251 | --------------------------------------------------------------------------------