├── .gitignore
├── mako
    ├── templates
    │   ├── hello.txt
    │   └── binning
    │   │   ├── index.rst
    │   │   ├── setup.rst
    │   │   ├── phylosift.rst
    │   │   └── concoct.rst
    └── settings.yaml
├── source
    ├── assembly
    │   ├── index.rst
    │   ├── qtrim.rst
    │   ├── map.rst
    │   ├── assembly.rst
    │   └── reqs.rst
    ├── binning
    │   ├── index.rst
    │   ├── setup.rst
    │   ├── phylosift.rst
    │   └── concoct.rst
    ├── comparative-taxonomic-analysis
    │   ├── index.rst
    │   ├── krona.rst
    │   ├── rrna.rst
    │   └── compare.rst
    ├── annotation
    │   ├── index.rst
    │   ├── annotation.rst
    │   ├── translation.rst
    │   ├── software.rst
    │   ├── normalization.rst
    │   ├── metaxa2.rst
    │   └── differential.rst
    ├── comparative-functional-analysis
    │   ├── index.rst
    │   ├── annotation.rst
    │   ├── genefinding.rst
    │   ├── genecoverage.rst
    │   └── compare.rst
    ├── index.rst
    └── conf.py
├── run_mako.py
├── README.rst
└── Makefile


/.gitignore:
--------------------------------------------------------------------------------
1 | build
2 | 


--------------------------------------------------------------------------------
/mako/templates/hello.txt:
--------------------------------------------------------------------------------
1 | hello world!
2 | 


--------------------------------------------------------------------------------
/source/assembly/index.rst:
--------------------------------------------------------------------------------
 1 | Metagenomic Assembly Workshop
 2 | =================================================
 3 | In this metagenomics workshop we will learn how to:
 4 | 
 5 | - Quality trim reads with sickle
 6 | - Perform assemblies with velvet
 7 | - Map back reads to assemblies with bowtie2
 8 | 
 9 | The workshop has the following exercises:
10 | 
11 | .. toctree::
12 |    :maxdepth: 2
13 | 
14 |    reqs
15 |    qtrim
16 |    assembly
17 |    map
18 | 
19 | At least a basic knowledge of how to work with the command line is required
20 | otherwise it will be very difficult to follow some of the examples. Have
21 | fun!
22 | 


--------------------------------------------------------------------------------
/mako/templates/binning/index.rst:
--------------------------------------------------------------------------------
 1 | Metagenomic Binning Workshop
 2 | =================================================
 3 | In this metagenomics workshop we will learn how to:
 4 | 
 5 | - Perform unsupervised binning with concoct 
 6 | - Evaluate binning performance
 7 | 
 8 | The workshop has the following exercises:
 9 | 
10 | 1. Setup Environment
11 | 2. Running Concoct
12 | 3. Evaluate Clustering Using Single Copy Genes
13 | 4. Phylogenetic Classification using Phylosift
14 | 
15 | At least a basic knowledge of how to work with the command line is required
16 | otherwise it will be very difficult to follow some of the examples. Have
17 | fun!
18 | 
19 | Contents:
20 | 
21 | .. toctree::
22 |     :maxdepth: 2
23 | 
24 |     setup
25 |     concoct
26 |     phylosift
27 | 


--------------------------------------------------------------------------------
/source/binning/index.rst:
--------------------------------------------------------------------------------
 1 | Metagenomic Binning Workshop
 2 | =================================================
 3 | In this metagenomics workshop we will learn how to:
 4 | 
 5 | - Perform unsupervised binning with concoct 
 6 | - Evaluate binning performance
 7 | 
 8 | The workshop has the following exercises:
 9 | 
10 | 1. Setup Environment
11 | 2. Running Concoct
12 | 3. Evaluate Clustering Using Single Copy Genes
13 | 4. Phylogenetic Classification using Phylosift
14 | 
15 | At least a basic knowledge of how to work with the command line is required
16 | otherwise it will be very difficult to follow some of the examples. Have
17 | fun!
18 | 
19 | Contents:
20 | 
21 | .. toctree::
22 |     :maxdepth: 2
23 | 
24 |     setup
25 |     concoct
26 |     phylosift
27 | 
28 | 


--------------------------------------------------------------------------------
/source/comparative-taxonomic-analysis/index.rst:
--------------------------------------------------------------------------------
 1 | =======================================================
 2 | Comparative Taxonomic Analysis Workshop
 3 | =======================================================
 4 | In this workshop we will do a comparative analysis of several Baltic Sea
 5 | samples on taxonomy. The following topics will be discussed:
 6 | 
 7 | - Taxonomic annotation of rRNA reads using sortmeRNA
 8 | - Visualizing taxonomy with KRONA
 9 | - Comparative taxonomic analysis in R
10 | 
11 | The workshop has the following exercises:
12 | 
13 | 1. rRNA Annotation Exercise
14 | 2. KRONA Exercise
15 | 3. Comprative Taxonomic Analysis Exercise
16 | 
17 | Contents:
18 | 
19 | .. toctree::
20 |    :maxdepth: 2
21 | 
22 |    rrna
23 |    krona
24 |    compare
25 | 


--------------------------------------------------------------------------------
/source/annotation/index.rst:
--------------------------------------------------------------------------------
 1 | Metagenomic Annotation Workshop
 2 | =================================================
 3 | In this part of the metagenomics workshop we will:
 4 | 
 5 | - Translate nucleotides into amino acid seuquences using EMBOSS
 6 | - Annotate metagenomic reads with Pfam domains
 7 | - Discuss and perform normalization of metagenomic counts
 8 | - Take a look at different gene abundance analysis
 9 | 
10 | The workshop has the following exercises:
11 | 
12 | 1. Translation Exercise
13 | 2. HMMER Exercise
14 | 3. Normalization Exercise
15 | 4. Differential Exercise
16 | 5. BonusExercise: Metaxa2
17 | 
18 | At least a basic knowledge of how to work with the command line is required
19 | otherwise it will be very difficult to follow some of the examples. Have
20 | fun!
21 | 
22 | Contents:
23 | 
24 | .. toctree::
25 |    :maxdepth: 2
26 | 
27 |    software
28 |    translation
29 |    annotation
30 |    normalization
31 |    differential
32 |    metaxa2
33 | 


--------------------------------------------------------------------------------
/source/comparative-functional-analysis/index.rst:
--------------------------------------------------------------------------------
 1 | =======================================================
 2 | Comparative Functional Analysis Workshop
 3 | =======================================================
 4 | In this workshop we will do a comparative analysis of several Baltic Sea
 5 | samples on function. The following topics will be discussed:
 6 | 
 7 | - Find genes on a coassembly of all samples using Prodigal
 8 | - Classify the found genes with similar function in Clusters of Orthologous
 9 |   Groups (COG) using WebMGA
10 | - Compare the expression of the different COG families and classes by looking
11 |   at their coverage in different samples using R
12 | 
13 | The workshop has the following exercises:
14 | 
15 | 1. Gene Finding Exercise
16 | 2. COG Exercise
17 | 3. Comprative Functional Analysis Exercise
18 | 
19 | Contents:
20 | 
21 | .. toctree::
22 |    :maxdepth: 2
23 | 
24 |    genefinding
25 |    annotation
26 |    genecoverage
27 |    compare
28 | 


--------------------------------------------------------------------------------
/source/index.rst:
--------------------------------------------------------------------------------
 1 | .. Metagenomics Workshop SciLifeLab documentation master file, created by
 2 |    sphinx-quickstart on Tue May  6 17:38:39 2014.
 3 |    You can adapt this file completely to your liking, but it should at least
 4 |    contain the root `toctree` directive.
 5 | 
 6 | Welcome to the Metagenomics Workshop at SciLifeLab, Stockholm
 7 | =============================================================
 8 | 
 9 | This is a three day metagenomics workshop. We will discuss assembly, binning
10 | and annotation of metagenomic samples.
11 | 
12 | Program:
13 | 
14 | * Day 1 Time 14-17
15 |     * :doc:`assembly/index`
16 |     * :doc:`comparative-functional-analysis/index`
17 |     * :doc:`comparative-taxonomic-analysis/index`
18 | * Day 2 Time 14-17
19 |     * :doc:`binning/index`
20 | * Day 3 Time 13.30-17
21 |     * :doc:`annotation/index`
22 | 
23 | 
24 | Contents:
25 | 
26 | .. toctree::
27 |    :maxdepth: 2
28 | 
29 |    assembly/index
30 |    comparative-functional-analysis/index
31 |    comparative-taxonomic-analysis/index
32 |    binning/index
33 |    annotation/index
34 | 
35 | Enjoy!
36 | 
37 | For questions:
38 | 
39 | * Johannes Alneberg at sclifelab dot se
40 | * Johan dot Bengtsson-Palme at gu dot se
41 | * Ino de Bruijn at bils dot se
42 | * Luisa Hugerth at scilifelab dot se
43 | 
44 | The code that generates this workshop is available at:
45 | 
46 | https://github.com/inodb/2014-5-metagenomics-workshop
47 | 


--------------------------------------------------------------------------------
/source/binning/setup.rst:
--------------------------------------------------------------------------------
 1 | ==========================================
 2 | Setup Environment
 3 | ==========================================
 4 | This workshop will be using the same environment used for the assembly workshop. If you did not participate in the assembly workshop, please have a look at the introductory setup description for that. 
 5 | 
 6 | Programs used in this workshop
 7 | ==============================
 8 | The following programs are used in this workshop:
 9 | 
10 |     - CONCOCT_
11 |     - Phylosift_
12 |     - Blast_
13 |  
14 | .. _CONCOCT: http://github.com/BinPro/CONCOCT
15 | .. _Phylosift: http://phylosift.wordpress.com/ 
16 | .. _BLAST: http://blast.ncbi.nlm.nih.gov/
17 | 
18 | All programs and scripts that you need for this workshop are already installed, all you have to do is load the virtual
19 | environment. Once you are logged in to the server run::
20 | 
21 |     source /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/activate
22 | 
23 | If you'd wish to inactivate this virtual environment you could run::
24 |     
25 |     deactivate # Don't run this now
26 | 
27 | NOTE: This is a python virtual environment. The binary folder of the virtual
28 | environment has symbolic links to all programs used in this workshop so you
29 | should be able to run those without problems.
30 | 
31 | Check that the programs are available
32 | =====================================
33 | After you have activated the virtual environment the following commands should execute properly and you should be able to see some brief instructions on how to run the different programs respectively.
34 | 
35 | CONCOCT::
36 | 
37 |     concoct -h
38 | 
39 | 
40 | BLAST::
41 | 
42 |     rpsblast --help
43 | 
44 | 
45 | 


--------------------------------------------------------------------------------
/mako/templates/binning/setup.rst:
--------------------------------------------------------------------------------
 1 | ==========================================
 2 | Setup Environment
 3 | ==========================================
 4 | This workshop will be using the same environment used for the assembly workshop. If you did not participate in the assembly workshop, please have a look at the introductory setup description for that. 
 5 | 
 6 | Programs used in this workshop
 7 | ==============================
 8 | The following programs are used in this workshop:
 9 | 
10 |     - CONCOCT_
11 |     - Phylosift_
12 |     - Blast_
13 |  
14 | .. _CONCOCT: http://github.com/BinPro/CONCOCT
15 | .. _Phylosift: http://phylosift.wordpress.com/ 
16 | .. _BLAST: http://blast.ncbi.nlm.nih.gov/
17 | 
18 | All programs and scripts that you need for this workshop are already installed, all you have to do is load the virtual
19 | environment. Once you are logged in to the server run::
20 | 
21 |     ${commands['activate']}
22 | 
23 | If you'd wish to inactivate this virtual environment you could run::
24 |     
25 |     deactivate # Don't run this now
26 | 
27 | NOTE: This is a python virtual environment. The binary folder of the virtual
28 | environment has symbolic links to all programs used in this workshop so you
29 | should be able to run those without problems.
30 | 
31 | Check that the programs are available
32 | =====================================
33 | After you have activated the virtual environment the following commands should execute properly and you should be able to see some brief instructions on how to run the different programs respectively.
34 | 
35 | CONCOCT::
36 | 
37 |     ${commands['check_activate']['concoct']}
38 | 
39 | 
40 | BLAST::
41 | 
42 |     ${commands['check_activate']['rpsblast']}
43 | 
44 | 


--------------------------------------------------------------------------------
/run_mako.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | from __future__ import print_function
 3 | from mako.template import Template
 4 | from mako.lookup import TemplateLookup
 5 | import yaml
 6 | import argparse
 7 | import os
 8 | 
 9 | FILE_PATH = os.path.dirname(__file__)
10 | 
11 | TEMPLATES = ['binning/index.rst',
12 |              'binning/setup.rst',
13 |              'binning/concoct.rst',
14 |              'binning/phylosift.rst']
15 | 
16 | def main(args):
17 |     with open(os.path.join(args.input_path, "settings.yaml")) as settings_file:
18 |         settings = yaml.load(settings_file)
19 | 
20 |     lookup = TemplateLookup(directories=[args.template_path])
21 |     for t in TEMPLATES: 
22 |         template = lookup.get_template(t)
23 |         print(template.render(**settings), 
24 |                 file=open(os.path.join(args.output_path, t), 'w'))
25 | 
26 | def sanitize_input(args):
27 |     assert os.path.isdir(args.output_path)
28 |     assert os.path.isdir(args.input_path)
29 |     assert os.path.isdir(args.template_path)
30 | 
31 | if __name__ == '__main__':
32 |     parser = argparse.ArgumentParser()
33 |     parser.add_argument('-o', '--output_path', default='source/',
34 |         help=('Path to where rendered templates will be written.'
35 |             ' Any existing files with the same file names will '
36 |             'be over written. Default = sources'))
37 |     parser.add_argument('-i', '--input_path', default='mako/',
38 |         help=('Path where the raw mako settings '
39 |             'files are stored. Default = mako'))
40 |     parser.add_argument('-t', '--template_path', default=None,
41 |         help=('Path where the raw mako templates are stored. '
42 |             'Default = <input_path>/templates'))
43 |     args = parser.parse_args()
44 |     if args.template_path is None:
45 |         args.template_path = os.path.join(args.input_path, 'templates')
46 |     sanitize_input(args)
47 |     main(args)
48 | 


--------------------------------------------------------------------------------
/source/annotation/annotation.rst:
--------------------------------------------------------------------------------
 1 | ================================================================
 2 | Search amino acid sequences with HMMER against the Pfam database
 3 | ================================================================
 4 | It is time to do the actual Pfam annotation of our metagenomes!
 5 | 
 6 |     
 7 | Running ``hmmsearch`` on the translated sequence data sets
 8 | ==========================================================
 9 | Before we run ``hmmsearch``, we will look at its available options::
10 | 
11 |     hmmsearch -h
12 |     
13 | As you will see, the program takes a substantial amount of arguments.
14 | In this workshop we will work with the table output from HMMER, which
15 | you get by specifying the ``--tblout`` option together with a file
16 | name. We also want to make sure that we only got statistically
17 | relevant matches, which we can do using the E-value option. The
18 | E-value (Expect-value) is an estimation of how often we would expect
19 | to find a similar hit by chance, given the size of the database. To
20 | avoid getting a lot of noise matches, we will specify and E-value of
21 | 10^-5, that is that we would by chance get a match with a similarly good
22 | alignment in 1 out of 100000 cases. This can be set with the ``-E 1e-5``
23 | option. Finally, to speed up the process a little, we will use the
24 | ``--cpu`` option to get multi-core support. On the Uppmax machines you can
25 | use up to 16 cores for the HMMER runs.
26 | 
27 | To specify the HMM-file database and the input data set, we just type in
28 | the names of those two files at the end of the command. Finally we add in
29 | the ``> /dev/null`` string, to avoid getting the screen cluttered with 
30 | sequence alignments that HMMER outputs. That should give us the following
31 | command::
32 | 
33 |     hmmsearch --tblout <output file> -E 1e-5 --cpu 8 ~/Pfam/Pfam-mobility.hmm <input file (protein format)> > /dev/null
34 |     
35 | Now run this command on all four input files that we just have downloaded. When the
36 | command has finished for all files, we can move on to the normalization exercise.
37 | 


--------------------------------------------------------------------------------
/source/comparative-taxonomic-analysis/krona.rst:
--------------------------------------------------------------------------------
 1 | ===============================
 2 | Visualising taxonomy with KRONA
 3 | ===============================
 4 | To get a graphical representation of the taxonomic classifications you can use
 5 | KRONA, which is an excellent program for exploring data with hierarchical
 6 | structures in general. The output file is an html file that can be viewed in a
 7 | browser. Again make a directory for KRONA::
 8 | 
 9 |     mkdir -p ~/metagenomics/cta/krona
10 |     cd ~/metagenomics/cta/krona
11 | 
12 | And run KRONA, concatenating the archaea and bacteria class files together at the same time like this and providing the name of the sample::
13 | 
14 |     ktImportRDP \
15 |     <(cat ../rdp/0328_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/0328_rrna.silva-bac-16s-database-id85.fasta.class.tsv),0328 \
16 |     <(cat ../rdp/0403_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/0403_rrna.silva-bac-16s-database-id85.fasta.class.tsv),0403 \
17 |     <(cat ../rdp/0423_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/0423_rrna.silva-bac-16s-database-id85.fasta.class.tsv),0423 \
18 |     <(cat ../rdp/0531_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/0531_rrna.silva-bac-16s-database-id85.fasta.class.tsv),0531 \
19 |     <(cat ../rdp/0619_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/0619_rrna.silva-bac-16s-database-id85.fasta.class.tsv),0619 \
20 |     <(cat ../rdp/0705_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/0705_rrna.silva-bac-16s-database-id85.fasta.class.tsv),0705 \
21 |     <(cat ../rdp/0709_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/0709_rrna.silva-bac-16s-database-id85.fasta.class.tsv),0709 \
22 |     <(cat ../rdp/1001_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/1001_rrna.silva-bac-16s-database-id85.fasta.class.tsv),1001 \
23 |     <(cat ../rdp/1004_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/1004_rrna.silva-bac-16s-database-id85.fasta.class.tsv),1004 \
24 |     <(cat ../rdp/1028_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/1028_rrna.silva-bac-16s-database-id85.fasta.class.tsv),1028 \
25 |     <(cat ../rdp/1123_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/1123_rrna.silva-bac-16s-database-id85.fasta.class.tsv),1123
26 | 
27 | The ``<()`` in bash can be used for process substitution
28 | (http://tldp.org/LDP/abs/html/process-sub.html ). Just for your information,
29 | the above command was actually generated with the following commands::
30 | 
31 |     cmd=`echo ktImportRDP; for s in ${samplenames[*]}; do echo '<('cat ../rdp/${s}_rrna.silva-arc-16s-database-id95.fasta.class.tsv ../rdp/${s}_rrna.silva-bac-16s-database-id85.fasta.class.tsv')',$s; done`
32 |     echo $cmd
33 | 
34 | Copy the resulting file rdp.krona.html to your local computer with scp and open it in firefox.
35 | 


--------------------------------------------------------------------------------
/source/comparative-functional-analysis/annotation.rst:
--------------------------------------------------------------------------------
 1 | ===============================
 2 | Functional annotation
 3 | ===============================
 4 | Now that we have extracted the genes/proteins we want to functionally annotate
 5 | those. There are a bunch of ways of doing this. We will use webMGA to do do
 6 | rpsBLAST searches against the COG database. COGs are clusters of orthologs
 7 | genes, i.e.  evolutionary counterparts in different species, usually with the
 8 | same function (http://www.ncbi.nlm.nih.gov/COG/). Many COGs have known
 9 | functions and the COGs are also grouped at a higher level with functional
10 | classes.
11 | 
12 | To download the protein sequences that Prodigal generated, open a local
13 | terminal and type::
14 | 
15 |     mkdir -p ~/metagenomics/cfa/prodigal
16 |     cd ~/metagenomics/cfa/prodigal
17 |     scp username@milou.uppmax.uu.se:~/metagenomics/cfa/prodigal/baltic-sea-ray-noscaf-41.1000.aa.fa .
18 | 
19 | To get COG classifications of your proteins, go to webMGA
20 | http://weizhong-lab.ucsd.edu/metagenomic-analysis/ and select Server  /
21 | Function annotation / COG. Upload the protein file
22 | (``baltic-sea-ray-noscaf-41.1000.aa.fa``) and use the default -e value cutoff.
23 | rpsBLAST is used, which is a BLAST based on position specific scoring matrices
24 | (pssm). For each COG, one such pssm has been constructed. These are compiled
25 | into a database of profiles that is searched against.
26 | http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/wwwblast/node20.html. rpsBLAST is
27 | more sensitive than a normal BLAST, which is important if genomes in your
28 | metagenome are distant from existing sequences in databases. It is also faster
29 | than searching against all proteins out there.
30 | 
31 | When the search is done you get a zipped folder. On milou, create the
32 | directory::
33 | 
34 |     mkdir -p ~/metagenomics/cfa/wmga-cog
35 | 
36 | Use wget or curl to download the zip file on uppmax or use scp to upload it to
37 | that folder i.e.::
38 | 
39 |     scp output.zip username@milou.uppmax.uu.se:~/metagenomics/cfa/wmga-cog
40 | 
41 | Then unzip the file on kalkyl::
42 | 
43 |     cd ~/metagenomics/cfa/wmga-cog
44 |     unzip output.zip
45 | 
46 | Have a look at the README.txt to see what all the files represent. The file
47 | output.2 includes detailed information on the classifications for every protein
48 | with a hit below the -e value cutoff. View them with::
49 | 
50 |     less README.txt
51 |     less -S output.2
52 | 
53 | NOTE: If the queueing takes too much time you can also just copy the results
54 | from the project dir::
55 | 
56 |     cp -r /proj/g2014113/metagenomics/cfa/wmga-cog/ ~/metagenomics/cfa/
57 | 
58 | **Question: What seem to be the 3 most abundant COG classes in our combined
59 | sample (not taking coverage into account)?**
60 | 
61 | .. less output.2.class | tail -n +2 | sort -nk2,2 | tail -3
62 |    J       1895    Translation, ribosomal structure and biogenesis 
63 |    R       2031    General function prediction only 
64 |    E       2308    Amino acid transport and metabolism 
65 | 


--------------------------------------------------------------------------------
/source/comparative-taxonomic-analysis/rrna.rst:
--------------------------------------------------------------------------------
 1 | ==============================================================
 2 | Extracting rRNA encoding reads and annotating them
 3 | ==============================================================
 4 | Taxonomic composition of a sample can be based on e.g. BLASTing the contigs
 5 | against a database of reference genomes, or by utilising rRNA sequences.
 6 | Usually assembly doesn’t work well for rRNA genes due to their highly conserved
 7 | regions, therefore extracting rRNA from contigs will miss a lot of the
 8 | taxonomic information that can be obtained by analysing the reads directly.
 9 | Analysing the reads also has the advantage of being quantitative, i.e. we don’t
10 | need to calculate coverages by the mapping procedure we applied for the
11 | functional genes above. We will extract rRNA encoding reads with the program
12 | sortmeRNA which is one of the fastest software solutions for this. The program
13 | sortmeRNA has built-in multithreading support so this time we use that for
14 | parallelization. These are the commands to run::
15 | 
16 |     mkdir -p ~/metagenomics/cta/sortmerna
17 |     cd ~/metagenomics/cta/sortmerna
18 |     samplenames=(0328 0403 0423 0531 0619 0705 0709 1001 1004 1028 1123)
19 |     for s in ${samplenames[*]}
20 |         do sortmerna -n 2 --db \
21 |         /proj/g2014113/src/sortmerna-1.9/rRNA_databases/silva-arc-16s-database-id95.fasta \
22 |         /proj/g2014113/src/sortmerna-1.9/rRNA_databases/silva-bac-16s-database-id85.fasta \
23 |         --I /proj/g2014113/metagenomics/cta/reads/${s}_pe.fasta \
24 |         --accept ${s}_rrna \
25 |         --other ${s}_nonrrna  \
26 |         --bydbs  \
27 |         -a 8  \
28 |         --log ${s}_bilan  \
29 |         -m 5242880
30 |     done
31 | 
32 | Again, this command takes rather long to run (~5m per sample) so just copy the results if you don’t feel like waiting::
33 | 
34 |     cp /proj/g2014113/metagenomics/cta/sortmerna/* ~/metagenomics/cta/sortmerna
35 |  
36 | It outputs the reads or part of reads that encode rRNA in a fasta file. These
37 | rRNA sequences can be classified in many ways, and again blasting them against
38 | a suitable database is one option. Here we use a simple and fast method (unless
39 | you have too many samples), the classifier tool at RDP (ribosomal database
40 | project). This uses a naive bayesian classifier trained on many sequences of
41 | defined taxonomies. It gives bootstrap support values for each taxonomic level;
42 | usually the support gets lower the further down the hierarchy you go. Genus
43 | level is the lowest level provided. You can use the web service if you prefer,
44 | and upload each file individually, or you can use the uppmax installation of
45 | RDP classifier like this (~4m)::
46 | 
47 |     mkdir -p ~/metagenomics/cta/rdp
48 |     cd ~/metagenomics/cta/rdp
49 |     for s in ../sortmerna/*_rrna*.fasta
50 |         do java -Xmx1g -jar /glob/inod/src/rdp_classifier_2.6/dist/classifier.jar \
51 |         classify \
52 |         -g 16srrna  \
53 |         -b `basename ${s}`.bootstrap  \
54 |         -h `basename ${s}`.hier.tsv  \
55 |         -o `basename ${s}`.class.tsv  \
56 |         ${s}
57 |     done
58 | 


--------------------------------------------------------------------------------
/source/annotation/translation.rst:
--------------------------------------------------------------------------------
 1 | ==========================================================
 2 | Translating nucleotide sequences into amino acid sequences
 3 | ==========================================================
 4 | The first step before we can annotate the metagenomes with Pfam domains
 5 | using HMMER will be to translate the reads into amino acid sequences. This
 6 | is necessary because HMMER (still) does not translate nucleotide sequnces
 7 | into protein space on the fly (like,for example, BLAST). For completing
 8 | this task we will use ``transeq``, part of the `EMBOSS <http://emboss.sourceforge.net>`_
 9 | package.
10 |     
11 | Running ``transeq`` on the sequence data sets
12 | =============================================
13 | To run ``transeq``, take a look at its available options::
14 | 
15 |     transeq -h
16 |     
17 | If you have trouble getting ``transeq`` to run, try to run::
18 | 
19 |     module load emboss
20 | 
21 | A few options are important in this context. First of all, we need to
22 | supply an input file, using the (somewhat bulky) option ``-sequence``.
23 | Second, we also need to specify an output file, otherwise transeq will
24 | simply write its output to the screen. This is specified using the
25 | ``-outseq`` option.
26 | 
27 | However, if we just run ``transeq`` like this we will
28 | run into two additional problems. First, ``transeq`` by default just
29 | translate the reading frame beginning at the first base in the input sequnece,
30 | and will ignore any bases in the reading frames beginning with base two
31 | and three, as well as those on the reverse-complementary strand. Second,
32 | the software will add stop characters in the form of asterixes ``*`` whenever
33 | it encounters a stop codon. This will occasionally cause HMMER to choke, so we
34 | want stop codons to instead be translated into X characters that HMMER can handle.
35 | The following excerpt form the `HMMER creator's blog <http://selab.janelia.org/people/eddys/blog/?p=424>`_
36 | on this subject is one of my personal all-time favorites in terms of computer
37 | software documentation:
38 | 
39 |     There’s two ways people do six-frame translation. You can translate each
40 |     frame into separate ORFs, or you can translate the read into exactly six
41 |     “ORFs”, one per frame, with * chararacters marking stop codons. HMMER
42 |     prefers that you do the former. Technically, * chararacters aren’t legal
43 |     amino acid residue codes, and the author of HMMER3 is a pedantic nitpicker,
44 |     passive-aggressive, yet also a pragmatist: so while HMMER3 pragmatically
45 |     accepts * chararacters in input “protein” sequences just fine, it pedantically
46 |     relegates them to somewhat suboptimal status, and it passively-aggressively
47 |     figures that any suboptimal performance on \*-containing ORFs is your own
48 |     fault for using \*’s in the first place.
49 |     
50 | To avoid making Sean Eddy angry and causing other problems for our HMMER runs,
51 | we will use the ``-frame 6`` option to ``transeq`` in order to get translations
52 | of all six reading frames, and the ``-clean`` option to convert stop codons to X
53 | instead of \*.
54 | 
55 | That should give us the command::
56 | 
57 |     transeq -sequence <input file> -outseq <output file> -frame 6 -clean
58 |     
59 | Now run this command on all four input files that we have created links to.
60 | When the command has finished for all files, we can move on to the actual
61 | annotation.
62 | 


--------------------------------------------------------------------------------
/source/assembly/qtrim.rst:
--------------------------------------------------------------------------------
 1 | ==========================================
 2 | Quality trimming Illumina paired-end reads
 3 | ==========================================
 4 | In this excercise you will learn how to quality trim Illumina paired-end reads.
 5 | The most common Next Generation Sequencing (NGS) technology for metagenomics.
 6 | 
 7 | Sickle
 8 | ======
 9 | For quality trimming Illumina paired end reads we use the library sickle which
10 | trims reads from 3' end to 5' end using a sliding window. If the mean quality
11 | drops below a specified number the remaining part of the read will be trimmed.
12 | 
13 | 
14 | Downloading a test set
15 | ======================
16 | Today we'll be working on a small metagenomic data set from the anterior nares
17 | (http://en.wikipedia.org/wiki/Anterior_nares).
18 | 
19 | .. image:: https://raw.github.com/inodb/2013-metagenomics-workshop-gbg/master/images/nostril.jpg
20 | 
21 | 
22 | So get ready for your first smell of metagenomic assembly - pun intended. Run
23 | all these commands in your shell::
24 |     
25 |     # Download the reads and extract them
26 |     mkdir -p ~/asm-workshop
27 |     mkdir -p ~/asm-workshop/data
28 |     cd ~/asm-workshop/data
29 |     wget http://downloads.hmpdacc.org/data/Illumina/anterior_nares/SRS018585.tar.bz2
30 |     tar -xjf SRS018585.tar.bz2
31 | 
32 | If successfull you should have the files::
33 | 
34 |     $ ls -lh ~/asm-workshop/data/SRS018585/
35 |     -rw-rw-r-- 1 inod inod  36M Apr 18  2011 SRS018585.denovo_duplicates_marked.trimmed.1.fastq
36 |     -rw-rw-r-- 1 inod inod  36M Apr 18  2011 SRS018585.denovo_duplicates_marked.trimmed.2.fastq
37 |     -rw-rw-r-- 1 inod inod 6.4M Apr 18  2011 SRS018585.denovo_duplicates_marked.trimmed.singleton.fastq
38 | 
39 | If not, try to find out if one of the previous commands gave an error.
40 | 
41 | Look at the top of the one of the pairs::
42 | 
43 |     cat ~/asm-workshop/data/SRS018585/SRS018585.denovo_duplicates_marked.trimmed.1.fastq | head
44 | 
45 | **Question: Can you explain what the different parts of this header mean @HWI-EAS324_102408434:5:100:10055:13493/1?**
46 | 
47 | 
48 | Running sickle on a paired end library
49 | ======================================
50 | I like to create directories for specific parts I'm working on and creating
51 | symbolic links (shortcuts in windows) to the input files. One can use the
52 | command ``ln`` for creating links. The difference between a symbolic link and a
53 | hard link can be found here:
54 | http://stackoverflow.com/questions/185899/what-is-the-difference-between-a-symbolic-link-and-a-hard-link.
55 | In this case I use symbolic links so I know what path the original reads have,
56 | which can help one remember what those reads were::
57 |     
58 |     mkdir -p ~/asm-workshop/sickle
59 |     cd ~/asm-workshop/sickle
60 |     ln -s ../data/SRS018585/SRS018585.denovo_duplicates_marked.trimmed.1.fastq pair1.fastq
61 |     ln -s ../data/SRS018585/SRS018585.denovo_duplicates_marked.trimmed.2.fastq pair2.fastq
62 | 
63 | Now run sickle::
64 | 
65 |     # check if sickle is in your PATH
66 |     which sickle
67 |     # Run sickle
68 |     sickle pe \
69 |         -f pair1.fastq \
70 |         -r pair2.fastq \
71 |         -t sanger \
72 |         -o qtrim1.fastq \
73 |         -p qtrim2.fastq \
74 |         -s qtrim.unpaired.fastq
75 |     # Check what files have been generated
76 |     ls
77 | 
78 | Sickle states how many reads it trimmed, but it is always good to be
79 | suspicious! Check if the numbers correspond with the amount of reads you count.
80 | Hint: use ``wc -l``.
81 | 
82 | **Question: How many paired reads are left after trimming? How many singletons?**
83 | 
84 | **Question: What are the different quality scores that sickle can handle? Why do we specify -t sanger here?**
85 | 


--------------------------------------------------------------------------------
/mako/templates/binning/phylosift.rst:
--------------------------------------------------------------------------------
 1 | ===========================================
 2 | Phylogenetic Classification using Phylosift
 3 | ===========================================
 4 | In this workshop we'll extract interesting bins from the concoct runs and investigate which species they consists of. We'll start by using a plain'ol BLASTN search and later we'll try a more sophisticated strategy with the program Phylosift.
 5 | 
 6 | Extract bins from CONCOCT output
 7 | ================================
 8 | The output from concoct is only a list of cluster id and contig ids respectively, so if we'd like to have fasta files for all our bins, we need to run the following script::
 9 |     
10 |     ${commands['extract_fasta_help']}
11 | 
12 | Running it will create a separate fasta file for each bin, so we'd first like to create a output directory where we can store these files::
13 | 
14 |     ${'\n    '.join(commands['extract_fasta'])}
15 | 
16 | Now you can see a number of bins in your output folder::
17 | 
18 |     ${commands['list_bins']}
19 | 
20 | Using the graph downloaded in the previous part, decide one cluster you'd like to investigate further. We're going to use the web based BLASTN tool at ncbi, so lets first download the fasta file for the cluster you choose. Execute on a terminal not logged in to UPPMAX::
21 |     
22 |     ${commands['download_fasta']}
23 | 
24 | Before starting to blasting this cluster, lets begin with the next assignment, since the next assignment will include a long waiting time that suits for running the BLASTN search.
25 | 
26 | Phylosift
27 | =========
28 | Phylosift is a software created for the purpose of determining the phylogenetic composition of your metagenomic data. It uses a defined set of genes to predict the taxonomy of each sequence in your dataset. You can read more about how this works here: http://phylosift.wordpress.com
29 | I've yet to discover how to install phylosift into a common bin, so in order to execute phylosift, you'd have to cd into the phylosift directory::
30 | 
31 |     ${commands['move_to_phylosift']}
32 | 
33 | Running phylosift will take some time (roughly 45 min) and UPPMAX do not want you to run this kind of heavy jobs on the regular login session, so what we'll do is to allocate an interactive node. For this course we have 16 nodes booked and available for our use so you will not need to wait in line. Start your interactive session with 4 cores available::
34 | 
35 |     ${commands['allocate_interactive']}
36 |     
37 | Now we have more computational resources available so lets start running phylosift on the cluster you choose (excange x in x.fa for your cluster number). You could also choose to use the clusters from the binning results using a single sample, but then you need to redo the fasta extraction above.::
38 | 
39 |     ${'\n    '.join(commands['run_phylosift'])}
40 | 
41 | While this command is running, go to ncbi web blast service: 
42 | 
43 | http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome
44 | 
45 | Upload your fasta file that you downloaded in the previous step and submit a blast search against the nr/nt database.
46 | Browse through the result and try and see if you can do a taxonomic classification from these.
47 | 
48 | When the phylosift run is completed, browse the output directory::
49 | 
50 |     ${commands['browse_phylosift']}
51 | 
52 | All of these files are interesting, but the most fun one is the html file, so lets download this to your own computer and have a look. Again, switch to a terminal where you're not logged in to UPPMAX::
53 | 
54 |     ${commands['download_phylosift']}
55 | 
56 | Did the phylosift result correspond to any results in the BLAST output?
57 | 
58 | As you hopefully see, this phylosift result file is quite neat, but it doesn't show its full potential using a pure cluster, so to display the results for a more diverse input file we have prepared a run for the complete dataset::
59 | 
60 |     ${commands['browse_all_phylosift']}
61 | 
62 | And download this (running it on your own terminal again)::
63 | 
64 |     ${commands['download_phylosift_all']}
65 | 
66 | Can you "find your bin" within this result file?
67 | 


--------------------------------------------------------------------------------
/source/comparative-functional-analysis/genefinding.rst:
--------------------------------------------------------------------------------
 1 | ==================
 2 | Gene finding
 3 | ==================
 4 | Now that you have assembled the data into contigs next natural step to do is
 5 | annotation of the data, i.e. finding the genes and doing functional annotation
 6 | of those. For gene finding a range of programs are available (Metagene
 7 | Annotator, MetaGeneMark, Orphelia, FragGeneScan), here we will use Prodigal
 8 | which is very fast and has recently been enhanced for metagenomics. We will use
 9 | the -p flag which instructs Prodigal to use the algorithm suitable for
10 | metagenomic data. We will use a dataset consisting of 11 samples from a time
11 | series sampling of surface water in the Baltic Sea. Sequencing was done with
12 | Illumina MiSeq here generating on average 835,048 2 x 250 bp reads per sample.
13 | The reads can be found here::
14 | 
15 |     /proj/g2014113/metagenomics/comparative-functional-analysis/reads
16 | 
17 | The first four numbers in the filename represent a date. All samples are from
18 | 2012. R1 and R2 both contain one read of a pair. They are ordered, so the first
19 | four lines in R1 are paired with the read in the first four lines of R2. They
20 | are in CASAVA v1.8 format (http://en.wikipedia.org/wiki/FASTQ_format).
21 | 
22 | A coassembly has already been made with Ray using all reads to save you some
23 | time. You can find the contigs from a combined assembly on reads from all
24 | samples here::
25 | 
26 |     /proj/g2014113/metagenomics/cfa/assembly/baltic-sea-ray-noscaf-41.1000.fa
27 | 
28 | They have been constructed with Ray using a kmer of 41 and no scaffolding. Only
29 | contigs >= 1000 are in this file. The reason a coassembly is used is that we
30 | can get an idea of the entire metagenome over multiple samples. By mapping the
31 | reads back per sample we can compare coverages of contigs between samples.
32 | 
33 | **Question: What could be a possible advantage/disadvantage for the assembly
34 | process when assembling multiple samples at one time?**
35 | 
36 | .. Advantage: more coverage. Disadvantage: more related strains/species makes
37 | .. graph traversal harder
38 | 
39 | **Question: Can you think of other approaches to get a coassembly?**
40 | 
41 | .. Maybe map contigs against each other in merge them in that way. Preferably
42 | .. taking coverages into account
43 | 
44 | Note that all solutions (i.e. the generated outputs) for the exercises are also in::
45 | 
46 |     /proj/g2014113/metagenomics/cfa/
47 | 
48 | In all the following exercises you should again use the virtual environment to
49 | get all the necessary programs (unless you already loaded it ofc)::
50 | 
51 |     source /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/activate
52 | 
53 | It’s time to run Prodigal. First create an output directory with a copy of the
54 | contig file::
55 | 
56 |     mkdir -p ~/metagenomics/cfa/prodigal
57 |     cd ~/metagenomics/cfa/prodigal
58 |     cp /proj/g2014113/metagenomics/cfa/assembly/baltic-sea-ray-noscaf-41.1000.fa .
59 | 
60 | Then run Prodigal on the contig file (~2m20)::
61 | 
62 |     prodigal -a baltic-sea-ray-noscaf-41.1000.aa.fa \
63 |              -d baltic-sea-ray-noscaf-41.1000.nuc.fa \
64 |              -i baltic-sea-ray-noscaf-41.1000.fa \
65 |              -f gff -p meta \
66 |              > baltic-sea-ray-noscaf-41.1000.gff
67 | 
68 | This will produce 3 files:
69 | 
70 |     * ``-d`` a fasta file with the gene sequences (nucleotides)
71 |     * ``-a`` a fasta file with the protein sequences (aminoacids)
72 |     * ``stdout`` a gff file
73 | 
74 | The gff format is a standardised file type for showing annotations.It’s a tab
75 | delimited file that can be viewed by e.g. ::
76 | 
77 |     less baltic-sea-ray-noscaf-41.1000.gff
78 | 
79 | Pass the option -S to less if you don’t want lines to wrap
80 | 
81 | An explanation of the gff format can be found at
82 | http://genome.ucsc.edu/FAQ/FAQformat.html
83 | 
84 | **Question: How many coding regions were found by Prodigal? Hint: use grep -c**
85 | 
86 | .. less *.gff | grep -c 'CDS'
87 | .. 23577
88 | 
89 | **Question: How many contigs have coding regions? How many do not?**
90 | 
91 | .. less *.gff | grep '^contig' | grep 'CDS' | awk '{print $1}' | sort -u | wc -l
92 | .. 8517
93 | .. grep -c '^>cont' baltic-sea-ray-noscaf-41.1000.fa 
94 | .. 8533
95 | .. 8533-8517=16
96 | 


--------------------------------------------------------------------------------
/source/assembly/map.rst:
--------------------------------------------------------------------------------
  1 | ============================================
  2 | Mapping reads back to the assembly
  3 | ============================================
  4 | 
  5 | Overview
  6 | ======================
  7 | 
  8 | There are many different mappers available to map your reads back to the
  9 | assemblies. Usually they result in a SAM or BAM file
 10 | (http://genome.sph.umich.edu/wiki/SAM). Those are formats that contain the
 11 | alignment information, where BAM is the binary version of the plain text SAM
 12 | format. In this tutorial we will be using bowtie2
 13 | (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml).
 14 | 
 15 | 
 16 | The SAM/BAM file can afterwards be processed with Picard
 17 | (http://picard.sourceforge.net/) to remove duplicate reads. Those are likely to
 18 | be reads that come from a PCR duplicate (http://www.biostars.org/p/15818/).
 19 | 
 20 | 
 21 | BEDTools (http://code.google.com/p/bedtools/) can then be used to retrieve
 22 | coverage statistics.
 23 | 
 24 | 
 25 | There is a script available that does it all at once. Read it and try to
 26 | understand what happens in each step::
 27 |     
 28 |     less `which map-bowtie2-markduplicates.sh`
 29 |     map-bowtie2-markduplicates.sh -h
 30 | 
 31 | Bowtie2 has some nice documentation: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
 32 | 
 33 | **Question: what does bowtie2-build do?**
 34 | 
 35 | Picard's documentation also exists! Two bioinformatics programs in a row with
 36 | decent documentation! Take a moment to celebrate, then have a look here:
 37 | http://sourceforge.net/apps/mediawiki/picard/index.php 
 38 | 
 39 | **Question: Why not just remove all identitical pairs instead of mapping them
 40 | and then removing them?**
 41 | 
 42 | **Question: What is the difference between samtools rmdup and Picard MarkDuplicates?**
 43 | 
 44 | 
 45 | 
 46 | Mapping reads with bowtie2
 47 | ==========================
 48 | Take an assembly and try to map the reads back using bowtie2. Do this on an
 49 | interactive node again, and remember to change the 'out_21' part to the actual output directory that you generated::
 50 | 
 51 |     # Create a new directory and link files
 52 |     mkdir -p ~/asm-workshop/bowtie2
 53 |     cd ~/asm-workshop/bowtie2
 54 |     ln -s ../velvet/out_21/contigs.fa contigs.fa
 55 |     ln -s ../sickle/pair1.fastq pair1.fastq
 56 |     ln -s ../sickle/pair2.fastq pair2.fastq
 57 | 
 58 |     # Run the everything in one go script. 
 59 |     map-bowtie2-markduplicates.sh -t 1 -c pair1.fastq pair2.fastq pair contigs.fa contigs map > map.log 2> map.err
 60 | 
 61 | Inspect the ``map.log`` output and see if all went well.
 62 | 
 63 | **Question: What is the overall alignment rate of your reads that bowtie2 reports?**
 64 | 
 65 | Add the answer to the doc_.
 66 | 
 67 | 
 68 | Some general statistics from the SAM/BAM file
 69 | =============================================
 70 | You can also determine mapping statistics directly from the bam file. Use for
 71 | instance::
 72 |     
 73 |     # Mapped reads only
 74 |     samtools view -c -F 4 map/contigs_pair-smds.bam
 75 |      
 76 |     # Unmapped reads only
 77 |     samtools view -c -f 4 map/contigs_pair-smds.bam
 78 | 
 79 | From:
 80 | http://left.subtree.org/2012/04/13/counting-the-number-of-reads-in-a-bam-file/.
 81 | The number is different from the number that bowtie2 reports, because these are
 82 | the numbers after removing duplicates. The ``-smds`` part stands for running
 83 | ``samtools sort``, ``MarkDuplicates.jar`` and ``samtools sort`` again on the
 84 | bam file. If all went well with the mapping there should also be a
 85 | ``map/contigs_pair-smd.metrics`` file where you can see the percentage of
 86 | duplication. Add that to the doc_ as well.
 87 | 
 88 | 
 89 | Coverage information from BEDTools
 90 | =============================================
 91 | Look at the output from BEDTools::
 92 | 
 93 |     less map/contigs_pair-smds.coverage
 94 | 
 95 | The format is explained here
 96 | http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html. The
 97 | ``map-bowtie2-markduplicates.sh`` script also outputs the mean coverage per
 98 | contig::
 99 | 
100 |     less map/contigs_pair-smds.coverage.percontig
101 | 
102 | **Question: What is the contig with the highest coverage? Hint: use sort -k**
103 | 
104 | .. _doc: https://docs.google.com/spreadsheet/ccc?key=0AvduvUOYAB-_dDdDSVhqUi1KQmJkTlZJcHVfMGI3a2c#gid=3
105 | 


--------------------------------------------------------------------------------
/mako/settings.yaml:
--------------------------------------------------------------------------------
 1 | commands: 
 2 |     activate: 'source /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/activate'
 3 |     check_activate:
 4 |         concoct: 'concoct -h'
 5 |         rpsblast: 'rpsblast --help'
 6 |     copy_dataset: 
 7 |         - 'mkdir -p ~/binning-workshop'
 8 |         - 'mkdir -p ~/binning-workshop/data'
 9 |         - 'cd ~/binning-workshop/'
10 |         - 'cp /proj/g2014113/nobackup/concoct-workshop/Contigs_gt1000.fa data/'
11 |         - 'cp /proj/g2014113/nobackup/concoct-workshop/120322_coverage_nolen.tsv data/'
12 |     browse_coverage: 'less -S ~/binning-workshop/data/120322_coverage_nolen.tsv'
13 |     cut_coverage: 'cut -f1,3 ~/binning-workshop/data/120322_coverage_nolen.tsv > ~/binning-workshop/data/120322_coverage_one_sample.tsv' 
14 |     run_concoct:
15 |         one_sample: 
16 |             - 'mkdir -p ~/binning-workshop/concoct_output'
17 |             - 'concoct --coverage_file ~/binning-workshop/data/120322_coverage_one_sample.tsv --composition_file ~/binning-workshop/data/Contigs_gt1000.fa -l 3000 -b ~/binning-workshop/concoct_output/3000_one_sample/'
18 |         look_clustering: 'less ~/binning-workshop/concoct_output/3000_one_sample/clustering_gt3000.csv'
19 |         all_samples:
20 |             'concoct --coverage_file ~/binning-workshop/data/120322_coverage_nolen.tsv --composition_file ~/binning-workshop/data/Contigs_gt1000.fa -l 3000 -b ~/binning-workshop/concoct_output/3000_all_samples/'
21 |     copy_blast: 
22 |         - 'cp /proj/g2014113/nobackup/concoct-workshop/Contigs_gt1000_blast.out ~/binning-workshop/data/'
23 |         - 'cp /proj/g2014113/nobackup/concoct-workshop/scg_cogs_min0.97_max1.03_unique_genera.txt ~/binning-workshop/data/'
24 |         - 'cp /proj/g2014113/nobackup/concoct-workshop/cdd_to_cog.tsv ~/binning-workshop/data/'
25 |     check_cog_scripts:
26 |         - 'COG_table.py -h'
27 |         - 'COGPlot.R -h'
28 |     cogplot_single:
29 |         - 'COG_table.py -b ~/binning-workshop/data/Contigs_gt1000_blast.out -c ~/binning-workshop/concoct_output/3000_one_sample/clustering_gt3000.csv -m ~/binning-workshop/data/scg_cogs_min0.97_max1.03_unique_genera.txt --cdd_cog_file ~/binning-workshop/data/cdd_to_cog.tsv > ~/binning-workshop/cog_table_3000_single_sample.tsv'
30 |         - 'COGPlot.R -s ~/binning-workshop/cog_table_3000_single_sample.tsv -o ~/binning-workshop/cog_plot_3000_single_sample.pdf'
31 |     download_single_cogplot: 'scp username@milou.uppmax.uu.se:~/binning-workshop/cog_plot_3000_single_sample.pdf ~/Desktop/'
32 |     cogplot_multiple:
33 |         - 'COG_table.py -b ~/binning-workshop/data/Contigs_gt1000_blast.out -c ~/binning-workshop/concoct_output/3000_all_samples/clustering_gt3000.csv -m ~/binning-workshop/data/scg_cogs_min0.97_max1.03_unique_genera.txt --cdd_cog_file ~/binning-workshop/data/cdd_to_cog.tsv > ~/binning-workshop/cog_table_3000_all_samples.tsv'
34 |         - 'COGPlot.R -s ~/binning-workshop/cog_table_3000_all_samples.tsv -o ~/binning-workshop/cog_plot_3000_all_samples.pdf'
35 |     download_multiple_cogplot: 'scp username@milou.uppmax.uu.se:~/binning-workshop/cog_plot_3000_all_samples.pdf ~/Desktop'
36 |     extract_fasta_help: 'extract_fasta_bins.py -h'    
37 |     extract_fasta:
38 |         - 'mkdir -p ~/binning-workshop/concoct_output/3000_all_samples/fasta_bins'
39 |         - 'extract_fasta_bins.py ~/binning-workshop/data/Contigs_gt1000.fa ~/binning-workshop/concoct_output/3000_all_samples/clustering_gt3000.csv --output_path ~/binning-workshop/concoct_output/3000_all_samples/fasta_bins/'
40 |     download_fasta: 'scp username@milou.uppmax.uu.se:~/binning-workshop/concoct_output/3000_all_samples/fasta_bins/x.fa ~/Desktop/'
41 |     list_bins: 'ls ~/binning-workshop/concoct_output/3000_all_samples/fasta_bins'
42 |     allocate_interactive: 'interactive -A g2014113 -p core -n 4 -t 4:00:00'
43 |     move_to_phylosift: 'cd /proj/g2014113/src/phylosift_v1.0.1'
44 |     run_phylosift: 
45 |         - 'mkdir -p ~/binning-workshop/phylosift_output/'
46 |         - '/proj/g2014113/src/phylosift_v1.0.1/phylosift all -f --output ~/binning-workshop/phylosift_output/ ~/binning-workshop/concoct_output/3000_all_samples/fasta_bins/x.fa' 
47 |     browse_phylosift: 'ls ~/binning-workshop/phylosift_output/'
48 |     download_phylosift: 'scp username@milou.uppmax.uu.se:~/binning-workshop/phylosift_output/x.fa.html ~/Desktop/'
49 |     browse_all_phylosift: 'ls /proj/g2014113/nobackup/concoct-workshop/phylosift_output/all/'
50 |     download_phylosift_all: 'scp username@milou.uppmax.uu.se:/proj/g2014113/nobackup/concoct-workshop/phylosift_output/all/Contigs_gt1000.fa.html ~/Desktop/'
51 | 


--------------------------------------------------------------------------------
/source/annotation/software.rst:
--------------------------------------------------------------------------------
  1 | ==========================================
  2 | Checking required software
  3 | ==========================================
  4 | Before we begin, we will quickly go through the required software and datasets
  5 | for this workshop. For those who are already command-line-skilled there will
  6 | also be a possibility to install the Metaxa2 tool for rRNA finding, but this
  7 | is not required to complete the workshop.
  8 | 
  9 | Programs used in this workshop
 10 | ==============================
 11 | The following programs are used in this workshop:
 12 | 
 13 |     - `EMBOSS (transeq)`__
 14 |     - HMMER_
 15 |     - R_
 16 |     - Optionally: Metaxa2_
 17 | 
 18 | .. __: http://emboss.sourceforge.net
 19 | .. _HMMER: http://hmmer.janelia.org
 20 | .. _R: http://www.r-project.org
 21 | .. _Metaxa2: http://microbiology.se/software/metaxa2/
 22 | 
 23 | Since we are going to use the plotting functionality of R, we need to login
 24 | to Uppmax with X11 forwarding turned on. In the Unix/Linux terminal this is
 25 | easily achieved by adding the ``-X`` (captal X) option. All programs but
 26 | Metaxa2 are already installed, all you have to do is load the virtual
 27 | environment for this workshop. Once you are logged in to the server run::
 28 | 
 29 |     source /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/activate
 30 | 
 31 | You deactivate the virtual environment with::
 32 |     
 33 |     deactivate
 34 | 
 35 | NOTE: This is a python virtual environment. The binary folder of the virtual
 36 | environment has symbolic links to all programs used in this workshop so you
 37 | should be able to run those without problems.
 38 | 
 39 | 
 40 | Check all programs in one go with which
 41 | ==================================================
 42 | To check whether you have all programs installed in one go, you can use ``which``
 43 | to test for the following programs::
 44 | 
 45 |     hmmsearch
 46 |     transeq
 47 |     R
 48 |     blastall
 49 |     
 50 | Data and databases used in this workshop
 51 | ========================================
 52 | In this workshop, we are (due to time constraints) going to use a simplified version
 53 | of the `Pfam <http://pfam.xfam.org/>`__ database, including only protein families
 54 | related to plasmid replication and maintenance. This database is pre-compiled and can
 55 | be downloaded from http://microbiology.se/teach/scilife2014/pfam.tar.gz
 56 | Download it using the following commands::
 57 | 
 58 |     mkdir -p ~/Pfam
 59 |     cd ~/Pfam
 60 |     wget http://microbiology.se/teach/scilife2014/pfam.tar.gz
 61 |     tar -xzvf pfam.tar.gz
 62 |     cd ~
 63 |     
 64 | In addition, you will need to obtain the following data sets for the workshop::
 65 | 
 66 |     /proj/g2014113/metagenomics/annotation/baltic1.fna
 67 |     /proj/g2014113/metagenomics/annotation/baltic2.fna                                                                                                                                                               
 68 |     /proj/g2014113/metagenomics/annotation/indian_lake.fna
 69 |     /proj/g2014113/metagenomics/annotation/swedish_lake.fna
 70 |     
 71 | We are going to use two data sets from the Baltic Sea, one from a Swedish lake and one
 72 | from an Indian lake contaminated with wastewater from pharmaceutical production. For
 73 | the same of time, I have reduced the data sets in size dramatically prior to this
 74 | workshop. You can create links to the above files using the ``ln -s <path>`` command.
 75 | Use it on all the four data sets.
 76 | 
 77 | 
 78 | (Optional excercise) Install Metaxa2 by yourself
 79 | ================================================
 80 | Follow these steps only if you want to install ``Metaxa2`` by yourself.
 81 | The code for Metaxa2 is available from http://microbiology.se/sw/Metaxa2_2.0rc3.tar.gz
 82 | You can install Metaxa2 as follows::
 83 | 
 84 |     # Create a src and a bin directory
 85 |     mkdir -p ~/src
 86 |     mkdir -p ~/bin 
 87 | 
 88 |     # Go to the source directory and download the Metaxa2 tarball
 89 |     cd ~/src
 90 |     wget http://microbiology.se/sw/Metaxa2_2.0rc3.tar.gz
 91 |     tar -xzvf Metaxa2_2.0rc3.tar.gz
 92 |     cd Metaxa2_2.0rc3
 93 | 
 94 |     # Run the installation script
 95 |     ./install_metaxa2
 96 |     
 97 |     # Try to run Metaxa2 (this should bring up the main options for the software)
 98 |     metaxa2 -h
 99 | 
100 | If this did not work, you can try this manual approach::
101 | 
102 |     cd ~/src/Metaxa2_2.0rc3
103 |     cp -r metaxa2* ~/bin/
104 |     
105 |     # Then try to run Metaxa2 again
106 |     metaxa2 -h
107 |     
108 | If this brings up the help message, you are all set!
109 | 


--------------------------------------------------------------------------------
/source/binning/phylosift.rst:
--------------------------------------------------------------------------------
 1 | ===========================================
 2 | Phylogenetic Classification using Phylosift
 3 | ===========================================
 4 | In this workshop we'll extract interesting bins from the concoct runs and investigate which species they consists of. We'll start by using a plain'ol BLASTN search and later we'll try a more sophisticated strategy with the program Phylosift.
 5 | 
 6 | Extract bins from CONCOCT output
 7 | ================================
 8 | The output from concoct is only a list of cluster id and contig ids respectively, so if we'd like to have fasta files for all our bins, we need to run the following script::
 9 |     
10 |     extract_fasta_bins.py -h
11 | 
12 | Running it will create a separate fasta file for each bin, so we'd first like to create a output directory where we can store these files::
13 | 
14 |     mkdir -p ~/binning-workshop/concoct_output/3000_all_samples/fasta_bins
15 |     extract_fasta_bins.py ~/binning-workshop/data/Contigs_gt1000.fa ~/binning-workshop/concoct_output/3000_all_samples/clustering_gt3000.csv --output_path ~/binning-workshop/concoct_output/3000_all_samples/fasta_bins/
16 | 
17 | Now you can see a number of bins in your output folder::
18 | 
19 |     ls ~/binning-workshop/concoct_output/3000_all_samples/fasta_bins
20 | 
21 | Using the graph downloaded in the previous part, decide one cluster you'd like to investigate further. We're going to use the web based BLASTN tool at ncbi, so lets first download the fasta file for the cluster you choose. Execute on a terminal not logged in to UPPMAX::
22 |     
23 |     scp username@milou.uppmax.uu.se:~/binning-workshop/concoct_output/3000_all_samples/fasta_bins/x.fa ~/Desktop/
24 | 
25 | Before starting to blasting this cluster, lets begin with the next assignment, since the next assignment will include a long waiting time that suits for running the BLASTN search.
26 | 
27 | Phylosift
28 | =========
29 | Phylosift is a software created for the purpose of determining the phylogenetic composition of your metagenomic data. It uses a defined set of genes to predict the taxonomy of each sequence in your dataset. You can read more about how this works here: http://phylosift.wordpress.com
30 | I've yet to discover how to install phylosift into a common bin, so in order to execute phylosift, you'd have to cd into the phylosift directory::
31 | 
32 |     cd /proj/g2014113/src/phylosift_v1.0.1
33 | 
34 | Running phylosift will take some time (roughly 45 min) and UPPMAX do not want you to run this kind of heavy jobs on the regular login session, so what we'll do is to allocate an interactive node. For this course we have 16 nodes booked and available for our use so you will not need to wait in line. Start your interactive session with 4 cores available::
35 | 
36 |     interactive -A g2014113 -p core -n 4 -t 4:00:00
37 |     
38 | Now we have more computational resources available so lets start running phylosift on the cluster you choose (excange x in x.fa for your cluster number). You could also choose to use the clusters from the binning results using a single sample, but then you need to redo the fasta extraction above.::
39 | 
40 |     mkdir -p ~/binning-workshop/phylosift_output/
41 |     /proj/g2014113/src/phylosift_v1.0.1/phylosift all -f --output ~/binning-workshop/phylosift_output/ ~/binning-workshop/concoct_output/3000_all_samples/fasta_bins/x.fa
42 | 
43 | While this command is running, go to ncbi web blast service: 
44 | 
45 | http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome
46 | 
47 | Upload your fasta file that you downloaded in the previous step and submit a blast search against the nr/nt database.
48 | Browse through the result and try and see if you can do a taxonomic classification from these.
49 | 
50 | When the phylosift run is completed, browse the output directory::
51 | 
52 |     ls ~/binning-workshop/phylosift_output/
53 | 
54 | All of these files are interesting, but the most fun one is the html file, so lets download this to your own computer and have a look. Again, switch to a terminal where you're not logged in to UPPMAX::
55 | 
56 |     scp username@milou.uppmax.uu.se:~/binning-workshop/phylosift_output/x.fa.html ~/Desktop/
57 | 
58 | Did the phylosift result correspond to any results in the BLAST output?
59 | 
60 | As you hopefully see, this phylosift result file is quite neat, but it doesn't show its full potential using a pure cluster, so to display the results for a more diverse input file we have prepared a run for the complete dataset::
61 | 
62 |     ls /proj/g2014113/nobackup/concoct-workshop/phylosift_output/all/
63 | 
64 | And download this (running it on your own terminal again)::
65 | 
66 |     scp username@milou.uppmax.uu.se:/proj/g2014113/nobackup/concoct-workshop/phylosift_output/all/Contigs_gt1000.fa.html ~/Desktop/
67 | 
68 | Can you "find your bin" within this result file?
69 | 
70 | 


--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
  1 | ==================================
  2 | Metagenomics Workshop SciLifeLab
  3 | ==================================
  4 | 
  5 | This repository holds the code for the website of the metagenomics workshop
  6 | held at SciLifeLab, Stockholm 21-23 May 2014. The website is written using
  7 | Sphinx_. The webpage can be found at:
  8 | 
  9 | http://inodb.github.io/2014-5-metagenomics-workshop/
 10 | 
 11 | and
 12 | 
 13 | http://2014-5-metagenomics-workshop.readthedocs.org/
 14 | 
 15 | How does it work?
 16 | -------------------------
 17 | In short, we use a python package called Sphinx_ to convert a bunch of text
 18 | files written in reStructuredText_ (reST) to HTML pages. Instead of editing the
 19 | HTML directly you change text files in the reST_ format. Those are the
 20 | ``*.rst`` files in  the `source directory`_. That's all you need to know to
 21 | start `Contributing`_.
 22 | 
 23 | Contributing
 24 | -------------
 25 | We follow the Fork_ & pull_ model. It's not necessary to do anything on the
 26 | command line. All you have to is click on fork. Then you can  edit the
 27 | ``*.rst`` files directly through the GitHub interface if you want. Only the
 28 | Sphinx specific commands will not work, such as the table of contents command
 29 | ``toctree``. You can also `add new files`_ by clicking on the plus symbol next
 30 | to a directory. After you are satisfied with you changes you click on the pull
 31 | request button. Do note that changing the ``*.rst`` files does not change the
 32 | actual webpage, maybe somebody else (.e.g me) can do that for you. If you want
 33 | to learn how to compile the ``*.rst`` files to ``*.html``, please read on.
 34 | 
 35 | Compile the reST files to HTML locally
 36 | ---------------------------------------
 37 | The only thing that is a bit more tricky is actually compiling the ``*.rst``
 38 | files to ``*.html`` files. This is not necessary to contribute since you can
 39 | see the results in Github (GitHub shows ``*.rst`` files as they would look like
 40 | in HTML by default). If you want to compile the files locally you would do::
 41 |     
 42 |     pip install sphinx  # install sphinx
 43 |     git clone https://github.com/inodb/2014-5-metagenomics-workshop
 44 |     make html
 45 | 
 46 | The resulting HTML pages are in the folder ``build/``. You can open the files
 47 | in your browser by typing e.g.
 48 | ``file:///home/inodb/path/to/build/html/index.html`` in the address bar. If you
 49 | want to make changes you should:
 50 | 
 51 | 1. fork_ this repo
 52 | 2. clone your forked repo
 53 | 3. Make the changes to the ``*.rst`` files
 54 | 4. run ``make html``
 55 | 5. look at the results
 56 | 6. add the changes with ``git add files that you changed``
 57 | 7. commit the changes with ``git commit``
 58 | 8. push the changes to your own repo with ``git push``
 59 | 9. do a pull_ request by clicking on the pull request button on the GitHub page
 60 |    of your repo
 61 | 
 62 | This only changes the ``*.rst`` files in the ``master`` branch, not the actual
 63 | webpage, which is in the ``gh-pages`` branch. How that is set up is explained
 64 | in the section.
 65 | 
 66 | Compile the reST files to HTML on milou
 67 | ---------------------------------------
 68 | The generated docs can be found on bit.ly/metalove. The HTML files are located in
 69 | ``/proj/g2014113/webexport/``. To update those files you first clone the repository
 70 | somewhere on milou. Then load the virtual environment of the workshop::
 71 |     
 72 |     source /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/activate
 73 | 
 74 | Then from the root dir of the repository run::
 75 | 
 76 |     make milou
 77 |     
 78 | The HTML files will then be updated. Obviously you should be part of the g2014113 project.
 79 | 
 80 | Updating the HTML to GitHub Pages
 81 | --------------------------------------
 82 | The website is hosted on `GitHub Pages`. It works by having a branch called
 83 | ``gh-pages`` on this repository, which has all the HTML. I used
 84 | brantfaircloth's `sphinx_to_github.sh`_ script to set it up. Basically it sets
 85 | up a ``gh-pages`` branch in the ``build/html`` folder of the repository, so
 86 | everytime you run ``make html`` it changes the files in that branch. You then
 87 | ``cd build/html``, commit the new HTML files and push them to the ``gh-pages``
 88 | branch. After that the result can be viewed at:
 89 | 
 90 | http://yourusername.github.io/reponame/
 91 | 
 92 | I'll update the branch ``gh-pages`` myself after your pull request with the
 93 | changed ``*.rst`` files on the ``master`` branch was accepted.
 94 | 
 95 | 
 96 | .. _sphinx: http://sphinx-doc.org/
 97 | .. _fork: https://help.github.com/articles/fork-a-repo
 98 | .. _pull: https://help.github.com/articles/using-pull-requests
 99 | .. _reStructuredText: http://sphinx-doc.org/rest.html
100 | .. _reST: http://sphinx-doc.org/rest.html
101 | .. _source directory: https://github.com/inodb/2014-5-metagenomics-workshop/tree/master/source
102 | .. _GitHub Pages: https://pages.github.com/
103 | .. _add new files: https://github.com/blog/1327-creating-files-on-github
104 | .. _sphinx_to_github.sh: https://gist.github.com/brantfaircloth/791759
105 | 


--------------------------------------------------------------------------------
/source/assembly/assembly.rst:
--------------------------------------------------------------------------------
  1 | ==========================================
  2 | Assembling reads with Velvet
  3 | ==========================================
  4 | In this exercise we will learn how to perform an assembly with Velvet. Velvet
  5 | takes your reads as input and turns them into contigs. It consists of two
  6 | steps. In the first step, ``velveth``, the de Bruijn graph is created.
  7 | Afterwards the graph is traversed and contigs are created with ``velvetg``.
  8 | When constructing the de Bruijn graph, a *kmer* has to be specified. Reads are
  9 | cut up into pieces of length *k*, each representing a node in the graph, edges
 10 | represent an overlap (some de Bruijn graph assemblers do this differently, but
 11 | the idea is the same). The advantage of using kmer overlap instead of read
 12 | overlap is that the computational requirements grow with the number of unique
 13 | kmers instead of unique reads. A more detailled explanation can be found at
 14 | http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html.
 15 | 
 16 | 
 17 | Pick a kmer
 18 | ===========
 19 | Please work in pairs for this assignment. Every group can select a kmer of
 20 | their likings - pick a random one if you haven't developed a preference yet.
 21 | Write you and your partner's name down at a kmer on the
 22 | Google doc_ for this workshop.
 23 | 
 24 | .. _doc: https://docs.google.com/spreadsheet/ccc?key=0AvduvUOYAB-_dDdDSVhqUi1KQmJkTlZJcHVfMGI3a2c#gid=3 
 25 | 
 26 | velveth
 27 | =======
 28 | Create the graph data structure with ``velveth``. Again like we did with
 29 | ``sickle``, first create a directory with symbolic links to the pairs that you
 30 | want to use::
 31 | 
 32 |     mkdir -p ~/asm-workshop/velvet
 33 |     cd ~/asm-workshop/velvet
 34 |     ln -s ../sickle/qtrim1.fastq pair1.fastq
 35 |     ln -s ../sickle/qtrim2.fastq pair2.fastq
 36 | 
 37 | The reads need to be interleaved for ``velveth``::
 38 | 
 39 |     shuffleSequences_fastq.pl pair1.fastq pair2.fastq pair.fastq
 40 | 
 41 | Run velveth over the kmer you picked (21 in this example)::
 42 | 
 43 |     velveth out_21 21 -fastq -shortPaired pair.fastq
 44 | 
 45 | Check what directories have been created::
 46 | 
 47 |     ls
 48 | 
 49 | velvetg
 50 | =======
 51 | To get the actual contigs you will have to run ``velvetg`` on the created
 52 | graph. You can vary options such expected coverage and the coverage cut-off if
 53 | you want, but we do not do that in this tutorial. We only choose not to do
 54 | scaffolding::
 55 | 
 56 |     velvetg out_21 -scaffolding no
 57 | 
 58 | 
 59 | assemstats
 60 | ==========
 61 | After the assembly one wants to look at the length distributions of the
 62 | resulting assemblies. You can use the ``assemstats`` script for that::
 63 | 
 64 |     assemstats 100 out_*/contigs.fa
 65 | 
 66 | Try to find-out what each of the stats represent by varying the cut-off. One of
 67 | the most often used statistics in assembly length distribution comparisons is
 68 | the *N50 length*, a weighted median, where you weight each contig by it's
 69 | length. This way you assign more weight to larger contigs. Fifty procent of all
 70 | the bases in the assembly are contained in contigs shorter or equal to N50
 71 | length. Once you have gotten an idea of what it all the stats mean, it is time
 72 | to compare your results with the other attendees of this workshop. Generate the results and copy them to the doc_::
 73 | 
 74 |     assemstats 100 out_*/contigs.fa
 75 | 
 76 | Do the same for the cut-off at 1000 and add it to the doc_. Compare your kmer
 77 | against the others. If there are very little available yet, this would be an
 78 | ideal time to help out some other attendees or do the same exercise for a kmer
 79 | that has not been picked by somebody else yet. Please write down you and your
 80 | partners name again at the doc_ in that case.
 81 | 
 82 | 
 83 | **Question: What are the important length statistics? Do we prefer sum over
 84 | length? Should it be a combination?**
 85 | 
 86 | Think of a formula that could indicate the best preferred
 87 | length distribution where you express the optimization function in terms of the
 88 | column names from the doc_. For instance only ``n50_len`` or ``sum *
 89 | n50_len``.
 90 | 
 91 | 
 92 | (Optional exercise) Ray
 93 | =======================
 94 | Try to create an assembly with Ray over the same kmer. Ray is an assembler that
 95 | uses MPI to distribute the assembly over multiple cores and nodes. The latest
 96 | version of Ray was made to work well with metagenomics data as well::
 97 | 
 98 |     mkdir -p ~/asm-workshop/ray
 99 |     cd ~/asm-workshop/ray
100 |     ln -s ../sickle/qtrim1.fastq pair1.fastq
101 |     ln -s ../sickle/qtrim2.fastq pair2.fastq
102 |     mpiexec -n 1 Ray -k 21 -p pair1.fastq pair2.fastq -o out_21
103 | 
104 | Add the ``assemstats`` results to the doc_ as you did for Velvet. There is a
105 | separate tab for the Ray assemblies, compare the results with Velvet.
106 | 
107 | (Optional exercise) VelvetOptimiser
108 | ===================================
109 | VelvetOptimiser_ is a script that runs Velvet multiple times and follows the
110 | optimization function you give it. Use VelvetOptimiser_ to find the assembly
111 | that gets the best score for the optimization function you designed in
112 | `assemstats`_. It requires ``BioPerl``, which you can get on uppmax with
113 | ``module load BioPerl``.
114 | 
115 | .. _VelvetOptimiser: https://github.com/Victorian-Bioinformatics-Consortium/VelvetOptimiser
116 | 


--------------------------------------------------------------------------------
/source/annotation/normalization.rst:
--------------------------------------------------------------------------------
  1 | ==========================================================
  2 | Normalization of count data from the metagenomic data sets
  3 | ==========================================================
  4 | An important aspects of working with metagenomics is to apply proper
  5 | normalization procedures to the retrieved counts. There are several
  6 | ways to do this, and in part the method of choice is dependent on
  7 | the research question investigated, but in part also based on more
  8 | philosphical considerations. Let's start with a bit of theory.
  9 | 
 10 | Why is normalization important?
 11 | ===============================
 12 | Generally, sequencing data sets are not of the same size. In addition,
 13 | different genes and genomes come in different sizes, which means that
 14 | *at equal coverage, the number of mapped reads to a certain gene or
 15 | region will be directly dependent on the length of that region*.
 16 | Luckily, the latter scenario is not a huge issue for Pfam families
 17 | (although it exists), and we will not care about it more today. We
 18 | will however care about the size of the sequencing libraries. To make
 19 | relatively fair comparisons between sets, we need to normalize the
 20 | gene counts to something. Let's begin with checking how unequal the
 21 | librairies are. You can do that by counting the number of sequences
 22 | in the FASTA files, by checking for the number of ">" characters in
 23 | each file, using ``grep``::
 24 | 
 25 |     grep -c ">" <input file>
 26 |     
 27 | As you will see, there are quite substantial differences in the
 28 | number of reads in each library. How do we account for that?
 29 | 
 30 | What normalization methods are possible?
 31 | ========================================
 32 | 
 33 | The choice of normalization method will depend on what research
 34 | question we want to ask. An easy way of removing the technical
 35 | bias related to different sequencing effort in different libraries
 36 | is to simply divide each gene count with the total library size.
 37 | That will yield a relative proportion of counts to that gene. To
 38 | make that number easier to interpret, we can multiply it by
 39 | 1,000,000 to get *the number of reads corresponding to that gene
 40 | or feature per million reads*.
 41 | 
 42 |     (counts of gene X / total number of reads) * 1000000
 43 | 
 44 | This is a quick way of normalizing, but it does not consider
 45 | the composition of the sample. Say that you are interested in
 46 | studying bacterial gene content within e.g. different plant hosts.
 47 | Then the interesting changes in bacterial composition might be
 48 | drowned by genetic material from the host plant. That will then
 49 | have a huge impact on the gene abundances of the bacteria, even if
 50 | those abundances are actually the same. The same applies to complex
 51 | microbial communities with both bacteria, single-cell eukaryotes
 52 | and viruses. In such cases, it might be better to consider a
 53 | normalization to the number of bacteria in the sample (or eukaryotes
 54 | if that is what you want to study). One way of doing that is to
 55 | count the number of reads mapping to the 16S rRNA gene in each
 56 | sample. You can then divide each gene count with the number of
 57 | 16S rRNA counts, to yield a genes per 16S proportion.
 58 | 
 59 |     (counts of gene X / counts of 16S rRNA gene)
 60 |     
 61 | There is a few problems with using the 16S rRNA gene in this way.
 62 | The most prominient one is that the gene exists in a single copy in
 63 | some bacteria, but in multiple (sometimes >10) copies in other
 64 | species. That means that this number will not truly be a per-genome
 65 | estimate. Other genetic markers, such as the *rpoB* gene has been
 66 | suggested for this, but has not yet taken off.
 67 | 
 68 | Finally, we could imagine a scenario in which you are only
 69 | interested in the proportion of different annotated features in
 70 | your sample. One can then instead divide to the total number of
 71 | reads mapped to *something* in the database used. That will give
 72 | relative proportions, and will remove a lot of "noise", but will
 73 | have the limitation that only the well-defined part of the
 74 | microbial community can be studied, and the rest is ignored.
 75 | 
 76 |     (counts of gene X / total number of mapped reads)
 77 | 
 78 |     
 79 | Trying out some normalization methods
 80 | =====================================
 81 | We are now ready to try out these methods on our data. Let's begin
 82 | generating the numbers we need for normalization. We begin with the
 83 | library sizes. As you remember, those numbers can be generated using
 84 | ``grep``::
 85 | 
 86 |     grep -c ">" <input file>
 87 |     
 88 | To get the number of 16S rRNA sequences, we will use Metaxa2. If you
 89 | did not install it, you can "cheat" by getting the numbers from this
 90 | file: ``/proj/g2014113/metagenomics/annotation/metaxa2_16S_rRNA_counts.txt``. 
 91 | If you installed it previously, you can test it out using the following
 92 | command::
 93 | 
 94 |     metaxa2 -i <input file> -o <output file> --cpu 16 --align none
 95 |     
 96 | Metaxa2 will take a few minutes to run. You will then be able to
 97 | get the number of bacterial 16S rRNA sequences from the file ending
 98 | with .summary.txt.
 99 | 
100 | Finally, we would like to get the number of reads mapping to *any*
101 | Pfam family in the database. To get that number, we can again use
102 | ``grep``. This time however, we will use it to *remove* the entries
103 | that we are not interested in, and counting the rest. This can be
104 | done by::
105 | 
106 |     grep -c -v "^#" <hmmer output file>
107 |     
108 | That will remove all lines beginning with a ``#`` character, and
109 | count all remaining lines. Write all the numbers down that you have
110 | got during this exercize, we will use them in the next step!
111 | 


--------------------------------------------------------------------------------
/source/assembly/reqs.rst:
--------------------------------------------------------------------------------
  1 | ==========================================
  2 | Checking required software
  3 | ==========================================
  4 | An often occuring theme in bioinformatics is installing software. Here we wil
  5 | go over some steps to help you check whether you actually have the right
  6 | software installed. There's an optional excerise on how to install ``sickle``.
  7 | 
  8 | Programs used in this workshop
  9 | ==============================
 10 | The following programs are used in this workshop:
 11 | 
 12 |     - Bowtie2_
 13 |     - Velvet_
 14 |     - samtools_
 15 |     - sickle_
 16 |     - Picard_
 17 |     - Ray_
 18 |     
 19 | .. _Bowtie2: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
 20 | .. _Velvet: http://www.ebi.ac.uk/~zerbino/velvet/
 21 | .. _xclip: http://sourceforge.net/projects/xclip/
 22 | .. _parallel: https://www.gnu.org/software/parallel/
 23 | .. _samtools: http://samtools.sourceforge.net/
 24 | .. _CD-HIT: https://code.google.com/p/cdhit/
 25 | .. _AMOS: http://sourceforge.net/apps/mediawiki/amos/index.php?title=AMOS
 26 | .. _sickle: https://github.com/najoshi/sickle
 27 | .. _Picard: http://picard.sourceforge.net/index.shtml
 28 | .. _Ray: http://denovoassembler.sourceforge.net/
 29 | 
 30 | All programs are already installed, all you have to do is load the virtual
 31 | environment for this workshop. Once you are logged in to the server run::
 32 | 
 33 |     source /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/activate
 34 | 
 35 | You deactivate the virtual environment with::
 36 |     
 37 |     deactivate
 38 | 
 39 | NOTE: This is a python virtual environment. The binary folder of the virtual
 40 | environment has symbolic links to all programs used in this workshop so you
 41 | should be able to run those without problems.
 42 | 
 43 | 
 44 | Using which to locate a program
 45 | ===============================
 46 | An easy way to determine whether you have have a certain program installed is
 47 | by typing::
 48 | 
 49 |     which programname
 50 |     
 51 | where ``programname`` is the name of the program you want to use. The program
 52 | ``which`` searches all directories in ``$PATH`` for the executable file
 53 | ``programname`` and returns the path of the first found hit. This is exactly
 54 | what happens when you would just type ``programname`` on the command line, but
 55 | then ``programname`` is also executed. To see what your ``$PATH`` looks like,
 56 | simply ``echo`` it::
 57 |     
 58 |     echo $PATH
 59 | 
 60 | For more information on the ``$PATH`` variable see this link:
 61 | http://www.linfo.org/path_env_var.html.
 62 | 
 63 | Check all programs in one go with which
 64 | ==================================================
 65 | To check whether you have all programs installed in one go, you can use ``which``.
 66 | 
 67 |     bowtie2
 68 |     bowtie2-build
 69 |     velveth
 70 |     velvetg
 71 |     shuffleSequences_fastq.pl
 72 |     parallel
 73 |     samtools
 74 |     Ray
 75 | 
 76 | 
 77 | We will now iterate over all the programs in calling ``which`` on each of them.
 78 | First make a variable containing all programs separated by whitespace::
 79 | 
 80 |     $ req_progs="bowtie2 bowtie2-build velveth velvetg parallel samtools shuffleSequences_fastq.pl Ray"
 81 |     $ echo $req_progs
 82 |     bowtie2 bowtie2-build velveth velvetg parallel samtools shuffleSequences_fastq.pl
 83 | 
 84 | Now iterate over the variable ``req_progs`` and call which::
 85 | 
 86 |     $ for p in $req_progs; do which $p || echo $p not in PATH; done
 87 |     /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/bowtie2
 88 |     /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/bowtie2-build
 89 |     /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/velveth
 90 |     /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/velvetg
 91 |     /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/parallel
 92 |     /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/samtools
 93 |     /proj/g2014113/metagenomics/virt-env/mg-workshop/bin/shuffleSequences_fastq.pl
 94 | 
 95 | In Unix-like systems a program that sucessfully completes it tasks should
 96 | return a zero exit status. For the program ``which`` that is the case if the
 97 | program is found. The ``||`` character does not mean *pipe the output onward* as
 98 | you are probably familiar with (otherwise see
 99 | http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-4.html), but checks whether the
100 | program before it exists succesfully and executes the part behind it if not.
101 | 
102 | If any of the installed programs is missing, try to install them yourself or
103 | ask. If you are having troubles following these examples, try to find some bash
104 | tutorials online next time you have some time to kill. Educating yourself on
105 | how to use the command line effectively increases your productivity immensely.
106 | 
107 | Some bash resources:
108 | 
109 |   - Excellent bash tutorial http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html
110 |   - Blog post on pipes for NGS http://www.vincebuffalo.com/2013/08/08/the-mighty-named-pipe.html
111 |   - Using bash and GNU parallel for NGS http://bit.ly/gwbash
112 | 
113 | (Optional excercise) Install sickle by yourself
114 | ===============================================
115 | Follow these steps only if you want to install ``sickle`` by yourself.
116 | Installation procedures of research software often follow the same pattern.
117 | Download the code, *compile* it and copy the binary to a location in your
118 | ``$PATH``.  The code for sickle is on https://github.com/najoshi/sickle. I
119 | prefer *compiling* my programs in ``~/src`` and then copying the resulting
120 | program to my ``~/bin`` directory, which is in my ``$PATH``. This should get
121 | you a long way::
122 | 
123 |     mkdir -p ~/src
124 | 
125 |     # Go to the source directory and clone the sickle repository
126 |     cd ~/src
127 |     git clone https://github.com/najoshi/sickle
128 |     cd sickle
129 | 
130 |     # Compile the program
131 |     make
132 | 
133 |     # Create a bin directory
134 |     mkdir -p ~/bin
135 |     cp sickle ~/bin
136 | 


--------------------------------------------------------------------------------
/source/comparative-functional-analysis/genecoverage.rst:
--------------------------------------------------------------------------------
  1 | ==============================================
  2 | Determine gene coverage in metagenomic samples
  3 | ==============================================
  4 | Ok, now that we know what functions are represented in the combined samples (we
  5 | could call it the Baltic meta-community, i.e. a community of communities), we
  6 | may want to know how much of the different functions (COG families and classes)
  7 | are present in the different samples, since this will likely change between
  8 | seasons. To do this we first map the reads from the different samples against
  9 | the contigs. We will use the mapping script that we used this morning. First
 10 | create a directory and cd there::
 11 | 
 12 |     mkdir -p ~/metagenomics/cfa/map
 13 |     cd ~/metagenomics/cfa/map
 14 | 
 15 | Copy the contig file and build an index on it for bowtie2::
 16 | 
 17 |     cp /proj/g2014113/metagenomics/cfa/assembly/baltic-sea-ray-noscaf-41.1000.fa .
 18 |     bowtie2-build baltic-sea-ray-noscaf-41.1000.fa baltic-sea-ray-noscaf-41.1000.fa
 19 | 
 20 | You will end up with various baltic-sea-ray-noscaf-41.1000.fa.*.bt2 files that
 21 | represent the index for the assembly. It allows for faster alignment of
 22 | sequences, see http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
 23 | for more information.
 24 | 
 25 | Now we will use some crazy bash for loop to map all the reads. This actually
 26 | prints the mapping command instead of executing, because it takes too much time
 27 | to run it, it does however create the directories::
 28 | 
 29 |     for s in /proj/g2014113/metagenomics/cfa/reads/*_R1.fastq.gz; do
 30 |         echo mkdir -p $(basename $s _R1.fastq.gz)
 31 |         echo cd $(basename $s _R1.fastq.gz)
 32 |         echo map-bowtie2-markduplicates.sh -ct 1 \
 33 |             $s ${s/R1/R2} pair \
 34 |             ~/metagenomics/cfa/map/baltic-sea-ray-noscaf-41.1000.fa asm \
 35 |             bowtie2
 36 |         echo cd ..
 37 |     done
 38 | 
 39 | The for loop iterates over all the first mates of the pairs. It then creates a
 40 | directory using the basename of the pair with the directory part and the
 41 | postfix removed, goes to that dir and runs ``map-bowtie2-markduplicates.sh`` on
 42 | both mates of the pair. Try to change the for loop such that it only maps one
 43 | sample.
 44 | 
 45 | **Question: Can you think of an easy way to parallelize this?**
 46 | .. Add an & after bowtie2
 47 | 
 48 |     For more examples on how to parallelize this check: http://bit.ly/gwbash
 49 | 
 50 | If you sort of understand what's going on in the for loop above you are welcome
 51 | to copy the data that we have already generated::
 52 | 
 53 |     cp -r /proj/g2014113/metagenomics/cfa/map/* ~/metagenomics/cfa/map/
 54 | 
 55 | Take a look at the files that have been created. Check
 56 | ``map-bowtie2-markduplicates.sh -h`` for an explanation of the different files.
 57 | 
 58 | **Question what is the mean coverage for contig-394 in sample 0328?**
 59 | 
 60 | .. 0
 61 | 
 62 | Next we want to figure out the coverage for every gene in every contig per
 63 | sample. We will use the bedtools coverage command within the BEDTools suite
 64 | (https://code.google.com/p/bedtools/) that can parse a SAM/BAM file and a gff
 65 | file to extract coverage information for every gene::
 66 | 
 67 |     mkdir -p ~/metagenomics/cfa/coverage-hist-per-feature-per-sample
 68 |     cd ~/metagenomics/cfa/coverage-hist-per-feature-per-sample
 69 | 
 70 | Run bedtools coverage on one sample (~4m)::
 71 | 
 72 |     for s in 0328; do
 73 |         bedtools coverage -hist -abam ~/metagenomics/cfa/map/$s/bowtie2/asm_pair-smds.bam \
 74 |             -b ../prodigal/baltic-sea-ray-noscaf-41.1000.gff \
 75 |             > $s-baltic-sea-ray-noscaf-41.1000.gff.coverage.hist
 76 |     done
 77 | 
 78 | Copy the other ones::
 79 | 
 80 |     cp /proj/g2014113/metagenomics/cfa/map/coverage-hist-per-feature-per-sample/* .
 81 | 
 82 | Have a look at which files have been created with less again. The final four
 83 | columns give you the histogram i.e. coverage, number of bases with that
 84 | coverage, length of the contig/feature/gene, bases with that coverage expressed
 85 | as a ratio of the length of the contig/feature/gene.
 86 | 
 87 | Now what we want to is do is to extract the mean coverage per COG instead of
 88 | per gene. Remember that multiple genes can belong to the same COG so we will
 89 | take the sum of the mean coverage from those genes. We will use the script
 90 | ``br-sum-mean-cov-per-cog.py`` for that. First make a directory
 91 | again and go there::
 92 | 
 93 |     mkdir -p ~/metagenomics/cfa/cog-sum-mean-cov
 94 |     cd ~/metagenomics/cfa/cog-sum-mean-cov
 95 | 
 96 | The script expects a file with one samplename per line so we will create an
 97 | array with those sample names
 98 | (http://www.thegeekstuff.com/2010/06/bash-array-tutorial/)::
 99 | 
100 |     samplenames=(0328 0403 0423 0531 0619 0705 0709 1001 1004 1028 1123)
101 |     echo ${samplenames[*]}
102 | 
103 | Now we can use process substitution to give the script those sample names
104 | without having to store it to a file first.
105 | 
106 | **Question: What is the difference between the following statements?**::
107 | 
108 |     echo ${samplenames[*]}
109 |     cat <(echo ${samplenames[*]})
110 |     cat <(echo ${samplenames[*]} | tr ' ' '\n') 
111 | 
112 | .. First one just echoes
113 |    second one concatenates the contents of the "file" with samplenames to stdout
114 |    the last one adds newlines
115 | 
116 | Run the the script that computes the sum of mean coverages per COG (~2m47)::
117 | 
118 |     br-sum-mean-cov-per-cog.py --samplenames <(echo ${samplenames[*]} | tr ' ' '\n') \
119 |         ../prodigal/baltic-sea-ray-noscaf-41.1000.gff ../prodigal/baltic-sea-ray-noscaf-41.1000.aa.fa \
120 |         ../wmga-cog/output.2 ../coverage-hist-per-feature-per-sample/*.gff.coverage.hist \
121 |         > cog-sum-mean-cov.tsv
122 | 
123 | Have a look at the table with less -S again.
124 | 
125 | **Question: What is the sum of mean coverages for COG0038 in sample 0423?**
126 | 
127 | .. 1.8215488
128 | 


--------------------------------------------------------------------------------
/source/comparative-functional-analysis/compare.rst:
--------------------------------------------------------------------------------
  1 | =================================================
  2 | Comparative functional analysis with R
  3 | =================================================
  4 | Having this table one can use different statistical and visualisation software
  5 | to analyse the results. One option would be to import a simpler version of the
  6 | table into the program Fantom, a graphical user interface program developed for
  7 | comparative analysis of metagenome data. You can try this in the end of the day
  8 | if you have time.
  9 | 
 10 | But here we will use the statistical programming language R to do some simple
 11 | analysis. cd to the directory where you have the cog-sum-mean-cov.tsv file.
 12 | Then start R::
 13 | 
 14 |     cd ~/metagenomics/cfa
 15 |     R
 16 | 
 17 | and import the data::
 18 | 
 19 |     tab_cog <- read.delim("cog-sum-mean-cov/cog-sum-mean-cov.tsv")
 20 | 
 21 | Assign the different columns with descriptors to vectors of logical names::
 22 | 
 23 |     cogf <- tab_cog[,1] # cog family
 24 |     cogfd <- tab_cog[,2] # cog family descriptor
 25 |     cogc <- tab_cog[,3] # cog class
 26 |     cogcd <- tab_cog[,4] # cog class descriptor
 27 | 
 28 | Make a matrix with the coverages of the cog families::
 29 | 
 30 |     cogf_cov <- as.matrix(tab_cog[,5:ncol(tab_cog)]) # coverage in the different samples
 31 | 
 32 | And why not put sample names into a vector as well::
 33 | 
 34 |     sample <- colnames(cogf_cov)
 35 |     sample
 36 | 
 37 | Let’s clean the sample names a bit::
 38 | 
 39 |     for (i in 1:length(sample)) {
 40 |         sample[i] <- matrix(unlist(strsplit(sample[i],"_")), 1)[1,4]
 41 |     }
 42 | 
 43 | Since the coverages will differ depending on how many reads per sample we have
 44 | we can normalise by dividing the coverages by the total coverage for the sample
 45 | (only considering cog-annotated genes though)::
 46 | 
 47 |     for (i in 1:ncol(cogf_cov)) {
 48 |         cogf_cov[,i] <- cogf_cov[,i]/sum(cogf_cov[,i])
 49 |     }
 50 | 
 51 | The cogf_cov gives coverage per cog family. Let’s summarise within cog classes
 52 | and make a separate matrix for that::
 53 | 
 54 |     unique_cogc <- levels(cogc)
 55 |     cogc_cov <- matrix(ncol = length(sample), nrow = length(unique_cogc))
 56 |     colnames(cogc_cov) <- sample
 57 |     rownames(cogc_cov) <- unique_cogc
 58 |     for (i in 1:length(unique_cogc)) {
 59 |         these <- grep(paste("^", unique_cogc[i],"$", sep = ""), cogc)
 60 |         for (j in 1:ncol(cogf_cov)) {
 61 |             cogc_cov[i,j] <- sum(cogf_cov[these,j])
 62 |         }
 63 |     }
 64 | 
 65 | 
 66 | OK, now let’s start playing with the data. We can for example do a pairwise
 67 | plot of coverage of cog classes in sample1 vs. sample2::
 68 | 
 69 |     plot(cogc_cov[,1], cogc_cov[,2])
 70 | 
 71 | or make a stacked barplot showing the different classes in the different
 72 | samples::
 73 | 
 74 |     barplot(cogf_cov, col = rainbow(100), border=NA)
 75 |     barplot(cogc_cov, col = rainbow(10), border=NA)
 76 | 
 77 | The vegan package contains many nice functions for doing (microbial) ecology
 78 | analysis. Load vegan::
 79 | 
 80 |     install.packages("vegan") # not necessary if already installed
 81 |     library(vegan)
 82 | 
 83 | If installing doesn't work for you have a look here
 84 | http://www.stat.osu.edu/computer-support/mathstatistics-packages/installing-r-libraries-locally-your-home-directory
 85 | 
 86 | We can calculate pairwise distances between the samples based on their
 87 | functional composition. In ecology pairwise distance between samples is
 88 | referred to as beta-diversity, although typically based on taxonomic
 89 | composition rather than functional::
 90 | 
 91 |     cogf_dist <- as.matrix(vegdist(t(cogf_cov), method="bray", binary=FALSE, diag=TRUE, upper=TRUE, na.rm = FALSE))  
 92 |     cogc_dist <- as.matrix(vegdist(t(cogc_cov), method="bray", binary=FALSE, diag=TRUE, upper=TRUE, na.rm = FALSE))  
 93 | 
 94 | You can visualise the distance matrices as a heatmaps::
 95 | 
 96 |     image(cogf_dist)
 97 |     image(cogc_dist)
 98 | 
 99 | Are the distances calculated on the different functional levels correlated?::
100 | 
101 |     plot(cogc_dist, cogf_dist)
102 | 
103 | Now let’s cluster the samples based on the distances with hierarchical
104 | clustering. We use the function "agnes" in the "cluster" library and apply
105 | average linkage clustering::
106 | 
107 |     install.packages("cluster") # not necessary if already installed
108 |     library(cluster)
109 | 
110 |     cluster <- agnes(cogf_dist, diss = TRUE, method = "average")
111 |     plot(cluster, which.plots = 2, hang = -1, label = sample, main = "", axes = FALSE, xlab = "", ylab = "", sub = "")
112 | 
113 | Alternatively you can use the function heatmap, that calculates distances both
114 | between samples and between features and clusters in two dimensions::
115 | 
116 |     heatmap(cogf_dist, scale = "none")
117 |     heatmap(cogc_dist, scale = "none")
118 | 
119 | And let’s ordinate the data in two dimensions. This can be done e.g. by PCA
120 | based on the actual coverage values or by e.g. PcOA or NMDS (non-metrical
121 | dimensional scaling). Let's do NMDS::
122 | 
123 |     mds <- metaMDS(cogf_dist)
124 |     plot(mds$points[,1], mds$points[,2], pch = 20, xlab = "NMDS1", ylab = "NMDS2", cex = 2)
125 | 
126 | We can color the samples according to date (provided your samples are ordered
127 | according to date). There are some nice color scales to choose from here
128 | http://colorbrewer2.org/::
129 | 
130 |     install.packages("RColorBrewer") # not necessary if already installed
131 |     library(RColorBrewer)
132 |     color = brewer.pal(length(sample), "Reds") # or select another color scale!
133 | 
134 |     mds <- metaMDS(cogf_dist)
135 |     plot(mds$points[,1], mds$points[,2], pch = 20, xlab = "NMDS1", ylab = "NMDS2", cex = 5, col = color)
136 | 
137 | Let’s compare with how it looks if we base the clustering on COG class coverage
138 | instead::
139 | 
140 |     mds <- metaMDS(cogc_dist)
141 |     plot(mds$points[,1], mds$points[,2], pch = 20, xlab = "NMDS1", ylab = "NMDS2", cex = 5, col = color)
142 | 
143 | In addition to these examples there are of course infinite ways to analyse the
144 | results in R. One could for instance find COGs that significantly differ in
145 | abundance between samples, do different types of correlations between metadata
146 | (nutrients, temperature, etc) and functions, etc. Leave your R window open,
147 | since we will compare these results with taxonomic data in a bit.
148 | 


--------------------------------------------------------------------------------
/mako/templates/binning/concoct.rst:
--------------------------------------------------------------------------------
  1 | ==========================================================
  2 | CONCOCT - Clustering cONtigs with COverage and ComposiTion
  3 | ==========================================================
  4 | In this excercise you will learn how to use a new software for automatic and unsupervised binning of metagenomic contigs, called CONCOCT. 
  5 | CONCOCT uses a statistical model called a Gaussian Mixture Model to cluster sequences based on their tetranucleotide frequencies and their average coverage over multiple samples. 
  6 | 
  7 | The theory behind using the coverage pattern is that sequences having similar coverage pattern over multiple samples are likely to belong to the same species.
  8 | Species having a similar abundance pattern in the samples can hopefully be separated by the tetranucleotide frequencies.
  9 | 
 10 | We will be working with an assembly made using only the reads from this single sample, but since CONCOCT is constructed to be ran using the coverage profile over multiple samples, we'll be investigating how the performance is affected if we add several other samples.
 11 | This is done by mapping the reads from the other samples to the contigs resulting from this single sample assembly. 
 12 | 
 13 | 
 14 | Getting to know the test data set
 15 | =================================
 16 | Today we'll be working on a metagenomic data set from the baltic sea.
 17 | The sample we'll be using is part of a time series study, where the same location have been sampled twice weekly during 2013. This specific sample was taken march 22. 
 18 | 
 19 | Start by copying the contigs to your working directory::
 20 |     
 21 |     ${'\n    '.join(commands['copy_dataset'])}
 22 | 
 23 | You should now have one fasta file containing all contigs, in this case only contigs longer than 1000 bases is included to save space, and one comma separated file containing the coverage profiles for each contig.
 24 | Let's have a look at the coverage profiles::
 25 | 
 26 |     ${commands['browse_coverage']}
 27 | 
 28 | Try to find the column corresponding to March 22 and compare this column to the other ones. Can you draw any conclusions from this comparison?
 29 | 
 30 | We'd like to first run concoct using only one sample, so we remove all other columns in the coverage table to create this new coverage file::
 31 | 
 32 |     ${commands['cut_coverage']}
 33 | 
 34 | Running CONCOCT
 35 | ===============
 36 | CONCOCT takes a number of parameters that you got a glimpse of earlier, running::
 37 | 
 38 |     ${commands['check_activate']['concoct']}
 39 | 
 40 | The contigs will be input as the composition file and the coverage file obviously as the coverage file. The output path is given by as the -b (--basename) parameter, where it is important to include a trailing slash if we want to create an output directory containing all result files. 
 41 | Last but not least we will set the length threshold to 3000 to speed up the clustering (the less contigs we use, the shorter the runtime)::
 42 | 
 43 |     ${'\n    '.join(commands['run_concoct']['one_sample'])}
 44 | 
 45 | This command will normally take a couple of minutes to finish. When it is done, check the output directory and try to figure out what the different files contain.
 46 | Especially, have a look at the main output file:: 
 47 | 
 48 |     ${commands['run_concoct']['look_clustering']}
 49 | 
 50 | This file gives you the cluster id for each contig that was included in the clustering, in this case all contigs longer than 3000 bases. 
 51 | 
 52 | For the comparison we will now run concoct again, using the coverage profile over all samples in the time series::
 53 | 
 54 |     ${commands['run_concoct']['all_samples']}
 55 | 
 56 | Have a look at the output from this clustering as well, do you notice anything different?
 57 | 
 58 | Evaluating Clustering Results
 59 | =============================
 60 | One way of evaluating the resulting clusters are to look at the distribution of so called Single Copy Genes (SCG:s), genes that are present in all bacteria and archea in only one copy. 
 61 | With this background, a complete and correct bin should have exactly one copy of each gene present, while missing genes indicate an inclomplete bin and several copies of the same gene indicate a chimeric cluster. 
 62 | To predict genes in prokaryotes, we use Prodigal that we then use as the query sequences for an RPS-BLAST search against the Clusters of Orthologous Groups (COG) database.
 63 | This RPS-BLAST search takes about an hour and a half for our dataset so we're going to use a precomputed result file.
 64 | Copy this result file along with two files necessary for the COG counting scripts::
 65 | 
 66 |     ${'\n    '.join(commands['copy_blast'])}
 67 | 
 68 | Before moving on, we need to install some R packages, please run these commands line by line::
 69 | 
 70 |     R
 71 |     install.packages("ggplot2")
 72 |     install.packages("reshape")
 73 |     install.packages("getopt")
 74 |     q()
 75 | 
 76 | With the CONCOCT distribution comes scripts for parsing this output and creating a plot where each cog present in the data are grouped accordingly to the clustering results, namely COG_table.py and COGPlot.R. These scripts are added to the virtual environment, try check out their usage::
 77 |     
 78 |     ${'\n    '.join(commands['check_cog_scripts'])}
 79 | 
 80 | Let's first create a plot for the single sample run::
 81 | 
 82 |     ${'\n    '.join(commands['cogplot_single'])}
 83 | 
 84 | This command might not work for some R related reason. If you've tried getting it to work more than you wish to do, just copy the results from the workshop directory::
 85 | 
 86 |     cp /proj/g2014113/nobackup/concoct-workshop/cogplots/* ~/binning-workshop/    
 87 | 
 88 | 
 89 | This command should have created a pdf file with your plot. In order to look at it, you can download it to your personal computer with scp. OBS! You need to run this in a separate terminal window where you are not logged in to Uppmax::
 90 | 
 91 |     ${commands['download_single_cogplot']}
 92 | 
 93 | Have a look at the plot and try to figure out if the clustering was successful or not. Which clusters are good? Which clusters are bad? Are all clusters present in the plot?
 94 | Now, lets do the same thing for the multiple samples run::
 95 | 
 96 |     ${'\n    '.join(commands['cogplot_multiple'])}
 97 | 
 98 | And download again from your separate terminal window::
 99 | 
100 |     ${commands['download_multiple_cogplot']}
101 | 
102 | What differences can you observe for these plots? Think about how we were able to use samples not included in the assembly in order to create a different clustering result. Can this be done with any samples?
103 | 


--------------------------------------------------------------------------------
/source/comparative-taxonomic-analysis/compare.rst:
--------------------------------------------------------------------------------
  1 | ========================================
  2 | Comparative taxonomic analysis with R
  3 | ========================================
  4 | KRONA is not very good for comparing multiple samples though. Instead we will
  5 | use R as for the functional data.  First combine the data from the different
  6 | samples with the script sum_rdp_annot.pl (made by us) into a table. By default
  7 | the script uses a bootstrap support of 0.80 to include a taxonomic level (this
  8 | can be changed easily by changing the number on row 16 in the script). You may
  9 | need to change the input file names in the beginning of the script ($in_file[0]
 10 | = ...). The script will only import the 16S bacterial data::
 11 | 
 12 |     cd ~/metagenomics/cta/rdp
 13 |     sum_rdp_annot.pl > summary.rrna.silva-bac-16s-database-id85.fasta.class.0.80.tsv
 14 | 
 15 | Let’s import this table into R::
 16 | 
 17 |     tab_tax <- read.delim("summary.rrna.silva-bac-16s-database-id85.fasta.class.0.80.tsv")
 18 | 
 19 | And assign the descriptor column to a vector::
 20 | 
 21 |     tax <- tab_tax[,1]
 22 | 
 23 | And put the counts into a matrix::
 24 | 
 25 |     tax_counts <- tab_tax[,2:ncol(tab_tax)] # counts of taxa in the different samples
 26 | 
 27 | Since you will compare this dataset with the functional dataset you generated
 28 | before it's great if the samples come in the same order. Check the previous
 29 | order::
 30 |  
 31 |     sample
 32 | 
 33 | And the current order::
 34 | 
 35 |     colnames(tax_counts)
 36 | 
 37 | if they are not in the same order contact an assistant for help. Otherwise::
 38 | 
 39 |     colnames(tax_counts) <- sample
 40 |     rownames(tax_counts) <- tax
 41 | 
 42 | Make a normalised version of tax_counts::
 43 | 
 44 |     norm_tax_counts <- tax_counts
 45 |     for (i in 1:ncol(tax_counts)) {
 46 |         norm_tax_counts[,i] <- tax_counts[,i]/sum(tax_counts[,i])
 47 |     }
 48 | 
 49 | What different taxa do we have there?::
 50 | 
 51 |     tax
 52 | 
 53 | From the tax_counts matrix we can create new matrices at defined taxonomic
 54 | levels. If you open the text file /proj/g2013206/metagenomics/r_commands.txt
 55 | you can copy and paste all of this code into R (or use the source command) and
 56 | this will give you the matrices and vectors below (check carefully that you
 57 | didn’t get any error messages!)::
 58 | 
 59 |     phylum_counts             matrix with counts for different phyla 
 60 |     norm_phylum_counts          normalised version of phylum_count 
 61 |     phylum              vector with phyla (same order as in phyla matrix)
 62 | 
 63 |     class_counts             matrix with counts for different classes
 64 |     norm_class_counts        normalised version of class_count
 65 |     class                 vector with classes
 66 | 
 67 |     phylumclass_counts         matrix with counts for different phyla and proteobacteria classes 
 68 |     norm_phylumclass_counts    normalised version of phylumclass_count
 69 |     phylumclass              vector with phyla and proteobacteria classes
 70 | 
 71 | The “other” clade in each of the above sums reads that were not classified at
 72 | the defined level. Having these more well defined matrices we can compare the
 73 | taxonomic composition in the samples. We can apply the commands that we did for
 74 | the functional analysis::
 75 | 
 76 |     library(vegan)
 77 |     library(RColorBrewer)
 78 | 
 79 | Barplots::
 80 | 
 81 |     par(mar=c(1,2,1,22)) # Increase the MARgins, to make space for a legend
 82 | 
 83 |     barplot(norm_phylum_counts, col = rainbow(11), legend.text=TRUE, args.legend=list(x=ncol(norm_phylum_counts)+25, y=1, adj=c(0,0)))
 84 |     barplot(norm_phylumclass_counts, col = rainbow(15), legend.text=TRUE, args.legend=list(x=ncol(norm_phylum_counts)+30, y=1, adj=c(0,0)))
 85 |     barplot(norm_class_counts, col = rainbow(18), legend.text=TRUE, args.legend=list(x=ncol(norm_phylum_counts)+32, y=1, adj=c(0,0)))
 86 | 
 87 | If you can't see the legends, they're just in a bad position. Try altering the x and y parameters in the args.legend.
 88 | 
 89 | Calculate beta-diversity based on class-level taxonomic counts::
 90 | 
 91 |     class_dist <- as.matrix(vegdist(t(norm_class_counts[1:(nrow(norm_class_counts) - 1),]), method="bray", binary=FALSE, diag=TRUE, upper=TRUE, na.rm = FALSE))  
 92 | 
 93 | Note that by "[1:(nrow(norm_class_counts) - 1),]" we exclude the last row in
 94 | norm_class_counts when we calculate the distances because this is the "others"
 95 | column that contains all kinds of unclassified taxa.
 96 | 
 97 | Hierarchical clustering::
 98 | 
 99 |     library(cluster)
100 |     cluster <- agnes(class_dist, diss = TRUE, method = "average")
101 |     plot(cluster, which.plots = 2, hang = -1)
102 | 
103 | Heatmaps with clusterings::
104 | 
105 |     heatmap(norm_class_counts, scale = "none")
106 |     heatmap(norm_phylumclass_counts, scale = "none")
107 | 
108 | And ordinate the data by NMDS::
109 | 
110 |     color = brewer.pal(length(sample), "Reds")
111 |     mds <- metaMDS(class_dist)
112 |     plot(mds$points[,1], mds$points[,2], pch = 20, xlab = "NMDS1", ylab = "NMDS2", cex = 5, col = color)
113 | 
114 | Does the pattern look similar as that obtained by functional data?::
115 | 
116 |     mds <- metaMDS(cogc_dist)
117 |     plot(mds$points[,1], mds$points[,2], pch = 20, xlab = "NMDS1", ylab = "NMDS2", cex = 5, col = color)
118 | 
119 | We can actually check how beta diversity generated by the two approaches is
120 | correlated::
121 | 
122 |     plot(cogf_dist, class_dist)
123 |     cor.test(cogf_dist, class_dist) 
124 | 
125 | (For comparing matrices it is common to use a mantel test, but the r-value (but
126 | not the p-value) is in fact the same.)
127 | 
128 | Finally, let’s check how alpha-diversity fluctuates over the year and compares
129 | between taxonomic and functional data. Since alpha-diversity is influenced by
130 | sample size it is advisable to subsample the datasets to the same number of
131 | reads. We can make a subsampled table using the vegan function rrarefy::
132 | 
133 |     sub_class_counts <- t(rrarefy(t(class_counts), 100))
134 | 
135 | This will be difficult to achieve for the functional data at this point,
136 | however, so let’s skip that for the functional data. 
137 | 
138 | Let’s use Shannon diversity index since this is pretty insensitive to sample
139 | size. Shannon index combines richness (number of species) and evenness (how
140 | evenly the species are distributed); many, evenly distributed species gives a
141 | high Shannon. There is a vegan function for getting shannon::
142 | 
143 |     class_shannon <- diversity(class_counts[1:(nrow(norm_class_counts) - 1),], MARGIN = 2)
144 |     sub_class_shannon <- diversity(sub_class_counts[1:(nrow(norm_class_counts) - 1),], MARGIN = 2)
145 |     cogf_shannon <- diversity(cogf_cov, MARGIN = 2)
146 | 
147 | How does subsampling influence shannon?::
148 | 
149 |     plot(class_shannon, sub_class_shannon)
150 | 
151 | Is functional and taxonomic shannon correlated?::
152 | 
153 |     plot(sub_class_shannon, cogf_shannon)
154 |     cor.test(sub_class_shannon, cogf_shannon)
155 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
  1 | # Makefile for Sphinx documentation
  2 | #
  3 | 
  4 | # You can set these variables from the command line.
  5 | SPHINXOPTS    =
  6 | SPHINXBUILD   = sphinx-build
  7 | PAPER         =
  8 | BUILDDIR      = build
  9 | MILOU_EXPORT  = /proj/g2014113/webexport/
 10 | 
 11 | # User-friendly check for sphinx-build
 12 | ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
 13 | $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)
 14 | endif
 15 | 
 16 | # Internal variables.
 17 | PAPEROPT_a4     = -D latex_paper_size=a4
 18 | PAPEROPT_letter = -D latex_paper_size=letter
 19 | ALLSPHINXOPTS   = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
 20 | # the i18n builder cannot share the environment and doctrees with the others
 21 | I18NSPHINXOPTS  = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
 22 | 
 23 | .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext
 24 | 
 25 | help:
 26 | 	@echo "Please use \`make <target>' where <target> is one of"
 27 | 	@echo "  html       to make standalone HTML files"
 28 | 	@echo "  dirhtml    to make HTML files named index.html in directories"
 29 | 	@echo "  singlehtml to make a single large HTML file"
 30 | 	@echo "  pickle     to make pickle files"
 31 | 	@echo "  json       to make JSON files"
 32 | 	@echo "  htmlhelp   to make HTML files and a HTML help project"
 33 | 	@echo "  qthelp     to make HTML files and a qthelp project"
 34 | 	@echo "  devhelp    to make HTML files and a Devhelp project"
 35 | 	@echo "  epub       to make an epub"
 36 | 	@echo "  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
 37 | 	@echo "  latexpdf   to make LaTeX files and run them through pdflatex"
 38 | 	@echo "  latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
 39 | 	@echo "  text       to make text files"
 40 | 	@echo "  man        to make manual pages"
 41 | 	@echo "  texinfo    to make Texinfo files"
 42 | 	@echo "  info       to make Texinfo files and run them through makeinfo"
 43 | 	@echo "  gettext    to make PO message catalogs"
 44 | 	@echo "  changes    to make an overview of all changed/added/deprecated items"
 45 | 	@echo "  xml        to make Docutils-native XML files"
 46 | 	@echo "  pseudoxml  to make pseudoxml-XML files for display purposes"
 47 | 	@echo "  linkcheck  to check all external links for integrity"
 48 | 	@echo "  doctest    to run all doctests embedded in the documentation (if enabled)"
 49 | 
 50 | clean:
 51 | 	rm -rf $(BUILDDIR)/*
 52 | 
 53 | milou:
 54 | 	$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(MILOU_EXPORT)
 55 | 	@echo
 56 | 	@echo "Build finished. The HTML pages are in $(MILOU_EXPORT)"
 57 | 
 58 | html:
 59 | 	$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
 60 | 	@echo
 61 | 	@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
 62 | 
 63 | dirhtml:
 64 | 	$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
 65 | 	@echo
 66 | 	@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
 67 | 
 68 | singlehtml:
 69 | 	$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
 70 | 	@echo
 71 | 	@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
 72 | 
 73 | pickle:
 74 | 	$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
 75 | 	@echo
 76 | 	@echo "Build finished; now you can process the pickle files."
 77 | 
 78 | json:
 79 | 	$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
 80 | 	@echo
 81 | 	@echo "Build finished; now you can process the JSON files."
 82 | 
 83 | htmlhelp:
 84 | 	$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
 85 | 	@echo
 86 | 	@echo "Build finished; now you can run HTML Help Workshop with the" \
 87 | 	      ".hhp project file in $(BUILDDIR)/htmlhelp."
 88 | 
 89 | qthelp:
 90 | 	$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
 91 | 	@echo
 92 | 	@echo "Build finished; now you can run "qcollectiongenerator" with the" \
 93 | 	      ".qhcp project file in $(BUILDDIR)/qthelp, like this:"
 94 | 	@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/MetagenomicsWorkshopSciLifeLab.qhcp"
 95 | 	@echo "To view the help file:"
 96 | 	@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/MetagenomicsWorkshopSciLifeLab.qhc"
 97 | 
 98 | devhelp:
 99 | 	$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
100 | 	@echo
101 | 	@echo "Build finished."
102 | 	@echo "To view the help file:"
103 | 	@echo "# mkdir -p $$HOME/.local/share/devhelp/MetagenomicsWorkshopSciLifeLab"
104 | 	@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/MetagenomicsWorkshopSciLifeLab"
105 | 	@echo "# devhelp"
106 | 
107 | epub:
108 | 	$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
109 | 	@echo
110 | 	@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
111 | 
112 | latex:
113 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
114 | 	@echo
115 | 	@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
116 | 	@echo "Run \`make' in that directory to run these through (pdf)latex" \
117 | 	      "(use \`make latexpdf' here to do that automatically)."
118 | 
119 | latexpdf:
120 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
121 | 	@echo "Running LaTeX files through pdflatex..."
122 | 	$(MAKE) -C $(BUILDDIR)/latex all-pdf
123 | 	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
124 | 
125 | latexpdfja:
126 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
127 | 	@echo "Running LaTeX files through platex and dvipdfmx..."
128 | 	$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
129 | 	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
130 | 
131 | text:
132 | 	$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
133 | 	@echo
134 | 	@echo "Build finished. The text files are in $(BUILDDIR)/text."
135 | 
136 | man:
137 | 	$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
138 | 	@echo
139 | 	@echo "Build finished. The manual pages are in $(BUILDDIR)/man."
140 | 
141 | texinfo:
142 | 	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
143 | 	@echo
144 | 	@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
145 | 	@echo "Run \`make' in that directory to run these through makeinfo" \
146 | 	      "(use \`make info' here to do that automatically)."
147 | 
148 | info:
149 | 	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
150 | 	@echo "Running Texinfo files through makeinfo..."
151 | 	make -C $(BUILDDIR)/texinfo info
152 | 	@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."
153 | 
154 | gettext:
155 | 	$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
156 | 	@echo
157 | 	@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."
158 | 
159 | changes:
160 | 	$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
161 | 	@echo
162 | 	@echo "The overview file is in $(BUILDDIR)/changes."
163 | 
164 | linkcheck:
165 | 	$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
166 | 	@echo
167 | 	@echo "Link check complete; look for any errors in the above output " \
168 | 	      "or in $(BUILDDIR)/linkcheck/output.txt."
169 | 
170 | doctest:
171 | 	$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
172 | 	@echo "Testing of doctests in the sources finished, look at the " \
173 | 	      "results in $(BUILDDIR)/doctest/output.txt."
174 | 
175 | xml:
176 | 	$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
177 | 	@echo
178 | 	@echo "Build finished. The XML files are in $(BUILDDIR)/xml."
179 | 
180 | pseudoxml:
181 | 	$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
182 | 	@echo
183 | 	@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."
184 | 


--------------------------------------------------------------------------------
/source/annotation/metaxa2.rst:
--------------------------------------------------------------------------------
  1 | ==================================================================
  2 | Bonus exercise: Using Metaxa2 to investigate the taxonomic content
  3 | ==================================================================
  4 | Now when we are familiar to using R, we can just as well use it to go through
  5 | another type of output generated by the Metaxa2 software. Metaxa2 does
  6 | classification of rRNA at different taxonomic levels, assigning each read to
  7 | a taxonomic affiliation only if it is reliably able to (given conservation
  8 | between taxa etc.) You can read more about Metaxa2 here:
  9 | http://microbiology.se/software/metaxa2
 10 | 
 11 | Install Metaxa2
 12 | ===============
 13 | For this exercise to work, you need Metaxa2 installed. If you did not do this
 14 | earlier, here are the installation instructions again. If you did install
 15 | Metaxa2 at the beginning of the workshop, you can skip this step and move
 16 | straight to the next heading!
 17 | 
 18 | The code for Metaxa2 is available from http://microbiology.se/sw/Metaxa2_2.0rc3.tar.gz
 19 | You can install Metaxa2 as follows::
 20 | 
 21 |     # Create a src and a bin directory
 22 |     mkdir -p ~/src
 23 |     mkdir -p ~/bin 
 24 | 
 25 |     # Go to the source directory and download the Metaxa2 tarball
 26 |     cd ~/src
 27 |     wget http://microbiology.se/sw/Metaxa2_2.0rc3.tar.gz
 28 |     tar -xzvf Metaxa2_2.0rc3.tar.gz
 29 |     cd Metaxa2_2.0rc3
 30 | 
 31 |     # Run the installation script
 32 |     ./install_metaxa2
 33 |     
 34 |     # Try to run Metaxa2 (this should bring up the main options for the software)
 35 |     metaxa2 -h
 36 | 
 37 | If this did not work, you can try this manual approach::
 38 | 
 39 |     cd ~/src/Metaxa2_2.0rc3
 40 |     cp -r metaxa2* ~/bin/
 41 |     
 42 |     # Then try to run Metaxa2 again
 43 |     metaxa2 -h
 44 |     
 45 | If this brings up the help message, you are all set!
 46 | 
 47 |     
 48 | Generating family level taxonomic counts
 49 | ========================================
 50 | 
 51 | If you have already run Metaxa2 to get the number of 16S rRNA sequences,
 52 | you can use the output of those runs. Otherwise you need to run the
 53 | following command on all the raw read data from all libraries::
 54 | 
 55 |     metaxa2 -i <input file> -o <output file> --cpu 16 --align none
 56 | 
 57 | To get counts on the family level from the metaxa2 output, we will use
 58 | another tool bundled with the Metaxa2 package; the Metaxa2 Taxonomic
 59 | Travesal Tool (``metaxa2_ttt``). Take a look at its options by typing::
 60 | 
 61 |     metaxa2_ttt -h
 62 |     
 63 | In this exercise we are interested in bacterial counts only , so we will
 64 | use the ``-t b`` option. Since we are only interested in family abundance
 65 | (we have too little data to get any good genus or species counts), we will
 66 | only output the data for phyla, classes, orders and families, that is we
 67 | will use the ``-m 5`` option. As input files, you should use the files
 68 | ending with ".taxonomy.txt" that Metaxa2 produced as output. That should
 69 | give you a command looking like this::
 70 | 
 71 |     metaxa2_ttt -i <metaxa .taxonomy.txt file> -o <output file> -m 5 -t b
 72 |     
 73 | Run this command on the taxonomy.txt files from all input libraries. It
 74 | should be really quick. If you type ``ls`` you will notice that ``metaxa2_ttt``
 75 | produced a bunch of .level_X.txt files. Those are the files we are going
 76 | to work with next.
 77 | 
 78 | Visualizing family level taxonomic counts
 79 | =========================================
 80 | To visualize the family level counts, we will once again use R. Fire it
 81 | up again and load in the count tables from Metaxa2::
 82 | 
 83 |     R
 84 |     b1_fam = read.table("baltic1.level_5.txt", sep = "\t", row.names = 1)
 85 | 
 86 | Repeat this procedure for all four data set. If you saved your workspace,
 87 | the merge_four function should still be available. You can try it out on the
 88 | taxonomic counts::
 89 | 
 90 |     all_fam = merge_four(b1_fam,b2_fam,swe_fam,ind_fam,c("Baltic 1","Baltic 2","Sweden", "India"))
 91 |     
 92 | Let's load in the ``gplots`` library again, and make a heatmap of the raw
 93 | data::
 94 | 
 95 |     library(gplots)
 96 |     heatmap.2(all_fam, trace = "none", col = c("grey",colorpanel(255,"black","red","yellow")), margin = c(5,10), cexCol = 1, cexRow = 0.7)
 97 |     
 98 | As you will notice, we will need to do some tweaking to fit in the taxonomic data::
 99 | 
100 |     heatmap.2(all_fam, trace = "none", col = c("grey",colorpanel(255,"black","red","yellow")), margin = c(5,30), cexCol = 1, cexRow = 0.7)
101 | 
102 | 
103 | Apply normalizations
104 | ====================
105 | As you might already have guessed, taxonomic count data suffers from the same
106 | biases from, for example, sequencing library size as other gene data. To
107 | account for that, we will apply a normalization procedure. Please note that
108 | the normalization methods 2 and 3 (number of 16S and number of total matches
109 | to database) would in this case be the same. In other words, the both yield
110 | the relative abundances of the taxa. We will therefore only look at two
111 | normalization procedures in this part of the lab.
112 | 
113 | First, we will normalize to the number of reads in each sequencing library.
114 | Find the note you have taken on the data set sizes. Then apply a command like
115 | this on the data::
116 | 
117 |     b1_fam_norm1 = b1_fam / 118025 * 1000000
118 |     
119 | That will give you the 16S rRNA counts for the different families per million
120 | reads. Do the same thing for the other data sets.
121 | 
122 | Next, we will do the same for the other type of normalization, the division
123 | by the mapped number of reads/total number of 16S rRNA. This can, once more,
124 | be done by dividing the vector by its sum::
125 | 
126 |     b1_fam_norm2 = b1_fam / sum(b1_fam)
127 |     
128 | Follow the above procedure for all the data sets, and store the final
129 | result from ``merge_four`` into a variable, for example called ``fam_norm2``.
130 | 
131 | Comparing taxonomic distributions
132 | =================================
133 | 
134 | Next we will compare the taxonomic composition of the four environments.
135 | Let's start out by just using a barplot. To get the different taxa on
136 | the x-axis, we will transform the matrix with normalized counts using the
137 | ``t()`` command. But first we need to set the margins to fit the taxonomic
138 | names::
139 | 
140 |     par(mar = c(25, 4, 4, 2))
141 |     barplot(t(fam_norm1), main = "Counts per million reads", las = 2, cex.names = 0.6, beside = TRUE)
142 |     
143 | We can then do the same for the relative abundances::
144 | 
145 |     barplot(t(fam_norm2), main = "Relative abundance", las = 2, cex.names = 0.6, beside = TRUE)
146 |     
147 | To only look at families present in at least two samples, we can use the
148 | following command for filtering::
149 | 
150 |     fam_norm1_filter = fam_norm1[rowSums(fam_norm1 > 0) >= 2,]
151 |     barplot(t(fam_norm1_filter), main = "Counts per million reads", las = 2, cex.names = 0.6, beside = TRUE)
152 |     
153 | 
154 | **Question: Which normalization method would be most suitable to use in this case? Why?**
155 | 
156 | 
157 | We can also look at the differences in taxonomic content using a heatmap. As before,
158 | we will use the squareroot as a variance stabilizing transform::
159 | 
160 |     heatmap.2(sqrt(fam_norm1), trace = "none", col = c("grey",colorpanel(255,"black","red","yellow")), margin = c(30,10), cexCol = 1, cexRow = 0.7)
161 | 
162 | Finally, we can of course also use PCA on taxonomic abundances. We will turn back to the
163 | ``prcomp`` PCA command::
164 | 
165 |     fam_norm1_pca = prcomp(sqrt(fam_norm1))
166 | 
167 | We can visualize the PCA using the ``biplot`` command::
168 | 
169 |     biplot(fam_norm1_pca, cex = 0.5)
170 |     
171 | To see the proportion of variance explained by the different components, we can use the
172 | normal plot command::
173 | 
174 |     plot(fam_norm1_pca)
175 | 
176 | **Question: Can you think about any other type of problem with the data we are using now?
177 | This problem applies to both kinds of data, but should be particularly problematic with
178 | the taxonomic counts...**
179 | 
180 | 


--------------------------------------------------------------------------------
/source/binning/concoct.rst:
--------------------------------------------------------------------------------
  1 | ==========================================================
  2 | CONCOCT - Clustering cONtigs with COverage and ComposiTion
  3 | ==========================================================
  4 | In this excercise you will learn how to use a new software for automatic and unsupervised binning of metagenomic contigs, called CONCOCT. 
  5 | CONCOCT uses a statistical model called a Gaussian Mixture Model to cluster sequences based on their tetranucleotide frequencies and their average coverage over multiple samples. 
  6 | 
  7 | The theory behind using the coverage pattern is that sequences having similar coverage pattern over multiple samples are likely to belong to the same species.
  8 | Species having a similar abundance pattern in the samples can hopefully be separated by the tetranucleotide frequencies.
  9 | 
 10 | We will be working with an assembly made using only the reads from this single sample, but since CONCOCT is constructed to be ran using the coverage profile over multiple samples, we'll be investigating how the performance is affected if we add several other samples.
 11 | This is done by mapping the reads from the other samples to the contigs resulting from this single sample assembly. 
 12 | 
 13 | 
 14 | Getting to know the test data set
 15 | =================================
 16 | Today we'll be working on a metagenomic data set from the baltic sea.
 17 | The sample we'll be using is part of a time series study, where the same location have been sampled twice weekly during 2013. This specific sample was taken march 22. 
 18 | 
 19 | Start by copying the contigs to your working directory::
 20 |     
 21 |     mkdir -p ~/binning-workshop
 22 |     mkdir -p ~/binning-workshop/data
 23 |     cd ~/binning-workshop/
 24 |     cp /proj/g2014113/nobackup/concoct-workshop/Contigs_gt1000.fa data/
 25 |     cp /proj/g2014113/nobackup/concoct-workshop/120322_coverage_nolen.tsv data/
 26 | 
 27 | You should now have one fasta file containing all contigs, in this case only contigs longer than 1000 bases is included to save space, and one comma separated file containing the coverage profiles for each contig.
 28 | Let's have a look at the coverage profiles::
 29 | 
 30 |     less -S ~/binning-workshop/data/120322_coverage_nolen.tsv
 31 | 
 32 | Try to find the column corresponding to March 22 and compare this column to the other ones. Can you draw any conclusions from this comparison?
 33 | 
 34 | We'd like to first run concoct using only one sample, so we remove all other columns in the coverage table to create this new coverage file::
 35 | 
 36 |     cut -f1,3 ~/binning-workshop/data/120322_coverage_nolen.tsv > ~/binning-workshop/data/120322_coverage_one_sample.tsv
 37 | 
 38 | Running CONCOCT
 39 | ===============
 40 | CONCOCT takes a number of parameters that you got a glimpse of earlier, running::
 41 | 
 42 |     concoct -h
 43 | 
 44 | The contigs will be input as the composition file and the coverage file obviously as the coverage file. The output path is given by as the -b (--basename) parameter, where it is important to include a trailing slash if we want to create an output directory containing all result files. 
 45 | Last but not least we will set the length threshold to 3000 to speed up the clustering (the less contigs we use, the shorter the runtime)::
 46 | 
 47 |     mkdir -p ~/binning-workshop/concoct_output
 48 |     concoct --coverage_file ~/binning-workshop/data/120322_coverage_one_sample.tsv --composition_file ~/binning-workshop/data/Contigs_gt1000.fa -l 3000 -b ~/binning-workshop/concoct_output/3000_one_sample/
 49 | 
 50 | This command will normally take a couple of minutes to finish. When it is done, check the output directory and try to figure out what the different files contain.
 51 | Especially, have a look at the main output file:: 
 52 | 
 53 |     less ~/binning-workshop/concoct_output/3000_one_sample/clustering_gt3000.csv
 54 | 
 55 | This file gives you the cluster id for each contig that was included in the clustering, in this case all contigs longer than 3000 bases. 
 56 | 
 57 | For the comparison we will now run concoct again, using the coverage profile over all samples in the time series::
 58 | 
 59 |     concoct --coverage_file ~/binning-workshop/data/120322_coverage_nolen.tsv --composition_file ~/binning-workshop/data/Contigs_gt1000.fa -l 3000 -b ~/binning-workshop/concoct_output/3000_all_samples/
 60 | 
 61 | Have a look at the output from this clustering as well, do you notice anything different?
 62 | 
 63 | Evaluating Clustering Results
 64 | =============================
 65 | One way of evaluating the resulting clusters are to look at the distribution of so called Single Copy Genes (SCG:s), genes that are present in all bacteria and archea in only one copy. 
 66 | With this background, a complete and correct bin should have exactly one copy of each gene present, while missing genes indicate an inclomplete bin and several copies of the same gene indicate a chimeric cluster. 
 67 | To predict genes in prokaryotes, we use Prodigal that we then use as the query sequences for an RPS-BLAST search against the Clusters of Orthologous Groups (COG) database.
 68 | This RPS-BLAST search takes about an hour and a half for our dataset so we're going to use a precomputed result file.
 69 | Copy this result file along with two files necessary for the COG counting scripts::
 70 | 
 71 |     cp /proj/g2014113/nobackup/concoct-workshop/Contigs_gt1000_blast.out ~/binning-workshop/data/
 72 |     cp /proj/g2014113/nobackup/concoct-workshop/scg_cogs_min0.97_max1.03_unique_genera.txt ~/binning-workshop/data/
 73 |     cp /proj/g2014113/nobackup/concoct-workshop/cdd_to_cog.tsv ~/binning-workshop/data/
 74 | 
 75 | Before moving on, we need to install some R packages, please run these commands line by line::
 76 | 
 77 |     R
 78 |     install.packages("ggplot2")
 79 |     install.packages("reshape")
 80 |     install.packages("getopt")
 81 |     q()
 82 | 
 83 | With the CONCOCT distribution comes scripts for parsing this output and creating a plot where each cog present in the data are grouped accordingly to the clustering results, namely COG_table.py and COGPlot.R. These scripts are added to the virtual environment, try check out their usage::
 84 |     
 85 |     COG_table.py -h
 86 |     COGPlot.R -h
 87 | 
 88 | Let's first create a plot for the single sample run::
 89 | 
 90 |     COG_table.py -b ~/binning-workshop/data/Contigs_gt1000_blast.out -c ~/binning-workshop/concoct_output/3000_one_sample/clustering_gt3000.csv -m ~/binning-workshop/data/scg_cogs_min0.97_max1.03_unique_genera.txt --cdd_cog_file ~/binning-workshop/data/cdd_to_cog.tsv > ~/binning-workshop/cog_table_3000_single_sample.tsv
 91 |     COGPlot.R -s ~/binning-workshop/cog_table_3000_single_sample.tsv -o ~/binning-workshop/cog_plot_3000_single_sample.pdf
 92 | 
 93 | This command might not work for some R related reason. If you've tried getting it to work more than you wish to do, just copy the results from the workshop directory::
 94 | 
 95 |     cp /proj/g2014113/nobackup/concoct-workshop/cogplots/* ~/binning-workshop/    
 96 | 
 97 | 
 98 | This command should have created a pdf file with your plot. In order to look at it, you can download it to your personal computer with scp. OBS! You need to run this in a separate terminal window where you are not logged in to Uppmax::
 99 | 
100 |     scp username@milou.uppmax.uu.se:~/binning-workshop/cog_plot_3000_single_sample.pdf ~/Desktop/
101 | 
102 | Have a look at the plot and try to figure out if the clustering was successful or not. Which clusters are good? Which clusters are bad? Are all clusters present in the plot?
103 | Now, lets do the same thing for the multiple samples run::
104 | 
105 |     COG_table.py -b ~/binning-workshop/data/Contigs_gt1000_blast.out -c ~/binning-workshop/concoct_output/3000_all_samples/clustering_gt3000.csv -m ~/binning-workshop/data/scg_cogs_min0.97_max1.03_unique_genera.txt --cdd_cog_file ~/binning-workshop/data/cdd_to_cog.tsv > ~/binning-workshop/cog_table_3000_all_samples.tsv
106 |     COGPlot.R -s ~/binning-workshop/cog_table_3000_all_samples.tsv -o ~/binning-workshop/cog_plot_3000_all_samples.pdf
107 | 
108 | And download again from your separate terminal window::
109 | 
110 |     scp username@milou.uppmax.uu.se:~/binning-workshop/cog_plot_3000_all_samples.pdf ~/Desktop
111 | 
112 | What differences can you observe for these plots? Think about how we were able to use samples not included in the assembly in order to create a different clustering result. Can this be done with any samples?
113 | 
114 | 


--------------------------------------------------------------------------------
/source/conf.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #
  3 | # Metagenomics Workshop SciLifeLab documentation build configuration file, created by
  4 | # sphinx-quickstart on Tue May  6 17:38:39 2014.
  5 | #
  6 | # This file is execfile()d with the current directory set to its
  7 | # containing dir.
  8 | #
  9 | # Note that not all possible configuration values are present in this
 10 | # autogenerated file.
 11 | #
 12 | # All configuration values have a default; values that are commented out
 13 | # serve to show the default.
 14 | 
 15 | import sys
 16 | import os
 17 | 
 18 | # If extensions (or modules to document with autodoc) are in another directory,
 19 | # add these directories to sys.path here. If the directory is relative to the
 20 | # documentation root, use os.path.abspath to make it absolute, like shown here.
 21 | #sys.path.insert(0, os.path.abspath('.'))
 22 | 
 23 | # -- General configuration ------------------------------------------------
 24 | 
 25 | # If your documentation needs a minimal Sphinx version, state it here.
 26 | #needs_sphinx = '1.0'
 27 | 
 28 | # Add any Sphinx extension module names here, as strings. They can be
 29 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 30 | # ones.
 31 | extensions = []
 32 | 
 33 | # Add any paths that contain templates here, relative to this directory.
 34 | templates_path = ['_templates']
 35 | 
 36 | # The suffix of source filenames.
 37 | source_suffix = '.rst'
 38 | 
 39 | # The encoding of source files.
 40 | #source_encoding = 'utf-8-sig'
 41 | 
 42 | # The master toctree document.
 43 | master_doc = 'index'
 44 | 
 45 | # General information about the project.
 46 | project = u'Metagenomics Workshop SciLifeLab'
 47 | copyright = u'2014, Johannes Alneberg, Johan Bengtsson-Palme, Ino de Bruijn, Luisa Hugerth, Mikael Huss, Thomas Svensson'
 48 | 
 49 | # The version info for the project you're documenting, acts as replacement for
 50 | # |version| and |release|, also used in various other places throughout the
 51 | # built documents.
 52 | #
 53 | # The short X.Y version.
 54 | version = '1.0'
 55 | # The full version, including alpha/beta/rc tags.
 56 | release = '1.0'
 57 | 
 58 | # The language for content autogenerated by Sphinx. Refer to documentation
 59 | # for a list of supported languages.
 60 | #language = None
 61 | 
 62 | # There are two options for replacing |today|: either, you set today to some
 63 | # non-false value, then it is used:
 64 | #today = ''
 65 | # Else, today_fmt is used as the format for a strftime call.
 66 | #today_fmt = '%B %d, %Y'
 67 | 
 68 | # List of patterns, relative to source directory, that match files and
 69 | # directories to ignore when looking for source files.
 70 | exclude_patterns = []
 71 | 
 72 | # The reST default role (used for this markup: `text`) to use for all
 73 | # documents.
 74 | #default_role = None
 75 | 
 76 | # If true, '()' will be appended to :func: etc. cross-reference text.
 77 | #add_function_parentheses = True
 78 | 
 79 | # If true, the current module name will be prepended to all description
 80 | # unit titles (such as .. function::).
 81 | #add_module_names = True
 82 | 
 83 | # If true, sectionauthor and moduleauthor directives will be shown in the
 84 | # output. They are ignored by default.
 85 | #show_authors = False
 86 | 
 87 | # The name of the Pygments (syntax highlighting) style to use.
 88 | pygments_style = 'sphinx'
 89 | 
 90 | # A list of ignored prefixes for module index sorting.
 91 | #modindex_common_prefix = []
 92 | 
 93 | # If true, keep warnings as "system message" paragraphs in the built documents.
 94 | #keep_warnings = False
 95 | 
 96 | 
 97 | # -- Options for HTML output ----------------------------------------------
 98 | 
 99 | # The theme to use for HTML and HTML Help pages.  See the documentation for
100 | # a list of builtin themes.
101 | html_theme = 'default'
102 | 
103 | # Theme options are theme-specific and customize the look and feel of a theme
104 | # further.  For a list of options available for each theme, see the
105 | # documentation.
106 | #html_theme_options = {}
107 | 
108 | # Add any paths that contain custom themes here, relative to this directory.
109 | #html_theme_path = []
110 | 
111 | # The name for this set of Sphinx documents.  If None, it defaults to
112 | # "<project> v<release> documentation".
113 | #html_title = None
114 | 
115 | # A shorter title for the navigation bar.  Default is the same as html_title.
116 | #html_short_title = None
117 | 
118 | # The name of an image file (relative to this directory) to place at the top
119 | # of the sidebar.
120 | #html_logo = None
121 | 
122 | # The name of an image file (within the static path) to use as favicon of the
123 | # docs.  This file should be a Windows icon file (.ico) being 16x16 or 32x32
124 | # pixels large.
125 | #html_favicon = None
126 | 
127 | # Add any paths that contain custom static files (such as style sheets) here,
128 | # relative to this directory. They are copied after the builtin static files,
129 | # so a file named "default.css" will overwrite the builtin "default.css".
130 | html_static_path = ['_static']
131 | 
132 | # Add any extra paths that contain custom files (such as robots.txt or
133 | # .htaccess) here, relative to this directory. These files are copied
134 | # directly to the root of the documentation.
135 | #html_extra_path = []
136 | 
137 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
138 | # using the given strftime format.
139 | #html_last_updated_fmt = '%b %d, %Y'
140 | 
141 | # If true, SmartyPants will be used to convert quotes and dashes to
142 | # typographically correct entities.
143 | #html_use_smartypants = True
144 | 
145 | # Custom sidebar templates, maps document names to template names.
146 | #html_sidebars = {}
147 | 
148 | # Additional templates that should be rendered to pages, maps page names to
149 | # template names.
150 | #html_additional_pages = {}
151 | 
152 | # If false, no module index is generated.
153 | #html_domain_indices = True
154 | 
155 | # If false, no index is generated.
156 | #html_use_index = True
157 | 
158 | # If true, the index is split into individual pages for each letter.
159 | #html_split_index = False
160 | 
161 | # If true, links to the reST sources are added to the pages.
162 | #html_show_sourcelink = True
163 | 
164 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
165 | #html_show_sphinx = True
166 | 
167 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
168 | #html_show_copyright = True
169 | 
170 | # If true, an OpenSearch description file will be output, and all pages will
171 | # contain a <link> tag referring to it.  The value of this option must be the
172 | # base URL from which the finished HTML is served.
173 | #html_use_opensearch = ''
174 | 
175 | # This is the file name suffix for HTML files (e.g. ".xhtml").
176 | #html_file_suffix = None
177 | 
178 | # Output file base name for HTML help builder.
179 | htmlhelp_basename = 'MetagenomicsWorkshopSciLifeLabdoc'
180 | 
181 | 
182 | # -- Options for LaTeX output ---------------------------------------------
183 | 
184 | latex_elements = {
185 | # The paper size ('letterpaper' or 'a4paper').
186 | #'papersize': 'letterpaper',
187 | 
188 | # The font size ('10pt', '11pt' or '12pt').
189 | #'pointsize': '10pt',
190 | 
191 | # Additional stuff for the LaTeX preamble.
192 | #'preamble': '',
193 | }
194 | 
195 | # Grouping the document tree into LaTeX files. List of tuples
196 | # (source start file, target name, title,
197 | #  author, documentclass [howto, manual, or own class]).
198 | latex_documents = [
199 |   ('index', 'MetagenomicsWorkshopSciLifeLab.tex', u'Metagenomics Workshop SciLifeLab Documentation',
200 |    u'Johannes Alneberg, Johan Bengtsson-Palme, Ino de Bruijn, Luisa Hugerth, Mikael Huss, Thomas Svensson', 'manual'),
201 | ]
202 | 
203 | # The name of an image file (relative to this directory) to place at the top of
204 | # the title page.
205 | #latex_logo = None
206 | 
207 | # For "manual" documents, if this is true, then toplevel headings are parts,
208 | # not chapters.
209 | #latex_use_parts = False
210 | 
211 | # If true, show page references after internal links.
212 | #latex_show_pagerefs = False
213 | 
214 | # If true, show URL addresses after external links.
215 | #latex_show_urls = False
216 | 
217 | # Documents to append as an appendix to all manuals.
218 | #latex_appendices = []
219 | 
220 | # If false, no module index is generated.
221 | #latex_domain_indices = True
222 | 
223 | 
224 | # -- Options for manual page output ---------------------------------------
225 | 
226 | # One entry per manual page. List of tuples
227 | # (source start file, name, description, authors, manual section).
228 | man_pages = [
229 |     ('index', 'metagenomicsworkshopscilifelab', u'Metagenomics Workshop SciLifeLab Documentation',
230 |      [u'Johannes Alneberg, Johan Bengtsson-Palme, Ino de Bruijn, Luisa Hugerth, Mikael Huss, Thomas Svensson'], 1)
231 | ]
232 | 
233 | # If true, show URL addresses after external links.
234 | #man_show_urls = False
235 | 
236 | 
237 | # -- Options for Texinfo output -------------------------------------------
238 | 
239 | # Grouping the document tree into Texinfo files. List of tuples
240 | # (source start file, target name, title, author,
241 | #  dir menu entry, description, category)
242 | texinfo_documents = [
243 |   ('index', 'MetagenomicsWorkshopSciLifeLab', u'Metagenomics Workshop SciLifeLab Documentation',
244 |    u'Johannes Alneberg, Johan Bengtsson-Palme, Ino de Bruijn, Luisa Hugerth, Mikael Huss, Thomas Svensson', 'MetagenomicsWorkshopSciLifeLab', 'One line description of project.',
245 |    'Miscellaneous'),
246 | ]
247 | 
248 | # Documents to append as an appendix to all manuals.
249 | #texinfo_appendices = []
250 | 
251 | # If false, no module index is generated.
252 | #texinfo_domain_indices = True
253 | 
254 | # How to display URL addresses: 'footnote', 'no', or 'inline'.
255 | #texinfo_show_urls = 'footnote'
256 | 
257 | # If true, do not generate a @detailmenu in the "Top" node's menu.
258 | #texinfo_no_detailmenu = False
259 | 


--------------------------------------------------------------------------------
/source/annotation/differential.rst:
--------------------------------------------------------------------------------
  1 | ======================================================================
  2 | Estimating differentially abundant protein families in the metagenomes
  3 | ======================================================================
  4 | Finally, we are about to do some real analysis of the data, and look
  5 | at the results! To do this, we will use the R statistical program.
  6 | You start the program by typing::
  7 |     
  8 |     R
  9 |     
 10 | To get out of R, you type ``q()``. You will then be asked if you want
 11 | to save your workspace. Typing "y" (yes) might be smart, since that
 12 | will remember all your variables until the next time you use R in the
 13 | same directory!
 14 |     
 15 | Loading the the count tables
 16 | ============================
 17 | 
 18 | We will begin by loading the count tables from HMMER into R::
 19 | 
 20 |     b1 = read.table("baltic1.hmmsearch", sep = "")
 21 | 
 22 | To get the number of entries of each kind, we will use the R command ``rle``.
 23 | We want to get the domain list, which is the third column. For ``rle`` to be
 24 | able to work with the data, we must also convert it into a proper vector.::
 25 | 
 26 |     raw_counts = rle(as.vector(b1[,3]))
 27 |     b1_counts = as.matrix(raw_counts$lengths)
 28 |     row.names(b1_counts) = raw_counts$values
 29 |     
 30 | Repeat this procedure for all four data sets.
 31 | 
 32 | Apply normalizations
 33 | ====================
 34 | 
 35 | We will now try out the three different normalization methods to see their
 36 | effect on the data. First, we will try by normalizing to the number of reads
 37 | in each sequencing library. Find the note you have taken on the data set sizes.
 38 | Then apply a command like this on the data::
 39 | 
 40 |     b1_norm1 = b1_counts / 118025
 41 |     
 42 | You will now see counts in the range of 10^-5 and 10^6. To make these numbers
 43 | more interpretable, let's also multiply them by 1,000,000 to yield the counts
 44 | per million reads::
 45 | 
 46 |     b1_norm1 = b1_counts / 118025 * 1000000
 47 |     
 48 | Do the same thing for the other data sets.
 49 | 
 50 | We would then like to compare all the four data sets to each other. Since R's
 51 | merge function really suck for multiple data sets, I have provided this
 52 | function for merging four data sets. Copy and paste it into the R console::
 53 |     
 54 |     merge_four = function(a,b,c,d,names) {
 55 |     m1 = merge(a,b,by = "row.names", all = TRUE)
 56 |     row.names(m1) = m1[,1]
 57 |     m1 = m1[,2:3]
 58 |     m2 = merge(c, m1, by = "row.names", all = TRUE)
 59 |     row.names(m2) = m2[,1]
 60 |     m2 = m2[,2:4]
 61 |     m3 = merge(d, m2, by = "row.names", all = TRUE)
 62 |     row.names(m3) = m3[,1]
 63 |     m3 = m3[,2:5]
 64 |     m3[is.na(m3)] = 0
 65 |     colnames(m3) = c(names[4], names[3], names[1], names[2])
 66 |     return(as.matrix(m3))
 67 |     }
 68 |     
 69 | You can then try it by running this command on the raw counts::
 70 |     
 71 |     norm0 = merge_four(b1_counts,b2_counts,swe_counts,ind_counts,c("Baltic 1","Baltic 2","Sweden", "India"))
 72 | 
 73 | You should then see a matrix containing all counts from the four data
 74 | sets, with each row corresponding to a Pfam family. Next, run the same
 75 | command on the normalized data and store the output into a variable, called
 76 | for example ``norm1``. The total abundance of mobility domains can then be
 77 | visualzied using the following command::
 78 | 
 79 |     barplot(colSums(norm1))
 80 | 
 81 | We can then repeat the normalization procedure, by instead normalizing to
 82 | the number of 16S rRNA counts in each library. This can be done similarly
 83 | to the division by total number of reads above::
 84 | 
 85 |     b1_norm2 = b1_counts / 21
 86 |     
 87 | This time, we won't multiply by a million, as that would make numbers
 88 | much larger (and harder to interpret).
 89 | 
 90 | Follow the above procedure for all the data sets, and finally store the
 91 | end result from ``merge_four`` into a variable, for example called ``norm2``.
 92 | 
 93 | Finally, we will do the same for the third type of normalization, the
 94 | division by the mapped number of reads. This can, once more, be done as
 95 | above::
 96 | 
 97 |     b1_norm3 = b1_counts / 22
 98 |     
 99 | Follow the above procedure for all the data sets, and store the final
100 | result from ``merge_four`` into a variable, for example called ``norm3``.
101 | 
102 | A note on saving plots
103 | ======================
104 | Note that if you would like to save your plots to a PDF file you can run
105 | the command::
106 | 
107 |     pdf("output_file_name.pdf", width = 10, height = 10)
108 |     
109 | and then you can just run all the R commands as normal. Instead of getting
110 | plots printed on the screen, all the plots will be output to the specified
111 | PDF file, and can later be viewed in e.g. Acrobat Reader. When you are
112 | finished plotting you can finalize the PDF file using the command::
113 | 
114 |     dev.off()
115 |     
116 | This closes the PDF and enables other software to read it. Please note that
117 | it will be considered a "broken" PDF until the ``dev.off()`` command is run!
118 | 
119 | Comparing normalizations
120 | ========================
121 | 
122 | Let us now quickly compare the three normalization methods. As a quick
123 | overview, we can just make three colorful barplots next to each other,
124 | each representing one normalization method::
125 | 
126 |     layout(matrix(c(1,3,2,4),2,2))
127 |     barplot(norm0, col = 1:nrow(norm1), main = "Raw gene counts")
128 |     barplot(norm1, col = 1:nrow(norm1), main = "Counts per million reads")
129 |     barplot(norm2, col = 1:nrow(norm2), main = "Counts per 16S rRNA")
130 |     barplot(norm3, col = 1:nrow(norm3), main = "Relative abundance")
131 |     
132 | As you can see, each of these plots will tell a slightly different story.
133 | Let's take a closer look at how normalization affect the behavior of some
134 | genes. First, we can see if there are any genes that are present in all
135 | samples. This is easily investigated by the following command, which takes
136 | counts if a value is larger than zero, counts the number of occurences per
137 | per row (rowSums), and finally outputs all the rows from ``norm1`` where
138 | this sum is exactly four::
139 | 
140 |     norm1[rowSums(norm1 > 0) == 4,]
141 | 
142 | If that didn't give you much luck, you can try if you can find any genes
143 | that occur in at least three samples::
144 | 
145 |     norm1[rowSums(norm1 > 0) >= 3,]
146 | 
147 | Select one of those and find out its row number in the count table.
148 | Hint: ``row.names(norm1)`` will help you here! Now lets make boxplots for
149 | that row only::
150 | 
151 |     x = <insert your selected row number here>
152 |     layout(matrix(c(1,3,2,4),2,2))
153 |     barplot(norm0[x,], main = paste(row.names(norm1)[x], "- Raw gene counts"))
154 |     barplot(norm1[x,], main = paste(row.names(norm1)[x], "- Counts per million reads"))
155 |     barplot(norm2[x,], main = paste(row.names(norm2)[x], "- Counts per 16S rRNA"))
156 |     barplot(norm3[x,], main = paste(row.names(norm3)[x], "- Relative abundance"))
157 |     
158 | You can now try this for a number of other genes (by changing the value of
159 | ``x``) and see how normalization affects your story.
160 | 
161 | **Question: Which normalization method would be most suitable to use in this case? Why?**
162 | 
163 | 
164 | Visualizing differences in gene abundance
165 | =========================================
166 | 
167 | One neat way of visualizing metagenomic count data is through heatmaps. R has a built-in
168 | heatmap function, that can be called using the (surprise...) ``heatmap`` command.
169 | However, you will quickly notice that this function is rather limited, and we will
170 | therefore install a package containing a better one - the ``gplots`` package. You can do
171 | this by typing the following command::
172 | 
173 |     install.packages("gplots")
174 |     
175 | Just answer "yes" to the questions, and the package will be installed locally for your
176 | user. After installation you load the package by typing::
177 | 
178 |     library(gplots)
179 | 
180 | After this, you will be able to use the more powerful ``heatmap.2`` command. Try,
181 | for example, this command on the data::
182 | 
183 |     heatmap.2(norm1, trace = "none", col = colorpanel(255,"black","red","yellow"), margin = c(5,10), cexCol = 1, cexRow = 0.7)
184 |     
185 | The trace, margin, cexCol and cexRow options are just there to make the plot look better
186 | (play around with them if you wish). The ``col = colorpanel(255,"black","red","yellow")``
187 | option creates a scale from black to yellow where yellow means highly abundant and black
188 | lowly abundant. To make more clear which genes that are not even detected, let's add a
189 | grey color to that for genes with zero count::
190 | 
191 |     heatmap.2(norm1, trace = "none", col = c("grey",colorpanel(255,"black","red","yellow")), margin = c(5,10), cexCol = 1, cexRow = 0.7)
192 | 
193 | You will now notice that it is hard to see the differences for the lowly abundant genes.
194 | To aid in this, we can add a variance-stabilizing transform (fancy name for squareroot)
195 | to the data::
196 | 
197 |     norm1_sqrt = sqrt(norm1)
198 | 
199 | You can then re-run the ``heatmap.2`` command on the newly created ``norm1_sqrt``
200 | variable.
201 | 
202 | Sometimes, it makes more sense to apply a logarithmic transform to the data instead of
203 | the squareroot. This, however, is a bit more tricky since we have zeros in the data.
204 | For fun's sake, we can try::
205 | 
206 |     norm1_log10 = log10(norm1)
207 |     heatmap.2(norm1_log10, trace = "none", col = c("grey",colorpanel(255,"black","red","yellow")), margin = c(5,10), cexCol = 1, cexRow = 0.7)
208 | 
209 | This should give you an error message. The easiest way to solve this problem is to add
210 | some small number to the matrix before the ``log10`` command. Since we will display this
211 | number with grey color anyway, it will in this case, and for this application, matter
212 | much exactly what number you add. You can, for example, choose 1::
213 | 
214 |     norm1_log10 = log10(norm1 + 1)
215 |     heatmap.2(norm1_log10, trace = "none", col = c("grey",colorpanel(255,"black","red","yellow")), margin = c(5,10), cexCol = 1, cexRow = 0.7)
216 | 
217 | Before we end, let's also try another kind of commonly used visualization, the PCA plot.
218 | Principal Component Analysis (PCA) essentially builds upon projecting complex data onto a
219 | 2D (or 3D) surface, while trying to separate the data points as much as possible. This
220 | can be useful for finding groups of observations that fit together. We will use the built-in
221 | PCA command called ``prcomp``::
222 | 
223 |     norm1_pca = prcomp(norm1_sqrt)
224 | 
225 | Note that we used the data created using the variance stabilizing transform. There are more
226 | sophisticated ways of reducing the influence of very large values, but many times the
227 | squareroot is sufficient. We can visualize the PCA using a plotting command called ``biplot``::
228 | 
229 |     layout(1)      
230 |     biplot(norm1_pca, cex = 0.5)
231 |     
232 | To see the proportion of variance explained by the different components, we can use the
233 | normal plot command::
234 | 
235 |     plot(norm1_pca)
236 |     
237 | We want the first two bars to be as large as possible, since that means that the dataset
238 | can be easily simplified to two dimensions. If all bars are of roughly equal height, the
239 | projection to a 2D surface has caused a loss of much of the information of the data, and
240 | we can not trust the patterns in the PCA plot as much.
241 | 
242 | If we do the PCA on the relative abundance data (normalization three), we can get a view
243 | of which Pfam domains that dominate in these samples::
244 | 
245 |     norm3_pca = prcomp(norm3)
246 |     biplot(norm3_pca, cex = 0.5)
247 | 
248 | And that's the end of the lab. If you have lots of time to spare, you can move on to the
249 | bonus excersize, in which we will analyze the 16S rRNA data generated by Metaxa2 further,
250 | to understand which bacterial species that are present in the samples.
251 | 


--------------------------------------------------------------------------------