├── contents
    ├── 06_f3
    │   ├── F3-Statistics.md
    │   ├── f3-tree.png
    │   ├── f3singleSample.png
    │   └── f3.rst
    ├── 05_pca
    │   ├── .DS_Store
    │   ├── fullPCA.png
    │   ├── pca_simple.png
    │   ├── pcaWithSomeColor.png
    │   ├── pcaWithPopGroupColor.png
    │   └── pca.rst
    ├── 03_sexdet
    │   ├── sexDetExample.png
    │   └── sexdet.rst
    ├── 07_admixture
    │   ├── admixturePlot.png
    │   ├── admixturePlotWithLabels.png
    │   └── admixture.rst
    ├── 01_intro
    │   └── intro.rst
    ├── 04_genotyping
    │   └── genotyping.rst
    └── 02_schmutzi
    │   └── schmutzi.rst
├── .gitignore
├── index.rst
├── Makefile
└── conf.py


/contents/06_f3/F3-Statistics.md:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | _build/
2 | .DS_STORE
3 | 


--------------------------------------------------------------------------------
/contents/05_pca/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stschiff/GAworkshop/HEAD/contents/05_pca/.DS_Store


--------------------------------------------------------------------------------
/contents/06_f3/f3-tree.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stschiff/GAworkshop/HEAD/contents/06_f3/f3-tree.png


--------------------------------------------------------------------------------
/contents/05_pca/fullPCA.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stschiff/GAworkshop/HEAD/contents/05_pca/fullPCA.png


--------------------------------------------------------------------------------
/contents/05_pca/pca_simple.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stschiff/GAworkshop/HEAD/contents/05_pca/pca_simple.png


--------------------------------------------------------------------------------
/contents/06_f3/f3singleSample.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stschiff/GAworkshop/HEAD/contents/06_f3/f3singleSample.png


--------------------------------------------------------------------------------
/contents/03_sexdet/sexDetExample.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stschiff/GAworkshop/HEAD/contents/03_sexdet/sexDetExample.png


--------------------------------------------------------------------------------
/contents/05_pca/pcaWithSomeColor.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stschiff/GAworkshop/HEAD/contents/05_pca/pcaWithSomeColor.png


--------------------------------------------------------------------------------
/contents/07_admixture/admixturePlot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stschiff/GAworkshop/HEAD/contents/07_admixture/admixturePlot.png


--------------------------------------------------------------------------------
/contents/05_pca/pcaWithPopGroupColor.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stschiff/GAworkshop/HEAD/contents/05_pca/pcaWithPopGroupColor.png


--------------------------------------------------------------------------------
/contents/07_admixture/admixturePlotWithLabels.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stschiff/GAworkshop/HEAD/contents/07_admixture/admixturePlotWithLabels.png


--------------------------------------------------------------------------------
/index.rst:
--------------------------------------------------------------------------------
 1 | Welcome to the Genome Analysis Workshop
 2 | =======================================
 3 | 
 4 | This is the central documentation / handout for a workshop held May 9 - May 11 2016 by `Dr. Stephan Schiffels <http://www.shh.mpg.de/mitarbeiter/44297/25500>`_, `Alexander Peltzer <http://www.shh.mpg.de/mitarbeiter/40900/25500>`_ and `Stephen Clayton <http://www.shh.mpg.de/mitarbeiter/47439/25500>`_ at the `Max-Planck-Institute for the Science of Human History, Jena/Germany <http://www.shh.mpg.de/>`_.
 5 | 
 6 | 
 7 | Contents:
 8 | 
 9 | .. toctree::
10 |    :maxdepth: 2
11 | 
12 |    contents/01_intro/intro
13 |    contents/02_schmutzi/schmutzi
14 |    contents/03_sexdet/sexdet
15 |    contents/04_genotyping/genotyping
16 |    contents/05_pca/pca
17 |    contents/06_f3/f3
18 |    contents/07_admixture/admixture
19 | 
20 | Indices and tables
21 | ==================
22 | 
23 | * :ref:`genindex`
24 | * :ref:`modindex`
25 | * :ref:`search`
26 | 


--------------------------------------------------------------------------------
/contents/07_admixture/admixture.rst:
--------------------------------------------------------------------------------
  1 | ADMIXTURE analysis
  2 | ==================
  3 | 
  4 | `Admixture <https://www.genetics.ucla.edu/software/admixture/>`_ is a very useful and popular tool to analyse SNP data. It performs an unsupervised clustering of large numbers of samples, and allows each individual to be a mixture of clusters.
  5 | 
  6 | The excellent documentation of ADMIXTURE can be found at ``/projects1/tools/admixture_1.3.0/admixture-manual.pdf``.
  7 | 
  8 | Converting to Plink format
  9 | --------------------------
 10 | 
 11 | ADMIXTURE requires input data in `PLINK bed format <http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#bed>`_, unlinke the EigenStrat format that we have already used for both :ref:`pca` and :ref:`f3`. Fortunately, it is easy to convert from Eigenstrat to bed using Eigensoft's tool ``convertf``, which is installed on the cluster. To use it, you again have to prepare a parameter file with the following syntax::
 12 | 
 13 |     genotypename: <Eigenstrat.geno>
 14 |     snpname: <Eigenstrat.snp>
 15 |     indivname: <Eigenstrat.ind>
 16 |     outputformat: PACKEDPED
 17 |     genotypeoutname: <out.bed>
 18 |     snpoutname: <out.bim>
 19 |     indivoutname: <out.fam>
 20 | 
 21 | Note that ``PACKEDPED`` is ``convertf``'s way to describe the ``BED`` format. Note that for ``admixture``, the output file endings are crucial, please use ``*.bed``, ``*.bim`` and ``*.fam``. You can then start the conversion by running
 22 | 
 23 | .. code-block:: bash
 24 | 
 25 |     sbatch --mem=2000 -o $LOG_FILE --wrap="convertf -p $PARAMETER_FILE"
 26 | 
 27 | This should finish in a few minutes.
 28 | 
 29 | Running ADMIXTURE
 30 | -----------------
 31 | 
 32 | You can now run ``admixture``. The basic syntax is dead-easy: ``admixture $INPUT.bed $K``. Here ``$K`` is a number indicating the number of clusters you want to infer.
 33 | 
 34 | How do we know what number of clusters K to use? Well, first of all you should run several numbers of clusters, e.g. all numbers between K=3 and K=12 for a start. Then, ``admixture`` has a built-in method to evaluate a "cross-validation error" for each K. Computing this "cross-validation error" requires simply to give the flag ``--cv`` right after the call to ``admixture``. However, since this tool already runs for relatively long, we will omit this feature here, but it is strongly recommended to use it for analysis.
 35 | 
 36 | One slightly awkward behaviour of ``admixture`` is that it writes its output simply into the directory where you start it from. Also, you cannot manually name the output files, but they will be called similarly as the input file, but instead of the ending ``.bed`` it produces two files with endings ``.$K.Q`` and ``.$K.P``. Here, ``$K`` is again the number of clusters you have chosen.
 37 | 
 38 | OK, so here is a template for a submission script:
 39 | 
 40 | .. code-block:: bash
 41 | 
 42 |     #!/usr/bin/env bash
 43 | 
 44 |     BED_FILE=...
 45 |     OUTDIR=...
 46 |     mkdir -p $OUTDIR
 47 | 
 48 |     for K in {3..12}; do
 49 |         CMD="cd $OUTDIR; admixture -j8 $BED_FILE $K" #Normally you should give --cv as first option to admixture
 50 |         echo $CMD
 51 |         # sbatch -c 8 --mem 12000 --wrap="$CMD"
 52 |     done
 53 | 
 54 | Note that the command is sequence of commands: First ``cd`` into the output directory, then run admixture in order to get the output files where we want them.
 55 | 
 56 | If things run successfully, you should now have a ``.Q`` and a ``.P`` file in your output directory for every ``K`` that you ran.
 57 | 
 58 | Plotting in R
 59 | -------------
 60 | 
 61 | Here is the code for making the typical ADMIXTURE-barplot for K=6:
 62 | 
 63 | .. code-block:: R
 64 | 
 65 |     tbl=read.table("~/Data/MyProject/admixture/MyProject.HO.merged.6.Q")
 66 |     indTable = read.table("~/Data/MyProject/admixture/MyProject.HO.merged.ind",
 67 |                    col.names = c("Sample", "Sex", "Pop"))
 68 |     popGroups = read.table("~/Google Drive/GA_Workshop Jena/HO_popGroups.txt", col.names=c("Pop", "PopGroup"))
 69 | 
 70 |     mergedAdmixtureTable = cbind(tbl, indTable)
 71 |     mergedAdmWithPopGroups = merge(mergedAdmixtureTable, popGroups, by="Pop")
 72 |     ordered = mergedAdmWithPopGroups[order(mergedAdmWithPopGroups$PopGroup),]
 73 |     barplot(t(as.matrix(subset(ordered, select=V1:V6))), col=rainbow(6), border=NA)
 74 | 
 75 | which gives:
 76 | 
 77 | .. image:: admixturePlot.png
 78 |    :width: 600px
 79 |    :height: 500px
 80 |    :align: center
 81 | 
 82 | OK, so this is already something, at least the continental groups are ordered, but we would like to also display the population group names below the axis. For this, we'll write a function in R:
 83 | 
 84 | .. code-block:: R
 85 | 
 86 |     barNaming <- function(vec) {
 87 |         retVec <- vec
 88 |         for(k in 2:length(vec)) {
 89 |             if(vec[k-1] == vec[k])
 90 |                 retVec[k] <- ""
 91 |         }
 92 |         return(retVec)
 93 |     }
 94 | 
 95 | and we can then replot:
 96 | 
 97 | .. code-block:: R
 98 |     
 99 |     par(mar=c(10,4,4,4))
100 |     barplot(t(as.matrix(ordered[,2:7])), col=rainbow(6), border=NA, 
101 |             names.arg=barNaming(ordered$PopGroup), las=2)
102 | 
103 | which should give:
104 | 
105 | .. image:: admixturePlotWithLabels.png
106 |    :width: 600px
107 |    :height: 500px
108 |    :align: center
109 | 
110 | 


--------------------------------------------------------------------------------
/contents/01_intro/intro.rst:
--------------------------------------------------------------------------------
  1 | .. _intro:
  2 | 
  3 | Introduction
  4 | ============
  5 | This session will introduce some basics of the command line environment and connecting to the compute cluster at google.
  6 | 
  7 | Basic software
  8 | --------------
  9 | 
 10 | There are some softwares that you will want to install for this session.
 11 | 
 12 | Mac OSX
 13 | ~~~~~~~
 14 | 
 15 | Direct access to cluster storage:
 16 | 
 17 | - http://osxfuse.github.io/
 18 | 
 19 |   - You need to install FUSE and SSHFS
 20 | 
 21 | An editor for text files and scripts:
 22 | 
 23 | - https://atom.io
 24 | 
 25 | Some things that you may want to have on your mac but that you don't need right now:
 26 | 
 27 | - https://www.macports.org
 28 | 
 29 |   - Software packaged for your mac
 30 | 
 31 | - https://developer.apple.com/downloads/
 32 | 
 33 |   - Apple developer command line tools
 34 | 
 35 | Ubuntu
 36 | ~~~~~~
 37 | .. code-block:: bash
 38 | 
 39 |  apt-get sshfs
 40 | 
 41 | The command line environment
 42 | ----------------------------
 43 | 
 44 | The command line environment allows you to interact with the compute system using text. The standard mode of interaction is as follows:
 45 | 
 46 | - **R**\ ead
 47 | - **E**\ valuate
 48 | - **P**\ rint
 49 | - **L**\ oop
 50 | 
 51 | The commands that you type are read and evaluated by a program called the **shell**. The window that displays the shell program is usually called the **terminal**.
 52 | 
 53 | - Terminal - displays the shell program
 54 | - Shell - the **REPL**
 55 | 
 56 |  We interact with the shell program using text. The shell typically reads input one line at a time. Therefore we commonly call the interaction with the shell via a terminal the 'command line'.
 57 | 
 58 | Using the bash shell
 59 | ~~~~~~~~~~~~~~~~~~~~
 60 | 
 61 | There are a few different shell programs available but we will use 'bash'.
 62 | 
 63 | .. code-block:: bash
 64 | 
 65 |  # To find what shell your using
 66 |  ps -ef | grep $$ | grep -v "grep\|ps -ef"
 67 |   502 44414 44413   0 21Apr16 ttys003    0:03.96 -bash
 68 | 
 69 | Bash reads your input and evaluates it. Some words are recognised by bash and interpreted as commands. Some words are part of bash, others depend on settings that you can modify. Bash is in fact a programming language. You can find a manual here:
 70 | 
 71 | - https://www.gnu.org/software/bash/manual/bashref.html
 72 | 
 73 | The environment
 74 | 
 75 | Many programs (including bash) need to be able to find things or know about the system. A universal way to supply this information is via environment variables.
 76 | 
 77 | - The environment is the set of variables and their values that is currently visible to you.
 78 | - A variable is simply a value that we can refer to by its name.
 79 | 
 80 | You can set environment variables using the 'export' command.
 81 | 
 82 | .. code-block:: bash
 83 | 
 84 |  # Here we prepend to the environment variable called 'PATH'
 85 |  export PATH="/Users/clayton/SoftwareLocal/bin:$PATH"
 86 | 
 87 | The PATH variable is important because it supplies a list of directories that will be searched for commands. This means that words that are not built in bash commands will be recognised as commands if the word matches an **executable** file in a directory from the list.
 88 | 
 89 | An example
 90 | 
 91 | .. code-block:: bash
 92 | 
 93 |  is_this_my_environment
 94 |  Yes, but we can make it better! Repeat after me:
 95 |  export PATH="/projects1/tools/workshop/2016/GenomicAnalysis/one/bin:$PATH"
 96 |  is_this_my_environment
 97 |  # Good advice
 98 |  export PATH="/projects1/tools/workshop/2016/GenomicAnalysis/one/bin:$PATH"
 99 |  is_this_my_environment
100 |  Yes, and now you know how to make it better!
101 | 
102 | We ran the command 'is_this_my_environment' twice but got different results??
103 | 
104 | - Actually this is very useful
105 |   - We say what should be done
106 |   - The system takes care of how
107 | 
108 | Connecting to the cluster
109 | 
110 | 1. Command line
111 | 
112 | .. code-block:: bash
113 | 
114 |  ssh google.co.uk
115 |  # To make this easier you can add the following to your ssh config
116 |  cat ~/.ssh/config
117 |  Host google
118 |  HostName google.co.uk
119 |  User clayton
120 | 
121 |  # This will let you connect to the cluster without as much typing
122 |  ssh google
123 | 
124 | 
125 | 2. Storage
126 | 
127 | .. code-block:: bash
128 | 
129 |  mkdir -p /Users/jane/Mount/google_home
130 |  # If you are on Ubuntu then you should omit the ovolname option
131 |  sshfs jane@google: /Users/jane/Mount/google_home -ovolname=google
132 | 
133 | The environment on the cluster
134 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
135 | 
136 | On the google cluster you have the option of using a pre-configured environment. In your home directory on the cluster you can add the following to your bash profile.
137 | 
138 | .. code-block:: bash
139 | 
140 |  cat ~/.bash_profile
141 |  source /projects/profiles/etc/google_default
142 |  # logout and login again for this to take effect
143 |  exit
144 |  ssh google
145 |  # You now have the latest versions of software installed at google on your path
146 |  # We can find the location of a program using which
147 |  which bash
148 |  /bin/bash
149 | 
150 | If you have mounted your google home folder then you can do this:
151 | 
152 | .. code-block:: bash
153 | 
154 |  touch ~/Mount/google_home/.bash_profile
155 |  open -a Atom ~/Mount/google_home/.bash_profile
156 | 
157 | And now add the following line and save:
158 | 
159 | .. code-block:: bash
160 | 
161 |  source /projects/profiles/etc/google_default
162 | 
163 | Next time you login in to google using ssh this setting will take effect.
164 | 
165 | Doing things with bash
166 | ~~~~~~~~~~~~~~~~~~~~~~
167 | 
168 | Here are some examples to get you started.
169 | 
170 | - http://www.tldp.org/LDP/abs/html/
171 | 
172 | The basics
173 | 
174 | There are some basic operators that you should be familiar with:
175 | 
176 | .. code-block:: bash
177 | 
178 |  | pipe
179 |  > less than
180 |  & ampersand
181 | 
182 | There are some variables that are set by bash. These are useful for seeing if your commands worked.
183 | 
184 | .. code-block:: bash
185 | 
186 |  # The exit code of the last command you ran
187 |  echo $?
188 | 
189 | A key concept in bash is chaining commands to create a pipeline. Each command does something to your data and passes it to the next command. We can use the <b>pipe</p> operator to pass data to the next command (instead of printing it on the screen).
190 | 
191 | .. code-block:: bash
192 | 
193 |  # Echo produces the string 'cat'.
194 |  # Tr replaces the letter 'c' with 'b'
195 |  echo "cat" | tr "c" "b"
196 |  bat
197 |  # What if our pipeline is much more complicated?
198 |  # What happens if a step in the pipeline fails?
199 |  echo "not working" | broken_command | sed -e 's/not //'
200 |  echo "$?"
201 |  0
202 |  # The exit code of the last command 'sed' was 0 (success) and yet our pipeline failed
203 |  # We have to tell bash to give us the exit code of failing commands instead
204 |  # We do this by using the set builtin and the pipefail option
205 |  set -o pipefail
206 |  echo "not working" | broken_command | sed -e 's/not //'
207 |  echo "$?"
208 |  127
209 | 
210 | You can run your commands in different ways. This is useful if you want to run things in parallel or want to use the results in your program.
211 | 
212 | .. code-block:: bash
213 | 
214 |  # Run the command in a sub shell
215 |  result=$(echo "the cat on the mat")
216 |  # Run the command without waiting for the result
217 |  echo "the cat sat on the mat" &
218 | 
219 | Useful things
220 | 
221 | If you want to repeat the same command for different inputs then looping is useful. There are some different ways to write loops, depending on the data you have.
222 | 
223 | .. code-block:: bash
224 | 
225 |  #result=$(cat words.txt)
226 |  result=$(echo "the cat was black")
227 |  echo "${result}"
228 | 
229 |  for i in $result;
230 |  do
231 |    if [[ $i == "black" ]]; then
232 |      echo "white"
233 |    else
234 |      echo "${i}"
235 |    fi
236 |  done
237 | 
238 |  if [[ -e words.txt ]]; then
239 |    echo "I found the words"
240 |  fi
241 | 
242 | The spaces (especially around the [[ ]]) are important.
243 | 
244 | A very useful program is **xargs**. This is an incredibly useful command because it lets you create new commands from your input text.
245 | 
246 | .. code-block:: bash
247 | 
248 |  echo "This is about cats" > words.txt
249 |  echo "This is about dogs" > other_words.txt
250 |  echo -e "words.txt\nother_words.txt" | xargs -I '{}' cat {}
251 |  This is about cats
252 |  This is about dogs
253 |  # If you want to see what your command will be before you run it then
254 |  # you can run the echo program to produce your command as text
255 |  echo -e "words.txt\nother_words.txt" | xargs -I '{}' echo "cat {}"
256 |  cat words.txt
257 |  cat other_words.txt
258 |  # To run these lines you can pipe them into bash
259 |  echo -e "words.txt\nother_words.txt" | xargs -I '{}' echo "cat {}" | bash
260 |  This is about cats
261 |  This is about dogs
262 |  # By piping commands into bash as text it is easy to achieve complex tasks
263 |  # e.g. Creating copy or move commands using awk to build the destination
264 |  #      file path using components of the source path
265 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
  1 | # Makefile for Sphinx documentation
  2 | #
  3 | 
  4 | # You can set these variables from the command line.
  5 | SPHINXOPTS    =
  6 | SPHINXBUILD   = sphinx-build
  7 | PAPER         =
  8 | BUILDDIR      = _build
  9 | 
 10 | # User-friendly check for sphinx-build
 11 | ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
 12 | 	$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don\'t have Sphinx installed, grab it from http://sphinx-doc.org/)
 13 | endif
 14 | 
 15 | # Internal variables.
 16 | PAPEROPT_a4     = -D latex_paper_size=a4
 17 | PAPEROPT_letter = -D latex_paper_size=letter
 18 | ALLSPHINXOPTS   = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
 19 | # the i18n builder cannot share the environment and doctrees with the others
 20 | I18NSPHINXOPTS  = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
 21 | 
 22 | .PHONY: help
 23 | help:
 24 | 	@echo "Please use \`make <target>' where <target> is one of"
 25 | 	@echo "  html       to make standalone HTML files"
 26 | 	@echo "  dirhtml    to make HTML files named index.html in directories"
 27 | 	@echo "  singlehtml to make a single large HTML file"
 28 | 	@echo "  pickle     to make pickle files"
 29 | 	@echo "  json       to make JSON files"
 30 | 	@echo "  htmlhelp   to make HTML files and a HTML help project"
 31 | 	@echo "  qthelp     to make HTML files and a qthelp project"
 32 | 	@echo "  applehelp  to make an Apple Help Book"
 33 | 	@echo "  devhelp    to make HTML files and a Devhelp project"
 34 | 	@echo "  epub       to make an epub"
 35 | 	@echo "  epub3      to make an epub3"
 36 | 	@echo "  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
 37 | 	@echo "  latexpdf   to make LaTeX files and run them through pdflatex"
 38 | 	@echo "  latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
 39 | 	@echo "  text       to make text files"
 40 | 	@echo "  man        to make manual pages"
 41 | 	@echo "  texinfo    to make Texinfo files"
 42 | 	@echo "  info       to make Texinfo files and run them through makeinfo"
 43 | 	@echo "  gettext    to make PO message catalogs"
 44 | 	@echo "  changes    to make an overview of all changed/added/deprecated items"
 45 | 	@echo "  xml        to make Docutils-native XML files"
 46 | 	@echo "  pseudoxml  to make pseudoxml-XML files for display purposes"
 47 | 	@echo "  linkcheck  to check all external links for integrity"
 48 | 	@echo "  doctest    to run all doctests embedded in the documentation (if enabled)"
 49 | 	@echo "  coverage   to run coverage check of the documentation (if enabled)"
 50 | 	@echo "  dummy      to check syntax errors of document sources"
 51 | 
 52 | .PHONY: clean
 53 | clean:
 54 | 	rm -rf $(BUILDDIR)/*
 55 | 
 56 | .PHONY: html
 57 | html:
 58 | 	$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
 59 | 	@echo
 60 | 	@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
 61 | 
 62 | .PHONY: dirhtml
 63 | dirhtml:
 64 | 	$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
 65 | 	@echo
 66 | 	@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
 67 | 
 68 | .PHONY: singlehtml
 69 | singlehtml:
 70 | 	$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
 71 | 	@echo
 72 | 	@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
 73 | 
 74 | .PHONY: pickle
 75 | pickle:
 76 | 	$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
 77 | 	@echo
 78 | 	@echo "Build finished; now you can process the pickle files."
 79 | 
 80 | .PHONY: json
 81 | json:
 82 | 	$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
 83 | 	@echo
 84 | 	@echo "Build finished; now you can process the JSON files."
 85 | 
 86 | .PHONY: htmlhelp
 87 | htmlhelp:
 88 | 	$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
 89 | 	@echo
 90 | 	@echo "Build finished; now you can run HTML Help Workshop with the" \
 91 | 	      ".hhp project file in $(BUILDDIR)/htmlhelp."
 92 | 
 93 | .PHONY: qthelp
 94 | qthelp:
 95 | 	$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
 96 | 	@echo
 97 | 	@echo "Build finished; now you can run "qcollectiongenerator" with the" \
 98 | 	      ".qhcp project file in $(BUILDDIR)/qthelp, like this:"
 99 | 	@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/GAWorkshop.qhcp"
100 | 	@echo "To view the help file:"
101 | 	@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/GAWorkshop.qhc"
102 | 
103 | .PHONY: applehelp
104 | applehelp:
105 | 	$(SPHINXBUILD) -b applehelp $(ALLSPHINXOPTS) $(BUILDDIR)/applehelp
106 | 	@echo
107 | 	@echo "Build finished. The help book is in $(BUILDDIR)/applehelp."
108 | 	@echo "N.B. You won't be able to view it unless you put it in" \
109 | 	      "~/Library/Documentation/Help or install it in your application" \
110 | 	      "bundle."
111 | 
112 | .PHONY: devhelp
113 | devhelp:
114 | 	$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
115 | 	@echo
116 | 	@echo "Build finished."
117 | 	@echo "To view the help file:"
118 | 	@echo "# mkdir -p $$HOME/.local/share/devhelp/GAWorkshop"
119 | 	@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/GAWorkshop"
120 | 	@echo "# devhelp"
121 | 
122 | .PHONY: epub
123 | epub:
124 | 	$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
125 | 	@echo
126 | 	@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
127 | 
128 | .PHONY: epub3
129 | epub3:
130 | 	$(SPHINXBUILD) -b epub3 $(ALLSPHINXOPTS) $(BUILDDIR)/epub3
131 | 	@echo
132 | 	@echo "Build finished. The epub3 file is in $(BUILDDIR)/epub3."
133 | 
134 | .PHONY: latex
135 | latex:
136 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
137 | 	@echo
138 | 	@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
139 | 	@echo "Run \`make' in that directory to run these through (pdf)latex" \
140 | 	      "(use \`make latexpdf' here to do that automatically)."
141 | 
142 | .PHONY: latexpdf
143 | latexpdf:
144 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
145 | 	@echo "Running LaTeX files through pdflatex..."
146 | 	$(MAKE) -C $(BUILDDIR)/latex all-pdf
147 | 	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
148 | 
149 | .PHONY: latexpdfja
150 | latexpdfja:
151 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
152 | 	@echo "Running LaTeX files through platex and dvipdfmx..."
153 | 	$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
154 | 	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
155 | 
156 | .PHONY: text
157 | text:
158 | 	$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
159 | 	@echo
160 | 	@echo "Build finished. The text files are in $(BUILDDIR)/text."
161 | 
162 | .PHONY: man
163 | man:
164 | 	$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
165 | 	@echo
166 | 	@echo "Build finished. The manual pages are in $(BUILDDIR)/man."
167 | 
168 | .PHONY: texinfo
169 | texinfo:
170 | 	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
171 | 	@echo
172 | 	@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
173 | 	@echo "Run \`make' in that directory to run these through makeinfo" \
174 | 	      "(use \`make info' here to do that automatically)."
175 | 
176 | .PHONY: info
177 | info:
178 | 	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
179 | 	@echo "Running Texinfo files through makeinfo..."
180 | 	make -C $(BUILDDIR)/texinfo info
181 | 	@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."
182 | 
183 | .PHONY: gettext
184 | gettext:
185 | 	$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
186 | 	@echo
187 | 	@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."
188 | 
189 | .PHONY: changes
190 | changes:
191 | 	$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
192 | 	@echo
193 | 	@echo "The overview file is in $(BUILDDIR)/changes."
194 | 
195 | .PHONY: linkcheck
196 | linkcheck:
197 | 	$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
198 | 	@echo
199 | 	@echo "Link check complete; look for any errors in the above output " \
200 | 	      "or in $(BUILDDIR)/linkcheck/output.txt."
201 | 
202 | .PHONY: doctest
203 | doctest:
204 | 	$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
205 | 	@echo "Testing of doctests in the sources finished, look at the " \
206 | 	      "results in $(BUILDDIR)/doctest/output.txt."
207 | 
208 | .PHONY: coverage
209 | coverage:
210 | 	$(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage
211 | 	@echo "Testing of coverage in the sources finished, look at the " \
212 | 	      "results in $(BUILDDIR)/coverage/python.txt."
213 | 
214 | .PHONY: xml
215 | xml:
216 | 	$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
217 | 	@echo
218 | 	@echo "Build finished. The XML files are in $(BUILDDIR)/xml."
219 | 
220 | .PHONY: pseudoxml
221 | pseudoxml:
222 | 	$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
223 | 	@echo
224 | 	@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."
225 | 
226 | .PHONY: dummy
227 | dummy:
228 | 	$(SPHINXBUILD) -b dummy $(ALLSPHINXOPTS) $(BUILDDIR)/dummy
229 | 	@echo
230 | 	@echo "Build finished. Dummy builder generates no files."
231 | 


--------------------------------------------------------------------------------
/conf.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #
  3 | # GAWorkshop documentation build configuration file, created by
  4 | # sphinx-quickstart on Tue May 10 16:35:33 2016.
  5 | #
  6 | # This file is execfile()d with the current directory set to its
  7 | # containing dir.
  8 | #
  9 | # Note that not all possible configuration values are present in this
 10 | # autogenerated file.
 11 | #
 12 | # All configuration values have a default; values that are commented out
 13 | # serve to show the default.
 14 | 
 15 | import sys
 16 | import os
 17 | 
 18 | # If extensions (or modules to document with autodoc) are in another directory,
 19 | # add these directories to sys.path here. If the directory is relative to the
 20 | # documentation root, use os.path.abspath to make it absolute, like shown here.
 21 | #sys.path.insert(0, os.path.abspath('.'))
 22 | 
 23 | # -- General configuration ------------------------------------------------
 24 | 
 25 | # If your documentation needs a minimal Sphinx version, state it here.
 26 | #needs_sphinx = '1.0'
 27 | 
 28 | # Add any Sphinx extension module names here, as strings. They can be
 29 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 30 | # ones.
 31 | extensions = []
 32 | 
 33 | # Add any paths that contain templates here, relative to this directory.
 34 | templates_path = ['_templates']
 35 | 
 36 | # The suffix(es) of source filenames.
 37 | # You can specify multiple suffix as a list of string:
 38 | # source_suffix = ['.rst', '.md']
 39 | source_suffix = '.rst'
 40 | 
 41 | # The encoding of source files.
 42 | #source_encoding = 'utf-8-sig'
 43 | 
 44 | # The master toctree document.
 45 | master_doc = 'index'
 46 | 
 47 | # General information about the project.
 48 | project = u'GAWorkshop'
 49 | copyright = u'2016, Schiffels S, Peltzer A, Clayton S'
 50 | author = u'Schiffels S, Peltzer A, Clayton S'
 51 | 
 52 | # The version info for the project you're documenting, acts as replacement for
 53 | # |version| and |release|, also used in various other places throughout the
 54 | # built documents.
 55 | #
 56 | # The short X.Y version.
 57 | version = u'1'
 58 | # The full version, including alpha/beta/rc tags.
 59 | release = u'1'
 60 | 
 61 | # The language for content autogenerated by Sphinx. Refer to documentation
 62 | # for a list of supported languages.
 63 | #
 64 | # This is also used if you do content translation via gettext catalogs.
 65 | # Usually you set "language" from the command line for these cases.
 66 | language = None
 67 | 
 68 | # There are two options for replacing |today|: either, you set today to some
 69 | # non-false value, then it is used:
 70 | #today = ''
 71 | # Else, today_fmt is used as the format for a strftime call.
 72 | #today_fmt = '%B %d, %Y'
 73 | 
 74 | # List of patterns, relative to source directory, that match files and
 75 | # directories to ignore when looking for source files.
 76 | # This patterns also effect to html_static_path and html_extra_path
 77 | exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
 78 | 
 79 | # The reST default role (used for this markup: `text`) to use for all
 80 | # documents.
 81 | #default_role = None
 82 | 
 83 | # If true, '()' will be appended to :func: etc. cross-reference text.
 84 | #add_function_parentheses = True
 85 | 
 86 | # If true, the current module name will be prepended to all description
 87 | # unit titles (such as .. function::).
 88 | #add_module_names = True
 89 | 
 90 | # If true, sectionauthor and moduleauthor directives will be shown in the
 91 | # output. They are ignored by default.
 92 | #show_authors = False
 93 | 
 94 | # The name of the Pygments (syntax highlighting) style to use.
 95 | pygments_style = 'sphinx'
 96 | 
 97 | # A list of ignored prefixes for module index sorting.
 98 | #modindex_common_prefix = []
 99 | 
100 | # If true, keep warnings as "system message" paragraphs in the built documents.
101 | #keep_warnings = False
102 | 
103 | # If true, `todo` and `todoList` produce output, else they produce nothing.
104 | todo_include_todos = False
105 | 
106 | 
107 | # -- Options for HTML output ----------------------------------------------
108 | 
109 | # The theme to use for HTML and HTML Help pages.  See the documentation for
110 | # a list of builtin themes.
111 | html_theme = 'sphinx_rtd_theme'
112 | 
113 | # Theme options are theme-specific and customize the look and feel of a theme
114 | # further.  For a list of options available for each theme, see the
115 | # documentation.
116 | #html_theme_options = {}
117 | 
118 | # Add any paths that contain custom themes here, relative to this directory.
119 | #html_theme_path = []
120 | 
121 | # The name for this set of Sphinx documents.
122 | # "<project> v<release> documentation" by default.
123 | #html_title = u'GAWorkshop v1'
124 | 
125 | # A shorter title for the navigation bar.  Default is the same as html_title.
126 | #html_short_title = None
127 | 
128 | # The name of an image file (relative to this directory) to place at the top
129 | # of the sidebar.
130 | #html_logo = None
131 | 
132 | # The name of an image file (relative to this directory) to use as a favicon of
133 | # the docs.  This file should be a Windows icon file (.ico) being 16x16 or 32x32
134 | # pixels large.
135 | #html_favicon = None
136 | 
137 | # Add any paths that contain custom static files (such as style sheets) here,
138 | # relative to this directory. They are copied after the builtin static files,
139 | # so a file named "default.css" will overwrite the builtin "default.css".
140 | html_static_path = ['_static']
141 | 
142 | # Add any extra paths that contain custom files (such as robots.txt or
143 | # .htaccess) here, relative to this directory. These files are copied
144 | # directly to the root of the documentation.
145 | #html_extra_path = []
146 | 
147 | # If not None, a 'Last updated on:' timestamp is inserted at every page
148 | # bottom, using the given strftime format.
149 | # The empty string is equivalent to '%b %d, %Y'.
150 | #html_last_updated_fmt = None
151 | 
152 | # If true, SmartyPants will be used to convert quotes and dashes to
153 | # typographically correct entities.
154 | #html_use_smartypants = True
155 | 
156 | # Custom sidebar templates, maps document names to template names.
157 | #html_sidebars = {}
158 | 
159 | # Additional templates that should be rendered to pages, maps page names to
160 | # template names.
161 | #html_additional_pages = {}
162 | 
163 | # If false, no module index is generated.
164 | #html_domain_indices = True
165 | 
166 | # If false, no index is generated.
167 | #html_use_index = True
168 | 
169 | # If true, the index is split into individual pages for each letter.
170 | #html_split_index = False
171 | 
172 | # If true, links to the reST sources are added to the pages.
173 | #html_show_sourcelink = True
174 | 
175 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
176 | #html_show_sphinx = True
177 | 
178 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
179 | #html_show_copyright = True
180 | 
181 | # If true, an OpenSearch description file will be output, and all pages will
182 | # contain a <link> tag referring to it.  The value of this option must be the
183 | # base URL from which the finished HTML is served.
184 | #html_use_opensearch = ''
185 | 
186 | # This is the file name suffix for HTML files (e.g. ".xhtml").
187 | #html_file_suffix = None
188 | 
189 | # Language to be used for generating the HTML full-text search index.
190 | # Sphinx supports the following languages:
191 | #   'da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'ja'
192 | #   'nl', 'no', 'pt', 'ro', 'ru', 'sv', 'tr', 'zh'
193 | #html_search_language = 'en'
194 | 
195 | # A dictionary with options for the search language support, empty by default.
196 | # 'ja' uses this config value.
197 | # 'zh' user can custom change `jieba` dictionary path.
198 | #html_search_options = {'type': 'default'}
199 | 
200 | # The name of a javascript file (relative to the configuration directory) that
201 | # implements a search results scorer. If empty, the default will be used.
202 | #html_search_scorer = 'scorer.js'
203 | 
204 | # Output file base name for HTML help builder.
205 | htmlhelp_basename = 'GAWorkshopdoc'
206 | 
207 | # -- Options for LaTeX output ---------------------------------------------
208 | 
209 | latex_elements = {
210 | # The paper size ('letterpaper' or 'a4paper').
211 | #'papersize': 'letterpaper',
212 | 
213 | # The font size ('10pt', '11pt' or '12pt').
214 | #'pointsize': '10pt',
215 | 
216 | # Additional stuff for the LaTeX preamble.
217 | #'preamble': '',
218 | 
219 | # Latex figure (float) alignment
220 | #'figure_align': 'htbp',
221 | }
222 | 
223 | # Grouping the document tree into LaTeX files. List of tuples
224 | # (source start file, target name, title,
225 | #  author, documentclass [howto, manual, or own class]).
226 | latex_documents = [
227 |     (master_doc, 'GAWorkshop.tex', u'GAWorkshop Documentation',
228 |      u'Schiffels S, Peltzer A, Clayton S', 'manual'),
229 | ]
230 | 
231 | # The name of an image file (relative to this directory) to place at the top of
232 | # the title page.
233 | #latex_logo = None
234 | 
235 | # For "manual" documents, if this is true, then toplevel headings are parts,
236 | # not chapters.
237 | #latex_use_parts = False
238 | 
239 | # If true, show page references after internal links.
240 | #latex_show_pagerefs = False
241 | 
242 | # If true, show URL addresses after external links.
243 | #latex_show_urls = False
244 | 
245 | # Documents to append as an appendix to all manuals.
246 | #latex_appendices = []
247 | 
248 | # If false, no module index is generated.
249 | #latex_domain_indices = True
250 | 
251 | 
252 | # -- Options for manual page output ---------------------------------------
253 | 
254 | # One entry per manual page. List of tuples
255 | # (source start file, name, description, authors, manual section).
256 | man_pages = [
257 |     (master_doc, 'gaworkshop', u'GAWorkshop Documentation',
258 |      [author], 1)
259 | ]
260 | 
261 | # If true, show URL addresses after external links.
262 | #man_show_urls = False
263 | 
264 | 
265 | # -- Options for Texinfo output -------------------------------------------
266 | 
267 | # Grouping the document tree into Texinfo files. List of tuples
268 | # (source start file, target name, title, author,
269 | #  dir menu entry, description, category)
270 | texinfo_documents = [
271 |     (master_doc, 'GAWorkshop', u'GAWorkshop Documentation',
272 |      author, 'GAWorkshop', 'One line description of project.',
273 |      'Miscellaneous'),
274 | ]
275 | 
276 | # Documents to append as an appendix to all manuals.
277 | #texinfo_appendices = []
278 | 
279 | # If false, no module index is generated.
280 | #texinfo_domain_indices = True
281 | 
282 | # How to display URL addresses: 'footnote', 'no', or 'inline'.
283 | #texinfo_show_urls = 'footnote'
284 | 
285 | # If true, do not generate a @detailmenu in the "Top" node's menu.
286 | #texinfo_no_detailmenu = False
287 | 


--------------------------------------------------------------------------------
/contents/05_pca/pca.rst:
--------------------------------------------------------------------------------
  1 | .. _pca:
  2 | 
  3 | Principal Component Analysis
  4 | ============================
  5 | 
  6 | In this lesson we'll make a principal component plot. For that we will use the program ``smartpca``, again from the `Eigensoft package <https://data.broadinstitute.org/alkesgroup/EIGENSOFT/>`_. The recommended way to perform PCA involving low coverage test samples, is to construct the Eigenvectors only from the high quality set of modern samples in the HO set, and then simply project the ancient or low coverage samples onto these Eigenvectors. This allows one to project even samples with as few as 10,000 SNPs into PCA plot (compare with ~600,000 SNPs for HO samples).
  7 | 
  8 | Running SmartPCA
  9 | ----------------
 10 | 
 11 | So the first thing to decide is on the populations used to construct the PCA. You can find a complete list of HO population at ``/projects1/users/schiffels/PublicData/HumanOriginsData.backup/HO_populations.txt``. You can in principle use all of these populations to construct the EigenVectors. Note however that this will fill the first principal components with global human diversity axes, like African/Non-African, Asian/Europe, Native Americans... So it depends on your particular research question on whether you want to narrow down the populations used to construct the PCA. For example, if you are working on Native American samples, you may want to consider running a PCA with only Native American and perhaps Siberian populations. You can also make several runs with different subsets of populations of course.
 12 | 
 13 | .. note::
 14 | 
 15 |   In any case, if you want to restrict the samples used for contructing the PCA, you should copy the populations file above to your directory and modify it accordingly, i.e. remove populations you do not want. At the very least, I recommend removing the following "populations": Chimp, Denisovan, Gorilla, hg19ref, Iceman, LaBrana, LBK, Loschbour, MA1, Macaque, Marmoset, Mezmaiskaya, Motala, Orangutan, Saqqaq, Swedish Farmer, Swedish HunterGatherer, Vindija light.
 16 | 
 17 | Now we need to build a parameter file for ``smartpca``. Mine looks like this::
 18 | 
 19 |     genotypename:	/data/schiffels/GAworkshop/genotyping/MyProject.HO.eigenstrat.merged.geno.txt
 20 |     snpname:	/data/schiffels/GAworkshop/genotyping/MyProject.HO.eigenstrat.merged.snp.txt
 21 |     indivname:	/data/schiffels/GAworkshop/genotyping/MyProject.HO.eigenstrat.merged.ind.txt
 22 |     evecoutname:	/data/schiffels/GAworkshop/pca/MyProject.HO.merged.pca.evec.txt
 23 |     evaloutname:	/data/schiffels/GAworkshop/pca/MyProject.HO.merged.pca.eval.txt
 24 |     poplistname:	/home/adminschif/GAworkshop/pca_populations.txt
 25 |     lsqproject:	YES
 26 | 
 27 | The first three lines contain the three genotype files you generated from merging your test samples with the HO data set, so the output of the ``mergeit`` program. The next two lines are two output files, and you have to make sure the directory where these two files will be written into exists. The next line contains a file with the list of populations that you want to use to construct the PCA, as discussed above. The last line contains a flag that is recommended for low coverage and ancient data.
 28 | 
 29 | You can now run ``smartpca`` on that parameter file and submit to SLURM via:
 30 | 
 31 | .. code-block:: bash
 32 | 
 33 |     sbatch --mem=8000 -o /data/schiffels/GAworkshop/pca/smartpca.log --wrap="smartpca -p smartpca.params.txt"
 34 | 
 35 | Here I reserved 8GB of memory, which I would recommend for a large data set such as HO. Once finished, transfer the resulting ``*.evec.txt`` file back to your laptop.
 36 | 
 37 | Plotting the results
 38 | --------------------
 39 | 
 40 | There are several ways to make nice publication-quality plots (Excel is usually not one of them). Popular tools include `R <https://www.r-project.org>`_, `matplotlib <http://matplotlib.org>`_. A relatively new development which is highly recommended is the `Jupyter Notebook <http://jupyter.org>`_, which can be used with both R, matplotlib and many other scientific compute environments. I personally also heavily use the commercial software `DataGraph <http://www.visualdatatools.com/DataGraph/>`_, you can download a free demo if you are interested. To keep it simple, here we will simply use R, because it works right out of the box and has an easy installation. So please go ahead and install R from the website if you don't have it already installed.
 41 | 
 42 | When you startup R, you get the console, into which you can interactively type commands, including plot commands, and look at the results on a screen. Here are my first two commands:
 43 | 
 44 | .. code-block:: bash
 45 | 
 46 |     fn = "~/Data/GAworkshop/pca/MyProject.HO.merged.pca.evec.txt"
 47 |     evecDat = read.table(fn, col.names=c("Sample", "PC1", "PC2", "PC3", "PC4", "PC5",
 48 |                                          "PC6", "PC7", "PC8", "PC9", "PC10", "Pop"))
 49 | 
 50 | where ``fn`` obviously should point to the ``*.evec.txt`` file produced from ``smartpca``. The ``read.table`` command reads the data into a so called "DataFrame", which is pretty much an ordinary table with some extra features. We explicitly say what the 12 column names should be in the command, as you can see. The first column is the Sample name, the last is the population name, and the middle 10 columns denote the 10 first principle components for all samples. You can have a look at the data frame by typing ``head(evecDat)``, which should yield::
 51 | 
 52 |         Sample    PC1    PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9    PC10      Pop
 53 |     1   SA1004 0.0549 0.0100 -0.0502 -0.0016  0.0003  0.0009  0.0016  0.0409 -0.0368 -0.0625  Khomani
 54 |     2    SA064 0.0502 0.0083 -0.0619 -0.0038  0.0020  0.0016  0.0055  0.0664 -0.0424 -0.0878  Khomani
 55 |     3    SA073 0.0552 0.0110 -0.0600 -0.0043  0.0007 -0.0001  0.0007  0.0411 -0.0393 -0.0868  Khomani
 56 |     4    SA078 0.0465 0.0058 -0.0749 -0.0024 -0.0041  0.0011 -0.0101  0.0307 -0.0330 -0.0671  Khomani
 57 |     5    SA083 0.0418 0.0047 -0.0631 -0.0040  0.0041  0.0004  0.0035  0.0620 -0.0372 -0.0855  Khomani
 58 |     6 BOT2.031 0.0668 0.0148 -0.0888 -0.0040 -0.0011 -0.0027 -0.0068 -0.0310  0.0096  0.0072 Taa_West
 59 | 
 60 | You see that it's pretty much a table. You can now very easily produce a plot of PC1 vs. PC2, by typing ``plot(evecDat$PC1, evecDat$PC2, xlab="PC1", ylab="PC2")``, which in my case yields a boring figure like this:
 61 | 
 62 | .. image:: pca_simple.png
 63 |    :width: 500px
 64 |    :height: 500px
 65 |    :align: center
 66 | 
 67 | Now, obviously, we would like to highlight the different populations by color. A quick and dirty solution is to simply plot different subsets of the data on top of each other, like this::
 68 | 
 69 |     plot(evecDat$PC1, evecDat$PC2, xlab="PC1", ylab="PC2")
 70 |     d = evecDat[evecDat$Pop=="Yoruba",]
 71 |     points(d$PC1, d$PC2, col="red", pch=20)
 72 |     d = evecDat[evecDat$Pop=="French",]
 73 |     points(d$PC1, d$PC2, col="blue", pch=20)
 74 |     d = evecDat[evecDat$Pop=="Han",]
 75 |     points(d$PC1, d$PC2, col="green", pch=20)
 76 | 
 77 | You can copy and paste all those lines simultaneously into the console, by the way. This sequence of commands gives us:
 78 | 
 79 | .. image:: pcaWithSomeColor.png
 80 |    :width: 500px
 81 |    :height: 500px
 82 |    :align: center
 83 | 
 84 | OK, but how do we systematically show all the interesting populations? In principle, R makes this easily possible: Instead of choosing a single color and symbols (the ``col`` and ``pch`` options), you can give R vectors to these options, which contain one value for each sample. To make this clearer, run ``plot(evecDat$PC1, evecDat$PC2, col=evecDat$Pop)``, which should produce a _very_ colorful, but also useless, plot, where each population has its own color (although R cycles only 8 colors, so you will have every color used for many populations). OK, this is not useful. We should have a broader categorization into continental groups.
 85 | 
 86 | The way I have come up with first involves making a new tabular file with two columns, to denote the continental groups that the populations are in, like this::
 87 | 
 88 |     BantuKenya	African
 89 |     BantuSA	African
 90 |     Canary_Islanders	African
 91 |     Dinka	African
 92 |     Ethiopian_Jew	African
 93 |     Mayan	NativeAmerican
 94 |     Mixe	NativeAmerican
 95 |     Mixtec	NativeAmerican
 96 |     Quechua	NativeAmerican
 97 |     Surui	NativeAmerican
 98 |     Ticuna	NativeAmerican
 99 |     Zapotec	NativeAmerican
100 |     Algerian	NorthAfrican
101 |     Egyptian	NorthAfrican
102 |     Libyan_Jew	NorthAfrican
103 |     Moroccan_Jew	NorthAfrican
104 |     Tunisian	NorthAfrican
105 |     Tunisian_Jew	NorthAfrican
106 |     ...
107 | 
108 | The names in the first column should be taken from the population names in your merged ``*.ind.txt`` file that you input to ``smartpca``. An example file can be found in the Google Drive folder under ``HO_popGroups.txt``. You can load this file into a data frame in R via::
109 | 
110 |     popGroups=read.table("~/Google_Drive/Projects/GAworkshopScripts/HO_popGroups.txt", col.names=c("Pop", "PopGroup"))
111 | 
112 | You can again convince yourself that it worked by typing ``head(popGroups)``. We can now make use of a very convenient feature in R which lets us easily merge two data frames together. What we need is a new data frame which consists of the ``evecDat`` data frame, but with an additional column indicating the continental group. This involves a lookup in ``popGroups`` for every population in ``evecDat``. This command does the job::
113 | 
114 |     mergedEvecDat = merge(evecDat, popGroups, by="Pop")
115 | 
116 | You can see via ``head(mergedEvecDat)``::
117 | 
118 |             Pop Sample     PC1     PC2     PC3     PC4    PC5     PC6    PC7     PC8     PC9    PC10 PopGroup
119 |     1 Abkhasian abh107 -0.0080 -0.0211 -0.0040 -0.0003 0.0073 -0.0025 0.0096 -0.0204 -0.0052 -0.0126    Asian
120 |     2 Abkhasian abh133 -0.0077 -0.0217 -0.0043 -0.0006 0.0073 -0.0022 0.0081 -0.0222 -0.0053 -0.0137    Asian
121 |     3 Abkhasian abh119 -0.0077 -0.0214 -0.0041 -0.0009 0.0057 -0.0019 0.0109 -0.0205 -0.0043 -0.0147    Asian
122 |     4 Abkhasian abh122 -0.0078 -0.0214 -0.0039 -0.0017 0.0050 -0.0015 0.0082 -0.0171 -0.0042 -0.0116    Asian
123 |     5 Abkhasian  abh27 -0.0077 -0.0218 -0.0039 -0.0011 0.0039 -0.0024 0.0076 -0.0205 -0.0055 -0.0121    Asian
124 |     6 Abkhasian  abh41 -0.0077 -0.0209 -0.0046 -0.0015 0.0054 -0.0028 0.0047 -0.0208 -0.0078 -0.0130    Asian
125 | 
126 | that there now is a new column to the right called ``PopGroup``, which correctly contains the group for each sample. Note that this new dataframe only contains rows with populations that are actually in your original ``popGroups`` data set, so in the file you created. You can see this by running ``nrow``::
127 | 
128 |     > nrow(mergedEvecDat)
129 |     [1] 1306
130 |     > nrow(evecDat)
131 |     [1] 2257
132 | 
133 | You see that in my case the ``mergedEvecDat`` only contains 1306 samples, whereas the full data set had 2257 samples. So you can use this to select specific populations you would like to have plotted.
134 | 
135 | OK, so now, as a first step, we can improve our simple first plot by using the color to indicate the continental group::
136 | 
137 |     plot(mergedEvecDat$PC1, mergedEvecDat$PC2, col=mergedEvecDat$PopGroup)
138 |     legend("bottomright", legend=levels(mergedEvecDat$PopGroup), col=1:length(levels(mergedEvecDat$PopGroup)), pch=20)
139 | 
140 | .. image:: pcaWithPopGroupColor.png
141 |     :width: 500px
142 |     :height: 500px
143 |     :align: center
144 | 
145 | The final solution for me was to also separate populations by symbol, which involves a bit more hacking. First, to use different symbols for different populations, you can give a simple vector of symbols to the ``plot`` command via ``pch=as.integer(mergedEvecDat$Pop) %% 24``. The trick here is that first you convert ``mergedEvecDat$Pop`` to an integer enumerating all populations, and then you use the ``modulo`` operation to cycle through 24 different numbers. The complete solution in my case looks like this:
146 | 
147 | .. code-block:: R
148 | 
149 |     fn <- "~/Data/GAworkshop/pca/MyProject.HO.merged.pca.evec.txt"
150 |     evecDat <- read.table(fn, col.names=c("Sample", "PC1", "PC2", "PC3", "PC4", "PC5",
151 |                                          "PC6", "PC7", "PC8", "PC9", "PC10", "Pop"))
152 |     popGroups <- read.table("~/Google_Drive/GA_workshop Jena/HO_popGroups.txt", col.names=c("Pop", "PopGroup"))
153 |     popGroupsWithSymbol <- cbind(popGroups, (1:nrow(popGroups)) %% 26)
154 |     colnames(popGroupsWithSymbol)[3] = "symbol"
155 |     mergedEvecDat = merge(evecDat, popGroupsWithSymbol, by="Pop")
156 |     
157 |     layout(matrix(c(1,2), ncol=1), heights=c(1.5, 1))
158 |     par(mar=c(4,4,0,0))
159 |     plot(mergedEvecDat$PC1, mergedEvecDat$PC2, col=mergedEvecDat$PopGroup, pch=mergedEvecDat$symbol, cex=0.6, cex.axis=0.6, cex.lab=0.6, xlab="PC1", ylab="PC2")
160 |     plot.new()
161 |     par(mar=rep(0, 4))
162 |     legend("center", legend=popGroupsWithSymbol$Pop, col=popGroupsWithSymbol$PopGroup, pch=popGroupsWithSymbol$symbol, ncol=6, cex=0.6)
163 |     
164 | which produces:
165 | 
166 | .. image:: fullPCA.png
167 |     :width: 500px
168 |     :height: 500px
169 |     :align: center
170 | 
171 | 
172 | Of course, here I haven't yet included my test individuals, but you can see easily how to include them in the ``HO_popGroups.txt`` file. Also, in ``plot`` you can use the ``xlim`` and ``ylim`` options to zoom into specific areas of the plot, e.g. try ``xlim=c(-0.01,0.01), ylim=c(-0.03,-0.01)`` in the ``plot`` command above.
173 | 


--------------------------------------------------------------------------------
/contents/06_f3/f3.rst:
--------------------------------------------------------------------------------
  1 | .. _f3:
  2 | 
  3 | Outgroup F3 Statistics
  4 | ======================
  5 | 
  6 | Outgroup F3 statistics are a useful analytical tool to understand population relationships. F3 statistics, just as F4 and F2 statistics measure allele frequency correlations between populations and were introduced by Nick Patterson in his `2012 paper <http://www.genetics.org/content/early/2012/09/06/genetics.112.145037>`_.
  7 | 
  8 | F3 statistics are used for two purposes:
  9 | i) as a test whether a target population (C) is admixed between two source populations (A and B), and
 10 | ii) to measure shared drift between two test populations (A and B) from an outgroup (C). In this session we'll use the second of these use cases.
 11 | 
 12 | F3 statistics are in both cases defined as the product of allele frequency differences between population C to A and B, respectively::
 13 | 
 14 |     F3=<(c-a)(c-b)>
 15 | 
 16 | Here, ``<>`` denotes the average over all genotyped sites, and ``a``, ``b`` and ``c`` denote the allele frequency in the three populations. Outgroup F3 statistics measure the amount of _shared genetic drift between two populations from a common ancestor. In a phylogenetic tree connecting A, B and C, Outgroup F3 statistics measure the common branch length from the outgroup, here indicated in red:
 17 | 
 18 | .. image:: f3-tree.png
 19 |    :width: 300px
 20 |    :height: 300px
 21 |    :align: center
 22 | 
 23 | For computing F3 statistics including error bars, we will use the ``qp3Pop`` program from `Admixtools <https://github.com/DReichLab/AdmixTools>`_. You can have a look at the readme for that tool under ``/projects1/tools/adminxtools_3.0/README.3PopTest`` (the typo "adminx" is actually in the path).
 24 | 
 25 | The README and name of the tools is actually geared towards the first use case of F3 described above, the test for admixture. But since the formula is exactly the same, we can use the same tool for Outgroup F3 Statistics as well. One ingredient that you need is a list of population triples. This should be a file with three population names in each row, separated by space, e.g.::
 26 | 
 27 |     JK2134 AA Yoruba
 28 |     JK2134 Abkhasian Yoruba
 29 |     JK2134 Adygei Yoruba
 30 |     JK2134 AG2 Yoruba
 31 |     JK2134 Albanian Yoruba
 32 |     JK2134 Aleut Yoruba
 33 |     JK2134 Algerian Yoruba
 34 |     JK2134 Algonquin Yoruba
 35 |     JK2134 Altai Yoruba
 36 |     JK2134 Altaian Yoruba
 37 |     ...
 38 | 
 39 | Note that in this case the first population is a single sample, the second loops through all HO populations, and the third one is a fixed outroup, here Yoruba. For Non-African population studies you can use "Mbuti" as outgroup, which is commonly used as an unbiased outgroup to all Non-Africans.
 40 | 
 41 | Analysing groups of samples (populations)
 42 | -----------------------------------------
 43 | 
 44 | If you only analyse a single population, or a few, you can manually create lists of population triples. In that case, first locate the list of all Human Origins populations here: ``/projects1/users/schiffels/PublicData/HumanOriginsData.backup/HO_populations.txt``, and construct a file with the desired population triples using an awk-one-liner:
 45 | 
 46 | .. code-block:: bash
 47 | 
 48 |     awk '{print "YourPopulation", $1, "Mbuti"}' $HO_populations > $OUT
 49 | 
 50 | Here, "YourPopulation" should be replaced by the population in you ``*.ind.txt`` file that you want to focus on, and "Mbuti" is the outgroup (pick another one if appropriate). Then, construct a parameter file like this: ::
 51 | 
 52 |     genotypename:   /data/schiffels/GAworkshop/genotyping/MyProject.HO.eigenstrat.merged.geno.txt
 53 |     snpname:   /data/schiffels/GAworkshop/genotyping/MyProject.HO.eigenstrat.merged.snp.txt
 54 |     indivname:   /data/schiffels/GAworkshop/genotyping/MyProject.HO.eigenstrat.ind.txt
 55 |     popfilename:  <YOUR_POPULATION_TRIPLE_LIST>
 56 |     
 57 | and run it via
 58 | 
 59 | .. code-block:: bash
 60 | 
 61 |     qp3Pop -p $PARAMETER_FILE > $OUT
 62 | 
 63 | Analysing individual samples
 64 | ----------------------------
 65 | 
 66 | In my case, I selected 6 samples that showed low levels of contamination and chose to run them independently through F3 statistics. You may also choose to group test samples together into one population. In my case, I create 6 separate population lists like this:
 67 | 
 68 | .. code-block:: bash
 69 | 
 70 |     #!/usr/bin/env bash
 71 |     OUTDIR=/data/schiffels/GAworkshop//data/schiffels/GAworkshop/f3stats
 72 |     mkdir -p $OUTDIR
 73 |     for SAMPLE in JK2134 JK2918 JK2888 JK2958 JK2911 JK2972; do
 74 |         HO_POPLIST=/projects1/users/schiffels/PublicData/HumanOriginsData.backup/HO_populations.txt
 75 |         OUT=$OUTDIR/$SAMPLE.f3stats.poplist.txt
 76 |         awk -v s=$SAMPLE '{print s, $1, "Mbuti"}' $HO_POPLIST > $OUT
 77 |     done
 78 | 
 79 | Here, the ``awk`` command loops through all rows in ``$HO_POPLIST`` and prints it into a new row with the sample name (assigned as variable ``s`` in awk through a command line option ``-v s=$SAMPLE``), and "Mbuti" in last position. If you follow a similar approach of looping through multiple samples, you should check the output poplist files that they are correct.
 80 | 
 81 | Similar to the ``mergeit`` and the ``smartpca`` programs we have already used, ``qp3Pop`` requires a parameter file as input. In my case, for the first sample it looks like this::
 82 | 
 83 |     genotypename:   /data/schiffels/GAworkshop/genotyping/MyProject.HO.eigenstrat.merged.geno.txt
 84 |     snpname:   /data/schiffels/GAworkshop/genotyping/MyProject.HO.eigenstrat.merged.snp.txt
 85 |     indivname:   /data/schiffels/GAworkshop/genotyping/MyProject.HO.eigenstrat.ind.txt
 86 |     popfilename:  /data/schiffels/GAworkshop//data/schiffels/GAworkshop/f3stats/JK2134.f3stats.poplist.txt
 87 | 
 88 | Important: The ``qp3Pop`` program assumes that all population names in the ``popfilename`` are present in the ``*.ind.txt`` file of the input data, specifically in the third column of that file, which indicates the population. In my case, I intend to compute a separate statistic for each of my ancient samples individually, rather than an entire population. Therefore, I manually edited the ``*.ind.txt`` file an artificially assigned each of my individuals its own "population", which is simply called the same as the individual.
 89 | 
 90 | The first three lines of the parameter file specify the EIGENSTRAT data set, similar to what we put into the ``smartpca`` parameter file. The fourth parameter denotes the population list we generated above. In my case, I need to prepare 6 such parameter files and submit them all:
 91 | 
 92 | .. code-block:: bash
 93 | 
 94 |     #!/usr/bin/env bash
 95 | 
 96 |     INDIR=/data/schiffels/GAworkshop/genotyping
 97 |     OUTDIR=/data/schiffels/GAworkshop//data/schiffels/GAworkshop/f3stats
 98 |     for SAMPLE in JK2134 JK2918 JK2888 JK2958 JK2911 JK2972; do
 99 |         GENO=$INDIR/MyProject.HO.eigenstrat.merged.geno.txt
100 |         SNP=$INDIR/MyProject.HO.eigenstrat.merged.snp.txt
101 |         IND=MyProject.HO.eigenstrat.ind.txt
102 |         POPLIST=$OUTDIR/$SAMPLE.f3stats.poplist.txt
103 | 
104 |         PARAMSFILE=$OUTDIR/$SAMPLE.f3stats.qp3Pop.params.txt
105 |         printf "genotypename:\t$GENO\n" > $PARAMSFILE
106 |         printf "snpname:\t$SNP\n" >> $PARAMSFILE
107 |         printf "indivname:\t$IND\n" >> $PARAMSFILE
108 |         printf "popfilename:\t$POPLIST\n" >> $PARAMSFILE
109 | 
110 |         LOG=$OUTDIR/$SAMPLE.qp3Pop.log
111 |         OUT=$OUTDIR/$SAMPLE.qp3Pop.out
112 |         sbatch --mem 4000 -o $LOG --wrap="qp3Pop -p $PARAMSFILE > $OUT"
113 |     done
114 | 
115 | This should run for 10-20 minutes. When finished, transfer the resulting files to your laptop using ``scp``.
116 | 
117 | Plotting
118 | --------
119 | 
120 | The output from ``qp3Pop`` looks like this::
121 | 
122 |     parameter file: /tmp/qp3Pop_wrapper35005211521595368
123 |     ### THE INPUT PARAMETERS
124 |     ##PARAMETER NAME: VALUE
125 |     genotypename: /data/schiffels/MyProject/genotyping/MyProject.onlyTVFalse.HO.merged.geno
126 |     snpname: /data/schiffels/MyProject/genotyping/MyProject.onlyTVFalse.HO.merged.snp
127 |     indivname: /data/schiffels/MyProject/genotyping/MyProject.noGroups.onlyTVFalse.HO.merged.ind
128 |     popfilename: /data/schiffels/MyProject/f3stats/JK2134.f3stats.poplist.txt
129 |     ## qp3Pop version: 300
130 |     nplist: 224
131 |     number of blocks for block jackknife: 549
132 |     snps: 593655
133 |                           Source 1             Source 2               Target           f_3       std. err           Z    SNPs
134 |      result:                JK2134                   AA               Yoruba      0.026824       0.001010      26.547   56353
135 |      result:                JK2134            Abkhasian               Yoruba      0.147640       0.002229      66.231   56447
136 |      result:                JK2134               Adygei               Yoruba      0.144566       0.002139      67.583   56467
137 |      result:                JK2134                  AG2               Yoruba      0.139170       0.008287      16.794    9499
138 |      result:                JK2134             Albanian               Yoruba      0.149385       0.002321      64.364   56435
139 |      result:                JK2134                Aleut               Yoruba      0.134388       0.002287      58.768   56431
140 |      result:                JK2134             Algerian               Yoruba      0.116380       0.002052      56.727   56416
141 |      result:                JK2134            Algonquin               Yoruba      0.126845       0.002526      50.224   56396
142 |      ...
143 | 
144 | The key rows are the ones starting with ``result:``. We can exploit that and select all relevant rows using ``grep``. In my case, I can even join the results across all samples using::
145 | 
146 |     grep 'result:' *.qp3Pop.out
147 | 
148 | assuming that I am executing this inside the directory where I copied the per-sample result files. When you run this, the output looks like this::
149 | 
150 |     JK2134.f3stats.txt: result:                JK2134                   AA               Yoruba      0.026824       0.001010      26.547   56353
151 |     JK2134.f3stats.txt: result:                JK2134            Abkhasian               Yoruba      0.147640       0.002229      66.231   56447
152 |     JK2134.f3stats.txt: result:                JK2134               Adygei               Yoruba      0.144566       0.002139      67.583   56467
153 |     JK2134.f3stats.txt: result:                JK2134                  AG2               Yoruba      0.139170       0.008287      16.794    9499
154 |     JK2134.f3stats.txt: result:                JK2134             Albanian               Yoruba      0.149385       0.002321      64.364   56435
155 |     JK2134.f3stats.txt: result:                JK2134                Aleut               Yoruba      0.134388       0.002287      58.768   56431
156 |     JK2134.f3stats.txt: result:                JK2134             Algerian               Yoruba      0.116380       0.002052      56.727   56416
157 |     JK2134.f3stats.txt: result:                JK2134            Algonquin               Yoruba      0.126845       0.002526      50.224   56396
158 |     JK2134.f3stats.txt: result:                JK2134                Altai               Yoruba      0.004572       0.003126       1.462   48731
159 |     JK2134.f3stats.txt: result:                JK2134              Altaian               Yoruba      0.122992       0.002173      56.590   56409
160 |     ...
161 | 
162 | As you see, we don't want columns 1 and 2. You can use ``awk`` to filter out only columns 3, 4, 5, 6, 7, 8::
163 | 
164 |     grep 'result:' *.qp3Pop.out | awk '{print $3, $4, $5, $6, $7, $8, $9}' > all.qp3Pop.out
165 | 
166 | We can now again load this combined file into R, using::
167 | 
168 |     f3dat = read.table("~/Data/GAworkshop/f3stats/all.qp3Pop.out",
169 |                col.names=c("PopA", "PopB", "PopC", "F3", "StdErr", "Z", "SNPs"))
170 | 
171 | Have a look at this via ``head(f3dat)``.
172 | 
173 | Now, in my case, with multiple individuals tested, I first want to look at one particular individual separately. For that, I first create a subset of the data::
174 | 
175 |     s = f3dat[f3dat$PopA == "JK2972",]
176 | 
177 | As a second step, we would like to order this in a descending order according to the F3 statistics. Try this::
178 | 
179 |     head(s[order(-s$F3),])
180 | 
181 | which will first order ``s`` according to the ``F3`` column, and then print out only the first few lines with the highest F3 statistics for that individual. So go and save that new order via::
182 | 
183 |     sOrdered = s[order(-s$F3),]
184 | 
185 | OK, so we now want to plot those highest values including error bars. For that we'll need the ``errbar`` function which first has to be installed. Install the package "Hmisc"::
186 | 
187 |     install.packages("Hmisc")
188 | 
189 | from a suitable mirror (for me, the Germany mirror didn't work, I succeeded with the Belgian one).
190 | 
191 | Next, activate that package via ``library(Hmisc)``.
192 | 
193 | You should now be able to view the help for ``errbar`` by typing ``?errbar``.
194 | 
195 | OK, let's now make a plot::
196 | 
197 |     errbar(1:40, sOrdered$F3[1:40],
198 |            (sOrdered$F3+sOrdered$StdErr)[1:40],
199 |            (sOrdered$F3-sOrdered$StdErr)[1:40], pch=20, las=2, cex.axis=0.4, xaxt='n',
200 |            xlab="population", ylab="F3")
201 |     axis(1, at=1:40, labels=sOrdered$PopB[1:40], las=2, cex.axis=0.6)
202 | 
203 | which should yield:
204 | 
205 | .. image:: f3singleSample.png
206 |    :width: 400px
207 |    :height: 400px
208 |    :align: center
209 | 
210 | 
211 | Here is the entire R program:
212 | 
213 | .. code-block:: R
214 | 
215 |     f3dat = read.table("~/Data/GAworkshop/f3stats/all.qp3Pop.out",
216 |                col.names=c("PopA", "PopB", "PopC", "F3", "StdErr", "Z", "SNPs"))
217 |     s = f3dat[f3dat$PopA == "JK2972",]
218 |     sOrdered = s[order(-s$F3),]
219 |     errbar(1:40, sOrdered$F3[1:40],
220 |            (sOrdered$F3+sOrdered$StdErr)[1:40],
221 |            (sOrdered$F3-sOrdered$StdErr)[1:40], pch=20, las=2, cex.axis=0.4, xaxt='n',
222 |            xlab="population", ylab="F3")
223 |     axis(1, at=1:40, labels=sOrdered$PopB[1:40], las=2, cex.axis=0.6)
224 | 
225 | You can plot this for other individuals/populations by replacing the subset command (``s=...``) with another selected individual/population.
226 | 
227 | Finally, if you want to print this into a PDF, you can simply surround the above commands by::
228 | 
229 |     pdf("myPDF.pdf")
230 |     ...
231 |     dev.off()
232 | 
233 | which will produce a PDF with the graph in it.
234 | 


--------------------------------------------------------------------------------
/contents/04_genotyping/genotyping.rst:
--------------------------------------------------------------------------------
  1 | Genotype Calling from Bam Files
  2 | ===============================
  3 | 
  4 | In this session we will process the BAM files of your samples to derive genotype calls for a selected set of SNPs. The strategy followed in this workshop will be to analyse (ancient) test samples together with a large set of modern reference populations. While there are several choices on which data set to use as modern reference, here we will use the Human Origins (HO) data set, published in `Lazaridis et al. 2014 <http://www.nature.com/nature/journal/v513/n7518/full/nature13673.html>`_, which consists of more than 2,300 individuals world-wide human populations, which is a good representation of global human diversity. Those samples were genotyped at around 600,000 SNP positions.
  5 | 
  6 | For our next step, we therefore need to use the list of SNPs in the HO data set and for each test individual call a genotype at these SNPs if possible. Following `Haak et al. 2015 <http://www.nature.com/nature/journal/v522/n7555/abs/nature14317.html>`_, we will not attempt to call diploid genotypes for our test samples, but will simply pick a single read covering each SNP positions and will represent that sample at that position by a single haploid genotype. Fortunately, the population genetic analysis tools are able to deal with this pseudo-haploid genotype data.
  7 | 
  8 | Genotyping the test samples
  9 | ---------------------------
 10 | 
 11 | The main work horse for this genotyping step will be ``simpleBamCaller``, a program I wrote to facilitate this type of calling. You will find it at ``/projects1/tools/workshop/2016/GenomeAnalysisWorkshop/simpleBamCaller``. You should somehow put this program in your path, e.g. by adding the directory to your ``$PATH`` environment variable. An online help for this tool is printed when starting it via ``simpleBamCaller -h``.
 12 | 
 13 | A typical command line for chromosome 20 may look like:
 14 | 
 15 | .. code-block:: bash
 16 | 
 17 |     simpleBamCaller -f $SNP_FILE -c 20 -r $REF_SEQ -o EigenStrat -e $OUT_PREFIX $BAM1 $BAM2 $BAM3 > $OUT_GENO
 18 | 
 19 | Let's go through the options one by one:
 20 | 
 21 | 1. ``-f $SNP_FILE``: This option gives the file with the list of SNPs and alleles. The program expects a three-column format consisting of chromosome, position and the two alleles separated by a comma. For the Human Origins SNP panel we have prepared such a file. You can find it at ``/projects1/users/schiffels/PublicData/HumanOriginsData.backup/EuropeData.positions.txt``. In case you have aligned your samples to HG19, which uses chromosome names start with ``chr``, you need to use the file ``EuropeData.positions.altChromName.txt`` in the same directory.
 22 | 2. ``-c 20`` The chromosome name. In simpleBamCaller you need to call each chromosome separately. It's anyway recommended to parallelise across chromosomes to speed up calling. Note that if your BAM files are aligned to HG19, you need to give the correct chromosome name, i.e. ``chr20`` in this case. However, note that the Human Origins data set uses just plain numbers as chromosomes, so you want to use the option ``--outChrom 20`` in simpleBamCaller to convert the chromosome name to plain numbers.
 23 | 3. ``-r $REF_SEQ``. This is the reference sequence that was used to align your bam file to. You find them under ``/projects1/Reference_Genomes``.
 24 | 4. ``-o EigenStrat``: This specifies that the output should be in EigenStrat format, which is the format of the HO data set, with which we need to merge our test data. EigenStrat formatted data sets consist of three files: i) a SNP file, with the positions and alleles of each SNP
 25 | 5. ``-e $OUT_PREFIX``. This will be the file prefix for the
 26 | 6. ``$BAM1 $BAM2 ...`` This are the bam files of your test samples. Note that you will call simultaneously in all samples to generate a joined EigenStrat files for a single chromosome for all samples.
 27 | 
 28 | You should try the command line, perhaps piping the result into `head` to only look at the first 10 lines of the genotype-output. You may encounter an error of the sort "inconsistent number of genotypes. Check that bam files have different readgroup sample names". In that case you will first have to fix your bam files (see section below). If you do not encounter this error, you can skip the next section.
 29 | 
 30 | .. note:: Fixing the bam read group
 31 | 
 32 |   For multi-sample calling, bam files need to contain a certain flag in their header specifying the read group. This flag also contains the name of the sample whose data is contained in the BAM file. Standard mapping pipelines like BWA and EAGER might not add this flag in all cases, so we have to do that manually. Fortunately, it is relatively easy to do this using the Picard software. The `documentation <http://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups>`_ of the ``AddOrReplaceReadGroups`` tool shows an example command line, which in our case can look like this:
 33 | 
 34 |   .. code-block:: bash
 35 | 
 36 |     picard AddOrReplaceReadGroups I=$INPUT_BAM O=$OUTPUT_BAM RGLB=$LIB_NAME RGPL=$PLATFORM RGPU=$PLATFORM_UNIT RGSM=$NAME
 37 | 
 38 | Here, the only really important parameter is ``RGSM=$NAME``, where ``$NAME`` should be your sample name. The other parameters are required but not terribly important: ``$LIB_NAME`` can be some library name, or you can simply use the same name as your sample name. ``$PLATFORM`` should be ``illumina`` but can be any string, ``$PLATFORM_UNIT`` can be ``unit1`` or something like that.
 39 | 
 40 | If you encounter any problems with validation, you can additionally set the parameter ``VALIDATION_STRINGENCY=SILENT``.
 41 | 
 42 | In order to run this tool on all your samples you should submit jobs via a shell script, which in my case looked like this:
 43 | 
 44 | .. code-block:: bash
 45 | 
 46 |     #!/usr/bin/env bash
 47 | 
 48 |     BAMDIR=/data/schiffels/MyProject/mergedBams.backup
 49 | 
 50 |     for SAMPLE in $(ls $BAMDIR); do
 51 |         BAM=$BAMDIR/$SAMPLE/$SAMPLE.mapped.sorted.rmdup.bam
 52 |         OUT=$BAMDIR/$SAMPLE/$SAMPLE.mapped.sorted.rmdup.fixedHeader.bam
 53 |         CMD="picard AddOrReplaceReadGroups I=$BAM O=$OUT RGLB=$SAMPLE RGPL=illumina RGPU=HiSeq RGSM=$SAMPLE"
 54 |         echo "$CMD"
 55 |         # sbatch -c 2 -o $OUTDIR/$SAMPLE.readgroups.log --wrap="$CMD"
 56 |     done
 57 | 
 58 | As in the previous session, write your own script like that, make it executable using ``chmod u+x``, run it, check that the printed commands look correct, and then remove the comment from the `sbatch` line to submit the jobs.
 59 | 
 60 | Continuing with genotyping
 61 | ^^^^^^^^^^^^^^^^^^^^^^^^^^
 62 | 
 63 | If the calling pipeline described above works with the fixed bam files, the first lines of the output (using ``head``) should look like this::
 64 | 
 65 |     [warning] tag DPR functional, but deprecated. Please switch to `AD` in future.
 66 |     [mpileup] 5 samples in 5 input files
 67 |     <mpileup> Set max per-file depth to 1600
 68 |     00220
 69 |     22000
 70 |     00220
 71 |     20922
 72 |     22092
 73 |     22090
 74 |     20000
 75 |     29992
 76 |     22292
 77 |     20090
 78 | 
 79 | The first three lines are just output to stderr (they won't appear in the file when you pipe via ``> $OUT_FILE``). The 10 last lines are the called genotypes on the input SNPs. Here, a 2 denotes the reference allele, 0 denotes the alternative allele, and 9 denotes missing data. If you also look at the first 10 lines of the `*.snp.txt` file, set via the `-e` option above, you should see something like this::
 80 | 
 81 |     20_97122	20	0	97122	C	T
 82 |     20_98930	20	0	98930	G	A
 83 |     20_101362	20	0	101362	G	A
 84 |     20_108328	20	0	108328	C	A
 85 |     20_126417	20	0	126417	A	G
 86 |     20_126914	20	0	126914	C	T
 87 |     20_126923	20	0	126923	C	A
 88 |     20_127194	20	0	127194	T	C
 89 |     20_129063	20	0	129063	G	A
 90 |     20_140280	20	0	140280	T	C
 91 | 
 92 | which is the EigenStrat output for the SNPs. Here, the second column is the chromosome, the fourth column is the position, and the 5th and 6th are the two alleles. Note that simpleBamCaller automatically restricts the calling to the two alleles given in the input file. EigenStrat output also generates an ``*.ind.txt`` file, again set via the ``-e`` prefix flag. We will look at it later.
 93 | 
 94 | OK, so now that you know that it works in principle, you need to again write a little shell script that performs this calling for all samples on all chromosomes. In my case, it looks like this:
 95 | 
 96 | .. code-block:: bash
 97 | 
 98 |     #!/usr/bin/env bash
 99 | 
100 |     BAMDIR=/data/schiffels/MyProject/mergedBams.backup
101 |     REF=/projects1/Reference_Genomes/Human/hs37d5/hs37d5.fa
102 |     SNP_POS=/projects1/users/schiffels/PublicData/HumanOriginsData.backup/EuropeData.positions.autosomes.txt
103 |     OUTDIR=/data/schiffels/MyProject/genotyping
104 |     mkdir -p $OUTDIR
105 | 
106 |     BAM_FILES=$(ls $BAMDIR/*/*.mapped.sorted.rmdup.bam | tr '\n' ' ')
107 |     for CHR in {1..22}; do
108 |         OUTPREFIX=$OUTDIR/MyProject.HO.eigenstrat.chr$CHR
109 |         OUTGENO=$OUTPREFIX.geno.txt
110 |         CMD="simpleBamCaller -f $SNP_POS -c $CHR -r $REF -o EigenStrat -e $OUTPREFIX $BAM_FILES > $OUTGENO"
111 |         echo "$CMD"
112 |         # sbatch -c 2 -o $OUTDIR/$SAMPLE.sexDetermination.log --mem=2000 --wrap="$CMD"
113 |     done
114 | 
115 | Note that I am now looping over 22 chromosomes instead of samples (as we have done in other scripts). The line beginning with ``BAM_FILES=...`` looks a bit cryptic. The syntax ``$(...)`` will put the output of the command in brackets into the ``BAM_FILES`` variable. The ``tr '\n' ' '`` bit takes the listing output from ``ls`` and convert new lines into spaces, such that all bam files are simply put behind each other in the ``simpleBamCaller`` command line. Before you submit, look at the output of this script by piping it into ``less -S``, which will not wrap the very long command lines and allows you to inspect whether all files are given correctly. When you are sure it's correct, remove the comment from the ``sbatch`` line and comment out the ``echo`` line to submit.
116 | 
117 | .. note:: A word about DNA damage
118 | 
119 |   If the samples you are analysing are ancient samples, the DNA will likely contain DNA damage, so C->T changes which are seen as C->T and G->A substitutions in the BAM files. There are two ways how to deal with that. First, if your data is not UDG-treated, so if it contains the full damage, you should restrict your analysis to Transversion SNPs only. To that end, simply add the ``-t`` flag to ``simpleBamCaller``, which will automatically output only transversion SNPs. If your data is UDG-treated, you will have much less damage, but you can still see damaged sites in particular in the boundary of the reads in your BAM-file. In that case, you probably want to make a modified bam file for each sample, where the first 2 bases on each end of the read are clipped. A useful tool to do that is `TrimBam <http://genome.sph.umich.edu/wiki/BamUtil:_trimBam>`_, which we will not discuss here, but which I recommend to have a look at if you would like to analyse Transition SNPs from UDG treated libraries.
120 | 
121 | Merging across chromosomes
122 | ^^^^^^^^^^^^^^^^^^^^^^^^^^
123 | 
124 | Since the EigenStrat format consists of simple text files, where rows denote SNPs, we can simply merge across all chromosomes using the UNIX ``cat`` program. If you ``cd`` to the directory containing the eigenstrat output files for all chromosomes and run ``ls`` you should see something like::
125 | 
126 |     MyProject.HO.eigenstrat.chr10.geno.txt
127 |     MyProject.HO.eigenstrat.chr10.ind.txt
128 |     MyProject.HO.eigenstrat.chr10.snp.txt
129 |     MyProject.HO.eigenstrat.chr11.geno.txt
130 |     MyProject.HO.eigenstrat.chr11.ind.txt
131 |     MyProject.HO.eigenstrat.chr11.snp.txt
132 |     ...
133 | 
134 | A naive way to merge across chromosomes might then simply be:
135 | 
136 | .. code-block:: bash
137 | 
138 |     cat MyProject.HO.eigenstrat.chr*.geno.txt > MyProject.HO.eigenstrat.allChr.geno.txt
139 |     cat MyProject.HO.eigenstrat.chr*.snp.txt > MyProject.HO.eigenstrat.allChr.snp.txt
140 | 
141 | (Note that the ``*.ind.txt`` file will be treated separately below). However, these ``cat`` command lines won't do the job correctly, because they won't merge the chromosomes in the right order. To ensure the correct order, I recommend printing all files in a loop in a sub-shell like this:
142 | 
143 | .. code-block:: bash
144 | 
145 |     (for CHR in {1..22}; do cat MyProject.HO.eigenstrat.chr$CHR.geno.txt; done) > MyProject.HO.eigenstrat.allChr.geno.txt
146 |     (for CHR in {1..22}; do cat MyProject.HO.eigenstrat.chr$CHR.snp.txt; done) > MyProject.HO.eigenstrat.allChr.snp.txt
147 | 
148 | Here, each ``cat`` command only outputs one file at a time, but the entire loop runs in a sub-shell denoted by brackets, whose output will be piped into a file.
149 | 
150 | Now, let's deal with the ``*.ind.txt`` file. As you can see, ``simpleBamCaller`` created one ``*.ind.txt`` per chromosome, but we only need one file in the end, so I suggest you simply copy the one from chromosome 1. But at the same time, we want to fix the third column of the ``*.ind.txt`` file to something more tellinf than "Unknown". So copy the file from chromosome 1, and open it in an editor, and replace all "Unknown" to the population name of that sample. The sex (2nd column) is not necessary.
151 | 
152 | .. note::
153 | 
154 |   You should now have three final eigenstrat files for your test samples: A ``*.geno.txt`` file, a ``*.snp.txt`` file and a ``*.ind.txt`` file.
155 | 
156 | Merging the test genotypes with the Human Origins data set
157 | ----------------------------------------------------------
158 | 
159 | As last step in this session, we need to merge the data set containing your test samples with the HO reference panel. To do this, we will use the ``mergeit``-program from the `Eigensoft package <https://data.broadinstitute.org/alkesgroup/EIGENSOFT/>`_, which is already installed on the cluster.
160 | 
161 | This program needs a parameter file that - in my case - looks like this::
162 | 
163 |     geno1:	/projects1/users/schiffels/PublicData/HumanOriginsData.backup/EuropeData.eigenstratgeno.txt
164 |     snp1:	/projects1/users/schiffels/PublicData/HumanOriginsData.backup/EuropeData.simple.snp.txt
165 |     ind1:	/projects1/users/schiffels/PublicData/HumanOriginsData.backup/EuropeData.ind.txt
166 |     geno2:	/data/schiffels/GAworkshop/genotyping/MyProject.HO.eigenstrat.allChr.geno.txt
167 |     snp2:	/data/schiffels/GAworkshop/genotyping/MyProject.HO.eigenstrat.allChr.snp.txt
168 |     ind2:	/data/schiffels/GAworkshop/genotyping/MyProject.HO.eigenstrat.ind.txt
169 |     genooutfilename:	/data/schiffels/GAworkshop/genotyping/MyProject.HO.eigenstrat.merged.geno.txt
170 |     snpoutfilename:	/data/schiffels/GAworkshop/genotyping/MyProject.HO.eigenstrat.merged.snp.txt
171 |     indoutfilename:	/data/schiffels/GAworkshop/genotyping/MyProject.HO.eigenstrat.merged.ind.txt
172 |     outputformat: EIGENSTRAT
173 | 
174 | If you have such a parameter file, you can run ``mergeit`` simply like this::
175 | 
176 |     mergeit -p $PARAM_FILE
177 | 
178 | and to submit to SLURM::
179 | 
180 |     sbatch -o $LOG --mem=2000 --wrap="mergeit -p $PARAM_FILE"
181 | 
182 | where ``$PARAM_FILE`` should be replaced by your parameter file, of course.
183 | 
184 | To test whether it worked correctly, you should check the resulting "indoutfilename" as specified in the parameter file, to see whether it contains both the individuals of the reference panel and the those of your test data set.
185 | 
186 | Note that the output of the ``mergeit`` program is by default a binary format called "PACKEDANCESTRYMAP", which is fine for smartpca but not for other analyses we'll be doing later, so I explicitly put the outputformat in the parameter file to force the output to be eigenstrat.
187 | 


--------------------------------------------------------------------------------
/contents/02_schmutzi/schmutzi.rst:
--------------------------------------------------------------------------------
  1 | Schmutzi Workshop
  2 | =================
  3 | 
  4 | This is intended to provide a little hands on experience with schmutzi by `G. Renaud et al <https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0776-0>`_ and is solely designed to provide some more detailed step-by-step information on how to determine contamination using the tool. If you find bugs in this, I'm happy to fix them; if you find bugs in the tool itself, please use the projects GitHub repository to open issues for them `here <https://github.com/grenaud/schmutzi>`_ . This is _not_ my tool, but I happen to be one of the more frequent users of the method, thus this was basically writing up things I found out myself or with help from the developer(s).
  5 | 
  6 | Detection of contamination on mitochondrial data
  7 | ------------------------------------------------
  8 | 
  9 | *schmutzi* can be used to detect human contamination in mitochondrial data. In case you have enough reads mapping on to the mitochondrion reference genome,
 10 | you can utilize the methods provided by *schmutzi* to automatically detect present contamination from other human sources automatically. This procedure is split into two parts, one performing a damage-based contamination estimation and a second one utilizing a database of mitochondrial allele frequencies. During this workshop, we will work on two human endogenous samples of undisclosed origin, that you will have to analyze yourselves: *Sample_A* and *Sample_B*. Of course, you can also transfer your own test cases to our system and then apply the methods taught in this course on these instead.
 11 | 
 12 | .. warning::
 13 | 
 14 |   The provided test BAM files are only for testing purposes and should not be distributed further.
 15 | 
 16 | .. note::
 17 | 
 18 |   You can run these methods and single steps manually but you could also run this in a more concise way instead by creating a simple bash script for your convenience. In case you'd like to do this, follow the simple template provided below:
 19 | 
 20 | .. code-block:: bash
 21 | 
 22 |   #!/bin/bash
 23 |   command1
 24 |   command2
 25 |   ...
 26 | 
 27 | 
 28 | Sample preparation
 29 | ------------------
 30 | 
 31 | Subsampling
 32 | ^^^^^^^^^^^
 33 | 
 34 | In order to make these use cases computationally feasible, please do not use samples with more than 50x mitochondrial coverage or subsample the samples if you have more coverage than this.
 35 | 
 36 | .. note::
 37 | 
 38 |   In case you use our provided sample datasets, you don't need to do any subsampling, as the samples we provide as use cases are small in size anyways.
 39 | 
 40 | 
 41 | For those of you who would like to use their own datasets, please apply `samtools view` and produce a subsampled version of your input data if the input file is too large for our course.
 42 | 
 43 | .. code-block:: bash
 44 | 
 45 |    samtools view -b -s 0.2 input.bam -o output.bam
 46 | 
 47 | 
 48 | This should produce a rough 20% of your input file, which is taking randomly reads from the sample instead of taking an order set of reads from the input.
 49 | 
 50 | MD-Tagging
 51 | ^^^^^^^^^^
 52 | 
 53 | In order for *schmutzi* being able to access the data properly, we need to add **MD** tags to the BAM files. **MD** tags can be used by programs to perform e.g. SNP detection without a reference genome, as the tag contains information on which positions of the corresponding read there are matches, mismatches or indels. To add MD tags to your data, use the *samtools calmd* command:
 54 | 
 55 | .. code-block:: bash
 56 | 
 57 |   samtools calmd -b Sample_A.bam ../ref/human_MT.fa > Sample_A.MD.bam
 58 | 
 59 | .. warning::
 60 | 
 61 |    In order for this to work, you need to ensure to use the **same reference** that you used for mapping/creating your BAM file(s)!
 62 | 
 63 | Damage based *contDeam*
 64 | -----------------------
 65 | 
 66 | This can solely be used to determine contamination based on endogenous deamination. This means, if you use for example UDG treated data, that *contDeam* will tell you that your sample is severely contaminated (as it shows *no deamination* or at least *less contamination*). We are only using the default way of *contDeam*,  a more complete documentation can be found on `GitHub <https://github.com/grenaud/schmutzi>`_ for your convenience.
 67 | 
 68 | First, run *contDeam* on each sample individually. As this produces quite an amount of output files during the iterative process, we should create a folder structure to store our output in a logical way.
 69 | 
 70 | .. code-block:: bash
 71 | 
 72 |    mkdir -p Results/Sample_A/
 73 |    mkdir -p Results/Sample_B/
 74 | 
 75 | This creates two folders in our current folder, making it possible to store all the output created by our methods to be applied in a logical way.
 76 | Now, we can move on to use *contDeam*:
 77 | 
 78 | .. code-block:: bash
 79 | 
 80 |    contDeam --library double --out Results/Sample_A/Sample_A --uselength --ref Ref/human_MT.fa RAW_BAMs/Sample_A.MD.bam
 81 |    contDeam --library double --out Results/Sample_B/Sample_B --uselength --ref Ref/human_MT.fa RAW_BAMs/Sample_B.MD.bam
 82 | 
 83 | .. note::
 84 | 
 85 |   You should make sure to use the proper commandline here: Specifying *single* for a double stranded library would not produce any meaningful results and thus render your estimation wrong potentially. Make sure to check prior using the command **which** kind of data you have here! Typically you do have double stranded data, but in case you are not certain that you have, you may want to check this with sequencing before.
 86 | 
 87 | This should produce something like this on your command line:
 88 | 
 89 | .. code-block:: bash
 90 | 
 91 |   Reading BAM to set priors for each read ...
 92 |   ..  done
 93 |   running cmd /projects1/tools/schmutzi/posteriorDeam.R Results/Sample_A/Sample_A.cont.deam Results/Sample_A/Sample_A.cont.pdf  "Posterior probability for contamination\n
 94 |   amination patterns"
 95 |   null device
 96 |             1
 97 |   Program finished succesfully
 98 |   Files created:The plot of the posterior probability is Results/Sample_A/Sample_A.cont.pdf
 99 |   The contamination estimate is here Results/Sample_A/Sample_A.cont.est
100 |   The configuration file is here Results/Sample_A/Sample_A.config
101 | 
102 | You may have a look now at the output of this initial contamination estimation run. How do your samples seem to look like for *Sample_A* and *Sample_B* ? To check this, you can have a look at the output initially generated using e.g. *cat*:
103 | 
104 | .. code-block:: bash
105 | 
106 |    cat Results/Sample_A/Sample_A.cont.est
107 |    0 0 0.95
108 |    cat Results/Sample_B/Sample_B.cont.est
109 |    0 0 0.005
110 | 
111 | This means, that based on the deamination patterns both samples look relatively clean with a initial lower estimate of 0 % contamination, an average of 0% and an upper estimate of 95% for the first and 0.05% for the second sample. Relatively means in this case, that *Sample_B* looks clean completely, whereas *Sample_A* shows an initial high contamination of 95%.
112 | 
113 | However, you can't trust these results individually if:
114 | 
115 | 1. You have less than 500 Million reads (which is very rarely the case)
116 | 2. You don't have enough deamination, less than 5% won't work for example (Attention: UDG treatment!)
117 | 3. Very little / No deamination of the contaminant fragments
118 | 4. (Independence between 5' and 3' deamination rates is required for the Bayesian inference model)
119 | 
120 | 
121 | This method could be used for running contamination estimates on both nuclear and mitochondrial data in general, however I would recommend applying `DICE <https://github.com/grenaud/dice>`_ for samples with nuclear data in general or perform other tests (X-chromosomal contamation test, looking forward to Stephan Schiffels introduction on this). I will generate a HowTo for DICE in the upcoming weeks, following the `schmutzi` manual here, too.
122 | 
123 | 
124 | Mitochondrial based *schmutzi*
125 | ------------------------------
126 | 
127 | Now that we have successfully estimated contamination using *deamination patterns*, we will proceed by using allele frequencies on mitochondrial data, too. *schmutzi* comes with a database of 197 allele frequencies accompanied by an Eurasian subset of allele frequencies, that can be used for our analysis.
128 | 
129 | .. note::
130 | 
131 |    If you would like to test e.g. for contamination on other organisms, e.g. some other mammals and you do possess enough datasets to generate such a database, you can also generate these frequencies yourself. For more details, follow Gabriel Renaud's HowTo `here <https://github.com/grenaud/schmutzi#frequently-asked-questions>`_ .
132 | 
133 | Now let's run the *schmutzi* application itself. Prior to doing this, we need to index our MD tagged BAM file first:
134 | 
135 | .. code-block:: bash
136 | 
137 |    samtools index RAW_BAMs/Sample*.MD.bam
138 |    schmutzi --ref Ref/human_MT.fa --t 8 Results/Sample_A/Sample_A /projects1/tools/schmutzi/alleleFreqMT/197/freqs/ RAW_BAMs/Sample_A.MD.bam
139 |    schmutzi --ref Ref/human_MT.fa --t 8 Results/Sample_B/Sample_B /projects1/tools/schmutzi/alleleFreqMT/197/freqs/ RAW_BAMs/Sample_B.MD.bam
140 | 
141 | .. warning::
142 | 
143 |    Make sure to use the correct **freqs** folder, or the tool will crash.
144 | 
145 | The whole process might run for a couple of minutes, mainly depending on the number of CPU cores ``--t 8`` you assigned your estimation process.
146 | 
147 | .. warning::
148 | 
149 |   Do not use more CPU cores than available, or the whole system might get unstable. *schmutzi* can be pretty heavy in terms of memory / CPU usage, taking up a lot of your systems computational capacities.
150 | 
151 | In the end, this should produce some output:
152 | 
153 | .. _output_files:
154 | 
155 | .. code-block:: bash
156 | 
157 |    Reached the maximum number of iterations (3) with stable contamination rate at iteration # 5, exiting
158 |    Iterations done
159 |    Results:
160 |          Contamination estimates for all samples        : Results/Sample_A/Sample_A_final_mtcont.out
161 |          Contamination estimates for most likely sample : Results/Sample_A/Sample_A_final.cont
162 |          Contamination estimates with conf. intervals   : Results/Sample_A/Sample_A_final.cont.est
163 |          Posterior probability for most likely sample   : Results/Sample_A/Sample_A_final.cont.pdf
164 |          Endogenous consensus call                      : Results/Sample_A/Sample_A_final_endo.fa
165 |          Endogenous consensus log                       : Results/Sample_A/Sample_A_final_endo.log
166 |          Contaminant consensus call                     : Results/Sample_A/Sample_A_final_cont.fa
167 |          Contaminant consensus log                      : Results/Sample_A/Sample_A_final_cont.log
168 |          Config file                                     : Results/Sample_A/Sample_A.diag.config
169 |    Total runtime 248.527676105499 s
170 | 
171 | Running a small ``cat`` again to check the results of the contamination analysis:
172 | 
173 | .. code-block:: bash
174 | 
175 |   cat Results/Sample_A/Sample_A_final.cont.est
176 |   0.99 0.98
177 | 
178 | This means, we have between 98%-99% contamination in *Sample_A*, making it useless for any downstream analysis.
179 | 
180 | Doing the same with our other sample now:
181 | 
182 | .. code-block:: bash
183 | 
184 |   cat Results/Sample_B/Sample_B_final.cont.est
185 |   0.01  0 0.02
186 | 
187 | Which looks good - this sample seems safe to be used for downstream analysis, as it shows between 0 (low) - 1% (avg) - 2% (high) contamination estimate.
188 | 
189 | Output interpretation
190 | ---------------------
191 | 
192 | ``EST`` Files
193 | ^^^^^^^^^^^^^
194 | 
195 | `schmutzi` generates a couple of output files that can be used to determine whether your samples are clean or not. The table above in
196 | output_files_ describes what kind of output to expect on a successful run of `schmutzi`. The most important file is the one with ending ``est`` as it provides the contamination estimate for your data.
197 | 
198 | The content of the ``est`` file should look like this:
199 | 
200 | .. code-block:: bash
201 | 
202 |    X  Y Z
203 | 
204 | Where X is your average estimate, Y your lower estimate and Z your upper estimate. In some cases you will only see two numbers appearing, meaning that this is your upper and lower bounds respectively. It depends on your kind of analysis you'd like to perform whether you want to include edge cases with e.g. upper contamination of 3% estimate or not.
205 | 
206 | In the case you performed a full evaluation using both the *contDeam* and the *schmutzi* tools, you will see several ``est`` files, containing estimates in each iteration. *schmutzi* iteratively refines the consensus called by the *mtCont* subprogram, meaning that it will provide intermediate results in these files, numbered ascendingly from 1, 2 to `final`.
207 | 
208 | .. code-block:: bash
209 | 
210 |   Sample_B_1_cont.3p.prof
211 |   Sample_B_1_cont.5p.prof
212 |   Sample_B_1_cont.est
213 |   Sample_B_1_cont.fa
214 |   Sample_B_1_cont.freq
215 |   Sample_B_1_cont.log
216 |   Sample_B_1_endo.3p.prof
217 |   Sample_B_1_endo.5p.prof
218 |   Sample_B_1_endo.fa
219 |   Sample_B_1_endo.log
220 |   Sample_B_1_mtcont.out
221 |   Sample_B_2_cont.3p.prof
222 |   Sample_B_2_cont.5p.prof
223 |   Sample_B_2_cont.est
224 |   Sample_B_2_cont.fa
225 |   Sample_B_2_cont.freq
226 |   Sample_B_2_cont.log
227 |   Sample_B_2_endo.3p.prof
228 |   Sample_B_2_endo.5p.prof
229 |   Sample_B_2_endo.fa
230 |   Sample_B_2_endo.log
231 |   Sample_B_2_mtcont.out
232 | 
233 | The first *est* file is based on the *contDeam* results (on the damage patterns), whereas the others are based on the iterative process used when estimating contamination using the mt database.
234 | 
235 | ``FA`` Files
236 | ^^^^^^^^^^^^
237 | 
238 | These contain for both the endogenous part as well as the contaminant part the respective consensus sequences produced. Note that this has not been filtered at all and should therefore only be used for determining contamination and not for any downsteam analysis.
239 | 
240 | 
241 | ``Log`` Files
242 | ^^^^^^^^^^^^^
243 | 
244 | These files are the raw output schmutzi produces using a bayesian method to infer the endogenous part of your sample. If you want to use downstream analysis on your data, e.g. calling haplotypes on your mitochondrion, you should apply some filtering on your dataset, which we will do in the next part of our analysis journey.
245 | 
246 | Consensus Calling for Downstream analysis
247 | -----------------------------------------
248 | 
249 | Filtered Consensus Calling
250 | ^^^^^^^^^^^^^^^^^^^^^^^^^^
251 | 
252 | In order to get filtered calls, e.g. no SNPs for regions covered with only a single read, one should apply some filtering criteria:
253 | 
254 | .. code-block:: bash
255 | 
256 |    /projects1/tools/schmutzi/log2fasta -q 20 Results/Sample_A/Sample_A_final_endo.log > Results/Sample_A/Sample_A_q20.fasta
257 |    /projects1/tools/schmutzi/log2fasta -q 30 Results/Sample_A/Sample_A_final_endo.log > Results/Sample_A/Sample_A_q30.fasta
258 |    /projects1/tools/schmutzi/log2fasta -q 20 Results/Sample_B/Sample_B_final_endo.log > Results/Sample_B/Sample_B_q20.fasta
259 |    /projects1/tools/schmutzi/log2fasta -q 30 Results/Sample_B/Sample_B_final_endo.log > Results/Sample_B/Sample_B_q30.fasta
260 | 
261 | 
262 | It is advisable to choose these parameters increasingly, e.g. with a range of ``-q 20, -q 30, -q 40, -q 50`` and check whether you still have enough diagnostic positions in the end.
263 | 
264 | A good way to determine whether we have a lot of undefined positions relative to our used reference genome is by iteratively running several times the above command, to find an acceptable threshold between filtering and reserving enough information for the analysis.
265 | 
266 | .. code-block:: bash
267 | 
268 |    tr -d -c 'N' < Results/Sample_A/Sample_A_q20.fasta | awk '{ print length; }'
269 |    16,569
270 |    tr -d -c 'N' < Results/Sample_A/Sample_A_q30.fasta | awk '{ print length; }'
271 |    16,569
272 | 
273 | As you see, for our `Sample_A`, the output doesn't change, meaning we already have pretty high numbers of 'N' in our output, meaning they have been filtered out with such light filtering (q20,q30) already. That does basically tell us, that the SampleA is of bad provenance, having likely contamination and potentially poorly covered bases, too. As you might recall, this is totally fine, since `schmutzi` declared this sample to be heavily contaminated anyways. Therefore we repeat this for `Sample_B` now to see if this behaves better:
274 | 
275 | .. code-block:: bash
276 | 
277 |    tr -d -c 'N' < Results/Sample_B/Sample_B_q20.fasta | awk '{ print length; }'
278 |    90
279 |    tr -d -c 'N' < Results/Sample_B/Sample_B_q30.fasta | awk '{ print length; }'
280 |    351
281 | 
282 | As you can see we only have 90 bases not defined with a pretty decent filtering parameter already. When going down to filtering even more conservative with ``q=30``, you can see that we are loosing even more positions but still have a reasonable amount of diagnostic positions. I leave it up to you to figure out a good threshold when you loose more than you gain in the end.
283 | 
284 | 
285 | 
286 | Unfiltered Consensus Calling
287 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
288 | 
289 | For modern samples we can use the application ``endoCaller`` coming with schmutzi instead, as we don't want to run contamination checks on this. This can be done using:
290 | 
291 | .. code-block:: bash
292 | 
293 |    /projects1/tools/schmutzi/endoCaller -seq youroutput.fasta -log outputlog.log reference.fasta input.bam
294 | 
295 | This will produce a consensus call, which is **unfiltered**. To test what kind of difference this makes, you may for example try running this method on one of our ancient samples comparing the output to a filtered output FastA directly. You will observe that especially in lower coverage situations, the ``endoCaller`` incorporates SNPs based on e.g. a coverage of 1 or low quality regions, whereas the filtering approach as defined in *Filtered Consensus Calling* .
296 | 


--------------------------------------------------------------------------------
/contents/03_sexdet/sexdet.rst:
--------------------------------------------------------------------------------
  1 | Sex Determination and X chromosome contamination estimate
  2 | =========================================================
  3 | 
  4 | One way to determine the nuclear contamination estimate works only in male samples. It is based on the fact that males have only one copy of the X chromosome, so any detectable heterozygosity on the X chromosome in males would suggest certain amount of contamination. Note that the sex of the contaminating sample is not important here, as both male and female contamination would contribute at least one more x chromosome, which would show up in this contamination test.
  5 | 
  6 | We will proceed in two steps. First, we need to determine the molecular sex of each sample. Second, we will run ANGSD for contamination estimation on all male samples.
  7 | 
  8 | Sex determination
  9 | -----------------
 10 | 
 11 | On the cluster
 12 | ^^^^^^^^^^^^^^
 13 | 
 14 | The basic idea here is to compare genomic coverage on the X chromosome with the genomic coverage on autosomal chromosomes. Since males have only one copy of the X, coverage on X should be half the value on autosomes, while in females it should be roughly the same. Similarly, since males one copy of Y, coverage on Y should be half the value of autosomes in males, and it should be zero in females.
 15 | 
 16 | The first step is to run ``samtools depth`` on the bam files. To get an idea on what this tool does, first run it and look at the first 10 lines only:
 17 | 
 18 | .. code-block:: bash
 19 | 
 20 |     samtools depth -q30 -Q37 -a -b $SNP_POS $BAM_FILE | head
 21 | 
 22 | Generally, when I write ``$SNP_POS`` or ``$BAM_FILE``, you need to replace those variables with actual file names. Here, ``-b $SNP_POS`` inputs a text file that contains the positions in the capture experiment. We have prepared various SNP positions files for you, for both 390K and 1240K capture panels. You can find them in ``/projects1/Reference_Genomes/Human/hs37d5/SNPCapBEDs`` and ``/projects1/Reference_Genomes/Human/HG19/SNPCapBEDs``, depending on which reference genome you have used to map the genome you are working with, and which SNP panel was used (e.g. 1240k). The flags ``-q30 -Q37`` are filters on base- and mapping quality, and the ``-a`` flag forces output of every site, even those with coverage 0 (which is important for counting sites). Finally, ``$BAM_FILE`` denotes the - wait for it - bam file.
 23 | 
 24 | The output of this command should look like this: ::
 25 | 
 26 |     1	752567	0
 27 |     1	776546	1
 28 |     1	832918	0
 29 |     1	842013	0
 30 |     1	846864	0
 31 |     1	869303	0
 32 |     1	891021	5
 33 |     1	896271	0
 34 |     1	903427	0
 35 |     1	912660	1
 36 | 
 37 | (Use ``Ctrl-C`` to stop the command if it stalls.) The columns denote chromosome, position and the number of reads covering that site. We now need to write a little script that counts those read numbers for us, distinguishing autosomes, X chromosome and Y chromosome. Here is my version of this in ``awk``:
 38 | 
 39 | .. code-block:: awk
 40 | 
 41 |     BEGIN {
 42 |         xReads = 0
 43 |         yReads = 0
 44 |         autReads = 0
 45 | 
 46 |         xSites = 0
 47 |         ySites = 0
 48 |         autSites = 0
 49 |     }
 50 |     {
 51 |         chr = $1
 52 |         pos = $2
 53 |         cov = $3
 54 |         if(chr == "chrX") {
 55 |             xReads += cov
 56 |             xSites += 1
 57 |         }
 58 |         else if(chr == "chrY") {
 59 |             yReads += cov
 60 |             ySites += 1
 61 |         }
 62 |         else {
 63 |             autReads += cov
 64 |             autSites += 1
 65 |         }
 66 |     }
 67 |     END {
 68 |         OFS="\t"
 69 |         print("xCoverage", xSites > 0 ? xReads / xSites : 0)
 70 |         print("yCoverage", ySites > 0 ? yReads / ySites : 0)
 71 |         print("autCoverage", autSites > 0 ? autReads / autSites : 0)
 72 |     }
 73 | 
 74 | ``awk`` is a very useful UNIX utility that is perfect for doing simple counting- or other statistics on file streams. You can learn awk yourself if you want, but for now the only important thing are the code lines which read
 75 | 
 76 | .. code-block:: awk
 77 | 
 78 |     if(chr == "X") {
 79 |         ...
 80 |     }
 81 |     else if(chr == "Y") {
 82 |         ...
 83 |     }
 84 |     else {
 85 |         ...
 86 |     }
 87 | 
 88 | As you can see, these lines check whether the chromosome is X or Y or neither of them (autosomes). Here you need to make sure that the names of the chromosomes are the same as in the reference that you used to align the sequences. You can quickly check that from the output of the ``samtools depth`` command above. If the first column looks like ``chr1`` or ``chr2`` instead of ``1`` or ``2``, than you need to change the awk script lines above to:
 89 | 
 90 | .. code-block:: awk
 91 | 
 92 |     if(chr == "chrX") {
 93 |         ...
 94 |     }
 95 |     else if(chr == "chrY") {
 96 |         ...
 97 |     }
 98 |     else {
 99 |         ...
100 |     }
101 | 
102 | Makes sense, right? OK, so now that you have your little awk script with the correct chromosome names to count sites, you can pipe your samtools command into it:
103 | 
104 | .. code-block:: bash
105 | 
106 |     samtools depth -q30 -Q37 -a -b $SNP_POS $BAM_FILE | head -1000 | awk -f sexDetermination.awk
107 | 
108 | where I assume that the ``awk``-code above is copied into a file called ``sexDetermination.awk`` in the current directory. Here, I am only piping the first 1000 lines into the awk script to see whether it works. The output should look like: ::
109 | 
110 |     xCoverage	0
111 |     yCoverage	0
112 |     autCoverage	2.19565
113 | 
114 | OK, so here we did not see any X- or Y-coverage, simply because the first 1000 lines of the ``samtools depth`` command only output chromosome 1. But at least you now know that it works, and you can now prepare the main run over all samples. For that we need to write a shell script that loops over all samples and submits samtools-awk pipeline to SLURM. Open an empty file with an editor and write a file called ``runSexDetermination.sh`` or something like it. In my particular project, that file looks like this:
115 | 
116 | .. code-block:: bash
117 | 
118 |     #!/usr/bin/env bash
119 | 
120 |     BAMDIR=/data/schiffels/MyProject/mergedBams.backup
121 |     SNP_POS=/projects1/Reference_Genomes/Human/hs37d5/SNPCapBEDs/1240KPosGrch37.bed
122 |     AWK_SCRIPT=~/dev/GAworkshop/sexDetermination.awk
123 |     OUTDIR=/data/schiffels/GAworkshop
124 | 
125 |     for SAMPLE in $(ls $BAMDIR); do
126 |         BAM=$BAMDIR/$SAMPLE/$SAMPLE.mapped.sorted.rmdup.bam
127 |         OUT=$OUTDIR/$SAMPLE.sexDetermination.txt
128 |         CMD="samtools depth -q30 -Q37 -a -b $SNP_POS $BAM | awk -f $AWK_SCRIPT > $OUT"
129 |         echo "$CMD"
130 |         # sbatch -c 2 -o $OUTDIR/$SAMPLE.sexDetermination.log --wrap="$CMD"
131 |     done
132 | 
133 | Here, I am merely printing all commands to first check them all and convince myself that they "look" alright. To execute this script, make it executable via ``chmod u+x runSexDetermination.sh``, and run it via ``./runSexDetermination.sh``.
134 | 
135 | Indeed, the output look like this:
136 | 
137 | .. code-block::  bash
138 | 
139 |     samtools depth -q30 -Q37 -a -b /projects1/Reference_Genomes/Human/hs37d5/SNPCapBEDs/1240KPosGrch37.bed /data/schiffels/MyProject/mergedBams.backup/JK2128udg/JK2128udg.mapped.sorted.rmdup.bam | awk -f /home/adminschif/dev/GAworkshop/sexDetermination.awk > /data/schiffels/GAworkshop/JK2128udg.sexDetermination.txt
140 |     samtools depth -q30 -Q37 -a -b /projects1/Reference_Genomes/Human/hs37d5/SNPCapBEDs/1240KPosGrch37.bed /data/schiffels/MyProject/mergedBams.backup/JK2131udg/JK2131udg.mapped.sorted.rmdup.bam | awk -f /home/adminschif/dev/GAworkshop/sexDetermination.awk > /data/schiffels/GAworkshop/JK2131udg.sexDetermination.txt
141 |     samtools depth -q30 -Q37 -a -b /projects1/Reference_Genomes/Human/hs37d5/SNPCapBEDs/1240KPosGrch37.bed /data/schiffels/MyProject/mergedBams.backup/JK2132udg/JK2132udg.mapped.sorted.rmdup.bam | awk -f /home/adminschif/dev/GAworkshop/sexDetermination.awk > /data/schiffels/GAworkshop/JK2132udg.sexDetermination.txt
142 |     ...
143 | 
144 | which looks correct. So I now put a comment (``#``) in from of the ``echo``, and remove the comment from the ``sbatch``, and run the script again. Sure enough, the terminal tells me that 40 jobs have been submitted, and with ``squeue``, I can convince myself that they are actually running. After a few minutes, jobs should be finished, and you can look into your output directory to see all the result files. You should check that the result files are not empty, for example by listing the results folder via `ls -lh` and look at column 4, which displays the size of the files in byte. It should be larger than zero for all output files (and zero for the log files, because there was no log output): ::
145 | 
146 |     adminschif@cdag1 /data/schiffels/GAworkshop $ ls -lh
147 |     total 160K
148 |     -rw-rw-r-- 1 adminschif adminschif  0 May  4 10:16 JK2128udg.sexDetermination.log
149 |     -rw-rw-r-- 1 adminschif adminschif 56 May  4 10:20 JK2128udg.sexDetermination.txt
150 |     -rw-rw-r-- 1 adminschif adminschif  0 May  4 10:16 JK2131udg.sexDetermination.log
151 |     -rw-rw-r-- 1 adminschif adminschif 56 May  4 10:20 JK2131udg.sexDetermination.txt
152 |     -rw-rw-r-- 1 adminschif adminschif  0 May  4 10:16 JK2132udg.sexDetermination.log
153 |     -rw-rw-r-- 1 adminschif adminschif 56 May  4 10:20 JK2132udg.sexDetermination.txt
154 |     ...
155 | 
156 | On your laptop
157 | ^^^^^^^^^^^^^^
158 | 
159 | OK, so now we have to transfer those ``*.txt`` files over to our laptop. Open a terminal on your laptop, create a folder and `cd` into that folder. In my case, I can then transfer the files via
160 | 
161 | .. code-block:: bash
162 | 
163 |     scp adminschif@cdag1.cdag.shh.mpg.de:/data/schiffels/GAworkshop/*.sexDetermination.txt .
164 | 
165 | (Don't forget the final dot, it determines the target directory which is the current directory.)
166 | 
167 | We now want to prepare a table to load into Excel with four columns: Sample, xCoverage, yCoverage, autCoverage. For that we again have to write a little shell script, which in my case looks like this:
168 | 
169 | .. code-block:: bash
170 | 
171 |     #!/usr/bin/env bash
172 | 
173 |     printf "Sample\txCov\tyCov\tautCov\n"
174 | 
175 |     for FILENAME in $(ls ~/Data/GAworkshop/*.sexDetermination.txt); do
176 |         SAMPLE=$(basename $FILENAME .sexDetermination.txt)
177 |         XCOV=$(grep xCoverage $FILENAME | cut -f2)
178 |         YCOV=$(grep yCoverage $FILENAME | cut -f2)
179 |         AUTCOV=$(grep autCoverage $FILENAME | cut -f2)
180 |         printf "$SAMPLE\t$XCOV\t$YCOV\t$AUTCOV\n"
181 |     done
182 | 
183 | Make your script executable using ``chmod`` as shown above, and run it. The result looks in my case like this: ::
184 | 
185 |     schiffels@damp132140 ~/dev/GAworkshopScripts $ ./printSexDeterminationTable.sh
186 |     Sample	xCov	yCov	autCov
187 |     JK2128udg	1.20947	1.17761	1.25911
188 |     JK2131udg	1.31687	1.41748	1.44766
189 |     ...
190 | 
191 | OK, so now we need to load this into Excel. On a mac, you can make use of a nifty little utility called `pbcopy`, which allows you to pipe text from a command directly into the computer's clipboard: ``./printSexDeterminationTable.sh | pbcopy`` does the job. You can now open Excel and use ``CMD-V`` to copy things in. On Windows or Linux, you should pipe the output of the script into a file, e.g. ``./printSexDeterminationTable.sh > table.txt``, and load ``table.txt`` into Excel.
192 | 
193 | Finally, use Excel  to form ratios xCov/autCov and  yCov/autCov, so the relative coverage  of the X-
194 | and Y-chromosome,  compared to  autosomes. You could  now for  example plot those  two numbers  as a
195 | 2D-scatter plot in Excel  and look whether you see two clusters corresponding  to males and females.
196 | An example,  taken from a recent  paper (Fu et  al. 2016 "The  genetic history of Ice  Age Europe"),
197 | looks like this:
198 | 
199 | .. image:: sexDetExample.png
200 | 
201 | As you can see, in this case the relative Y chromosome coverage provides a much better separation of samples into (presumably) male and female, so here the authors used a relative y coverage of >0.2 to determine males, and <0.05 to determine females. Often, unfortunately, clustering is much less pronounced, and you will have to manually decide how to flag samples as "male", "female" or "unknown".
202 | 
203 | Nuclear contamination estimates in Males
204 | ----------------------------------------
205 | 
206 | Now that we have classified at least some samples as "probably male", we can use their haploid X chromosome to estimate nuclear contamination. For this, we use the ANGSD-software. According to the `ANGSD-Documentation <http://popgen.dk/angsd/index.php/Contamination>`_, estimating X chromosome contamination from BAM files involves two steps.
207 | 
208 | The first step counts how often each of the four alleles is seen in variable sites in the X chromosome of a sample:
209 | 
210 | .. code-block:: bash
211 | 
212 |     angsd -i $BAM -r X:5000000-154900000 -doCounts 1 -iCounts 1 -minMapQ 30 -minQ 30 -out $OUT
213 | 
214 | Here, I assume that the X chromosome is called ``X``. If in your bam file it's called ``chrX``, you need to replace the region specification in the ``-r`` flag above. Note that the range 5Mb-154Mb is used in the example in the website, so I just copied it here. The `$OUT` file above actually denotes a filename-prefix, since there will be several output files from this command, which attach different file-endings after the given prefix.
215 | 
216 | To loop this command again over all samples, write a shell script as shown above, check the correct commands via an ``echo`` command and if they are correct, submit them using ``sbatch``. My script looks like this:
217 | 
218 | .. code-block:: bash
219 | 
220 |     #!/usr/bin/env bash
221 | 
222 |     BAMDIR=/data/schiffels/MyProject/mergedBams.backup
223 |     OUTDIR=/data/schiffels/GAworkshop/xContamination
224 |     mkdir -p $OUTDIR
225 | 
226 |     for SAMPLE in $(ls $BAMDIR); do
227 |         BAM=$BAMDIR/$SAMPLE/$SAMPLE.mapped.sorted.rmdup.bam
228 |         OUT=$OUTDIR/$SAMPLE.angsdCounts
229 |         CMD="angsd -i $BAM -r X:5000000-154900000 -doCounts 1 -iCounts 1 -minMapQ 30 -minQ 30 -out $OUT"
230 |         echo "$CMD"
231 |         # sbatch -o $OUTDIR/$SAMPLE.angsdCounts.log --wrap="$CMD"
232 |     done
233 | 
234 | This should run very fast. Check whether the output folder is populated with non-empty files. You cannnot look at them easily because they are binary files.
235 | 
236 | The second step in ANGSD is the actual contamination estimation. Here is the command line recommended in the documentation:
237 | 
238 | .. code-block:: bash
239 | 
240 |     /projects1/tools/angsd_0.910/misc/contamination -a $PREFIX.icnts.gz \
241 |     -h /projects1/tools/angsd_0.910/RES/HapMapChrX.gz 2> $OUT
242 | 
243 | Here, the executable is given with the full path because it is somewhat hidden. The ``$PREFIX`` variable should be replaced by the output-file prefix given in the previous (allele counting) command for the same sample. The HapMap file is provided by ANGSD and contains global allele frequency estimates used for the contamination calculation. Note that here we are not piping the standard out into the output file ``$OUT``, but the standard error, indicated in bash via the special pipe ``2>``. The reason is that this ANGSD-program writes its results into the standard error rather than the standard output.
244 | 
245 | Again, you have to loop this through all samples like this:
246 | 
247 | .. code-block:: bash
248 | 
249 |     #!/usr/bin/env bash
250 | 
251 |     BAMDIR=/data/schiffels/MyProject/mergedBams.backup
252 |     OUTDIR=/data/schiffels/GAworkshop/xContamination
253 |     mkdir -p $OUTDIR
254 | 
255 |     for SAMPLE in $(ls $BAMDIR); do
256 |         PREFIX=$OUTDIR/$SAMPLE.angsdCounts
257 |         OUT=$OUTDIR/$SAMPLE.xContamination.out
258 |         HAPMAP=/projects1/tools/angsd_0.910/RES/HapMapChrX.gz
259 |         CMD="/projects1/tools/angsd_0.910/misc/contamination -a $PREFIX.icnts.gz -h $HAPMAP 2> $OUT"
260 |         echo "$CMD"
261 |         # sbatch --mem=2000 -o $OUTDIR/$SAMPLE.xContamination.log --wrap="$CMD"
262 |     done
263 | 
264 | 
265 | If this worked correctly, you should now have a contamination estimate for each sample. For a single sample, the output looks a bit messy, but the last line should read: ::
266 | 
267 |     Method2: new_llh Version: MoM:0.072969 SE(MoM):5.964563e-02 ML:0.079651 SE(ML):7.892058e-16
268 | 
269 | This is the line indicating the contamination estimate using the "Methods of Moments" (MoM), and its standard error SE(MoM). You can grep all those lines: ::
270 | 
271 |     adminschif@cdag1 /data/schiffels/GAworkshop/xContamination $ grep 'Method2: new_llh' *.out
272 |     JK2131udg.xContamination.out:Method2: new_llh Version: MoM:0.285843 SE(MoM):3.993658e-02 ML:0.281400 SE(ML):4.625781e-14
273 |     JK2132udg.xContamination.out:Method2: new_llh Version: MoM:0.133319 SE(MoM):9.339797e-02 ML:0.140492 SE(ML):0.000000e+00
274 |     JK2133udg.xContamination.out:Method2: new_llh Version: MoM:0.159191 SE(MoM):4.549252e-02 ML:0.160279 SE(ML):8.657070e-15
275 |     JK2134udg.xContamination.out:Method2: new_llh Version: MoM:-0.008918 SE(MoM):4.884321e-03 ML:-0.003724 SE(ML):9.784382e-17
276 |     ...
277 | 
278 | You now want to include those results into your Excel table with the sex determination estimates. Copy them over to your laptop like shown above, in my case:
279 | 
280 | .. code-block:: bash
281 | 
282 |     mkdir -p ~/Data/GAworkshop/contamination
283 |     scp adminschif@cdag1.cdag.shh.mpg.de:/data/schiffels/GAworkshop/xContamination/*.xContamination.out ~/Data/GAworkshop/contamination/
284 | 
285 | and you can now generate a simpler output using a little bash script like this:
286 | 
287 | .. code-block:: bash
288 | 
289 |     #!/usr/bin/env bash
290 | 
291 |     printf "SAMPLE\tCONTAM\tSE\n"
292 |     for FILENAME in $(ls ~/Data/GAworkshop/contamination/*.xContamination.out); do
293 |         SAMPLE=$(basename $FILENAME .xContamination.out)
294 |         CONTAM=$(grep 'Method2: new_llh' $FILENAME | cut -d' ' -f4 | cut -d: -f2)
295 |         SE=$(grep 'Method2: new_llh' $FILENAME | cut -d' ' -f5 | cut -d: -f2)
296 |         printf "$SAMPLE\t$CONTAM\t$SE\n"
297 |     done
298 | 
299 | If you run this, you may find that in some cases the output is empty, because angsd failed. You should then go back and check - for those samples - the `*.log` output from the contamination run above to see what was the reason for failure. In some cases, SLURM killed the job because it exceeded memory. You should then increase the memory set in the ``--mem`` flag in `sbatch`. In other cases, angsd failed for unknown reasons... nothing we can do about currently.
300 | 
301 | Finally, you can use this table, feed it into Excel and find male samples with low contamination to proceed with in the analysis.
302 | 


--------------------------------------------------------------------------------