├── chapter7
    ├── cache
    │   ├── __packages
    │   ├── entropy_0768bb85c32cc280dca5323ae92ed8b1.rdb
    │   ├── entropy_0768bb85c32cc280dca5323ae92ed8b1.rdx
    │   ├── entropy_0768bb85c32cc280dca5323ae92ed8b1.RData
    │   ├── binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.rdb
    │   ├── binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.rdx
    │   └── binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.RData
    ├── chapter7.pdf
    ├── figure
    │   ├── entropy-1.pdf
    │   └── binaryEntropy-1.pdf
    ├── chapter7.Rnw
    ├── chapter7_forInclude.Rnw
    ├── chapter7_forInclude.tex
    └── chapter7.tex
├── chapter6
    ├── cache
    │   ├── __packages
    │   ├── binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.rdb
    │   ├── binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.rdx
    │   ├── computation_08c23b2f0871f6eae4e9010b10244f2a.rdb
    │   ├── computation_08c23b2f0871f6eae4e9010b10244f2a.rdx
    │   ├── binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.RData
    │   ├── computation_08c23b2f0871f6eae4e9010b10244f2a.RData
    │   ├── binomPosteriors_caabe6c3a8215386b680b66a783c3a55.RData
    │   ├── binomPosteriors_caabe6c3a8215386b680b66a783c3a55.rdb
    │   ├── binomPosteriors_caabe6c3a8215386b680b66a783c3a55.rdx
    │   ├── mixturePosterior_7b51293945933662ab25491259906a02.rdb
    │   ├── mixturePosterior_7b51293945933662ab25491259906a02.rdx
    │   └── mixturePosterior_7b51293945933662ab25491259906a02.RData
    ├── chapter6.pdf
    ├── makePlots.R
    ├── makePlots.R~
    ├── chapter6.Rnw
    └── tikzlibrarybayesnet.code.tex
├── fullscript
    ├── cache
    │   ├── __packages
    │   ├── computation_08c23b2f0871f6eae4e9010b10244f2a.rdb
    │   ├── computation_08c23b2f0871f6eae4e9010b10244f2a.rdx
    │   └── computation_08c23b2f0871f6eae4e9010b10244f2a.RData
    ├── BasicProbabilityAndStatistics.pdf
    ├── BasicProbabilityAndStatistics.Rnw
    └── tikzlibrarybayesnet.code.tex
├── multivariateGaussian
    ├── cache
    │   ├── __packages
    │   ├── 3dgauss_9c4da507c5196242e80b280153e7c995.rdb
    │   ├── 3dgauss_9c4da507c5196242e80b280153e7c995.rdx
    │   ├── 3dgauss_9c4da507c5196242e80b280153e7c995.RData
    │   ├── multiGauss_cc3916828b40ec191d5b5bdee9808c87.rdb
    │   ├── multiGauss_cc3916828b40ec191d5b5bdee9808c87.rdx
    │   └── multiGauss_cc3916828b40ec191d5b5bdee9808c87.RData
    ├── figures
    │   ├── 3dgauss-1.pdf
    │   ├── uniGauss-1.pdf
    │   └── multiGauss-1.pdf
    ├── multivariateGaussian.pdf
    ├── multivariateGaussian.Rnw
    └── multivariateGaussian_forInclude.tex
├── chapter3
    ├── cdf.png
    ├── chapter3.pdf
    ├── scaledRV.png
    ├── histogram.png
    ├── distribution.png
    ├── chapter3.tex
    └── makePlots.R
├── chapter1
    ├── chapter1.pdf
    ├── chapter1.tex
    └── chapter1_forInclude.tex
├── chapter2
    ├── chapter2.pdf
    ├── chapter2.tex
    └── chapter2_forInclude.tex
├── chapter4
    ├── chapter4.pdf
    ├── chapter4.tex
    └── chapter4_forInclude.tex
├── chapter5
    ├── chapter5.pdf
    ├── dense_likelihood.png
    ├── sparse_likelihood.png
    ├── chapter5.tex
    ├── makePlots.R
    └── BernoulliData.txt
├── additionalMaterial
    ├── sufficient-statistics.pdf
    └── sufficient-statistics.tex
├── README.md
├── contributors
    └── contributors.tex
└── .gitignore


/chapter7/cache/__packages:
--------------------------------------------------------------------------------
1 | base
2 | 


--------------------------------------------------------------------------------
/chapter6/cache/__packages:
--------------------------------------------------------------------------------
1 | base
2 | knitr
3 | 


--------------------------------------------------------------------------------
/fullscript/cache/__packages:
--------------------------------------------------------------------------------
1 | base
2 | knitr
3 | 


--------------------------------------------------------------------------------
/multivariateGaussian/cache/__packages:
--------------------------------------------------------------------------------
1 | base
2 | knitr
3 | mvtnorm
4 | 


--------------------------------------------------------------------------------
/chapter3/cdf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter3/cdf.png


--------------------------------------------------------------------------------
/chapter1/chapter1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter1/chapter1.pdf


--------------------------------------------------------------------------------
/chapter2/chapter2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter2/chapter2.pdf


--------------------------------------------------------------------------------
/chapter3/chapter3.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter3/chapter3.pdf


--------------------------------------------------------------------------------
/chapter3/scaledRV.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter3/scaledRV.png


--------------------------------------------------------------------------------
/chapter4/chapter4.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter4/chapter4.pdf


--------------------------------------------------------------------------------
/chapter5/chapter5.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter5/chapter5.pdf


--------------------------------------------------------------------------------
/chapter6/chapter6.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/chapter6.pdf


--------------------------------------------------------------------------------
/chapter7/chapter7.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/chapter7.pdf


--------------------------------------------------------------------------------
/chapter3/histogram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter3/histogram.png


--------------------------------------------------------------------------------
/chapter3/distribution.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter3/distribution.png


--------------------------------------------------------------------------------
/chapter5/dense_likelihood.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter5/dense_likelihood.png


--------------------------------------------------------------------------------
/chapter7/figure/entropy-1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/figure/entropy-1.pdf


--------------------------------------------------------------------------------
/chapter5/sparse_likelihood.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter5/sparse_likelihood.png


--------------------------------------------------------------------------------
/chapter7/figure/binaryEntropy-1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/figure/binaryEntropy-1.pdf


--------------------------------------------------------------------------------
/additionalMaterial/sufficient-statistics.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/additionalMaterial/sufficient-statistics.pdf


--------------------------------------------------------------------------------
/fullscript/BasicProbabilityAndStatistics.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/fullscript/BasicProbabilityAndStatistics.pdf


--------------------------------------------------------------------------------
/multivariateGaussian/figures/3dgauss-1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/figures/3dgauss-1.pdf


--------------------------------------------------------------------------------
/multivariateGaussian/figures/uniGauss-1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/figures/uniGauss-1.pdf


--------------------------------------------------------------------------------
/multivariateGaussian/figures/multiGauss-1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/figures/multiGauss-1.pdf


--------------------------------------------------------------------------------
/multivariateGaussian/multivariateGaussian.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/multivariateGaussian.pdf


--------------------------------------------------------------------------------
/chapter7/cache/entropy_0768bb85c32cc280dca5323ae92ed8b1.rdb:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/cache/entropy_0768bb85c32cc280dca5323ae92ed8b1.rdb


--------------------------------------------------------------------------------
/chapter7/cache/entropy_0768bb85c32cc280dca5323ae92ed8b1.rdx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/cache/entropy_0768bb85c32cc280dca5323ae92ed8b1.rdx


--------------------------------------------------------------------------------
/chapter7/cache/entropy_0768bb85c32cc280dca5323ae92ed8b1.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/cache/entropy_0768bb85c32cc280dca5323ae92ed8b1.RData


--------------------------------------------------------------------------------
/chapter6/cache/binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.rdb:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.rdb


--------------------------------------------------------------------------------
/chapter6/cache/binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.rdx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.rdx


--------------------------------------------------------------------------------
/chapter6/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.rdb:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.rdb


--------------------------------------------------------------------------------
/chapter6/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.rdx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.rdx


--------------------------------------------------------------------------------
/chapter6/cache/binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.RData


--------------------------------------------------------------------------------
/chapter6/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.RData


--------------------------------------------------------------------------------
/chapter7/cache/binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.rdb:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/cache/binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.rdb


--------------------------------------------------------------------------------
/chapter7/cache/binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.rdx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/cache/binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.rdx


--------------------------------------------------------------------------------
/fullscript/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.rdb:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/fullscript/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.rdb


--------------------------------------------------------------------------------
/fullscript/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.rdx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/fullscript/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.rdx


--------------------------------------------------------------------------------
/chapter6/cache/binomPosteriors_caabe6c3a8215386b680b66a783c3a55.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/binomPosteriors_caabe6c3a8215386b680b66a783c3a55.RData


--------------------------------------------------------------------------------
/chapter6/cache/binomPosteriors_caabe6c3a8215386b680b66a783c3a55.rdb:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/binomPosteriors_caabe6c3a8215386b680b66a783c3a55.rdb


--------------------------------------------------------------------------------
/chapter6/cache/binomPosteriors_caabe6c3a8215386b680b66a783c3a55.rdx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/binomPosteriors_caabe6c3a8215386b680b66a783c3a55.rdx


--------------------------------------------------------------------------------
/chapter6/cache/mixturePosterior_7b51293945933662ab25491259906a02.rdb:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/mixturePosterior_7b51293945933662ab25491259906a02.rdb


--------------------------------------------------------------------------------
/chapter6/cache/mixturePosterior_7b51293945933662ab25491259906a02.rdx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/mixturePosterior_7b51293945933662ab25491259906a02.rdx


--------------------------------------------------------------------------------
/chapter7/cache/binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/cache/binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.RData


--------------------------------------------------------------------------------
/fullscript/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/fullscript/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.RData


--------------------------------------------------------------------------------
/chapter6/cache/mixturePosterior_7b51293945933662ab25491259906a02.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/mixturePosterior_7b51293945933662ab25491259906a02.RData


--------------------------------------------------------------------------------
/multivariateGaussian/cache/3dgauss_9c4da507c5196242e80b280153e7c995.rdb:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/cache/3dgauss_9c4da507c5196242e80b280153e7c995.rdb


--------------------------------------------------------------------------------
/multivariateGaussian/cache/3dgauss_9c4da507c5196242e80b280153e7c995.rdx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/cache/3dgauss_9c4da507c5196242e80b280153e7c995.rdx


--------------------------------------------------------------------------------
/multivariateGaussian/cache/3dgauss_9c4da507c5196242e80b280153e7c995.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/cache/3dgauss_9c4da507c5196242e80b280153e7c995.RData


--------------------------------------------------------------------------------
/multivariateGaussian/cache/multiGauss_cc3916828b40ec191d5b5bdee9808c87.rdb:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/cache/multiGauss_cc3916828b40ec191d5b5bdee9808c87.rdb


--------------------------------------------------------------------------------
/multivariateGaussian/cache/multiGauss_cc3916828b40ec191d5b5bdee9808c87.rdx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/cache/multiGauss_cc3916828b40ec191d5b5bdee9808c87.rdx


--------------------------------------------------------------------------------
/multivariateGaussian/cache/multiGauss_cc3916828b40ec191d5b5bdee9808c87.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/cache/multiGauss_cc3916828b40ec191d5b5bdee9808c87.RData


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # LectureNotes
2 | Lecture Notes (with exercises) for Basic Probability course at University of Amsterdam
3 | 
4 | written in Aug 2015 by Philip Schulz, ILLC, UvA
5 | minor editing by Christian Schaffner, ILLC, UvA
6 | 
7 | 


--------------------------------------------------------------------------------
/chapter6/makePlots.R:
--------------------------------------------------------------------------------
 1 | # Script for creating the plots of chapter 6
 2 | # Author: Philip Schulz
 3 | 
 4 | x = seq(0,1,0.001)
 5 | entropy = -log2(x)*x-log2(1-x)*(1-x)
 6 | 
 7 | png("binaryEntropy.png", width=8, height=8, units="in", res=300)
 8 | plot(x,entropy,type="l", xlab=expression(Theta), ylab = "H(X)")
 9 | dev.off()
10 | 


--------------------------------------------------------------------------------
/chapter6/makePlots.R~:
--------------------------------------------------------------------------------
 1 | # Script for creating the plots of chapter 6
 2 | # Author: Philip Schulz
 3 | 
 4 | x = seq(0,1,0.001)
 5 | entropy = -log2(x)*x-log2(1-x)*(1-x)
 6 | 
 7 | png("binaryEntropy.png", width=8, height=8, units="in", res=300)
 8 | plot(x,entropy,type="l", xlab=expression("Theta"), ylab = "H(X)")
 9 | dev.off()
10 | 


--------------------------------------------------------------------------------
/chapter1/chapter1.tex:
--------------------------------------------------------------------------------
 1 | \documentclass[a4paper,11pt,leqno]{report}
 2 | 
 3 | \usepackage{amsmath, amssymb, mdframed}
 4 | \usepackage{hyperref}
 5 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true}
 6 | 
 7 | \newmdtheoremenv{Theorem}{Theorem}[chapter]
 8 | \newmdtheoremenv{Definition}[Theorem]{Definition}
 9 | \newmdtheoremenv{Exercise}[Theorem]{Exercise}
10 | 
11 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}}
12 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}}
13 | 
14 | 
15 | \title{Basic Probability}
16 | \date{}
17 | 
18 | \begin{document}
19 | 
20 | \include{chapter1_forInclude}
21 | 
22 | \end{document}


--------------------------------------------------------------------------------
/chapter4/chapter4.tex:
--------------------------------------------------------------------------------
 1 | \documentclass[a4paper,11pt,leqno]{report}
 2 | 
 3 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption}
 4 | \usepackage{nicefrac}
 5 | 
 6 | \usepackage{hyperref}
 7 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true}
 8 | 
 9 | \newmdtheoremenv{Theorem}{Theorem}[chapter]
10 | \newmdtheoremenv{Definition}[Theorem]{Definition}
11 | \newmdtheoremenv{Exercise}[Theorem]{Exercise}
12 | 
13 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}}
14 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}}
15 | 
16 | \newcommand{\supp}{\operatorname{supp}} 
17 | \newcommand{\E}{\mathbb{E}}
18 | 
19 | \title{Basic Probability}
20 | \date{}
21 | 
22 | \begin{document}
23 | 
24 | \include{chapter4_forInclude}
25 | 
26 | \end{document}


--------------------------------------------------------------------------------
/contributors/contributors.tex:
--------------------------------------------------------------------------------
 1 | \section*{Contributors}
 2 | While we strive to continuously update this script and keep it on an acceptable level of grammaticality and mathematical correctness, it is unavoidable that some
 3 | mistakes creep in. We are therefore utterly grateful to our contributors who have helped improving the script and would like to acknowledge their contributions here.
 4 | \begin{itemize}
 5 | \item Philip Michgelsen has corrected a mistake in the definition of event spaces in chapter 1.
 6 | \item Bas Cornelissen has spotted a mistake in the statement of Markov's inequality.
 7 | \item Jonathan Sippel has spotted a mistake in our example calculation
 8 |   of binary entropy.
 9 | \item Thijs Baaijen has spotted a typo in Formula~\eqref{weatherRV}.
10 | \item Julia Turska has spotted various typos throughout the lecture notes (see issue 33 on the GitHub repository)
11 | \end{itemize}


--------------------------------------------------------------------------------
/chapter5/chapter5.tex:
--------------------------------------------------------------------------------
 1 | \documentclass[a4paper,11pt,leqno]{report}
 2 | 
 3 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption, graphicx}
 4 | \usepackage{nicefrac}
 5 | 
 6 | \usepackage{hyperref}
 7 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true}
 8 | 
 9 | \newmdtheoremenv{Definition}{Definition}[chapter]
10 | \newmdtheoremenv{Exercise}[Definition]{Exercise}
11 | \newmdtheoremenv{Theorem}[Definition]{Theorem}
12 | \newmdtheoremenv{Lemma}[Definition]{Lemma}
13 | 
14 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}}
15 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}}
16 | 
17 | \newcommand{\supp}{\operatorname{supp}} 
18 | \newcommand{\E}{\mathbb{E}}
19 | \newcommand{\eps}{\varepsilon}
20 | 
21 | 
22 | \title{Basic Probability}
23 | \date{}
24 | 
25 | \begin{document}
26 | 
27 | \setcounter{chapter}{4}
28 | \input{chapter5_forInclude}
29 | 
30 | \end{document}


--------------------------------------------------------------------------------
/chapter6/chapter6.Rnw:
--------------------------------------------------------------------------------
 1 | \documentclass[a4paper,11pt,leqno]{report}
 2 | 
 3 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption, graphicx, enumitem, tikz, bbm}
 4 | \usepackage{nicefrac}
 5 | \usetikzlibrary{bayesnet}
 6 | 
 7 | \usepackage{hyperref}
 8 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true}
 9 | 
10 | \newmdtheoremenv{Definition}{Definition}[chapter]
11 | \newmdtheoremenv{Exercise}[Definition]{Exercise}
12 | \newmdtheoremenv{Theorem}[Definition]{Theorem}
13 | \newmdtheoremenv{Lemma}[Definition]{Lemma}
14 | 
15 | \newcommand{\supp}{\operatorname{supp}} 
16 | \newcommand{\E}{\mathbb{E}}
17 | \newcommand{\eps}{\varepsilon}
18 | 
19 | \newcommand{\id}[1]{\mathbbm{1}\left(#1\right)}
20 | 
21 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}}
22 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}}
23 | 
24 | \title{Basic Probability}
25 | \date{}
26 | 
27 | <<setup, include=F>>=
28 | library(knitr)
29 | @
30 | 
31 | \begin{document}
32 | 
33 | \setcounter{chapter}{5}
34 | <<child="chapter6_forInclude.Rnw">>=
35 | @
36 | 
37 | \end{document}


--------------------------------------------------------------------------------
/chapter7/chapter7.Rnw:
--------------------------------------------------------------------------------
 1 | \documentclass[a4paper,11pt,leqno]{report}
 2 | 
 3 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption, graphicx, enumitem}
 4 | \usepackage{nicefrac}
 5 | 
 6 | \usepackage{hyperref}
 7 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true}
 8 | 
 9 | \newmdtheoremenv{Definition}{Definition}[chapter]
10 | \newmdtheoremenv{Exercise}[Definition]{Exercise}
11 | \newmdtheoremenv{Theorem}[Definition]{Theorem}
12 | \newmdtheoremenv{Lemma}[Definition]{Lemma}
13 | 
14 | \newcommand{\supp}{\operatorname{supp}} 
15 | \newcommand{\E}{\mathbb{E}}
16 | \newcommand{\eps}{\varepsilon}
17 | 
18 | \DeclareSymbolFont{extraup}{U}{zavm}{m}{n}
19 | \DeclareMathSymbol{\varheart}{\mathalpha}{extraup}{86}
20 | \DeclareMathSymbol{\vardiamond}{\mathalpha}{extraup}{87}
21 | 
22 | 
23 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}}
24 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}}
25 | 
26 | \title{Basic Probability}
27 | \date{}
28 | 
29 | \begin{document}
30 | 
31 | \setcounter{chapter}{6}
32 | <<child="chapter7_forInclude.Rnw">>=
33 | @
34 | 
35 | \end{document}
36 | 


--------------------------------------------------------------------------------
/multivariateGaussian/multivariateGaussian.Rnw:
--------------------------------------------------------------------------------
 1 | \documentclass[a4paper,11pt,leqno]{report}
 2 | 
 3 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption, graphicx, enumitem}
 4 | \usepackage{nicefrac}
 5 | 
 6 | \usepackage{hyperref}
 7 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true}
 8 | 
 9 | \newmdtheoremenv{Definition}{Definition}[chapter]
10 | \newmdtheoremenv{Exercise}[Definition]{Exercise}
11 | \newmdtheoremenv{Theorem}[Definition]{Theorem}
12 | \newmdtheoremenv{Lemma}[Definition]{Lemma}
13 | 
14 | \newcommand{\supp}{\operatorname{supp}} 
15 | \newcommand{\E}{\mathbb{E}}
16 | \newcommand{\eps}{\varepsilon}
17 | 
18 | \newcommand{\N}[2]{\mathcal{N}\left( #1, #2 \right)}	
19 | 
20 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}}
21 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}}
22 | 
23 | \title{Basic Probability}
24 | \date{}
25 | 
26 | %% Load R packages
27 | <<setup, include=F>>=
28 | library(knitr)
29 | # for multivariate Gaussian
30 | library(mvtnorm)
31 | @
32 | 
33 | \begin{document}
34 | 
35 | \setcounter{chapter}{5}
36 | <<multivariateGaussian_forInclude, child='multivariateGaussian_forInclude.Rnw'>>=
37 | @
38 | 
39 | \end{document}


--------------------------------------------------------------------------------
/chapter2/chapter2.tex:
--------------------------------------------------------------------------------
 1 | \documentclass[a4paper,11pt,leqno]{report}
 2 | 
 3 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption}
 4 | % for Python code
 5 | \usepackage[procnames]{listings} 
 6 | \definecolor{keywords}{RGB}{255,0,90}
 7 | \definecolor{comments}{RGB}{0,0,113}
 8 | \definecolor{red}{RGB}{160,0,0}
 9 | \definecolor{green}{RGB}{0,150,0}
10 |  
11 | \lstset{language=Python, 
12 |         basicstyle=\tt\small, 
13 |         keywordstyle=\color{keywords},
14 |         commentstyle=\color{comments},
15 |         stringstyle=\color{red},
16 |         showstringspaces=false,
17 |         identifierstyle=\color{green},
18 |         procnamekeys={def,class}}
19 | %
20 | \usepackage{venndiagram}
21 | \usepackage{hyperref}
22 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true}
23 | 
24 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}}
25 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}}
26 | 
27 | \newmdtheoremenv{Theorem}{Theorem}[chapter]
28 | \newmdtheoremenv{Definition}[Theorem]{Definition}
29 | \newmdtheoremenv{Exercise}[Theorem]{Exercise}
30 | 
31 | 
32 | \title{Basic Probability}
33 | \date{}
34 | 
35 | \begin{document}
36 | 
37 | \setcounter{chapter}{1}
38 | \include{chapter2_forInclude}
39 | 
40 | \end{document}


--------------------------------------------------------------------------------
/chapter3/chapter3.tex:
--------------------------------------------------------------------------------
 1 | \documentclass[a4paper,11pt,leqno]{report}
 2 | 
 3 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption}
 4 | \usepackage{nicefrac}
 5 | \usepackage{graphicx}
 6 | % for Python code
 7 | \usepackage[procnames]{listings} 
 8 | \definecolor{keywords}{RGB}{255,0,90}
 9 | \definecolor{comments}{RGB}{0,0,113}
10 | \definecolor{red}{RGB}{160,0,0}
11 | \definecolor{green}{RGB}{0,150,0}
12 |  
13 | \lstset{language=Python, 
14 |         basicstyle=\tt\small, 
15 |         keywordstyle=\color{keywords},
16 |         commentstyle=\color{comments},
17 |         stringstyle=\color{red},
18 |         showstringspaces=false,
19 |         identifierstyle=\color{green},
20 |         procnamekeys={def,class}}
21 | %
22 | \usepackage{venndiagram}
23 | \usepackage{hyperref}
24 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true}
25 | 
26 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}}
27 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}}
28 | 
29 | \newmdtheoremenv{Theorem}{Theorem}[chapter]
30 | \newmdtheoremenv{Definition}[Theorem]{Definition}
31 | \newmdtheoremenv{Exercise}[Theorem]{Exercise}
32 | 
33 | 
34 | % \DeclareMathOperator{\supp}{supp}
35 | \newcommand{\supp}{\operatorname{supp}} 
36 | \newcommand{\E}{\mathbb{E}}
37 | \newcommand{\var}{\operatorname{var}}
38 | 
39 | 
40 | \title{Basic Probability}
41 | \date{}
42 | 
43 | \begin{document}
44 | 
45 | \setcounter{chapter}{2}
46 | \include{chapter3_forInclude}
47 | 
48 | \end{document}


--------------------------------------------------------------------------------
/chapter3/makePlots.R:
--------------------------------------------------------------------------------
 1 | # R script for creating plots in chapter 3
 2 | # Run as "Rsript makeplots.R"
 3 | # Author: Philip Schulz
 4 | 
 5 | # create vectors and compute mean
 6 | x = seq(1:8)
 7 | y = c(0.09, .21, .28, .23, .12, .04, .02, .01)
 8 | mu = sum(x*y)
 9 | 
10 | # open stream to file
11 | png("distribution.png", width=8, height=8, units="in", res=300)
12 | 
13 | # plot y against x
14 | plot(x,y,yaxp=c(0,0.35,7),xlab="Z",ylab="P(Z=z)", cex=1.5)
15 | # connect points and x-axis
16 | segments(x0=x, y0=rep(0,8), y1=y, lwd=5)
17 | # insert red lines
18 | abline(v=2,col="red",lwd=2)
19 | abline(v=5,col="red",lwd=2)
20 | # put arrow underneath x-axis to indicate mean
21 | arrows(mu,-0.03,mu,-.001,xpd=T)
22 | # close stream and save to file
23 | dev.off()
24 | 
25 | # compute cdf
26 | z = cumsum(y)
27 | 
28 | # open stream to file
29 | png("cdf.png", width=8, height=8, units="in", res=300)
30 | 
31 | # plot z against x
32 | plot(x,z,ylab="F(z)",xlab="Z")
33 | # add vertical strokes
34 | for (i in (1:length(x)-1)) { lines(c(x[i],x[i]+1),c(z[i],z[i])) }
35 | # close stream and save to file
36 | dev.off()
37 | 
38 | # add constant and scale X
39 | additive_constant = 3
40 | scale_factor = 2
41 | 
42 | # calculate new expectation
43 | new_x = x*scale_factor+additive_constant
44 | new_mu = sum(new_x*y)
45 | 
46 | all_x = c(x,new_x)
47 | all_y = c(y,y)
48 | 
49 | png("scaledRV.png", width=8, height=8, units="in", res=300)
50 | 
51 | plot(all_x, all_y, xlab="Z/X", ylab="P(Z=z)/P(X=x)")
52 | segments(x0=x, y0=rep(0,8), y1=y, col="blue")
53 | segments(x0=new_x, y0=rep(0,8), y1=y, col="red")
54 | arrows(mu,-0.03,mu,-.001,xpd=T, col="blue")
55 | arrows(new_mu,-0.03,new_mu,-.001,xpd=T, col="red")
56 | dev.off()


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | ## Core latex/pdflatex auxiliary files:
  2 | *.aux
  3 | *.lof
  4 | *.log
  5 | *.lot
  6 | *.fls
  7 | *.out
  8 | *.toc
  9 | 
 10 | ## Intermediate documents:
 11 | *.dvi
 12 | *-converted-to.*
 13 | # these rules might exclude image files for figures etc.
 14 | # *.ps
 15 | # *.eps
 16 | # *.pdf
 17 | 
 18 | ## Bibliography auxiliary files (bibtex/biblatex/biber):
 19 | *.bbl
 20 | *.bcf
 21 | *.blg
 22 | *-blx.aux
 23 | *-blx.bib
 24 | *.brf
 25 | *.run.xml
 26 | 
 27 | ## Build tool auxiliary files:
 28 | *.fdb_latexmk
 29 | *.synctex
 30 | *.synctex.gz
 31 | *.synctex.gz(busy)
 32 | *.pdfsync
 33 | 
 34 | ## Auxiliary and intermediate files from other packages:
 35 | 
 36 | 
 37 | # algorithms
 38 | *.alg
 39 | *.loa
 40 | 
 41 | # achemso
 42 | acs-*.bib
 43 | 
 44 | # amsthm
 45 | *.thm
 46 | 
 47 | # beamer
 48 | *.nav
 49 | *.snm
 50 | *.vrb
 51 | 
 52 | #(e)ledmac/(e)ledpar
 53 | *.end
 54 | *.[1-9]
 55 | *.[1-9][0-9]
 56 | *.[1-9][0-9][0-9]
 57 | *.[1-9]R
 58 | *.[1-9][0-9]R
 59 | *.[1-9][0-9][0-9]R
 60 | *.eledsec[1-9]
 61 | *.eledsec[1-9]R
 62 | *.eledsec[1-9][0-9]
 63 | *.eledsec[1-9][0-9]R
 64 | *.eledsec[1-9][0-9][0-9]
 65 | *.eledsec[1-9][0-9][0-9]R
 66 | 
 67 | # glossaries
 68 | *.acn
 69 | *.acr
 70 | *.glg
 71 | *.glo
 72 | *.gls
 73 | 
 74 | # gnuplottex
 75 | *-gnuplottex-*
 76 | 
 77 | # hyperref
 78 | *.brf
 79 | 
 80 | # knitr
 81 | *-concordance.tex
 82 | *.tikz
 83 | *-tikzDictionary
 84 | 
 85 | # listings
 86 | *.lol
 87 | 
 88 | # makeidx
 89 | *.idx
 90 | *.ilg
 91 | *.ind
 92 | *.ist
 93 | 
 94 | # minitoc
 95 | *.maf
 96 | *.mtc
 97 | *.mtc[0-9]
 98 | *.mtc[1-9][0-9]
 99 | 
100 | # minted
101 | _minted*
102 | *.pyg
103 | 
104 | # morewrites
105 | *.mw
106 | 
107 | # mylatexformat
108 | *.fmt
109 | 
110 | # nomencl
111 | *.nlo
112 | 
113 | # sagetex
114 | *.sagetex.sage
115 | *.sagetex.py
116 | *.sagetex.scmd
117 | 
118 | # sympy
119 | *.sout
120 | *.sympy
121 | sympy-plots-for-*.tex/
122 | 
123 | # TikZ & PGF
124 | *.dpth
125 | *.md5
126 | *.auxlock
127 | 
128 | # todonotes
129 | *.tdo
130 | 
131 | # xindy
132 | *.xdy
133 | 
134 | # WinEdt
135 | *.bak
136 | *.sav
137 | 
138 | chapter4/chapter4.synctex.gz
139 | 
140 | chapter5/chapter5.synctex_conflict-20150923-103239.gz
141 | 
142 | chapter3/chapter3.rel
143 | 
144 | chapter5/chapter5.synctex_conflict-20150923-151940.gz
145 | 
146 | **/cache/
147 | chapter7/figure/binaryEntropy-1.pdf
148 | 
149 | chapter7/figure/binaryEntropy-1.pdf
150 | 
151 | fullscript/figure/binaryEntropy-1.pdf
152 | 
153 | .DS_Store
154 | .texpadtmp
155 | 


--------------------------------------------------------------------------------
/fullscript/BasicProbabilityAndStatistics.Rnw:
--------------------------------------------------------------------------------
 1 | \documentclass[11pt,leqno,a4paper]{report}
 2 | 
 3 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption, graphicx, enumitem, tikz, bbm}
 4 | \usepackage{nicefrac}
 5 | \usetikzlibrary{bayesnet}
 6 | \usepackage{hyperref}
 7 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true}
 8 | % for Python code
 9 | \usepackage[procnames]{listings} 
10 | \definecolor{keywords}{RGB}{255,0,90}
11 | \definecolor{comments}{RGB}{0,0,113}
12 | \definecolor{red}{RGB}{160,0,0}
13 | \definecolor{green}{RGB}{0,150,0}
14 |  
15 | \lstset{language=Python, 
16 |         basicstyle=\tt\small, 
17 |         keywordstyle=\color{keywords},
18 |         commentstyle=\color{comments},
19 |         stringstyle=\color{red},
20 |         showstringspaces=false,
21 |         identifierstyle=\color{green},
22 |         procnamekeys={def,class}}
23 | %
24 | \usepackage{venndiagram}
25 | 
26 | \newmdtheoremenv{Theorem}{Theorem}[chapter]
27 | \newmdtheoremenv{Definition}[Theorem]{Definition}
28 | \newmdtheoremenv{Exercise}[Theorem]{Exercise}
29 | \newmdtheoremenv{Lemma}[Theorem]{Lemma}
30 | 
31 | \newcommand{\supp}{\operatorname{supp}} 
32 | \newcommand{\E}{\mathbb{E}}
33 | \newcommand{\var}{\operatorname{var}}
34 | \newcommand{\eps}{\varepsilon}
35 | \newcommand{\id}[1]{\mathbbm{1}\left(#1\right)}
36 | 
37 | \DeclareSymbolFont{extraup}{U}{zavm}{m}{n}
38 | \DeclareMathSymbol{\varheart}{\mathalpha}{extraup}{86}
39 | \DeclareMathSymbol{\vardiamond}{\mathalpha}{extraup}{87}
40 | 
41 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}}
42 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}}
43 | 
44 | 
45 | \author{Philip Schulz \\ Christian Schaffner}
46 | \title{Basic Probability and Statistics}
47 | \date{last modified: \today}
48 | 
49 | \begin{document}
50 | 
51 | <<setup, include=F>>=
52 | library(knitr)
53 | @
54 | 
55 | \begin{titlepage}
56 | \maketitle
57 | \end{titlepage}
58 | 
59 | \pagenumbering{roman}
60 | \tableofcontents
61 | \graphicspath{{../chapter3/}{../chapter5/}{../chapter6/}}
62 | 
63 | % insert preface
64 | \newpage
65 | \input{../contributors/contributors}
66 | \clearpage
67 | \setcounter{page}{1}
68 | \pagenumbering{arabic}
69 | \input{../chapter1/chapter1_forInclude}
70 | \input{../chapter2/chapter2_forInclude}
71 | \input{../chapter3/chapter3_forInclude}
72 | \input{../chapter4/chapter4_forInclude}
73 | \input{../chapter5/chapter5_forInclude}
74 | <<child="../chapter6/chapter6_forInclude.Rnw">>=
75 | @
76 | <<child="../chapter7/chapter7_forInclude.Rnw">>=
77 | @
78 | 
79 | 
80 | \end{document}
81 | 


--------------------------------------------------------------------------------
/chapter5/makePlots.R:
--------------------------------------------------------------------------------
 1 | # Script to generate likelihood plots for chapter 5.
 2 | # Plots are based on samples of bit-sequences of length 10
 3 | # Author: Philip Schulz
 4 | 
 5 | sequence_length = 10
 6 | sparse_samples = 2
 7 | dense_samples = 50
 8 | theta = 0.7
 9 | 
10 | dense_data = rbinom(dense_samples, sequence_length, theta)
11 | sparse_data1 = rbinom(sparse_samples, sequence_length, theta)
12 | sparse_data2 = rbinom(sparse_samples, sequence_length, theta)
13 | sparse_data3 = rbinom(sparse_samples, sequence_length, theta)
14 | 
15 | dense_likelihood = double()
16 | sparse_likelihood1 = double()
17 | sparse_likelihood2 = double()
18 | sparse_likelihood3 = double()
19 | 
20 | params = seq(0,1,0.001)
21 | 
22 | for (param in params) { dense_likelihood = c(dense_likelihood, prod(dbinom(dense_data, sequence_length, param))) }
23 | for (param in params) { sparse_likelihood1 = c(sparse_likelihood1, prod(dbinom(sparse_data1, sequence_length, param))) }
24 | for (param in params) { sparse_likelihood2 = c(sparse_likelihood2, prod(dbinom(sparse_data2, sequence_length, param))) }
25 | for (param in params) { sparse_likelihood3= c(sparse_likelihood3, prod(dbinom(sparse_data3, sequence_length, param))) }
26 | 
27 | dense_mode = max(dense_likelihood)
28 | sparse_mode1 = max(sparse_likelihood1)
29 | sparse_mode2 = max(sparse_likelihood2)
30 | sparse_mode3 = max(sparse_likelihood3)
31 | highest_mode = max(c(sparse_mode1, sparse_mode2, sparse_mode3))
32 | 
33 | dense_mode_idx = match(dense_mode, dense_likelihood)/length(params)
34 | sparse_mode1_idx = match(sparse_mode1, sparse_likelihood1)/length(params)
35 | sparse_mode2_idx = match(sparse_mode2, sparse_likelihood2)/length(params)
36 | sparse_mode3_idx = match(sparse_mode3, sparse_likelihood3)/length(params)
37 | 
38 | png("sparse_likelihood.png", width=8, height=8, units="in", res=300)
39 | plot(params, sparse_likelihood1, xlab=expression(Theta), ylab="Likelihood", type ="l", col="blue", ylim = c(0,highest_mode))
40 | axis(1,at = seq(0,10,0.1))
41 | lines(params, sparse_likelihood2, col="green")
42 | lines(params, sparse_likelihood3, col="red")
43 | segments(x0=sparse_mode1_idx, y0=0, x1=sparse_mode1_idx, sparse_mode1)
44 | segments(x0=sparse_mode2_idx, y0=0, x1=sparse_mode2_idx, sparse_mode2)
45 | segments(x0=sparse_mode3_idx, y0=0, x1=sparse_mode3_idx, sparse_mode3)
46 | dev.off()
47 | 
48 | png("dense_likelihood.png", width=8, height=8, units="in", res=300)
49 | plot(params, dense_likelihood, xlab=expression(Theta), ylab="Likelihood", type ="l", col="red")
50 | axis(1,at = seq(0,10,0.1))
51 | segments(x0=dense_mode_idx, y0=0, x1=dense_mode_idx, dense_mode)
52 | dev.off()
53 | 
54 | 


--------------------------------------------------------------------------------
/chapter5/BernoulliData.txt:
--------------------------------------------------------------------------------
  1 | 86 94 85 81 88
  2 | 80 82 84 89 84
  3 | 82 81 85 84 80
  4 | 87 86 86 88 87
  5 | 87 88 80 83 80
  6 | 79 83 92 81 88
  7 | 86 78 80 82 82
  8 | 84 85 87 90 82
  9 | 82 75 81 83 86
 10 | 86 83 71 85 84
 11 | 87 82 79 84 87
 12 | 79 83 85 82 87
 13 | 82 85 90 85 86
 14 | 83 84 87 82 84
 15 | 84 83 90 84 84
 16 | 85 82 87 75 85
 17 | 92 87 83 87 82
 18 | 80 86 84 89 88
 19 | 90 83 79 84 78
 20 | 82 84 81 89 84
 21 | 84 86 80 84 82
 22 | 87 86 85 81 88
 23 | 81 82 85 81 79
 24 | 85 83 88 86 90
 25 | 81 83 77 77 90
 26 | 86 90 87 84 83
 27 | 86 79 88 79 86
 28 | 88 82 74 83 77
 29 | 79 85 84 78 90
 30 | 83 85 87 80 78
 31 | 87 82 86 81 90
 32 | 85 89 84 85 81
 33 | 87 85 82 86 87
 34 | 79 86 86 79 82
 35 | 89 88 82 86 84
 36 | 73 83 84 86 82
 37 | 83 81 80 81 78
 38 | 85 79 86 76 77
 39 | 82 83 82 81 88
 40 | 83 81 79 84 80
 41 | 86 81 84 90 77
 42 | 84 87 88 85 81
 43 | 86 86 87 80 84
 44 | 86 84 90 75 82
 45 | 82 83 84 84 88
 46 | 80 79 87 82 82
 47 | 82 87 80 80 84
 48 | 79 82 79 80 87
 49 | 83 83 77 86 84
 50 | 83 85 83 91 92
 51 | 85 87 88 88 88
 52 | 87 75 84 79 80
 53 | 80 87 86 89 85
 54 | 79 84 75 90 87
 55 | 86 83 86 86 81
 56 | 87 79 88 87 88
 57 | 87 84 91 80 81
 58 | 85 83 81 84 83
 59 | 84 83 81 87 80
 60 | 87 86 90 89 84
 61 | 86 85 85 83 85
 62 | 84 84 91 88 85
 63 | 77 73 86 80 83
 64 | 80 81 84 83 84
 65 | 83 83 90 85 81
 66 | 87 83 79 89 81
 67 | 84 81 85 85 88
 68 | 85 85 82 89 86
 69 | 89 85 91 84 81
 70 | 88 75 82 82 81
 71 | 88 84 83 87 87
 72 | 84 85 87 89 88
 73 | 89 82 81 79 91
 74 | 82 80 86 86 85
 75 | 86 80 84 86 79
 76 | 87 82 87 84 82
 77 | 85 82 82 82 88
 78 | 82 86 76 85 90
 79 | 85 83 86 89 85
 80 | 92 92 89 79 81
 81 | 87 89 81 88 83
 82 | 88 86 88 86 87
 83 | 89 81 84 86 85
 84 | 87 88 89 81 83
 85 | 83 85 82 83 75
 86 | 82 88 76 80 82
 87 | 89 86 81 90 86
 88 | 88 84 92 84 77
 89 | 85 82 89 85 88
 90 | 77 87 83 91 86
 91 | 83 85 90 94 76
 92 | 73 81 82 77 77
 93 | 84 90 81 79 85
 94 | 90 83 80 85 86
 95 | 83 84 85 87 88
 96 | 80 80 87 81 82
 97 | 87 84 85 86 80
 98 | 92 82 77 84 85
 99 | 86 83 82 81 84
100 | 87 86 82 84 83
101 | 82 86 82 82 79
102 | 84 86 84 78 85
103 | 88 83 76 83 83
104 | 89 81 84 85 87
105 | 76 89 79 85 77
106 | 79 81 80 87 85
107 | 81 90 85 89 84
108 | 92 78 78 87 84
109 | 85 85 85 77 87
110 | 79 81 84 81 81
111 | 76 83 91 83 86
112 | 81 86 82 86 86
113 | 82 88 80 91 85
114 | 85 78 83 89 83
115 | 85 81 84 86 85
116 | 89 89 86 86 88
117 | 80 85 82 84 73
118 | 87 81 83 86 85
119 | 79 87 80 81 85
120 | 82 88 85 86 81
121 | 81 84 86 84 84
122 | 83 80 83 86 87
123 | 85 88 85 87 85
124 | 88 83 84 78 81
125 | 86 88 79 89 86
126 | 92 84 84 82 83
127 | 82 87 87 86 87
128 | 79 89 82 85 85
129 | 87 86 81 83 83
130 | 88 86 86 80 80
131 | 86 85 79 88 86
132 | 82 89 86 84 85
133 | 83 83 78 83 83
134 | 91 88 87 84 85
135 | 75 82 84 82 85
136 | 85 82 83 84 79
137 | 81 89 84 84 89
138 | 81 84 82 90 89
139 | 80 82 89 85 80
140 | 86 86 90 91 81
141 | 82 79 81 86 88
142 | 94 80 87 86 85
143 | 82 87 83 81 83
144 | 83 83 77 89 82
145 | 82 82 81 84 91
146 | 75 90 87 79 88
147 | 83 89 82 83 85
148 | 79 86 86 85 89
149 | 88 81 81 82 85
150 | 83 90 81 72 78
151 | 86 84 85 76 86
152 | 89 78 80 82 87
153 | 82 83 84 87 80
154 | 83 82 86 90 87
155 | 83 84 85 80 88
156 | 77 84 84 86 87
157 | 81 89 84 84 80
158 | 80 82 82 83 92
159 | 82 80 84 85 80
160 | 79 78 80 78 86
161 | 87 82 85 85 77
162 | 83 84 88 92 86
163 | 87 83 84 84 83
164 | 84 82 84 88 90
165 | 80 84 76 81 75
166 | 88 87 90 86 89
167 | 82 87 85 85 88
168 | 82 76 86 79 82
169 | 87 89 92 76 78
170 | 85 81 89 84 80
171 | 81 80 85 82 81
172 | 90 89 84 85 78
173 | 84 78 80 85 89
174 | 72 80 84 88 79
175 | 85 84 75 87 79
176 | 82 75 91 81 85
177 | 88 87 83 84 82
178 | 89 84 86 83 81
179 | 87 90 84 86 86
180 | 85 89 82 83 91
181 | 85 81 83 84 80
182 | 86 92 79 84 87
183 | 80 83 83 77 88
184 | 87 83 90 80 85
185 | 82 84 84 77 86
186 | 84 93 86 86 80
187 | 78 86 85 86 81
188 | 82 81 81 84 84
189 | 83 87 81 83 79
190 | 83 83 83 84 84
191 | 76 80 85 83 79
192 | 80 78 82 86 81
193 | 84 78 76 82 81
194 | 82 88 84 81 83
195 | 80 83 81 88 81
196 | 90 77 88 86 82
197 | 86 87 88 84 88
198 | 79 79 84 88 86
199 | 86 92 79 86 82
200 | 81 88 85 78 82
201 | 


--------------------------------------------------------------------------------
/chapter6/tikzlibrarybayesnet.code.tex:
--------------------------------------------------------------------------------
  1 | % tikzlibrary.code.tex
  2 | %
  3 | % Copyright 2010-2011 by Laura Dietz
  4 | % Copyright 2012 by Jaakko Luttinen
  5 | %
  6 | % This file may be distributed and/or modified
  7 | %
  8 | % 1. under the LaTeX Project Public License and/or
  9 | % 2. under the GNU General Public License.
 10 | %
 11 | % See the files LICENSE_LPPL and LICENSE_GPL for more details.
 12 | 
 13 | % Load other libraries
 14 | \usetikzlibrary{shapes}
 15 | \usetikzlibrary{fit}
 16 | \usetikzlibrary{chains}
 17 | \usetikzlibrary{arrows}
 18 | 
 19 | % Latent node
 20 | \tikzstyle{latent} = [circle,fill=white,draw=black,inner sep=1pt,
 21 | minimum size=20pt, font=\fontsize{10}{10}\selectfont, node distance=1]
 22 | % Observed node
 23 | \tikzstyle{obs} = [latent,fill=gray!25]
 24 | % Constant node
 25 | \tikzstyle{const} = [rectangle, inner sep=0pt, node distance=1]
 26 | % Factor node
 27 | \tikzstyle{factor} = [rectangle, fill=black,minimum size=5pt, inner
 28 | sep=0pt, node distance=0.4]
 29 | % Deterministic node
 30 | \tikzstyle{det} = [latent, diamond]
 31 | 
 32 | % Plate node
 33 | \tikzstyle{plate} = [draw, rectangle, rounded corners, fit=#1]
 34 | % Invisible wrapper node
 35 | \tikzstyle{wrap} = [inner sep=0pt, fit=#1]
 36 | % Gate
 37 | \tikzstyle{gate} = [draw, rectangle, dashed, fit=#1]
 38 | 
 39 | % Caption node
 40 | \tikzstyle{caption} = [font=\footnotesize, node distance=0] %
 41 | \tikzstyle{plate caption} = [caption, node distance=0, inner sep=0pt,
 42 | below left=5pt and 0pt of #1.south east] %
 43 | \tikzstyle{factor caption} = [caption] %
 44 | \tikzstyle{every label} += [caption] %
 45 | 
 46 | \tikzset{>={triangle 45}}
 47 | 
 48 | %\pgfdeclarelayer{b}
 49 | %\pgfdeclarelayer{f}
 50 | %\pgfsetlayers{b,main,f}
 51 | 
 52 | % \factoredge [options] {inputs} {factors} {outputs}
 53 | \newcommand{\factoredge}[4][]{ %
 54 |   % Connect all nodes #2 to all nodes #4 via all factors #3.
 55 |   \foreach \f in {#3} { %
 56 |     \foreach \x in {#2} { %
 57 |       \path (\x) edge[-,#1] (\f) ; %
 58 |       %\draw[-,#1] (\x) edge[-] (\f) ; %
 59 |     } ;
 60 |     \foreach \y in {#4} { %
 61 |       \path (\f) edge[->,#1] (\y) ; %
 62 |       %\draw[->,#1] (\f) -- (\y) ; %
 63 |     } ;
 64 |   } ;
 65 | }
 66 | 
 67 | % \edge [options] {inputs} {outputs}
 68 | \newcommand{\edge}[3][]{ %
 69 |   % Connect all nodes #2 to all nodes #3.
 70 |   \foreach \x in {#2} { %
 71 |     \foreach \y in {#3} { %
 72 |       \path (\x) edge [->,#1] (\y) ;%
 73 |       %\draw[->,#1] (\x) -- (\y) ;%
 74 |     } ;
 75 |   } ;
 76 | }
 77 | 
 78 | % \factor [options] {name} {caption} {inputs} {outputs}
 79 | \newcommand{\factor}[5][]{ %
 80 |   % Draw the factor node. Use alias to allow empty names.
 81 |   \node[factor, label={[name=#2-caption]#3}, name=#2, #1,
 82 |   alias=#2-alias] {} ; %
 83 |   % Connect all inputs to outputs via this factor
 84 |   \factoredge {#4} {#2-alias} {#5} ; %
 85 | }
 86 | 
 87 | % \plate [options] {name} {fitlist} {caption}
 88 | \newcommand{\plate}[4][]{ %
 89 |   \node[wrap=#3] (#2-wrap) {}; %
 90 |   \node[plate caption=#2-wrap] (#2-caption) {#4}; %
 91 |   \node[plate=(#2-wrap)(#2-caption), #1] (#2) {}; %
 92 | }
 93 | 
 94 | % \gate [options] {name} {fitlist} {inputs}
 95 | \newcommand{\gate}[4][]{ %
 96 |   \node[gate=#3, name=#2, #1, alias=#2-alias] {}; %
 97 |   \foreach \x in {#4} { %
 98 |     \draw [-*,thick] (\x) -- (#2-alias); %
 99 |   } ;%
100 | }
101 | 
102 | % \vgate {name} {fitlist-left} {caption-left} {fitlist-right}
103 | % {caption-right} {inputs}
104 | \newcommand{\vgate}[6]{ %
105 |   % Wrap the left and right parts
106 |   \node[wrap=#2] (#1-left) {}; %
107 |   \node[wrap=#4] (#1-right) {}; %
108 |   % Draw the gate
109 |   \node[gate=(#1-left)(#1-right)] (#1) {}; %
110 |   % Add captions
111 |   \node[caption, below left=of #1.north ] (#1-left-caption)
112 |   {#3}; %
113 |   \node[caption, below right=of #1.north ] (#1-right-caption)
114 |   {#5}; %
115 |   % Draw middle separation
116 |   \draw [-, dashed] (#1.north) -- (#1.south); %
117 |   % Draw inputs
118 |   \foreach \x in {#6} { %
119 |     \draw [-*,thick] (\x) -- (#1); %
120 |   } ;%
121 | }
122 | 
123 | % \hgate {name} {fitlist-top} {caption-top} {fitlist-bottom}
124 | % {caption-bottom} {inputs}
125 | \newcommand{\hgate}[6]{ %
126 |   % Wrap the left and right parts
127 |   \node[wrap=#2] (#1-top) {}; %
128 |   \node[wrap=#4] (#1-bottom) {}; %
129 |   % Draw the gate
130 |   \node[gate=(#1-top)(#1-bottom)] (#1) {}; %
131 |   % Add captions
132 |   \node[caption, above right=of #1.west ] (#1-top-caption)
133 |   {#3}; %
134 |   \node[caption, below right=of #1.west ] (#1-bottom-caption)
135 |   {#5}; %
136 |   % Draw middle separation
137 |   \draw [-, dashed] (#1.west) -- (#1.east); %
138 |   % Draw inputs
139 |   \foreach \x in {#6} { %
140 |     \draw [-*,thick] (\x) -- (#1); %
141 |   } ;%
142 | }
143 | 
144 | 


--------------------------------------------------------------------------------
/fullscript/tikzlibrarybayesnet.code.tex:
--------------------------------------------------------------------------------
  1 | % tikzlibrary.code.tex
  2 | %
  3 | % Copyright 2010-2011 by Laura Dietz
  4 | % Copyright 2012 by Jaakko Luttinen
  5 | %
  6 | % This file may be distributed and/or modified
  7 | %
  8 | % 1. under the LaTeX Project Public License and/or
  9 | % 2. under the GNU General Public License.
 10 | %
 11 | % See the files LICENSE_LPPL and LICENSE_GPL for more details.
 12 | 
 13 | % Load other libraries
 14 | \usetikzlibrary{shapes}
 15 | \usetikzlibrary{fit}
 16 | \usetikzlibrary{chains}
 17 | \usetikzlibrary{arrows}
 18 | 
 19 | % Latent node
 20 | \tikzstyle{latent} = [circle,fill=white,draw=black,inner sep=1pt,
 21 | minimum size=20pt, font=\fontsize{10}{10}\selectfont, node distance=1]
 22 | % Observed node
 23 | \tikzstyle{obs} = [latent,fill=gray!25]
 24 | % Constant node
 25 | \tikzstyle{const} = [rectangle, inner sep=0pt, node distance=1]
 26 | % Factor node
 27 | \tikzstyle{factor} = [rectangle, fill=black,minimum size=5pt, inner
 28 | sep=0pt, node distance=0.4]
 29 | % Deterministic node
 30 | \tikzstyle{det} = [latent, diamond]
 31 | 
 32 | % Plate node
 33 | \tikzstyle{plate} = [draw, rectangle, rounded corners, fit=#1]
 34 | % Invisible wrapper node
 35 | \tikzstyle{wrap} = [inner sep=0pt, fit=#1]
 36 | % Gate
 37 | \tikzstyle{gate} = [draw, rectangle, dashed, fit=#1]
 38 | 
 39 | % Caption node
 40 | \tikzstyle{caption} = [font=\footnotesize, node distance=0] %
 41 | \tikzstyle{plate caption} = [caption, node distance=0, inner sep=0pt,
 42 | below left=5pt and 0pt of #1.south east] %
 43 | \tikzstyle{factor caption} = [caption] %
 44 | \tikzstyle{every label} += [caption] %
 45 | 
 46 | \tikzset{>={triangle 45}}
 47 | 
 48 | %\pgfdeclarelayer{b}
 49 | %\pgfdeclarelayer{f}
 50 | %\pgfsetlayers{b,main,f}
 51 | 
 52 | % \factoredge [options] {inputs} {factors} {outputs}
 53 | \newcommand{\factoredge}[4][]{ %
 54 |   % Connect all nodes #2 to all nodes #4 via all factors #3.
 55 |   \foreach \f in {#3} { %
 56 |     \foreach \x in {#2} { %
 57 |       \path (\x) edge[-,#1] (\f) ; %
 58 |       %\draw[-,#1] (\x) edge[-] (\f) ; %
 59 |     } ;
 60 |     \foreach \y in {#4} { %
 61 |       \path (\f) edge[->,#1] (\y) ; %
 62 |       %\draw[->,#1] (\f) -- (\y) ; %
 63 |     } ;
 64 |   } ;
 65 | }
 66 | 
 67 | % \edge [options] {inputs} {outputs}
 68 | \newcommand{\edge}[3][]{ %
 69 |   % Connect all nodes #2 to all nodes #3.
 70 |   \foreach \x in {#2} { %
 71 |     \foreach \y in {#3} { %
 72 |       \path (\x) edge [->,#1] (\y) ;%
 73 |       %\draw[->,#1] (\x) -- (\y) ;%
 74 |     } ;
 75 |   } ;
 76 | }
 77 | 
 78 | % \factor [options] {name} {caption} {inputs} {outputs}
 79 | \newcommand{\factor}[5][]{ %
 80 |   % Draw the factor node. Use alias to allow empty names.
 81 |   \node[factor, label={[name=#2-caption]#3}, name=#2, #1,
 82 |   alias=#2-alias] {} ; %
 83 |   % Connect all inputs to outputs via this factor
 84 |   \factoredge {#4} {#2-alias} {#5} ; %
 85 | }
 86 | 
 87 | % \plate [options] {name} {fitlist} {caption}
 88 | \newcommand{\plate}[4][]{ %
 89 |   \node[wrap=#3] (#2-wrap) {}; %
 90 |   \node[plate caption=#2-wrap] (#2-caption) {#4}; %
 91 |   \node[plate=(#2-wrap)(#2-caption), #1] (#2) {}; %
 92 | }
 93 | 
 94 | % \gate [options] {name} {fitlist} {inputs}
 95 | \newcommand{\gate}[4][]{ %
 96 |   \node[gate=#3, name=#2, #1, alias=#2-alias] {}; %
 97 |   \foreach \x in {#4} { %
 98 |     \draw [-*,thick] (\x) -- (#2-alias); %
 99 |   } ;%
100 | }
101 | 
102 | % \vgate {name} {fitlist-left} {caption-left} {fitlist-right}
103 | % {caption-right} {inputs}
104 | \newcommand{\vgate}[6]{ %
105 |   % Wrap the left and right parts
106 |   \node[wrap=#2] (#1-left) {}; %
107 |   \node[wrap=#4] (#1-right) {}; %
108 |   % Draw the gate
109 |   \node[gate=(#1-left)(#1-right)] (#1) {}; %
110 |   % Add captions
111 |   \node[caption, below left=of #1.north ] (#1-left-caption)
112 |   {#3}; %
113 |   \node[caption, below right=of #1.north ] (#1-right-caption)
114 |   {#5}; %
115 |   % Draw middle separation
116 |   \draw [-, dashed] (#1.north) -- (#1.south); %
117 |   % Draw inputs
118 |   \foreach \x in {#6} { %
119 |     \draw [-*,thick] (\x) -- (#1); %
120 |   } ;%
121 | }
122 | 
123 | % \hgate {name} {fitlist-top} {caption-top} {fitlist-bottom}
124 | % {caption-bottom} {inputs}
125 | \newcommand{\hgate}[6]{ %
126 |   % Wrap the left and right parts
127 |   \node[wrap=#2] (#1-top) {}; %
128 |   \node[wrap=#4] (#1-bottom) {}; %
129 |   % Draw the gate
130 |   \node[gate=(#1-top)(#1-bottom)] (#1) {}; %
131 |   % Add captions
132 |   \node[caption, above right=of #1.west ] (#1-top-caption)
133 |   {#3}; %
134 |   \node[caption, below right=of #1.west ] (#1-bottom-caption)
135 |   {#5}; %
136 |   % Draw middle separation
137 |   \draw [-, dashed] (#1.west) -- (#1.east); %
138 |   % Draw inputs
139 |   \foreach \x in {#6} { %
140 |     \draw [-*,thick] (\x) -- (#1); %
141 |   } ;%
142 | }
143 | 
144 | 


--------------------------------------------------------------------------------
/additionalMaterial/sufficient-statistics.tex:
--------------------------------------------------------------------------------
  1 | \documentclass[a4paper,10pt,landscape,twocolumn]{scrartcl}
  2 | 
  3 | %% Settings
  4 | \newcommand\problemset{4}
  5 | \newcommand\deadline{Wednesday September 28th, 22:00h}
  6 | \newif\ifcomments
  7 | \commentsfalse % hide comments
  8 | %\commentstrue % show comments
  9 | 
 10 | % Packages
 11 | \usepackage{enumitem}
 12 | \usepackage[usenames,dvipsnames]{color}
 13 | \usepackage{multicol}
 14 | 
 15 | \usepackage{amsmath,amsthm,amssymb}
 16 | \usepackage[empty]{fullpage}
 17 | \usepackage{comment}
 18 | 
 19 | % Styling
 20 | \usepackage{tgpagella}
 21 | \usepackage{AlegreyaSans}
 22 | \setkomafont{section}{\Large\textsc}
 23 | \RedeclareSectionCommand[afterskip=.3\baselineskip]{subsection}
 24 | \setlength{\columnsep}{7em}
 25 | \definecolor{gray}{gray}{.4}
 26 | \definecolor{RED}{rgb}{.5,0,0}
 27 | \renewcommand*{\pagemark}{}
 28 | 
 29 | \usepackage{hyperref}
 30 | \DeclareMathOperator{\Cov}{Cov}
 31 | \DeclareMathOperator{\Cor}{Cor}
 32 | \DeclareMathOperator{\Var}{Var}
 33 | 
 34 | \begin{document}
 35 | 	{\sffamily\flushleft\color{gray}
 36 | 		\textsc{\bfseries basic probability: theory}\\
 37 | 		Master of Logic, University of Amsterdam, 2016\\
 38 | 		\textsc{teachers} Christian Schaffner and Philip Schulz 
 39 | 		\textsc{ta} Bas Cornelissen%
 40 | 	}
 41 | 	{\sffamily\flushleft\huge\bfseries
 42 | 		Some notes on sufficient statistics
 43 | 	}\\[1em]%
 44 | 	
 45 | \noindent
 46 | \paragraph{The exercise} In exercise 1 of this weeks' board questions, you were given a set $x_1^n = (x_1, \dots, x_n)$ of $n$ i.i.d.\ observations that were all geometrically distributed. So they are observations of RV's $X_1, \dots X_n$ where
 47 | 	\[
 48 | 		P(X_i = x_i \mid \Theta = \theta) = \text{Geom}(x_i \mid \theta) = (1-\theta)^{x_i} \theta.
 49 | 	\]
 50 | 	We had to show that $t := T(x_i) = \sum_{i=1}^n x_i$ is a sufficient statistic. 
 51 | 	
 52 | 	What does that even mean? By the Factorization Theorem, it suffices to find two functions $g(\theta, t)$ and $h(x, t)$ such that 
 53 | 	\begin{align}
 54 | 		P(X_1^n = x_1^n \mid \Theta = \theta) = g(\theta, t) \cdot h(x_1^n, t).
 55 | 	\end{align}
 56 | 	So what is our joint distribution? For legibility, we'll drop the random variables and write e.g. $P(x_1^n \mid \theta) := P(X_1^n = x_1^n \mid \Theta = \theta	)$. By independence this is:
 57 | 	\begin{align}
 58 | 		P(x_1^n \mid \theta) = \prod_{i=1}^n (1-\theta)^{x_i} \theta = (1-\theta)^{\sum_{i=1}^n x_i} \cdot \theta^n.
 59 | 	\end{align}
 60 | 	
 61 | 	\paragraph{The answer}
 62 | 	Now observe that this is simply $(1-\theta)^t \theta^n$, so when we choose $g(\theta, t) := (1-\theta)^t \theta^n$ and $h(x, t) := 1$ we have found a factorization of the joint. By the Factorization Theorem, $t$ is thus a sufficient statistic.
 63 | 	
 64 | 	\paragraph{But why?}
 65 | 	True as that may be, this feels a bit unsatisfactory. After all, the idea was that given the value of the sufficient statistic, it should be possible to write the PMF without using the parameter. The Factorization Theorem told you \emph{that} this is possible, but it doesn't tell you \emph{how} to do it.
 66 | 	
 67 | 	Or does it? In fact, the proof does. We essentially have to expand the conditional distribution of $x_1^n$ given $t$ and $\theta$:
 68 | 	\begin{align}\label{eq:blabla}
 69 | 	P(x_1^n \mid t, \theta) 
 70 | 		&= 	\frac{p(x_1^n, t \mid \theta)}{p(t \mid \theta)}
 71 | 		= \frac{p(x_1^n \mid \theta)}{p(t\mid \theta)}
 72 | 		= \frac{p(x_1^n \mid \theta)}{\sum_{z_1^n: t(z_1^n) = t} p(z_1^n, t \mid \theta)}.
 73 | 	\end{align}
 74 | 	In the second equality we used the fact that $t$ is a deterministic function of $x_1^n$ so the probability of $x_1^n$ and $t$ is exactly the same as the probability of $x_1^n$. In the third equality we used a little trick, writing a marginal as a marginalized joint.
 75 | 	
 76 | 	Recall that we actually had a factorization of $p(x_1^n \mid \theta)$, which we can now fill in in \eqref{eq:blabla} to get 
 77 | 	\begin{align}\label{eq:blabla2}
 78 | 		P(x_1^n \mid t, \theta) 
 79 | 			= \frac{g(\theta, t) \cdot h(x_1^n, t)}{\sum_{z_1^n} g(\theta, t) \cdot h(z_1^n, t)}
 80 | 			= \frac{h(x_1^n,t)}{\sum_{z_1^n} h(z_1^n, t)}
 81 | 	\end{align}
 82 | 	And since we know $h(x_1^n, t) = 1$, we can actually calculate this as
 83 | 	\begin{align}\label{eq:blabla3}
 84 | 		P(x_1^n \mid t, \theta) 
 85 | 			= \frac{1}{\sum_{z_1^n: T(z_1^n) = t} 1}
 86 | 			= \frac{1}{|\{z_1^n: T(z_1^n) = t\}|}
 87 | 	\end{align}
 88 | 	--- if you manage to count the set in the denominator, that is.
 89 | 
 90 | 	\paragraph{The lesson}
 91 | 	Taking a step back, consider the conditional probability of $x_1^n$ given $t$, as expressed in the first equality of \eqref{eq:blabla}. That is the thing we want to write without using $\theta$, and we can do so if we somehow manage to cancel out the $\theta$ in the numerator against the $\theta$'s in the denominator. This is precisely what happened in the last step of \eqref{eq:blabla2}. Working with actual distributions, this might however be very difficult. Also, finding the actual distribution \emph{without} the $\theta$ need not be easy: you have to deal with the sum in \eqref{eq:blabla2}.
 92 | 	
 93 | 	What else should now be clear? For example: if we have data $x_1^n$ and $y_1^n$ with the same sufficient statistic $T(x_1^n) = T(y_1^n) = t$, drawn from two distributions, with parameters $\theta$ and $\theta'$, then by \eqref{eq:blabla2}
 94 | 	\[
 95 | 	P(x_1^n \mid t, \theta) = P(y_1^n \mid t, \theta').
 96 | 	\]	 
 97 | 	
 98 | 	We can also say something about the original distributions, not conditioned on $t$. The distributions of $x_1^n$ and $y_1^n$ differ from another only in the normalizing constant. We can make that more explicit as follows:
 99 | 	\begin{align}
100 | 	P(x_1^n \mid \theta) 
101 | 		&= P(t \mid \theta) \cdot P(x_1^n \mid t, \theta)\\
102 | 		&= P(t \mid \theta') \cdot P(y_1^n \mid t, \theta') \\
103 | 		&=\frac{P(t \mid \theta)}{P(t \mid \theta')} \cdot P(t \mid \theta') \cdot P(y_1^n \mid t, \theta')\\
104 | 		&= \frac{P(t \mid \theta)}{P(t \mid \theta')} \cdot  P(y_1^n \mid \theta')
105 | 	\end{align}
106 | 	
107 | 	
108 | 	
109 | 
110 | \end{document}
111 | 


--------------------------------------------------------------------------------
/chapter4/chapter4_forInclude.tex:
--------------------------------------------------------------------------------
  1 | 
  2 | \setcounter{chapter}{3}
  3 | \chapter{Bayes' rule and its applications}
  4 | 
  5 | \section{The chain rule}
  6 | 
  7 | This chapter is going to focus on how to re-write joint and conditional probabilities. When we turn to statistics later on, it will
  8 | turn out that it is often hard to define a joint distribution over many variables. Likewise, it can be hard to calculate 
  9 | the probability distribution of a RV $ X $ conditioned on a RV $ Y $ but it may be much easier to find the distribution of $ Y $
 10 | conditioned on $ X $. In this chapter we are essentially trying to find simpler expressions for distributions that may be hard to
 11 | compute.
 12 | 
 13 | The first general method for simplifying a joint distribution is known as the \textbf{chain rule}. For completeness' sake, we are going to formulate the chain rule first for events and then for random variables.
 14 | 
 15 | \begin{Theorem}{\textbf{(Chain rule)}} \label{thm:chain}
 16 | The joint probability of events $ E_{1}, \ldots, E_{n} $ can be factorised as
 17 | $$ \mathbb{P}(E_{1}, \ldots, E_{n}) = \mathbb{P}(E_{1}) \times \mathbb{P}(E_{2}|E_{1}) \times \ldots \times \mathbb{P}(E_{n}|E_{1}, \ldots, E_{n-1}) $$
 18 | \end{Theorem} 
 19 | Recall from Definition~\ref{def:jointprob} the notation
 20 | $\mathbb{P}(E_1,E_2) = \mathbb{P}(E_1 \cap E_2)$ for denoting the
 21 | probability that both events $E_1$ and $E_2$ occur. Also remember that
 22 | we use the abbreviation $E_1^n := E_1, \ldots, E_n$; so for the case
 23 | of events, we have $\mathbb{P}(E_1^n) = \mathbb{P}(\bigcap_{i=1}^n E_i)$. There are a couple of things to note about the chain rule: First of all, the numbering of the events is arbitrary. That means that it does not matter in which
 24 | order we decompose the joint probability. We could just as well start with any $ E_{i} $ for $ 1 \leq i \leq n $. Second we used the 
 25 | word \textit{factorise}. This simply means that we decompose any expression (in this case a joint probability) into a product. Products are
 26 | nice in that we can arrange them in any order that we like (i.e.\ they commute). Moreover, products make a lot of calculations easier, as we will
 27 | see later.
 28 | 
 29 | Let us go ahead and actually prove the chain rule. 
 30 | \paragraph{Proof of Theorem~\ref{thm:chain}} We are going to do so inductively and choose $ \mathbb{P}(E_{1}, E_{2}) $ as our
 31 | base case. Then we simply employ the definition of conditional probability to get
 32 | \begin{equation}
 33 | \mathbb{P}(E_{1}, E_{2}) = \mathbb{P}(E_{1}) \times \dfrac{\mathbb{P}(E_{1}, E_{2})}{\mathbb{P}(E_{1})} = \mathbb{P}(E_{1}) \times \mathbb{P}(E_{2}|E_{1})
 34 | \end{equation}
 35 | 
 36 | Let us assume that the chain rule holds for events $ E_{1}, \ldots, E_{n-1} $. We will abbreviate them as $ E_{1}^{n-1} $. Then we get
 37 | \begin{equation}
 38 | \mathbb{P}(E_{1}^{n-1}, E_{n}) = \mathbb{P}(E_{1}^{n-1}) \times \dfrac{\mathbb{P}(E_{1}^{n-1}, E_{n})}{\mathbb{P}(E_{1}^{n-1})} 
 39 | = \mathbb{P}(E_{1}^{n-1}) \times \mathbb{P}(E_{n}|E_{1}^{n-1})
 40 | \end{equation}
 41 | 
 42 | Since $ \mathbb{P}(E_{1}^{n-1}) $ factorises according to the chain
 43 | rule by our induction hypothesis, we have completed the proof.
 44 | $ \square $\bigskip
 45 | 
 46 | The chain rule can make our lives even simpler if we have independent events. Assume we want to compute the joint probability of 3 events 
 47 | $ E_{1},E_{2},E_{3} $ and we also know that $ E_{1} \bot E_{2} $. In this case our factorisation becomes \eqref{simpleFactor} where
 48 | the first equality follows from the chain rule and the second equality follows from independence between $ E_{1} $ and $ E_{2} $.
 49 | \begin{align} \label{simpleFactor}
 50 | \mathbb{P}(E_{1}, E_{2}, E_{3}) &= \mathbb{P}(E_{1}) \times \mathbb{P}(E_{2}|E_{1}) \times \mathbb{P}(E_{3}|E_{1},E_{2}) \\
 51 | &= \mathbb{P}(E_{1}) \times \mathbb{P}(E_{2}) \times \mathbb{P}(E_{3}|E_{1},E_{2}) \nonumber
 52 | \end{align}
 53 | 
 54 | We can now state the chain rule for random variables. There are two ways you can go about proving it. Either you 
 55 | calculate the probability of a specific setting of the variables or you just do the proof based on the distributions of the RVs.
 56 | So in the first case you would have to prove that
 57 | \begin{align*}
 58 | \forall x_1,\ldots,x_n: &P(X_{1} = x_{1}, \ldots, X_{n} = x_{n}) \\
 59 | &= P(X_{1} = x_{1}) \times \ldots \times P(X_{n} = x_{n}|X_{1}=x_{1}, \ldots, X_{n-1} = x_{n-1})
 60 | \end{align*}
 61 | whereas in the second case you would simply prove that
 62 | \begin{align*}
 63 | P_{X_{1}^{n}} = \overset{n}{\underset{i=1}{\sum}}P_{X_{i}|X_{1}^{i-1}}
 64 | \end{align*}
 65 | 
 66 | Incidentally, we also introduce a very short notation for the chain rule above. Note that it is not quite correct, since if
 67 | $ i = 1 $ we would be conditioning on $ X_{0} $. That is not to bad however, since we can always define ourselves a constant variable $ X_{0} $ that does not affect the distribution. Moreover, this notation is really just meant to be convenient, so you should just accept it as is when you encounter it in papers.
 68 | 
 69 | \begin{Exercise}
 70 | Prove the chain rule for random variables. The proof is totally analogous to the one given for events.
 71 | \end{Exercise}
 72 | 
 73 | \begin{Exercise}
 74 | Let $X_0$ be a constant RV, i.e.\ there exists $c \in \mathbb{R}$ such that $P(X_0 = c)=1$. 
 75 | Prove that $X_0$ is independent of any set of other random variables $X_1,\ldots,X_n$.
 76 | \end{Exercise}
 77 | 
 78 | \section{Bayes' rule}
 79 | 
 80 | In this section we are going to prove \textbf{Bayes' rule}. The rule follows directly from the chain rule.
 81 | The proof is really simple and thus of no great interest in and by itself. The consequences of Bayes' rule
 82 | are huge however. It will basically allow us to invert a conditional probability distribution. You may rightfully
 83 | ask: what's the deal? Well, as we said in the beginning, it may be hard to compute a conditional distribution in one
 84 | direction but much easier to compute it in the other direction. On top of that, Bayes' rule opens up a whole range of new possibilities. We will discuss those as we proceed in this chapter.
 85 | \begin{Theorem}{\textbf{(Bayes' rule)}}
 86 | The probability distribution of a random variable $ X $ given a random variable $ Y $ can be computed as
 87 | $$ P_{X|Y} = \dfrac{P_{Y|X}P_{X}}{P_{Y}} $$
 88 | \end{Theorem}
 89 | 
 90 | And here comes the proof:
 91 | \begin{equation}
 92 | P_{X|Y} = \dfrac{P_{XY}}{P_{Y}} = \dfrac{P_{Y|X}P_{X}}{P_{Y}} \, . \qquad  \square
 93 | \end{equation}
 94 | 
 95 | That was the proof! Considering how simple it was, it will be surprising to see what kind of benefits we can get out
 96 | of Bayes' rule. To get us started, let us introduce some terminology. In particular, each of the terms
 97 | in Bayes' rule has a specific name. You should really learn these names by heart as they crop up all over the place.
 98 | 
 99 | $$ \mathit{posterior} = \dfrac{\mathit{likelihood} \times \mathit{prior}}{\mathit{marginal~likelihood}} $$
100 | 
101 | The posterior is what we get after we have completed the computation. However, its name is related to the prior.
102 | The prior is just the probability that we would place on $ P(X=x) $ \textit{a priori}. Therefore $ P_{X} $
103 | is also known as the prior distribution. When we divide the product of likelihood and prior by the 
104 | marginal likelihood we get a new distribution
105 | over $ X $ that is conditioned on $ Y $. This is the distribution that we place on $ X $ \textit{a posteriori}, i.e.
106 | after having taken into account information about $ X $ that we may get from knowing the value of $ Y $. The marginal
107 | likelihood of $ Y $ is simply needed to normalize the expression to a probability distribution (i.e. to make sure that
108 | it sums to one). Why is it called marginal likelihood? The reason for this is how you can compute it. Recall that when
109 | we are given a joint distribution $ P_{XY} $, we can obtain the distribution
110 | $ P_{Y} $ by simply marginalizing over $ X $.
111 | \begin{equation}
112 | P(Y=y) = \sum_{x \in \supp(X)}P(X=x, Y=y)
113 | \end{equation}
114 | 
115 | In addition to that, the chain rule allows us to factorise the joint probability. Thus we get
116 | \begin{equation}
117 | P(Y=y) = \sum_{x \in \supp(X)} P(Y=y|X=x) \times P(X=x)
118 | \end{equation}
119 | 
120 | If you think that this looks an awful lot like the enumerator of Bayes' rule then you are exactly on the right track.
121 | Essentially, we are just summing over all possible denominators (with respect to $ X $). Let us make this more
122 | concrete with an example. Assume that we are given two coins. One of them is fair, meaning that it is equally probable
123 | to come up heads or tails. The other coin is biased towards tails and we happen to know that its probability to come up
124 | heads is only $ 0.3 $. Which coin is flipped is captured by a random variable $ X $ that takes on the value 0 if the
125 | fair coin is used and the value 1 if the biased coin is used. We have no idea which coin is going to be tossed, it could
126 | be either one. Therefore we set our prior to $ P(X=0) = P(X=1) = 0.5 $.
127 | 
128 | We flip the chosen coin 10 times and obtain 8 heads. The number of heads obtained during the 10 tosses is going
129 | to be encoded by $ Y $. Since all tosses are independent of each other, $ Y $ will
130 | follow a binomial distribution. For each of the two coins we also know the parameter of the binomial distribution.
131 | For the fair coin it is $ \theta = 0.5 $ and for the biased coin it is $ \theta = 0.3 $. Let us compute each of the
132 | enumerators separately.
133 | \begin{align}
134 | P(Y=8|X=0) \times P(X=0) &= \binom{10}{8} 0.5^8 (1-0.5)^2 \times 0.5 = 0.02195 \label{bayes1}\\
135 | P(Y=8|X=1) \times P(X=1) &= \binom{10}{8} 0.3^8 (1-0.3)^2 \times 0.5 = 0.0007 \label{bayes2}
136 | \end{align}
137 | Remember that $ Y \sim binom(10,\theta) $ and that $ \theta=0.5 $ if $ X=0 $ and $ \theta=0.3 $ if $ X=1 $. 
138 | 
139 | All that is left do is to compute the marginal likelihood of $ Y $. Luckily for us, $ X $ only assumes two
140 | values, so we only need to add up \eqref{bayes1} and \eqref{bayes2}.
141 | \begin{align}
142 | P(Y=8) = &P(Y=8|X=0) \times P(X=0) \\
143 | &+ P(Y=8|X=1) \times P(X=1) = 0.02265 \nonumber
144 | \end{align}
145 | 
146 | And finally we can apply Bayes' rule to compute the posterior probabilities of $ X $.
147 | \begin{align}
148 | P(X=0|Y=8) &= \dfrac{P(Y=8|X=0) \times P(X=0)}{P(Y=8)} \\
149 | &= \dfrac{0.2195}{0.02265} = 0.969 \nonumber \\
150 | P(X=1|Y=8) &= \dfrac{P(Y=8|X=1) \times P(X=1)}{P(Y=8)} \\
151 | &= \dfrac{0.0005}{0.02265} = 0.031 \nonumber \\
152 | \end{align}
153 | 
154 | There is a probability of $ 0.969 $ that the fair coin has been tossed when a sequence with eight heads is
155 | generated and only a probability of $ 0.031 $ that the biased coin was tossed. Obviously, the probability of the fair 
156 | coin is much higher. But how much higher? We can take the ratio of the two probabilities. This gives us 
157 | $ \nicefrac{0.969}{0.031} \approx 31 $. We can conclude that the fair coin is 31 times more likely to have generated the sequence with
158 | 8 heads than the biased coin. But wait a second, can we maybe find this ratio somewhere else? It turns out that 
159 | the ratio of the likelihoods is the same! That is $ \nicefrac{0.0439}{0.0014} \approx 31 $.
160 | 
161 | We started out by assuming that both coins were equally likely to be used. However, we then observed a sequence of 10 tosses, 8 of 
162 | which were heads and that made it 31 times more likely that the fair coin was used. What if the priors had not been equal?
163 | Actually, there is a more general story: While calculating the actual probabilities involves a lot of number crunching, just telling whether or not an observation will make one or the other event more likely is not too hard. [For the rest of this chapter, we assume that we only condition on events with non-zero probabilities such as $P(Y=y)>0$ so that we are never dividing by 0].
164 | \begin{align*}
165 | \frac{P(X=x_{1}|Y=y)}{P(X=x_{2}|Y=y)} &= \frac{\dfrac{P(Y=y|X=x_{1})P(X=x_{1})}{P(Y=y)}}{\dfrac{P(Y=y|X=x_{2})P(X=x_{2})}{P(Y=y)}} \\[1em]
166 | &= \frac{P(Y=y|X=x_{1})P(X=x_{1})}{P(Y=y|X=x_{2})P(X=x_{2})}
167 | \end{align*}
168 | 
169 | From the above equalities, we see that the ratio of the posterior probabilities is determined by the ratio of the likelihood times the
170 | prior. In our coin example, the priors were the same so it was only the likelihood that mattered. If the ratio of any of the above
171 | terms is greater than 1, the posterior will change in favour of $ X=x_{1} $. If the ratio is smaller than 1 the posterior changes
172 | in favour of $ X=x_{2} $. If the ratio is exactly 1, the posterior stays unchanged. 
173 | 
174 | Notice that in general, although our observations may shift the posterior in favour of $ X=x_{2} $, say, this shift does not necessarily imply that 
175 | $ P(X=x_{2}|Y=y) $ will be greater than $ P(X=x_{1}|Y=y) $. The condition that $ P(X=x_{2}|Y=y) $ is bigger than  $ P(X=x_{1}|Y=y) $ can be rewritten as follows
176 | \begin{align*}
177 | P(X=x_{1}|Y=y) &< P(X=x_{2}|Y=y)  &\Leftrightarrow \\
178 | \dfrac{P(Y=y|X=x_{1})P(X=x_{1})}{P(Y=y)} &< \dfrac{P(Y=y|X=x_{2})P(X=x_{2})}{P(Y=y)} &\Leftrightarrow \\
179 | P(Y=y|X=x_{1})P(X=x_{1}) &< P(Y=y|X=x_{2})P(X=x_{2}) &\Leftrightarrow \\
180 | \dfrac{P(Y=y|X=x_{1})}{P(Y=y|X=x_{2})} &< \dfrac{P(X=x_{2})}{P(X=x_{1})}
181 | \end{align*} 
182 | 
183 | The last line is of particular interest as it elucidates the relationship between the prior and the likelihood. Only if the likelihood
184 | ratio for $ X=x_{1} $ over $ X=x_{2} $ is smaller than the reversed prior ratio will the posterior probability of $ X=x_{2} $
185 | be greater than that of $ X=x_{1} $. This means that if we have strongly asymmetric priors (like $ P(X=x_{1}) = 0.9 $
186 | and $ P(X=x_{2}) = 0.1 $), the likelihood needs to discriminate very well between the two cases in order to tip the scale in
187 | favour of $ X=x_{2} $. In that sense the prior and the likelihood can be seen as battling forces whose equilibrium gives us 
188 | the posterior.
189 | 
190 | But enough theory about Bayes' rule, it is about time you apply it! To that end, we present you an exercise that is, in some variation,
191 | contained in virtually every textbook on probability theory, statistics or machine learning. Have fun with it!
192 | 
193 | \begin{Exercise}
194 | A random person walks into the doctor's office to be tested for a particular disease. The disease can be fatal if not treated. However,
195 | successful treatment is possible if the disease is discovered early enough. It is commonly known that the disease occurs in 1 out
196 | of 1000 people of the country's population. The doctor will administer a test that with a probability of 99\% returns a positive results
197 | if the patient does indeed have the disease. At the same time, the test also returns a positive result in 5\% of the cases where the
198 | patient does not have the disease. After the test has been administered to the patient in question, it returns a positive result.
199 | What is the probability that the patient is infected with the disease? 
200 | \\
201 | Proceed as follows:
202 | \begin{enumerate}
203 | \item Write down a guess for what you think the probability might be (do not consider any math at this point).
204 | \item Calculate that probability.
205 | \item Check whether there is a considerable difference between your initial guess and the calculated probability. Go on to examine
206 | how the different factors have influenced the probability of the patient having the disease.
207 | \end{enumerate}
208 | \end{Exercise}
209 | 
210 | Let us finish up this section with some more notation. In many applications of Bayes' rule we only want to know which outcome is
211 | the most likely, without worrying too much about the actual probabilities. Likewise, there is a range of situations where we
212 | just want to assign a score to outcomes and do not demand this score to be a probability. Throughout this chapter,
213 | we have repeatedly encountered the following phenomenon: In order to rank the values of an RV according to their probabilities, we do not necessarily need to compute the marginal likelihood since it cancels in all these comparisons anyway. Therefore, you will often see authors stating that
214 | \begin{equation} \label{proportionality}
215 | P(X=x|Y=y) \propto P(Y=y|X=x)P(X=x)
216 | \end{equation}
217 | 
218 | This equation reads as ``the posterior is proportional to the product of the likelihood and the prior''. In general, if we have two quantities
219 | $ a $ and $ b $, then by $ a \propto b $ we mean that there is some constant 
220 | $ C \in \mathbb{R} \setminus \{0\} $ such that $ a = Cb $. Notice
221 | that the probability distribution is a function and hence we require $ C $ to be the same across the domain of that function (that
222 | is $ C $ should be the same for all values of $ X $).
223 | 
224 | \begin{Exercise}
225 | What is the value of $ C $ in Equation~\eqref{proportionality}?
226 | \end{Exercise}
227 | 
228 | 
229 | 
230 | \section{Na\"ive Bayes}
231 | In this section, we introduce a rather crude application of Bayes's rule which is surprisingly successful nonetheless.
232 | Assume that instead of one random variable we are observing a sequence of random variables. Thus our problem is the following:
233 | \begin{equation}
234 | P(Y=y|X_{1}^{n}=x_{1}^{n}) \propto P(X_{1}^{n}=x_{1}^{n}|Y=y) \times P(Y=y) 
235 | \end{equation}
236 | 
237 | By the chain rule we can decompose the right-hand side into
238 | \begin{align}
239 | P(Y=y|X_{1}^{n}=x_{1}^{n})
240 | \propto &P(X_{1}=x_{1}|Y=y) \times \ldots \nonumber \\
241 | &\times P(X_{n}=x_{n}|Y=y,X_{1}^{n-1}=x_1^{n-1}) \times P(Y=y) \nonumber
242 | \end{align}
243 | 
244 | We are now going to introduce the aforementioned crudeness into the model by assuming that all $ X_1,\ldots,X_n$ are conditionally independent given $ Y $. Notice that
245 | this is just an assumption that we are making without justification. In fact, it is very likely wrong. However, it makes our
246 | lives much easier because we only have to deal with very simple terms of the form $ P(X_{i}=x_{i}|Y=y) $. Because of the
247 | crudeness of our assumptions, this probabilistic model is known as \textbf{na\"ive Bayes} (sometimes also 
248 | stupid Bayes).
249 | 
250 | \begin{Definition}
251 | A na\"ive Bayes model is a probabilistic model that assumes
252 | $$ P_{Y|X_{1}^{n}} \propto P_Y P_{X_{1}|Y} P_{X_{2}|Y} \cdots P_{X_{n}|Y} $$
253 | \end{Definition}
254 | Once we know all the component distributions $P_{X_i|Y}$, calculating the result is pretty straightforward. 
255 | 
256 | In order to illustrate how na\"ive Bayes works we are going to employ one of its showcase applications where it indeed had
257 | a lot of success in real life. The application we are talking about is text classification. The task is the following: you
258 | are given some documents and for each of the documents you have to assign a label signifying its class. What you consider
259 | a class depends on your actual application setting, but usually classes are broad categories, such as legal texts, medical
260 | texts etc. If you manage to succeed at this task, you can accomplish a lot of things automatically that required humans before. For example, you could tag online news with their relevant categories and people who are interested in
261 | a particular category will then have an easier time finding the news related to that category. Crucially, since you will
262 | write a computer program that does the classification for you, you will not need to read any of the texts yourself. This automation will obviously allow you to classify huge quantities of text in a very short amount of time.
263 | 
264 | \begin{Exercise}
265 | A collection of text (or any other kind of data for that matter) is often called a \textbf{corpus}. Here we are going to
266 | use a toy corpus. The corpus just consists of two sentences and we assume that each sentence constitutes
267 | a document.
268 | The categories that you can label the documents with are
269 | finance (0), medicine (1) or law (2). You can find the corpus (the pmfs of the distributions) below. For simplicity, we are not going to distinguish between lower and upper case words (this is actually common practice). For better 
270 | readability, we are also using the actual words instead of their numerical encodings as values for the random 
271 | variables. Just remember that those words could also be represented as real random variables. To shorten notation, we
272 | will use pmfs. If the probability of a word given a category is not specified, take it to be 0.
273 | 
274 | 
275 | Your task is to classify these two documents correctly using a Na\"ive Bayes Model that conditions each
276 | word's probability on the document class. Please also report the posterior probability for the correct label. 
277 | \end{Exercise}
278 | 
279 | \newpage
280 | \textbf{The corpus:}
281 | \begin{itemize}
282 | \item a fact has been revealed
283 | \item the doctor's judgement has not been reliable
284 | \end{itemize}
285 | 
286 | \textbf{The document category pmfs:}
287 | \begin{itemize}
288 | \item $ p(0) = 0.3 $
289 | \item $ p(1) = 0.2 $
290 | \item $ p(2) = 0.5 $
291 | \end{itemize}
292 | 
293 | \textbf{The lexical distribution for document category finance (0):}
294 | \begin{align*}
295 | &p(\mathit{a}|0) = 0.19~~p(\mathit{fact}|0)= 0.14~~p(\mathit{has}|0)=0.13~~p(\mathit{been}|0)=0.12 \\
296 | &p(\mathit{revealed}|0)=0.04~~p(\mathit{the}|0)=0.21~~p(\mathit{doctor's}|0)=0.03 \\
297 | &p(\mathit{judgement}|0)=0~~p(\mathit{not}|0)=0.11~~p(\mathit{reliable}|0)=0.03
298 | \end{align*}
299 | 
300 | \textbf{The lexical distribution for document category medicine (1):}
301 | \begin{align*}
302 | &p(\mathit{a}|1) = 0.02~~p(\mathit{fact}|1)= 0.08~~p(\mathit{has}|1)=0.13~~p(\mathit{been}|1)=0.13 \\
303 | &p(\mathit{revealed}|1)=0.01~~p(\mathit{the}|1)=0.18~~p(\mathit{doctor's}|1)=0.06 \\
304 | &p(\mathit{judgement}|1)=0.14~~p(\mathit{not}|1)=0.20~~p(\mathit{reliable}|1)=0.05
305 | \end{align*}
306 | 
307 | \textbf{The lexical distribution for document category law (2):}
308 | \begin{align*}
309 | &p(\mathit{a}|2) = 0.18~~p(\mathit{fact}|2)= 0.03~~p(\mathit{has}|2)=0.05~~p(\mathit{been}|2)=0.13 \\
310 | &p(\mathit{revealed}|2)=0.10~~p(\mathit{the}|2)=0.14~~p(\mathit{doctor's}|2)=0.06 \\
311 | &p(\mathit{judgement}|2)=0.07~~p(\mathit{not}|2)=0.08~~p(\mathit{reliable}|2)=0.16
312 | \end{align*}
313 | 
314 | \section*{Further Reading}
315 | Here, we have only scratched the surface of what Bayes' rule allows us to do. To get a wider outlook on what else is possible,
316 | you can consult \href{http://www.cs.ubc.ca/~murphyk/Bayes/bayesrule.html}{Kevin Murphy's webpage}.
317 | 
318 | %%% Local Variables:
319 | %%% mode: latex
320 | %%% TeX-master: "chapter4"
321 | %%% End:
322 | 


--------------------------------------------------------------------------------
/chapter7/chapter7_forInclude.Rnw:
--------------------------------------------------------------------------------
  1 | \chapter{Basics of Information Theory}
  2 | 
  3 | When we talk about \textit{information}, we often use the term in qualitative sense. We say things like 
  4 | \textit{This is valuable information} or 
  5 | \textit{We have a lack of information}. We can also make statements about some information being more helpful than other. For a long time, however,
  6 | people have been unable to quantify information. The person who succeeded in this endeavour was \href{https://en.wikipedia.org/wiki/Claude_Shannon}{Claude E. Shannon}
  7 | who with his famous 1948 article \textit{A Mathematical Theory of Communication} single-handedly created a new discipline: Information Theory! He also revolutionised
  8 | digital communication and can be seen as one of the main contributors to our modern communication systems like the telephone, the internet etc. 
  9 | 
 10 | The beauty about information theory is that it is based on probability theory and many results from probability theory seamlessly carry over to information theory.
 11 | In this chapter, we are going to discuss the bare basics of information theory. These basics are often enough to understand many information-theoretic arguments
 12 | that researchers make in fields like computer science, psychology and linguistics.
 13 | 
 14 | \section{Surprisal and Entropy}
 15 | Shannon's idea of information is as simple as it is compelling. The amount of \emph{surprisal} of an event $E$ is defined as the inverse probability $1/P(E)$. Intuitively, rare events (where $P(E)$ is small) are more surprising than those occurring with high probability (where $P(E)$ is high). If we are observing a realisation of a random variable, this realisation is surprising if it is unlikely to occur according to the distribution of that random variable. However, if the probability for the realisation is very low, then on average it does not occur very often, meaning that if we sample from the RV repeatedly, we are not surprised very often. We are not surprised when the probability mass of the distribution is concentrated on only a small subset of its support. 
 16 | 
 17 | On the other hand, we quite often are surprised, if we cannot predict what the outcome of our next draw from the RV might be. We are surprised when the distribution over values of the RV is (close to) uniform. Thus, we are going to be most surprised on average if we are observing realisations of a uniformly distributed RV.
 18 | 
 19 | Shannon's idea was that observing RVs that cause a lot of surprises is informative because we cannot predict the outcomes and with each new outcome we have effectively learned something (namely that the $ i^{th} $ outcome took on the value that it did). Observing RVs with very concentrated distributions is not very informative under this conception because by just choosing the most probable outcome we can correctly predict most actually observed outcomes. Obviously, if I manage to predict an outcome beforehand, its occurrence is not teaching me anything.
 20 | 
 21 | The goal of Shannon was to find a function that captures this intuitive idea. He eventually found it and showed that it is the only function to have properties that encompass the intuition. This function is called the \textbf{entropy} of a RV and it is simply the expected \textbf{surprisal} value, expressed in bits.
 22 | 
 23 | \begin{Definition}[Surprisal]
 24 | The surprisal (value) of an outcome $ x \in \supp(X) $ of some RV $ X
 25 | $ is defined as $ -\log_{2}(P(X=x)) = \log_2(\frac{1}{P(X=x)})$.
 26 | \end{Definition} 
 27 | 
 28 | Notice that we are using the logarithm of base 2 here. This is because surprisal and entropy are standardly measured in bits. Intuitively, the surprisal measures how many bits one needs to encode an observed outcome given that one knows the distribution underlying that outcome. Check \href{http://www.umsl.edu/~fraundorfp/egsurpriNOLOGS.html}{this website} to get a feeling for surprisal values measured in bits.
 29 | 
 30 | \begin{Definition}[Entropy]
 31 | The entropy $H(P_X)$ of a RV $ X $ with distribution $P_X$ is defined as 
 32 | $$H(P_X) := \E[-\log_{2}(P(X=x))] = - \!\! \sum_{x \in \supp(X)} P(X=x) \log_2(P(X=x)) \, .$$ 
 33 | For the ease of notation, we often write $H(X)$ instead of $H(P_X)$.
 34 | \end{Definition}
 35 | 
 36 | The notational convenience of writing $H(X)$ instead of $H(P_X)$ can be confusing, because entropy is really assigning a (non-negative) real number to a distribution, i.e.\ $H(X)$ is {\bf not a function} of the random variable $X$ and it is {\bf not a random variable} either! Formally, for any random variable $X$ with distribution $P_X$ over the set $\mathcal{X}=\supp(X)$ (which might be categorical, i.e.\ $X$ could for instance take on values ``blue'', ``red'' and ``green''), we consider the surprisal function (in bits) $f(x) := -\log_2(P(X=x))$ mapping elements $x \in \mathcal{X}$ to real numbers $f(x) \in \mathbb{R}$. In that case, the surprisal $f(X)$ is a random variable over the reals and its expected value is well defined and called entropy $H(X) = H(P_X) := \E_X[f(X)]$. 
 37 | 
 38 | As an example, we consider the categorical random variable $X$ with distribution $P(X=\varheart)=P(X=\clubsuit)=1/4, P(X=\spadesuit)=1/2$. In that case, $\supp(X) = \{\varheart, \clubsuit, \spadesuit \}$ and surprisal values in bits are $f(\varheart)=f(\clubsuit)=\log_2(4)=2, f(\spadesuit)=\log_2(2)=1$. The entropy is the expected surprisal value, i.e.\ the individual surprisal valuse weighted with their corresponding probabilities of occurring: $H(X) = \E_X[f(X)] = \frac{1}{4} \cdot 2 + \frac{1}{4} \cdot 2 + \frac{1}{2} \cdot 1 = 3/2$. 
 39 | 
 40 | The entropy ``does not care'' about the actual outcomes or labels of a random variable, but only about the distribution! In fact, not even the order of the actual probabilities matter, as we are taking an expected value and the additive terms commute. You can verify that the calculation of $H(X)=3/2$ in the example above does apply to all random variables $X$ with distribution $(1/2, 1/4, 1/4)$, no matter what the actual outcomes are. 
 41 | 
 42 | \begin{Exercise}
 43 | Compute the entropy of $Y \sim Binomial(n=2,p=1/2)$.
 44 | \end{Exercise}
 45 | 
 46 | The simplest and simultaneously most important example of entropy is given in Figure~\ref{fig:binaryEntropy} which shows the entropy of the Bernoulli distribution as a function of the parameter $ \theta \in [0,1]$. The entropy function of the Bernoulli is often called the \textbf{binary entropy} $h(\theta) := -\theta \cdot \log_2(\theta) - (1-\theta) \log_2(1-\theta)$. It measures the information of a binary decision, like a coin flip or an answer to a yes/no-question.
 47 | The entropy of the Bernoulli attains its maximum of 1 bit when the distribution is uniform, i.e.\ when both choices are equally 
 48 | probable. The entropy is 0 if and only if the coin is fully biased towards heads or tails. As explained above, the entropy of the distributions $(\theta, 1-\theta)$ and $(1-\theta,\theta)$ is the same and therefore $h(\theta)=h(1-\theta)$ and the graph is symmetric around $1/2$.
 49 | 
 50 | <<binaryEntropy, echo=F, cache=T, fig.cap="Binary entropy function",fig.align="center", fig.pos="t!">>=
 51 | x = seq(0,1,.001)
 52 | y = -(x*log2(x)+(1-x)*log2(1-x))
 53 | plot(x,y,ylab=expression(h(theta)), xlab=expression(theta),type="l")
 54 | @
 55 | 
 56 | \medskip
 57 | From the plot is it also easy to see that entropy is never negative. It holds in general that entropy is non-negative,
 58 | because entropy is defined as expectation of surprisal and surprisal is the negative logarithm of probabilities. 
 59 | Because $ \log(x) \leq 0 $ for $ x \in (0,1] $, it is clear that $ -\log(x) \geq 0 $ for $ x $ in the same
 60 | interval. Notice that from here on we drop the subscript and by convention let $ \log = \log_{2} $.
 61 | 
 62 | A standard interpretation of the entropy is that it quantifies uncertainty. As we have pointed out before, a uniform distribution means that you are most uncertain and indeed the uniform distribution maximizes the entropy. However, the more choices you have to pick from uniformly, the more uncertain you are going to be.  The entropy function also captures this intuition. Notice that if a discrete distribution is uniform, all probabilities are $ \frac{1}{|\supp(X)|} $. Clearly, as we increase $ |\supp(X)| $, we decrease the probabilities. By decreasing the probabilities, we increase their negative logarithms, and hence their average surprisal. Let us make this intuition more formal.
 63 | 
 64 | \begin{Theorem}
 65 | A discrete RV $ X $ with uniform distribution and support of size $ n $ has entropy
 66 | $ H(X) = \log(n) $.
 67 | \end{Theorem}
 68 | 
 69 | \paragraph{Proof:}
 70 | \begin{align}
 71 | H(X) &= \underset{x \in \supp(X)}{\sum}-\log(P(X=x))P(X=x) \\
 72 | &= \underset{x \in \supp(X)}{\sum} -\log(\frac{1}{|\supp(X)|})P(X=x) \\
 73 | &= \underset{x \in \supp(X)}{\sum}\log(n)P(X=x) = \log(n) \, .
 74 | \hspace{1cm} \square
 75 | \end{align}
 76 | 
 77 | \begin{Exercise}
 78 | You are trying to learn chess and you start by studying where chess grandmasters move their king when it
 79 | is positioned in one of the middle fields of the board. The king can move to any of the adjoining 8 fields. Since
 80 | you do not know a thing about chess yet, you assume that each move is equally probable. In this situation,
 81 | what is the entropy of moving the king?
 82 | \end{Exercise}
 83 | 
 84 | One of the first important results in information theory is \href{https://en.wikipedia.org/wiki/Shannon%27s_source_coding_theorem}{Shannon's source-coding theorem} which states that the entropy $H(X)$ of a random variable $X$ measures how many bits one will need on average to encode an outcome that is generated by the distribution $ P_{X} $. 
 85 | This result applies to the real-world problem of data compression. Assume that $N$ data points are drawn iid from the distribution $P_X$. In that case, the source-coding theorem tells us that on average, we will need $N \cdot H(X)$ bits to store the (optimally compressed) data. For example, let $P_X$ be the $Bernoulli(\theta)$ distribution over bits. In the case $\theta=1/2$, we have $N$ perfectly random bits which cannot be compressed, and hence we need $N \cdot H(X) = N \cdot h(\theta) = N \cdot h(1/2) = N$ bits of storage. For the general case $\theta \neq 1/2$ when the individual bits are biased, the graph of the binary entropy $h(\theta)$ in Figure~\ref{fig:binaryEntropy} tells us exactly what the compression ratio will be. We will not cover the proof of the source-coding theorem here, but refer to the literature instead.
 86 | 
 87 | 
 88 | \section{Conditional Entropy}
 89 | At the outset of this section we promised you that you could easily transfer results from probability 
 90 | theory to information theory. We will not be able to show any kind of linearity for entropy because it contains
 91 | log-terms and the logarithm is not linear. We can however find alternative expressions for joint entropy (where 
 92 | the joint entropy is simply the entropy of a joint RV). Before we do so, let us also define the notion of 
 93 | conditional entropy. We have seen in Section~\ref{sec:jointconditionaldistributions} that $P_{X|Y=y}$ is a valid probability distribution for any $y \in \supp(Y)$ such that $P(Y=y)>0$. Hence, we can also define its conditional entropy.
 94 | 
 95 | \begin{Definition}[Conditional Entropy]
 96 | For two jointly distributed RVs $ X,Y $ and $y \in \supp(Y)$ such that $P(Y=y)>0$, the conditional entropy of $ X $ given that $ Y=y $ is defined as
 97 | \begin{align*}
 98 | H(X | Y=y) &:= \E_X[-\log_{2}(P(X=x | Y=y))] \\
 99 | &= - \!\! \sum_{x \in \supp(X)} P(X=x | Y=y) \log_2(P(X=x | Y=y))\, . 
100 | \end{align*}
101 | The conditional entropy of $X$ given $Y$ is defined as
102 | $$ H(X | Y) := \E_Y[ H(X | Y) ] = \sum_{y \in \supp(Y)} P(Y=y) H(X | Y=y) \, .$$
103 | \end{Definition}
104 | 
105 | Intuitively, $H(X | Y)$ is the (average) uncertainty of $X$ after learning $Y$. Intuitively, learning $Y$ (and in fact any information) cannot increase your uncertainty about $X$. Formally, one can prove the following 
106 | \begin{Lemma}[see e.g.\ Proposition~4 of \href{http://homepages.cwi.nl/~schaffne/courses/inftheory/2016/notes/CramerFehr.pdf}{this script}] \label{lemma:noincrease}
107 | For any two random variables $X,Y$ with joint distribution $P_{XY}$, it holds that $H(X | Y) \leq H(X)$.
108 | \end{Lemma}
109 | Note however, that this non-increase of uncertainty only holds on average, as illustrated by the following example:
110 | 
111 | \paragraph{Example}
112 | Consider the binary random variables $X$ and $Y$, with joint distribution
113 | \begin{align*}
114 | &P(X=0,Y=0) = \frac{1}{2}, \quad P(X=0,Y=1) = \frac{1}{4}\\
115 | &P(X=1,Y=0) = 0, \quad P(X=1,Y=1) = \frac{1}{4}.
116 | \end{align*}
117 | By marginalization, we find that $P(X=0) = \frac{3}{4}$ and $P(X=1) = \frac{1}{4}$, while $P(Y=0) = P(Y=1) = \frac{1}{2}$. This allows us to make the following computations:
118 | \begin{align*}
119 | H(X,Y) &= \frac{1}{2}\log 2 + \frac{1}{4} \log 4  + \frac{1}{4} \log 4 = \frac{3}{2}\\
120 | H(X) &= h\left(\frac{1}{4}\right) = h\left(\frac{3}{4}\right) \approx 0.81\\
121 | H(Y) &= h\left(\frac{1}{2}\right) = 1\\
122 | H(X|Y) &= P(Y=0) \cdot H(X | Y=0) + P(Y=1) \cdot H(X | Y=1)\\
123 | &= \frac{1}{2} \cdot 0 + \frac12 \cdot 1 = \frac12 \\
124 | H(Y|X) &= P(X=0) \cdot H(Y | X=0) + P(X=1) \cdot H(Y | X=1)\\
125 | &= \frac{3}{4} \cdot h\left(\frac{1}{3} \right) + \frac{1}{4} \cdot 0 \approx 0.69
126 | \end{align*}
127 | % We also could have computed $H(X|Y)$ and $H(Y|X)$ directly through the definition of conditional entropy.
128 | Note that for this specific distribution, learning the outcome $Y=1$ increases the uncertainty about $X$, $H(X|Y=1) > H(X)$, but on average, we always have $H(X|Y) \leq H(X)$. It is important to remember that Lemma~\ref{lemma:noincrease} only holds on average, not for specific values of $Y$. Note also that in this example, $H(X|Y) \neq H(Y|X)$. 
129 | 
130 | It is not a coincidence that the joint entropy $H(X,Y)$ in the example above is equal to $H(X|Y)+H(Y)$ and $H(Y|X)+H(X)$. One can prove this chain rule in general:
131 | 
132 | \begin{align*}
133 | H(X,Y) &= \underset{\substack{x \in \supp(X)\\y \in \supp(Y)}}{\sum} -\log(P(X=x,Y=y)) \times P(X=x, Y=y) \\
134 | \begin{split}
135 | &= \underset{\substack{x \in \supp(X)\\ y \in \supp(Y)}}{\sum} -\log(P(X=x \mid  Y=y)) \times P(X=x,Y=y) \\ 
136 | &\qquad - \underset{y \in \supp(Y)}{\sum}\log(P(Y=y)) \times \sum_{x \in \supp(X)} P(X=x,Y=y) 
137 | \end{split} \\
138 | \begin{split}
139 | &=\sum_{y \in \supp(Y)} P(Y=y) \times \sum_{x \in \supp(X)} -\log(P(X=x \mid  Y=y)) \times P(X=x \mid Y=y) \\ &\qquad - \underset{y \in \supp(Y)}{\sum}\log(P(Y=y)) \times P(Y=y)
140 | \end{split} \\
141 | &= H(X | Y) + H(Y) \; .
142 | \end{align*}
143 | 
144 | \begin{Exercise}
145 | Prove that $ H(X,Y | Z) = H(X | Z) + H(Y | Z) $ if $ X \bot Y \mid Z $.
146 | \end{Exercise}
147 | As corollary, we get that $H(X,Y)=H(X)+H(Y)$ for independent random variables $X$ and $Y$. More generally, the entropy of $n$ independent random variables is $H(X_1^n) = \sum_{i=1}^n H(X_i)$.
148 | 
149 | 
150 | \section{An Information-Theoretic View on EM}
151 | Now that we have seen some information-theoretic concepts, you may be happy to hear that there is an information-theoretic interpretation
152 | of EM. This interpretation helps us to get a better intuition for the algorithm. To formulate that interpretation we need
153 | one more concept, however.
154 | 
155 | \begin{Definition}[Relative Entropy]
156 | The relative entropy of RVs \\ $ X,Y $ with distributions $P_X, P_Y$ and $\supp(X) \subseteq \supp(Y) $ is defined as
157 | $$ D(P_X||P_Y) := \sum_{x \in \supp(X)} P(X=x) \log \frac{P(X=x)}{P(Y=x)} \ . $$
158 | If $ P(Y=x) = 0 $ for any $ x \in \supp(X) $ we define $ D(P_X||P_Y) = \infty $. As with entropy, we often abbreviate $D(P_X||P_Y)$ with  $D(X||Y)$.
159 | \end{Definition}
160 | 
161 | The relative entropy is commonly known as \textbf{Kullback-Leibler (KL)} divergence. It measures the entropy of $ X $ as scaled to $ Y $. Intuitively,
162 | it gives a measure of how ``far away'' $ P_{X} $ is from $ P_{Y} $. To
163 | understand ``far away'', recall that entropy is a measure of
164 | uncertainty. 
165 | % The
166 | % relative entropy measure the uncertainty that you have about $ P_{X} $ if you know $ P_{Y} $\chris{hard to see why at this point}.
167 | This uncertainty is low if both distributions place most
168 | of their mass on the same outcomes. Since $ \log(1) = 0 $ the relative entropy is 0 if $ P_{X} = P_{Y} $.
169 | 
170 | It is worthwhile to point out the difference between relative and conditional entropy. Conditional entropy is the average entropy of $ X $ given that you
171 | know what value $ Y $ takes on. In the case of relative entropy you do not know the value of $ Y $, only its distribution.
172 | 
173 | \begin{Exercise}
174 | Show that $ D(X,Y||Y) = H(X | Y) $. Furthermore show that $ D(X,Y||Y) = H(X) $ if $ X\bot Y $.
175 | \end{Exercise}
176 | 
177 | 
178 | Let us start by remembering why we need EM. We have a model that defines a joint distribution
179 | over observed ($ x $) and latent data ($ z $). Such a model generally looks as follows:
180 | \begin{equation}
181 | P(X=x, Z=z  \mid  \Theta = \theta) = P(X=x \mid Z=z, \Theta=\theta) P(Z=z \mid \Theta = \theta)
182 | \end{equation}
183 | where we have chosen a factorization that provides a separate term for a distribution over only the
184 | latent data.
185 | 
186 | Recall that the goal of the EM algorithm is to iteratively increase the likelihood through consecutive
187 | updates of parameter estimates. These updates are achieved through maximum-likelihood estimation based
188 | on expected sufficient statistics. We are now going to show that a) EM computes a lower bound on the
189 | marginal log-likelihood of the data in each iteration and b) that this lower bound becomes tight when the
190 | expected sufficient statistics are taken with respect to the model posterior. The latter implies that
191 | EM performs the optimal update in each iteration.
192 | 
193 | Let us start by expanding the data log-likelihood and then lower-bounding it.
194 | \begin{align}
195 | &\log(P(X=x \mid \Theta=\theta)) = \log(\sum_y P(X=x, Y=y \mid  \Theta = \theta))  \\
196 | &= \log\left(\sum_{y} Q(Y=y \mid \Phi=\phi)\frac{P(X=x, Y=y \mid  \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\
197 | &\geq \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(X=x, Y=y \mid  \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right)
198 | \label{eq:ELBO1}
199 | \end{align}
200 | Here, we have used \href{https://en.wikipedia.org/wiki/Jensen\%27s_inequality}{Jensen's Inequality} to
201 | derive the lower bound. Observe that the log is indeed a concave function. 
202 | 
203 | We also have introduced
204 | an auxiliary distribution $ Q $ over the latent variables with parameters $ \phi $. 
205 | For reasons that we will explain shortly,
206 | this distributions is often called the \textbf{variational distribution} and its parameters the
207 | \textbf{variational parameters}. The letter $ Q $ is slightly non-standard to denote distributions but
208 | we are are following conventions from the field of \textbf{variational inference} here.
209 | 
210 | In the next step, we factorise the model distribution in order to recover a KL divergence term between
211 | the variational distribution and the model posterior over latent variables.
212 | \begin{align}
213 | &\sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(X=x, Y=y \mid  \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\
214 | &= \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(Y=y \mid X=x, \Theta = \theta)P(X=x \mid \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\
215 | &= \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(Y=y \mid X=x, \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) + \log(P(X=x \mid \Theta=\theta)) \\
216 | &= -D(Q||P) + \log(P(X=x \mid \Theta=\theta)) \label{eq:ELBO2}
217 | \end{align}
218 | Equation~\eqref{eq:ELBO2} gives us two insights. First it quantifies the gap between the lower bound
219 | and the actual data likelihood. This gap is equal to the KL divergence between the variational distribution
220 | and the model posterior over latent variables. Second, since KL divergence is always positive, the bound only becomes
221 | tight when $ P=Q $. But this is exactly what is happening in the E-step! The E-step sets $ P=Q $ and
222 | then computes expectations under that distribution (see Equation~\eqref{eq:ELBO1}). Thus, the E-step increases
223 | the lower bound on the marginal log-likelihood.
224 | 
225 | Looking back at Equation~\eqref{eq:ELBO1}, we also see that the M-step increases the lower bound because 
226 | it maximises $ \E\left[P(X=x, Y=y\mid \Theta = \theta)\right] $. We conclude that both steps
227 | are increasing the lower bound on the log-likelihood. We therefore conclude that EM increases the data likelihood
228 | in every iteration (or leaves it unchanged at worst).
229 | 
230 | We will finish with a quick rejoinder on variational inference. EM is a special case of variational inference.
231 | Variational inference is any inference procedure which uses an auxiliary distribution $ Q $ to compute
232 | a lower bound on the likelihood. In the general setting, the auxiliary distribution can be different from the 
233 | model posterior. This means that the bound never gets tight. However, in models in which the exact posterior 
234 | is hard (read: impossible) to compute, using a non-tight lower bound instead can be incredibly useful!
235 | 
236 | The reason this inference procedure is called \textit{variational} is because it is based on the 
237 | \href{https://en.wikipedia.org/wiki/Calculus_of_variations}{calculus of variations}. This works mostly
238 | like normal calculus except that standard operations like differentiation are done with respect to functions
239 | instead of variables.
240 | 
241 | %Naively, we could take the expectation with respect to any distribution
242 | %over latent values. Obviously, we would like to find the best one, i.e. the one that is closest to the
243 | %actual posterior. We can formalize this by introducing an auxiliary distribution\footnote{We follow
244 | %standard notation here by denoting the auxiliary distribution $ Q $ instead of $ P $. Also, the
245 | %parameter variable is chosen so as to distinguish it from the parameter variable of our model.} 
246 | %$ Q(z\mid\Phi=\phi) $ under
247 | %which we compute the expected sufficient statistics. We want to find the auxiliary distribution that
248 | %is closest to actual posterior $ P_{Z\midX=x,\Theta=\theta} $. We measure closeness in an information-theoretic
249 | %sense using KL-divergence. Formally, our goal is to find 
250 | %\begin{equation}
251 | %Q^{*}_{Z\mid\Phi=\phi} = \underset{Q_{Z\mid\Phi=\phi}}{\mbox{arg min}}~D\left( Q_{Z\mid\Phi=\phi} || P_{Z \mid X=x,\Theta=\theta} \right) \ .
252 | %\end{equation}
253 | 
254 | 
255 | 
256 | \section*{Further Material}
257 | 
258 | At the ILLC, there is a whole course about information theory, \href{http://homepages.cwi.nl/~schaffne/courses/inftheory/}{currently taught by Christian Schaffner}. David MacKay also offers \href{http://www.inference.phy.cam.ac.uk/itprnn/book.pdf}{a free book on the subject}. Finally,
259 | Coursera also offers \href{https://www.coursera.org/course/informationtheory}{an online course on information theory}.
260 | 
261 | The information-theoretic formulation of EM was pioneered in this \href{http://www.cs.toronto.edu/~fritz/absps/emk.pdf}{paper}. A very recent and intelligible 
262 | \href{https://arxiv.org/abs/1601.00670}{tutorial on variational inference} can be found on the archive.
263 | 
264 | \end{document}
265 | 
266 | %%% Local Variables:
267 | %%% mode: latex
268 | %%% TeX-master: "chapter7"
269 | %%% End:
270 | 


--------------------------------------------------------------------------------
/chapter7/chapter7_forInclude.tex:
--------------------------------------------------------------------------------
  1 | \chapter{Basics of Information Theory}
  2 | 
  3 | When we talk about \textit{information}, we often use the term in qualitative sense. We say things like 
  4 | \textit{This is valuable information} or 
  5 | \textit{We have a lack of information}. We can also make statements about some information being more helpful than other. For a long time, however,
  6 | people have been unable to quantify information. The person who succeeded in this endeavour was \href{https://en.wikipedia.org/wiki/Claude_Shannon}{Claude E. Shannon}
  7 | who with his famous 1948 article \textit{A Mathematical Theory of Communication} single-handedly created a new discipline: Information Theory! He also revolutionised
  8 | digital communication and can be seen as one of the main contributors to our modern communication systems like the telephone, the internet etc. 
  9 | 
 10 | The beauty about information theory is that it is based on probability theory and many results from probability theory seamlessly carry over to information theory.
 11 | In this chapter, we are going to discuss the bare basics of information theory. These basics are often enough to understand many information-theoretic arguments
 12 | that researchers make in fields like computer science, psychology and linguistics.
 13 | 
 14 | \section{Surprisal and Entropy}
 15 | Shannon's idea of information is as simple as it is compelling. The amount of \emph{surprisal} of an event $E$ is defined as the inverse probability $1/P(E)$. Intuitively, rare events (where $P(E)$ is small) are more surprising than those occurring with high probability (where $P(E)$ is high). If we are observing a realisation of a random variable, this realisation is surprising if it is unlikely to occur according to the distribution of that random variable. However, if the probability for the realisation is very low, then on average it does not occur very often, meaning that if we sample from the RV repeatedly, we are not surprised very often. We are not surprised when the probability mass of the distribution is concentrated on only a small subset of its support. 
 16 | 
 17 | On the other hand, we quite often are surprised, if we cannot predict what the outcome of our next draw from the RV might be. We are surprised when the distribution over values of the RV is (close to) uniform. Thus, we are going to be most surprised on average if we are observing realisations of a uniformly distributed RV.
 18 | 
 19 | Shannon's idea was that observing RVs that cause a lot of surprises is informative because we cannot predict the outcomes and with each new outcome we have effectively learned something (namely that the $ i^{th} $ outcome took on the value that it did). Observing RVs with very concentrated distributions is not very informative under this conception because by just choosing the most probable outcome we can correctly predict most actually observed outcomes. Obviously, if I manage to predict an outcome beforehand, its occurrence is not teaching me anything.
 20 | 
 21 | The goal of Shannon was to find a function that captures this intuitive idea. He eventually found it and showed that it is the only function to have properties that encompass the intuition. This function is called the \textbf{entropy} of a RV and it is simply the expected \textbf{surprisal} value, expressed in bits.
 22 | 
 23 | \begin{Definition}[Surprisal]
 24 | The surprisal (value) of an outcome $ x \in \supp(X) $ of some RV $ X
 25 | $ is defined as $ -\log_{2}(P(X=x)) = \log_2(\frac{1}{P(X=x)})$.
 26 | \end{Definition} 
 27 | 
 28 | Notice that we are using the logarithm of base 2 here. This is because surprisal and entropy are standardly measured in bits. Intuitively, the surprisal measures how many bits one needs to encode an observed outcome given that one knows the distribution underlying that outcome. Check \href{http://www.umsl.edu/~fraundorfp/egsurpriNOLOGS.html}{this website} to get a feeling for surprisal values measured in bits.
 29 | 
 30 | \begin{Definition}[Entropy]
 31 | The entropy $H(P_X)$ of a RV $ X $ with distribution $P_X$ is defined as 
 32 | $$H(P_X) := \E[-\log_{2}(P(X=x))] = - \!\! \sum_{x \in \supp(X)} P(X=x) \log_2(P(X=x)) \, .$$ 
 33 | For the ease of notation, we often write $H(X)$ instead of $H(P_X)$.
 34 | \end{Definition}
 35 | 
 36 | The notational convenience of writing $H(X)$ instead of $H(P_X)$ can be confusing, because entropy is really assigning a (non-negative) real number to a distribution, i.e.\ $H(X)$ is {\bf not a function} of the random variable $X$ and it is {\bf not a random variable} either! Formally, for any random variable $X$ with distribution $P_X$ over the set $\mathcal{X}=\supp(X)$ (which might be categorical, i.e.\ $X$ could for instance take on values ``blue'', ``red'' and ``green''), we consider the surprisal function (in bits) $f(x) := -\log_2(P(X=x))$ mapping elements $x \in \mathcal{X}$ to real numbers $f(x) \in \mathbb{R}$. In that case, the surprisal $f(X)$ is a random variable over the reals and its expected value is well defined and called entropy $H(X) = H(P_X) := \E_X[f(X)]$. 
 37 | 
 38 | As an example, we consider the categorical random variable $X$ with distribution $P(X=\varheart)=P(X=\clubsuit)=1/4, P(X=\spadesuit)=1/2$. In that case, $\supp(X) = \{\varheart, \clubsuit, \spadesuit \}$ and surprisal values in bits are $f(\varheart)=f(\clubsuit)=\log_2(4)=2, f(\spadesuit)=\log_2(2)=1$. The entropy is the expected surprisal value, i.e.\ the individual surprisal valuse weighted with their corresponding probabilities of occurring: $H(X) = \E_X[f(X)] = \frac{1}{4} \cdot 2 + \frac{1}{4} \cdot 2 + \frac{1}{2} \cdot 1 = 3/2$. 
 39 | 
 40 | The entropy ``does not care'' about the actual outcomes or labels of a random variable, but only about the distribution! In fact, not even the order of the actual probabilities matter, as we are taking an expected value and the additive terms commute. You can verify that the calculation of $H(X)=3/2$ in the example above does apply to all random variables $X$ with distribution $(1/2, 1/4, 1/4)$, no matter what the actual outcomes are. 
 41 | 
 42 | \begin{Exercise}
 43 | Compute the entropy of $Y \sim Binomial(n=2,p=1/2)$.
 44 | \end{Exercise}
 45 | 
 46 | The simplest and simultaneously most important example of entropy is given in Figure~\ref{fig:binaryEntropy} which shows the entropy of the Bernoulli distribution as a function of the parameter $ \theta \in [0,1]$. The entropy function of the Bernoulli is often called the \textbf{binary entropy} $h(\theta) := -\theta \cdot \log_2(\theta) - (1-\theta) \log_2(1-\theta)$. It measures the information of a binary decision, like a coin flip or an answer to a yes/no-question.
 47 | The entropy of the Bernoulli attains its maximum of 1 bit when the distribution is uniform, i.e.\ when both choices are equally 
 48 | probable. The entropy is 0 if and only if the coin is fully biased towards heads or tails. As explained above, the entropy of the distributions $(\theta, 1-\theta)$ and $(1-\theta,\theta)$ is the same and therefore $h(\theta)=h(1-\theta)$ and the graph is symmetric around $1/2$.
 49 | 
 50 | \begin{knitrout}
 51 | \definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{figure}[t!]
 52 | 
 53 | {\centering \includegraphics[width=\maxwidth]{figure/binaryEntropy-1} 
 54 | 
 55 | }
 56 | 
 57 | \caption[Binary entropy function]{Binary entropy function}\label{fig:binaryEntropy}
 58 | \end{figure}
 59 | 
 60 | 
 61 | \end{knitrout}
 62 | 
 63 | \medskip
 64 | From the plot is it also easy to see that entropy is never negative. It holds in general that entropy is non-negative,
 65 | because entropy is defined as expectation of surprisal and surprisal is the negative logarithm of probabilities. 
 66 | Because $ \log(x) \leq 0 $ for $ x \in (0,1] $, it is clear that $ -\log(x) \geq 0 $ for $ x $ in the same
 67 | interval. Notice that from here on we drop the subscript and by convention let $ \log = \log_{2} $.
 68 | 
 69 | A standard interpretation of the entropy is that it quantifies uncertainty. As we have pointed out before, a uniform distribution means that you are most uncertain and indeed the uniform distribution maximizes the entropy. However, the more choices you have to pick from uniformly, the more uncertain you are going to be.  The entropy function also captures this intuition. Notice that if a discrete distribution is uniform, all probabilities are $ \frac{1}{|\supp(X)|} $. Clearly, as we increase $ |\supp(X)| $, we decrease the probabilities. By decreasing the probabilities, we increase their negative logarithms, and hence their average surprisal. Let us make this intuition more formal.
 70 | 
 71 | \begin{Theorem}
 72 | A discrete RV $ X $ with uniform distribution and support of size $ n $ has entropy
 73 | $ H(X) = \log(n) $.
 74 | \end{Theorem}
 75 | 
 76 | \paragraph{Proof:}
 77 | \begin{align}
 78 | H(X) &= \underset{x \in \supp(X)}{\sum}-\log(P(X=x))P(X=x) \\
 79 | &= \underset{x \in \supp(X)}{\sum} -\log(\frac{1}{|\supp(X)|})P(X=x) \\
 80 | &= \underset{x \in \supp(X)}{\sum}\log(n)P(X=x) = \log(n) \, .
 81 | \hspace{1cm} \square
 82 | \end{align}
 83 | 
 84 | \begin{Exercise}
 85 | You are trying to learn chess and you start by studying where chess grandmasters move their king when it
 86 | is positioned in one of the middle fields of the board. The king can move to any of the adjoining 8 fields. Since
 87 | you do not know a thing about chess yet, you assume that each move is equally probable. In this situation,
 88 | what is the entropy of moving the king?
 89 | \end{Exercise}
 90 | 
 91 | One of the first important results in information theory is \href{https://en.wikipedia.org/wiki/Shannon%27s_source_coding_theorem}{Shannon's source-coding theorem} which states that the entropy $H(X)$ of a random variable $X$ measures how many bits one will need on average to encode an outcome that is generated by the distribution $ P_{X} $. 
 92 | This result applies to the real-world problem of data compression. Assume that $N$ data points are drawn iid from the distribution $P_X$. In that case, the source-coding theorem tells us that on average, we will need $N \cdot H(X)$ bits to store the (optimally compressed) data. For example, let $P_X$ be the $Bernoulli(\theta)$ distribution over bits. In the case $\theta=1/2$, we have $N$ perfectly random bits which cannot be compressed, and hence we need $N \cdot H(X) = N \cdot h(\theta) = N \cdot h(1/2) = N$ bits of storage. For the general case $\theta \neq 1/2$ when the individual bits are biased, the graph of the binary entropy $h(\theta)$ in Figure~\ref{fig:binaryEntropy} tells us exactly what the compression ratio will be. We will not cover the proof of the source-coding theorem here, but refer to the literature instead.
 93 | 
 94 | 
 95 | \section{Conditional Entropy}
 96 | At the outset of this section we promised you that you could easily transfer results from probability 
 97 | theory to information theory. We will not be able to show any kind of linearity for entropy because it contains
 98 | log-terms and the logarithm is not linear. We can however find alternative expressions for joint entropy (where 
 99 | the joint entropy is simply the entropy of a joint RV). Before we do so, let us also define the notion of 
100 | conditional entropy. We have seen in Section~\ref{sec:jointconditionaldistributions} that $P_{X|Y=y}$ is a valid probability distribution for any $y \in \supp(Y)$ such that $P(Y=y)>0$. Hence, we can also define its conditional entropy.
101 | 
102 | \begin{Definition}[Conditional Entropy]
103 | For two jointly distributed RVs $ X,Y $ and $y \in \supp(Y)$ such that $P(Y=y)>0$, the conditional entropy of $ X $ given that $ Y=y $ is defined as
104 | \begin{align*}
105 | H(X | Y=y) &:= \E_X[-\log_{2}(P(X=x | Y=y))] \\
106 | &= - \!\! \sum_{x \in \supp(X)} P(X=x | Y=y) \log_2(P(X=x | Y=y))\, . 
107 | \end{align*}
108 | The conditional entropy of $X$ given $Y$ is defined as
109 | $$ H(X | Y) := \E_Y[ H(X | Y) ] = \sum_{y \in \supp(Y)} P(Y=y) H(X | Y=y) \, .$$
110 | \end{Definition}
111 | 
112 | Intuitively, $H(X | Y)$ is the (average) uncertainty of $X$ after learning $Y$. Intuitively, learning $Y$ (and in fact any information) cannot increase your uncertainty about $X$. Formally, one can prove the following 
113 | \begin{Lemma}[see e.g.\ Proposition~4 of \href{http://homepages.cwi.nl/~schaffne/courses/inftheory/2016/notes/CramerFehr.pdf}{this script}] \label{lemma:noincrease}
114 | For any two random variables $X,Y$ with joint distribution $P_{XY}$, it holds that $H(X | Y) \leq H(X)$.
115 | \end{Lemma}
116 | Note however, that this non-increase of uncertainty only holds on average, as illustrated by the following example:
117 | 
118 | \paragraph{Example}
119 | Consider the binary random variables $X$ and $Y$, with joint distribution
120 | \begin{align*}
121 | &P(X=0,Y=0) = \frac{1}{2}, \quad P(X=0,Y=1) = \frac{1}{4}\\
122 | &P(X=1,Y=0) = 0, \quad P(X=1,Y=1) = \frac{1}{4}.
123 | \end{align*}
124 | By marginalization, we find that $P(X=0) = \frac{3}{4}$ and $P(X=1) = \frac{1}{4}$, while $P(Y=0) = P(Y=1) = \frac{1}{2}$. This allows us to make the following computations:
125 | \begin{align*}
126 | H(X,Y) &= \frac{1}{2}\log 2 + \frac{1}{4} \log 4  + \frac{1}{4} \log 4 = \frac{3}{2}\\
127 | H(X) &= h\left(\frac{1}{4}\right) = h\left(\frac{3}{4}\right) \approx 0.81\\
128 | H(Y) &= h\left(\frac{1}{2}\right) = 1\\
129 | H(X|Y) &= P(Y=0) \cdot H(X | Y=0) + P(Y=1) \cdot H(X | Y=1)\\
130 | &= \frac{1}{2} \cdot 0 + \frac12 \cdot 1 = \frac12 \\
131 | H(Y|X) &= P(X=0) \cdot H(Y | X=0) + P(X=1) \cdot H(Y | X=1)\\
132 | &= \frac{3}{4} \cdot h\left(\frac{1}{3} \right) + \frac{1}{4} \cdot 0 \approx 0.69
133 | \end{align*}
134 | % We also could have computed $H(X|Y)$ and $H(Y|X)$ directly through the definition of conditional entropy.
135 | Note that for this specific distribution, learning the outcome $Y=1$ increases the uncertainty about $X$, $H(X|Y=1) > H(X)$, but on average, we always have $H(X|Y) \leq H(X)$. It is important to remember that Lemma~\ref{lemma:noincrease} only holds on average, not for specific values of $Y$. Note also that in this example, $H(X|Y) \neq H(Y|X)$. 
136 | 
137 | It is not a coincidence that the joint entropy $H(X,Y)$ in the example above is equal to $H(X|Y)+H(Y)$ and $H(Y|X)+H(X)$. One can prove this chain rule in general:
138 | 
139 | \begin{align*}
140 | H(X,Y) &= \underset{\substack{x \in \supp(X)\\y \in \supp(Y)}}{\sum} -\log(P(X=x,Y=y)) \times P(X=x, Y=y) \\
141 | \begin{split}
142 | &= \underset{\substack{x \in \supp(X)\\ y \in \supp(Y)}}{\sum} -\log(P(X=x \mid  Y=y)) \times P(X=x,Y=y) \\ 
143 | &\qquad - \underset{y \in \supp(Y)}{\sum}\log(P(Y=y)) \times \sum_{x \in \supp(X)} P(X=x,Y=y) 
144 | \end{split} \\
145 | \begin{split}
146 | &=\sum_{y \in \supp(Y)} P(Y=y) \times \sum_{x \in \supp(X)} -\log(P(X=x \mid  Y=y)) \times P(X=x \mid Y=y) \\ &\qquad - \underset{y \in \supp(Y)}{\sum}\log(P(Y=y)) \times P(Y=y)
147 | \end{split} \\
148 | &= H(X | Y) + H(Y) \; .
149 | \end{align*}
150 | 
151 | \begin{Exercise}
152 | Prove that $ H(X,Y | Z) = H(X | Z) + H(Y | Z) $ if $ X \bot Y \mid Z $.
153 | \end{Exercise}
154 | As corollary, we get that $H(X,Y)=H(X)+H(Y)$ for independent random variables $X$ and $Y$. More generally, the entropy of $n$ independent random variables is $H(X_1^n) = \sum_{i=1}^n H(X_i)$.
155 | 
156 | 
157 | \section{An Information-Theoretic View on EM}
158 | Now that we have seen some information-theoretic concepts, you may be happy to hear that there is an information-theoretic interpretation
159 | of EM. This interpretation helps us to get a better intuition for the algorithm. To formulate that interpretation we need
160 | one more concept, however.
161 | 
162 | \begin{Definition}[Relative Entropy]
163 | The relative entropy of RVs \\ $ X,Y $ with distributions $P_X, P_Y$ and $\supp(X) \subseteq \supp(Y) $ is defined as
164 | $$ D(P_X||P_Y) := \sum_{x \in \supp(X)} P(X=x) \log \frac{P(X=x)}{P(Y=x)} \ . $$
165 | If $ P(Y=x) = 0 $ for any $ x \in \supp(X) $ we define $ D(P_X||P_Y) = \infty $. As with entropy, we often abbreviate $D(P_X||P_Y)$ with  $D(X||Y)$.
166 | \end{Definition}
167 | 
168 | The relative entropy is commonly known as \textbf{Kullback-Leibler (KL)} divergence. It measures the entropy of $ X $ as scaled to $ Y $. Intuitively,
169 | it gives a measure of how ``far away'' $ P_{X} $ is from $ P_{Y} $. To
170 | understand ``far away'', recall that entropy is a measure of
171 | uncertainty. 
172 | % The
173 | % relative entropy measure the uncertainty that you have about $ P_{X} $ if you know $ P_{Y} $\chris{hard to see why at this point}.
174 | This uncertainty is low if both distributions place most
175 | of their mass on the same outcomes. Since $ \log(1) = 0 $ the relative entropy is 0 if $ P_{X} = P_{Y} $.
176 | 
177 | It is worthwhile to point out the difference between relative and conditional entropy. Conditional entropy is the average entropy of $ X $ given that you
178 | know what value $ Y $ takes on. In the case of relative entropy you do not know the value of $ Y $, only its distribution.
179 | 
180 | \begin{Exercise}
181 | Show that $ D(X,Y||Y) = H(X | Y) $. Furthermore show that $ D(X,Y||Y) = H(X) $ if $ X\bot Y $.
182 | \end{Exercise}
183 | 
184 | 
185 | Let us start by remembering why we need EM. We have a model that defines a joint distribution
186 | over observed ($ x $) and latent data ($ z $). Such a model generally looks as follows:
187 | \begin{equation}
188 | P(X=x, Z=z  \mid  \Theta = \theta) = P(X=x \mid Z=z, \Theta=\theta) P(Z=z \mid \Theta = \theta)
189 | \end{equation}
190 | where we have chosen a factorization that provides a separate term for a distribution over only the
191 | latent data.
192 | 
193 | Recall that the goal of the EM algorithm is to iteratively increase the likelihood through consecutive
194 | updates of parameter estimates. These updates are achieved through maximum-likelihood estimation based
195 | on expected sufficient statistics. We are now going to show that a) EM computes a lower bound on the
196 | marginal log-likelihood of the data in each iteration and b) that this lower bound becomes tight when the
197 | expected sufficient statistics are taken with respect to the model posterior. The latter implies that
198 | EM performs the optimal update in each iteration.
199 | 
200 | Let us start by expanding the data log-likelihood and then lower-bounding it.
201 | \begin{align}
202 | &\log(P(X=x \mid \Theta=\theta)) = \log(\sum_y P(X=x, Y=y \mid  \Theta = \theta))  \\
203 | &= \log\left(\sum_{y} Q(Y=y \mid \Phi=\phi)\frac{P(X=x, Y=y \mid  \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\
204 | &\geq \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(X=x, Y=y \mid  \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right)
205 | \label{eq:ELBO1}
206 | \end{align}
207 | Here, we have used \href{https://en.wikipedia.org/wiki/Jensen\%27s_inequality}{Jensen's Inequality} to
208 | derive the lower bound. Observe that the log is indeed a concave function. 
209 | 
210 | We also have introduced
211 | an auxiliary distribution $ Q $ over the latent variables with parameters $ \phi $. 
212 | For reasons that we will explain shortly,
213 | this distributions is often called the \textbf{variational distribution} and its parameters the
214 | \textbf{variational parameters}. The letter $ Q $ is slightly non-standard to denote distributions but
215 | we are are following conventions from the field of \textbf{variational inference} here.
216 | 
217 | In the next step, we factorise the model distribution in order to recover a KL divergence term between
218 | the variational distribution and the model posterior over latent variables.
219 | \begin{align}
220 | &\sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(X=x, Y=y \mid  \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\
221 | &= \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(Y=y \mid X=x, \Theta = \theta)P(X=x \mid \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\
222 | &= \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(Y=y \mid X=x, \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) + \log(P(X=x \mid \Theta=\theta)) \\
223 | &= -D(Q||P) + \log(P(X=x \mid \Theta=\theta)) \label{eq:ELBO2}
224 | \end{align}
225 | Equation~\eqref{eq:ELBO2} gives us two insights. First it quantifies the gap between the lower bound
226 | and the actual data likelihood. This gap is equal to the KL divergence between the variational distribution
227 | and the model posterior over latent variables. Second, since KL divergence is always positive, the bound only becomes
228 | tight when $ P=Q $. But this is exactly what is happening in the E-step! The E-step sets $ P=Q $ and
229 | then computes expectations under that distribution (see Equation~\eqref{eq:ELBO1}). Thus, the E-step increases
230 | the lower bound on the marginal log-likelihood.
231 | 
232 | Looking back at Equation~\eqref{eq:ELBO1}, we also see that the M-step increases the lower bound because 
233 | it maximises $ \E\left[P(X=x, Y=y\mid \Theta = \theta)\right] $. We conclude that both steps
234 | are increasing the lower bound on the log-likelihood. We therefore conclude that EM increases the data likelihood
235 | in every iteration (or leaves it unchanged at worst).
236 | 
237 | We will finish with a quick rejoinder on variational inference. EM is a special case of variational inference.
238 | Variational inference is any inference procedure which uses an auxiliary distribution $ Q $ to compute
239 | a lower bound on the likelihood. In the general setting, the auxiliary distribution can be different from the 
240 | model posterior. This means that the bound never gets tight. However, in models in which the exact posterior 
241 | is hard (read: impossible) to compute, using a non-tight lower bound instead can be incredibly useful!
242 | 
243 | The reason this inference procedure is called \textit{variational} is because it is based on the 
244 | \href{https://en.wikipedia.org/wiki/Calculus_of_variations}{calculus of variations}. This works mostly
245 | like normal calculus except that standard operations like differentiation are done with respect to functions
246 | instead of variables.
247 | 
248 | %Naively, we could take the expectation with respect to any distribution
249 | %over latent values. Obviously, we would like to find the best one, i.e. the one that is closest to the
250 | %actual posterior. We can formalize this by introducing an auxiliary distribution\footnote{We follow
251 | %standard notation here by denoting the auxiliary distribution $ Q $ instead of $ P $. Also, the
252 | %parameter variable is chosen so as to distinguish it from the parameter variable of our model.} 
253 | %$ Q(z\mid\Phi=\phi) $ under
254 | %which we compute the expected sufficient statistics. We want to find the auxiliary distribution that
255 | %is closest to actual posterior $ P_{Z\midX=x,\Theta=\theta} $. We measure closeness in an information-theoretic
256 | %sense using KL-divergence. Formally, our goal is to find 
257 | %\begin{equation}
258 | %Q^{*}_{Z\mid\Phi=\phi} = \underset{Q_{Z\mid\Phi=\phi}}{\mbox{arg min}}~D\left( Q_{Z\mid\Phi=\phi} || P_{Z \mid X=x,\Theta=\theta} \right) \ .
259 | %\end{equation}
260 | 
261 | 
262 | 
263 | \section*{Further Material}
264 | 
265 | At the ILLC, there is a whole course about information theory, \href{http://homepages.cwi.nl/~schaffne/courses/inftheory/}{currently taught by Christian Schaffner}. David MacKay also offers \href{http://www.inference.phy.cam.ac.uk/itprnn/book.pdf}{a free book on the subject}. Finally,
266 | Coursera also offers \href{https://www.coursera.org/course/informationtheory}{an online course on information theory}.
267 | 
268 | The information-theoretic formulation of EM was pioneered in this \href{http://www.cs.toronto.edu/~fritz/absps/emk.pdf}{paper}. A very recent and intelligible 
269 | \href{https://arxiv.org/abs/1601.00670}{tutorial on variational inference} can be found on the archive.
270 | 
271 | \end{document}
272 | 
273 | %%% Local Variables:
274 | %%% mode: latex
275 | %%% TeX-master: "chapter7"
276 | %%% End:
277 | 


--------------------------------------------------------------------------------
/chapter1/chapter1_forInclude.tex:
--------------------------------------------------------------------------------
  1 | \chapter{Basic Probability And Combinatorics}
  2 | 
  3 | \section*{Notational conventions}
  4 | In this script we make use of certain notational conventions. We \textbf{bold-face} newly introduced
  5 | technical terms on first mention. Those are the terms whose definitions you are expected to know by heart
  6 | in this and following courses. \textit{Italics} serve the purpose of highlighting passages in the
  7 | script but also to discriminate linguistic examples from the rest of the text. Occasionally, we will
  8 | point to online references outside of this script. The corresponding links are coloured in  
  9 | \href{http://en.wikibooks.org/wiki/LaTeX/Hyperlinks}{blue} and you are encouraged to click them.
 10 | 
 11 | We denote sets with uppercase letters and overload notation by using $ |\cdot| $ as both a function
 12 | that yields the cardinality of a set and the length of a sequence. Besides using standard notation
 13 | for set union and intersection we denote the complement of a set S with respect to another set X by
 14 | $ S\backslash X $.
 15 | 
 16 | \section{Introduction}
 17 | \subsection{Why study probability theory?}
 18 | The fact that you have picked up this script and started reading it demonstrates that you already have
 19 | some interest in learning about probability theory. This probably means that you also have some conception
 20 | of what probability theory is and what to do with it. Nevertheless, we will take the opportunity to
 21 | quickly give you some additional motivations for studying probability theory.
 22 | 
 23 | This script is all about formalizing the notion of probability. In particular, we are interested in 
 24 | giving a formal interpretation to statements like ``A is more probable than B''. Let us take a simple
 25 | example to demonstrate why this is useful: Suppose it is Monday and you have a date scheduled for
 26 | Friday. Obviously you want to impress your date. Unluckily, however, you have tendency to be broke
 27 | come weekends. The decision you have to make now is whether to take your date to a fancy restaurant
 28 | (the impressive but expensive option) or to just go for drinks (the cheaper option). On what basis can
 29 | you make this decision? Well, you can ask yourself whether it is more likely that you are broke on 
 30 | Friday night or not. If you think that you being broke is more probable than you going for drinks, otherwise
 31 | you opt for the fancy restaurant.
 32 | 
 33 | The above is an example where we have used the intuitive notion of probability to assist us in decision
 34 | making. The first part, the computation of the probabilities of events (e.g. you being broke or not)
 35 | is something that we are going to develop in some detail in this script. The second part, the development
 36 | of a so-called \textit{decision rule} (e.g. to plan for the circumstances that are most probable to 
 37 | occur in the future) is something that will be covered in later courses.
 38 | 
 39 | Here is a second example of what one can do with probability theory. Assume you want to invest in the 
 40 | stock market. You will be putting in some money now and then you want to cash in on your gains (or losses)
 41 | in ten years time, say. Notice that this time around simply asking whether it is more probable that your
 42 | stock has risen or fallen in price is not enough. Even if your stock is worth more in ten years than it
 43 | was when you bought it, the absolute increase may be so miniscule that you could have found much better
 44 | investment options that would have yielded more gains. Worse even, if your gain is a smaller percentage
 45 | of your original capital than the overall inflation that occurred during the ten years of your investment,
 46 | you will actually have incurred a loss in terms of pure market power! So instead of asking whether
 47 | or not your stock will be worth more than what it was when you first bought it, you should rather
 48 | ask how much of an absolute gain you can expect from your investment. This second application of probability
 49 | theory, the computation of expectations over real values, is something we are going to cover in this
 50 | script, as well.
 51 | 
 52 | Alright, we hope that this has gotten you excited for the rest of the script. Let's get going!
 53 | 
 54 | \section{Sample spaces and events}
 55 | The whole of probability theory is based on assigning probability values to elements of a 
 56 | \textbf{sample space}. The members of the sample space are referred to as \textbf{outcomes} or \textbf{samples}.
 57 | 
 58 | \begin{Definition}[Sample Space] A sample space is any \href{http://en.wikipedia.org/wiki/Borel_set}{Borel set} 
 59 | $ \Omega $. We denote the members of a sample space by $ \omega \in \Omega $.
 60 | \end{Definition}
 61 | 
 62 | Standard examples of sample spaces are the flipping of a coin and the rolling of a die. Formally,
 63 | the sample space of a die roll is $ \Omega = \{1,2,3,4,5,6\} $. The sample space of a coin toss
 64 | would consist of heads and tails. However, it is often more convenient to represent outcomes numerically.
 65 | In the context of this course, we will achieve this by imposing any total order on the sample space and then identifying the outcomes with the positions they occupy in the corresponding ordered list. In this spirit we let 
 66 | the sample space of a coin toss be $ \Omega = \{1,2\} $ where $ 1 $ represents heads and $ 2 $ represents
 67 | tails, say (the other way around would be just as fine). 
 68 | 
 69 | More generally, we denote a sample space with $ n $ members as $ \Omega = \{1,\ldots,n\} $. A useful 
 70 | metaphor that we will often use is to think of generating an outcome from a sample space as a blind draw from an urn with $ n $ balls 
 71 | that are numbered and possibly coloured but otherwise indistinguishable. The rolling of a die, for example,
 72 | corresponds to drawing a ball from an urn with balls numbered $ 1 $ to $ 6 $. A somewhat more involved 
 73 | example is that of writing an English sentence of six words, for example the sentence: 
 74 | \textit{To be or not to be}. The process of writing this sentence can be conceptualized as drawing 
 75 | six balls from an urn that contains balls corresponding to words
 76 | in the English language\footnote{This is obviously a very unrealistic conception of how English
 77 | sentences are written as it totally ignores the fact that the words in a sentence are dependent on each
 78 | other and have to be placed in a particular order.}. Note that this will be a rather large urn as 
 79 | \href{http://www.languagemonitor.com/number-of-words/number-of-words-in-the-english-language-1008879}
 80 | {the vocabulary of the English language has already exceeded 1 million words}.
 81 | 
 82 | In our sample spaces as defined above, it is easy to distinguish individual outcomes. However, often times
 83 | we do not care about the outcomes themselves but about properties that some of them share. In the
 84 | die example we might be only interested in whether the outcome is even or odd. Transferring this scenario to the urn metaphor we would colour the balls with odd numbers green and the balls
 85 | with even numbers red. Again, any other colours are just as fine. All that matters is that 
 86 | we can discriminate a member of $ E = \{2,4,6\} $ from a member of $ O = \{1,3,5\} $. We do \textit{not}
 87 | need to discriminate between the outcomes that are members of the same set! In this particular setting
 88 | $ E $ and $ O $ are the \textbf{events} that we are interested in.
 89 | 
 90 | \begin{Definition}[Event]
 91 | An event $ A $ is any subset $ A \subseteq \Omega $.
 92 | \end{Definition}
 93 | 
 94 | Events are what usually interests us in probability theory. Just as with outcomes, we can 
 95 | also define the notion of an event space.
 96 | 
 97 | \newpage
 98 | \begin{Definition}[Event space]
 99 | An event space associated with a sample space $ \Omega $ is a set $ \mathcal{A} $ such that
100 | \begin{enumerate}
101 | \item $ \mathcal{A} $ is non-empty
102 | \item If $ A \in \mathcal{A} $ then $ A \subseteq \Omega $
103 | \item If $ A \in \mathcal{A} $ then $ \Omega \setminus A \in \mathcal{A} $
104 | \item If $ A,B \in \mathcal{A} $ then $ A \cup B \in \mathcal{A} $
105 | \end{enumerate}
106 | \end{Definition}
107 | 
108 | Notice that since $ \emptyset \subseteq S $ for any set $ S $ we always have $ \Omega \in \mathcal{A} $
109 | by item 3.
110 | 
111 | \begin{Exercise} 
112 | You can also arrive at the conclusion that $ \Omega \in \mathcal{A} $ always holds in a 
113 | different (and arguably more cumbersome) way. How so?
114 | %Solution: By item 1, $ \mathcal{A} $ is non-empty. Thus we can assume $ A \in \mathcal{A} $. But then also
115 | %$ A^{C_{\Omega}} \in \mathcal{A} $ by item 3. Item 4 then implies that $ \Omega \in \mathcal{A} $.
116 | \end{Exercise}
117 | 
118 | The fact that event spaces are closed under the set complement operation is very convenient. Say I
119 | organized a dinner party and invited $ 10 $ people. The day after you ask me if more than $ 8 $ people
120 | actually showed up. I just answer that I was very disappointed that my friends Mary and Paul did 
121 | not come. Although I did not directly address your question you know that the answer is negative. After
122 | all, I informed you that the complement event of the event you asked about had occurred.
123 | 
124 | \begin{Exercise} 
125 | In the above party example, what is the sample space? What is the smallest possible event space that is necessary to
126 | model the situation just described?
127 | % Solution: $ \Omega = \{x_{1} \ldots x_{10} | x_{i} \in \{0,1\}\} $
128 | % $ \mathcal{A} = \{\Omega, \emptyset, \{\omega \in \Omega | \sum x_{i} > 8\},
129 | % \{\omega \in \Omega | \sum x_{i} \leq 8\}\} $ 
130 | \end{Exercise}
131 | 
132 | In general, we will not worry too much about constructing an event space every time we encounter a new
133 | problem. The \textbf{power set} of the sample space conveniently happens to fulfil all the requirements
134 | we have for event spaces, so we will just always use it. Thus, all we will ever need to worry about
135 | is the construction of sample spaces since we now know how to construct event spaces from them in a 
136 | simple manner. In case you are a bit rusty, here is a reminder of what a power
137 | set is.
138 | 
139 | \begin{Definition}[Power Set]
140 | The power set $ \mathcal{P}(S) $ of any set $ S $ is defined as $ \mathcal{P}(S) := \underset{s \subseteq S}{\bigcup}~s $.
141 | \end{Definition}
142 | 
143 | In general, this leaves us with the pair $ (\Omega, \mathcal{P}(\Omega)) $. For outcomes in a sample space,
144 | let us stress again an important difference, namely that $ \omega \in \Omega $ but 
145 | $ \{\omega\} \in \mathcal{A}$.
146 | 
147 | \section{Some basic combinatorics}
148 | Combinatorics is the mathematics of counting. Counting is of course a very basic problem that may
149 | be solved by just looking at each element of a set. However, this na\"ive procedure is often
150 | unreasonably time consuming. Moreover, it does not allow us to make general statements about sets of any 
151 | size, i.e. sets of size $ n $.
152 | 
153 | In order to assess the size of our sample spaces, we would like to make such general statements. The reason
154 | is that when we are dealing with probability we often start from \textbf{uniform probabilities} 
155 | on the sample space where by uniform probability we simply mean the value $ \frac{1}{|\Omega|} $. This
156 | is the probability we will assign to each and every $ \omega \in \Omega $. We now say that all the
157 | elements in our sample space are equally probable. 
158 | Note that at this point we are using probabilities solely for the purpose of motivating combinatorics which
159 | is kind of a hack because we haven't even told you yet what a probability is. However, we hope that you
160 | find the idea of uniform probabilities somewhat intuitive. 
161 | 
162 | Let us start from scratch: What is the cardinality (size) of the sample space of a die roll? It
163 | is $ 6 $ because $ |\{1,2,3,4,5,6\}| = 6 $. Now what if we roll two dice? The sample space for each 
164 | individual die is already known. Let us call it $ \Omega_{1} $. The sample space for the rolling of two dice
165 | is then just the Cartesian product of two such sample spaces, i.e.
166 | $ \Omega_{2} = \Omega_{1} \times \Omega_{1} = \{(x,y)|x \in \Omega_{1}, y \in \Omega_{1}\} $. Since
167 | the cardinality of the Cartesian product of two sets $ S $ and $ S' $ is $ |S| \times |S'| $ we conclude
168 | that $ |\Omega_{2}| = |\Omega_{1} \times \Omega_{1}| = |\Omega_{1}| \times |\Omega_{1}| 
169 | = |\Omega_{1}|^{2} = 36 $.
170 | 
171 | Unsurprisingly, this method of performing a draw from the same sample space (urn) multiple times generalizes to any number of
172 | times $ n > 2 $. Nicely enough, it also generalizes to sets of different sizes (again by the Cartesian product 
173 | argument from above). However, we have to impose one important restriction on the use of this technique: it
174 | may only be applied when the sample spaces are independent, i.e. when the outcome of one space does
175 | not affect the outcome of the other. Often times, we will simply assume that this is the case, though.
176 | 
177 | The technique of inferring the size of a complex sample space from the sizes of the sample spaces
178 | it is constructed from is known as the \textbf{basic principle of counting}.
179 | 
180 | \begin{Definition}[Basic principle of counting]
181 | The basic principle of counting states that if two draws from sample spaces of size
182 | $ M $ and $ N $ respectively are performed independently of each other then the sample space
183 | composed from them has size $ M \times N $. 
184 | \end{Definition}
185 | 
186 | \begin{Exercise}
187 | Let us assume that a football game is played for strictly 90 minutes. Both teams start with 11 players. 
188 | A red card to a player results
189 | in that player being sent off the pitch. According to the rules of football, the game is stopped prematurely when either
190 | team has only 6 or fewer players remaining on the pitch. We are now interested in how many possible 
191 | situations (we assume that situations occur in one-minute intervals) there are in which the game still progresses,
192 | one or more red cards have been issued and exactly four goals have been scored. Give the corresponding sample space and its size.
193 | %Solution: We define three sample spaces: $ \Omega_{M} = {1 \cdots 89} $ for minutes played, 
194 | %$ \Omega_{R} = \{(x_{1},x_{2})|x_{1},x_{2} \in \{0,1,2,3,4\}, x_{1} + x_{2} > 0\} $ for red cards shown and 
195 | %$ \Omega_{G} = \{(x_{1},x_{2})|x_{1},x_{2} \in \{0,1,2,3,4\}, x_{1} + x_{2} = 4\} $ for total goals
196 | %scored. Clearly, $ |\Omega_{M}| = 89 $, $ |\Omega_{R}| = 20 $ and $ |\Omega_{G}| = 5 $. Our total
197 | %sample space is the Cartesian product of those three and its size is $ 89\times 20 \times 6 = 8900 $.
198 | \end{Exercise}
199 | 
200 | Note that up to now we have implicitly assumed that we would put every drawn ball back into the urn. This
201 | is also referred to as \textbf{sampling with replacement}. Let us now look at problems for \textbf{sampling
202 | without replacement}, i.e.\ problems where we are shrinking our sample space at each draw. One class of such
203 | problems is known as \textbf{permutation} problems.
204 | 
205 | \begin{Definition}[Permutation]
206 | A permutation on a set $ S $ is a bijection $ \sigma : S \rightarrow S : s \mapsto \sigma(s) $.
207 | \end{Definition}
208 | 
209 | Often times people also use the word permutation to refer to the image of a set under a permutation. What we
210 | need permutations for in practice is the reordering of ordered sets (which we will call lists). For example
211 | the permutations of the list $ L = (1,2,3) $ are:
212 | \begin{itemize}
213 | \item $ \sigma_{1} = \{1 \mapsto 1, 2 \mapsto 2, 3 \mapsto 3 \} \hfill \sigma_{1}(L) = (1,2,3) $
214 | \item $ \sigma_{2} = \{1 \mapsto 1, 2 \mapsto 3, 3 \mapsto 2 \} \hfill \sigma_{1}(L) = (1,3,2) $
215 | \item $ \sigma_{3} = \{1 \mapsto 2, 2 \mapsto 1, 3 \mapsto 3 \} \hfill \sigma_{1}(L) = (2,1,3) $
216 | \item $ \sigma_{4} = \{1 \mapsto 2, 2 \mapsto 3, 3 \mapsto 1 \} \hfill \sigma_{1}(L) = (2,3,1) $
217 | \item $ \sigma_{5} = \{1 \mapsto 3, 2 \mapsto 1, 3 \mapsto 2 \} \hfill \sigma_{1}(L) = (3,1,2) $
218 | \item $ \sigma_{6} = \{1 \mapsto 3, 2 \mapsto 2, 3 \mapsto 1 \} \hfill \sigma_{1}(L) = (3,2,1) $
219 | \end{itemize}
220 | 
221 | The way to think about a permutation as a draw from an urn is to look at each of the positions in the list in
222 | turn and insert an element from $ S $. Since a permutation is a bijection, we can only use each
223 | $ s \in S $ exactly once. This is precisely what it means to sample without replacement. Once a ball
224 | is drawn, it is removed from the urn. Let us make this effect concrete in the above example. For position one
225 | we have three elements to choose from. Hence we are dealing with a sample space of size $ 3 $. Position two
226 | still leaves us $ 2 $ choices, giving us a sample space of size $ 2 $. Finally, the element in the last position
227 | is totally determined as we are dealing with a sample space of size $ 1 $.
228 | % The danger here is that we might be giving the impression that you can sample from a sample space $\Omega$ without replacement which does not make much sense in the probability world.
229 | 
230 | Applying the basic principle of counting we now know that there are $ 3 \times 2 \times 1 $ permutations of the list
231 | $ (1,2,3) $. Incidentally, this proves our above example to be correct. More generally, if we have to reorder
232 | a list with $ n $ distinct elements (or draw without replacement from an urn with $ n $ numbered balls), there
233 | are $ n \times (n-1) \times \ldots \times 2 \times 1 $ permutations. Since this is pretty painful to write down
234 | we introduce a more succinct notation, provided by the \textbf{factorial} function.
235 | 
236 | \begin{Definition}[Factorial]
237 | The factorial $ n! $ of a non-negative natural number $ n \in \mathbb{N} $ is defined recursively as 
238 | \begin{itemize}
239 | \item $ 0! = 1 $
240 | \item $ k! = k\times (k-1)! $ for $ 0 < k \leq n $
241 | \end{itemize}
242 | \end{Definition}
243 | 
244 | From the above discussion we can now conclude that the number of permutations on a set or list of size $ n $
245 | is $ n! $. 
246 | 
247 | We can also define the notion of a k-permutation on a set $ S $ of size $ n $ such that $ k < n $.
248 | This means we are still drawing without replacement but we do not fully empty the urn. The reasoning for how
249 | many of those k-permutations there are remains exactly the same. There are $ n \times (n-1) \times (n-k+2) 
250 | \times (n-k+1) $ such permutations (make sure you understand why!). In order to ease notation we can again
251 | sneak in the factorial through multiplying this number with $ 1 $ in disguise. Concretely, we write
252 | \begin{align*}
253 | &n \times (n-1) \times \ldots \times (n-k+2) \times (n-k+1) \times 1 \\
254 | =& n \times (n-1) \times \ldots \times (n-k+2) \times (n-k+1) \times \frac{(n-k)!}{(n-k)!} \\
255 | =& \frac{n!}{(n-k)!}
256 | \end{align*}
257 | for the number of k-permutations on a set of size $ n $.
258 | 
259 | We will not see k-permutations all that often in this script but they constitute a helpful stepping stone to another
260 | concept that will be of crucial importance. Let us draw $ k $ balls from an urn with $ n $ balls where $ k \leq n $ and disregard
261 | the order in which we draw them. A classical example of such a setting would be the lottery where you are only interested in the
262 | balls drawn but not in the order in which they were drawn. We already know that for a set of $ k $ balls there are $ \frac{n!}{(n-k)!} $ 
263 | orders in which we can draw them, as this is a $ k $-permutation on our urn. Now, though, we need to get rid off the different 
264 | orderings. This is to say that we want to count each set of $ k $ balls that we can draw only once and not once per permutation of it.
265 | Luckily, we know how many permutations of a set of size $ k $ there are, namely $ k! $. Thus we divide out this number of permutations,
266 | yielding $ \frac{n!}{(n-k)!\times k!} $ as the number of possible ways to draw $ k $ \textit{different} balls from an urn with $ n $
267 | balls.
268 | At this point we should take a break and pat our own backs. After all, we have just derived one of the most important combinatorial 
269 | formulas, which is known as the \textbf{binomial coefficient}.
270 | 
271 | \begin{Definition}[Binomial co-efficient]
272 | The binomial coefficient $ \binom{n}{k} $ is defined as 
273 | $$ \binom{n}{k} := \dfrac{n!}{(n-k)!\times k!} $$ 
274 | for $ 0 < n, 0 \leq k \leq n $. It counts the number of ways
275 | to sample $ k $ distinct elements from a set with a total of $ n $ elements without regard to the order in which they are drawn.
276 | For this reason, it is pronounced ``n choose k''.
277 | \end{Definition}
278 | 
279 | \begin{Exercise}
280 | In the German lottery you have to bet on a set of $ 6 $ numbered balls to be drawn out of a total of $ 49 $ balls. Assuming that
281 | each ball is equally likely to be drawn, what is the chance of an individual bet to win the jackpot? The Dutch lottery is 
282 | slightly more involved. They also draw an additional coloured ball from $ 6 $ coloured balls. In order to win the jackpot you need to have 
283 | the number-colour combination right. What is your chance here?
284 | %Solution: There are $ \binom{49}{6} = 13.983.816 $ ways of betting on $ 6 $ balls. Thus the win probability is $ \frac{1}{13.983.816} $. 
285 | %For the Dutch lottery its even $ \frac{1}{13.983.816 \times 6} = \frac{1}{83.902.896} $.
286 | \end{Exercise}
287 | 
288 | The binomial coefficient will become crucially important later on. A common application, that you will see in this and other courses
289 | is counting the number of bit strings with certain properties. A bit is a variable that can take on values in $ \{0,1\} $. By the 
290 | basic principle of counting there are $ 2^{n} $ bit strings of length $ n $. How many bit strings of length $ 5 $ are there that contain
291 | exactly $ 3 $ ones? Well, there are $ 2^{5} = 32 $ bit strings of that length in total and $ \binom{5}{3} = 10 $ of them contain exactly
292 | three ones. Unsurprisingly, this is the same number of 5-bit strings with exactly $ 2 $ zeros. 
293 | The moral lesson here is that $ \binom{n}{k} = \binom{n}{n-k} $ as can be easily seen from the definition. Some other trivia about the
294 | binomial coefficient are that $ \binom{n}{0} = \binom{n}{n} = 1 $. Again, this follows directly from the definition. Somewhat trickier
295 | is the fact that $ \binom{n}{1} = \binom{n}{n-1} = n $. Can you derive this?
296 | 
297 | We can straightforwardly generalize the idea of the binomial coefficient to choosing more than just one set of objects. This means that instead of just
298 | looking at red versus non-red balls, say, we now distinguish between all the colours in our urn. For our strings this means that we move away from
299 | bit strings to strings with large alphabets, e.g. strings written in the English alphabet (which has 26 letters). Let's say we have $ r $ red, 
300 | $ b $ blue, $ g $ green and $ y $ yellow balls in our urn such that $ n = r+b+g+y $ is the total number of balls in the urn. How many different
301 | colour sequences can we draw? Well, we first arrange the $ r $ red balls in $ r $ out of $ n $ positions. This can be done in 
302 | $ \binom{n}{r} $ ways. We then place the $ b $ blue balls in $ \binom{n-r}{b} $ ways. Next, we place the $ g $ green balls in $ \binom{n-r-b}{g} $
303 | ways. Finally, we place the remaining yellow balls deterministically in the remaining positions since $ \binom{n-r-b-g}{y} = \binom{y}{y} = 1  $.
304 | We compute the total number of arrangements as
305 | 
306 | \begin{gather}
307 | \binom{n}{r} \binom{n-r}{b} \binom{n-r-b}{g} \binom{n-r-b-g}{y} = \\
308 | \dfrac{n!}{r!\times (n-r)!} \times \dfrac{(n-r)!}{b!\times (n-r-b)!} \times \dfrac{(n-r-b)!}{g! \times (n-r-b-g)!} \times 1 = \\
309 | \dfrac{n!}{r!b!g!y!}
310 | \end{gather}
311 | 
312 | Observe that the last equality follows because many of the factorials cancel and because we know that $ n-r-b-g = y $. We have now worked with
313 | only four colours, but the general case follows directly by induction on the number of colours (with the binomial coefficient as base case). 
314 | Thus, we can define the \textbf{multinomial coefficient}.
315 | 
316 | \begin{Definition}[Multinomial co-efficient]
317 | The multinomial coefficient for choosing $ k $ sets of objects with size $ m_{k} $ from a total of $ 0 < n = \underset{i=1}{\overset{k}{\sum}m_{i}} $ 
318 | objects is $$ \dfrac{n!}{\underset{i=1}{\overset{k}{\prod}}m_{i}!} $$
319 | \end{Definition}
320 | 
321 | \section*{Further material}
322 | For a slow and thorough introduction to combinatorics, see \href{http://eu.wiley.com/WileyCDA/WileyTitle/productCd-111840436X.html}{Faticoni (2013): 
323 | Combinatorics}. At the ILLC, there is \href{http://homepages.cwi.nl/~rdewolf/combinatorics14.html}{a biannual course on combinatorics}, 
324 | taught by Ronald de Wolf. Online, Princeton also offers \href{https://www.coursera.org/course/ac}{a course on combinatorics}.
325 | 
326 | 
327 | %%% Local Variables:
328 | %%% mode: latex
329 | %%% TeX-master: "chapter1"
330 | %%% End:
331 | 


--------------------------------------------------------------------------------
/multivariateGaussian/multivariateGaussian_forInclude.tex:
--------------------------------------------------------------------------------
  1 | \chapter{The Gaussian Distribution}
  2 | 
  3 | If there is any one distribution that has traversed mathematics and found a home in cultural memory, it surely is the \textbf{Gaussian} or \textbf{normal distribution}
  4 | (both names are common and we will use them interchangeably here). 
  5 | Not only is it super-useful in many data modelling applications, it also has a host of convenient mathematical properties, some of which we are going
  6 | to explore in this chapter. 
  7 | 
  8 | Before going into any detail, let us first motivate this distribution. What we want is a distribution on a real vector space ($ \mathbb{R}^{n} $). We will
  9 | start out with the simplest case and fist look at the Gaussian distribution on the real line. Our desiderata for the Gaussian\footnote{Notice that Gauss' original
 10 | motivation was different from ours. While we are giving a largely geometric account of the normal distribution, Gauss was concerned with finding a distribution
 11 | on $ n $ independent points whose maximum likelihood estimate (See Section~\ref{eq:parameterEstimation} would be } are as follows:
 12 | \begin{itemize}
 13 | \item The distribution should be centred around one specific point which we will call the mean
 14 | \item The more distant a point is from the mean, the less probable it should be
 15 | \item The distance metric should be adjustable so as to assign distant points more or less probability as needed
 16 | \item Equally distant points should have the same probability, independent of their direction
 17 | \end{itemize}
 18 | 
 19 | \section{The Univariate Gaussian}
 20 | 
 21 | The Gaussian distribution is one of the most important and most widely used distributions in all of statistics. The reason is that many natural observations
 22 | tend to be normally distributed. Many other distributions are also based on it or can be approximated by a Gaussian. Finally, there are several mathematical
 23 | properties of the Gaussian that make calculating with it rather easy. In this Section we will look at the \textbf{univariate} Gaussian distribution, that is
 24 | the Gaussian distribution in one dimension. In Section~\ref{sec:mvGauss} we will also see how to model data in $ \mathbb{R}^{n} $ that is extremely complexly 
 25 | structured with \textbf{multivariate} Gaussian distributions.
 26 | 
 27 | \begin{figure}
 28 | \center
 29 | \begin{knitrout}
 30 | \definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}
 31 | \includegraphics[width=\maxwidth]{figures/uniGauss-1} 
 32 | 
 33 | \end{knitrout}
 34 | \caption{Standard normal density (left) and with variance $ \sigma^{2} = 2 $ (right).}
 35 | \label{fig:uniGauss}
 36 | \end{figure}
 37 | 
 38 | \subsection{Deriving the Density}
 39 | 
 40 | What we want is a distribution that models spatial data, i.e. data that lives in some vector space. There should be a centre of mass around which the data concentrates
 41 | and deviation from that centre of mass should be ``penalized'', meaning that the further away from the centre a data point is, the less probable it should be.
 42 | Since we are interested in modelling spatial data in real vector spaces, we will choose the \href{https://en.wikipedia.org/wiki/Euclidean_distance}{Euclidian
 43 | distance} as a distance measure. In the case of one dimension, the Euclidean distance is simply the absolute difference. For $ x,y \in \mathbb{R} $, the Euclidian
 44 | distance is $ (y - x)^{2} $. Notice that this is symmetric as any good distance metric should be. 
 45 | 
 46 | As it looks right now, all deviations are going to be penalized to the same extent. In other words: the Euclidian distance is linear in the difference of two points.
 47 | What if we want to be a bit stricter and penalize points that are far away from the centre even more or conversely, if we wanted to be lenient and diminish the penalty
 48 | for deviation from the centre? In such a case we would have to scale the Euclidian distance. In fact, there is a generalization of the Euclidean distance that
 49 | allows for scaling. It is called the \href{https://en.wikipedia.org/wiki/Mahalanobis_distance}{Mahalanobis distance}. In the one-dimensional case, it introduces
 50 | a scale factor by which the difference between two point is scaled. The Mahalanobis distance between $ x,y \in \mathbb{R} $ is 
 51 | \begin{equation*}
 52 | \left(\frac{x - y}{\sigma}\right)^{2}
 53 | \end{equation*}
 54 | where $ \sigma > 0 $ is an adjustable scale factor. If $ \sigma < 1 $ it will exaggerate the difference between $ x $ and $ y $ and hence lead to a greater
 55 | penalty for distant points. Conversely, if $ \sigma > 1 $ it will lessen the difference between $ x $ and $ y $ and therefore lead to a smaller penalty for 
 56 | distant points. The square of $ \sigma $ is called the \textbf{variance} and used to parametrize the Gaussian distribution, while $ \sigma $ itself is 
 57 | known as the \textbf{standard deviation}.
 58 | 
 59 | Now that we have found an appropriate (and adjustable!) distance metric, we have to turn it into a probability density. The standard way of turning any quantity
 60 | into a probability density is by simply exponentiating it. This way, it is guaranteed to be positive. In the present case, we actually want that the probability
 61 | decreases as the distance between the two points increases. Thus we are actually going to exponentiate the negative of the Mahalanobis distance. Finally, we
 62 | might want to differentiate that distance at some point. Whenever we do so we are going to have to deal with the squaring function. In order to make our lives
 63 | easier when differentiating, we also prefix the Mahalanobis distance with $ \nicefrac{1}{2} $ before exponentiating it. The result is
 64 | \begin{equation}
 65 | \exp\left(-\frac{1}{2} \left(\frac{x - y}{\sigma}\right)^{2} \right) \ .
 66 | \end{equation}
 67 | 
 68 | Notice that so far we have one adjustable parameter in that expression, namely the scale factor $ \sigma $ from the Mahalanobis distance. Initially we said
 69 | that points which follow a Gaussian distribution should be arranged around a centre which is more commonly known as the \textbf{mean}. 
 70 | Let us call this mean $ \mu $. In order to vary the location of the centre,
 71 | we turn $ \mu $ into a parameter (we simply replace $ y $ with $ \mu $). To recap, $ \mu $ determines the location
 72 | of the centre of the Gaussian density and $ \sigma $ scales it. The parameters are therefore called \textbf{location parameter} and \textbf{scale parameter}, respectively.
 73 | We now have an expression that is proportional to the Gaussian density. Whenever a RV $ X $ is distributed according to a normal distribution with location parameter
 74 | $ \mu $ and scale parameter $ \sigma $, we write $ X \sim \N{\mu}{\sigma} $. The corresponding density is
 75 | \begin{equation}
 76 | p(x) \propto \exp\left(-\frac{1}{2} \left(\frac{x - \mu}{\sigma}\right)^{2} \right) \ .
 77 | \end{equation}
 78 | In order to get a proper density, we still need to normalize. This requires a non-trivial integration that falls without the scope of this subsection\footnote{If
 79 | you are interested in seeing several different proofs, check \href{https://en.wikipedia.org/wiki/Gaussian_integral}{here}. Laplace's proof is probably the easiest to follow.}. We
 80 | will just state the normalizer here. The full univariate normal density with parameters $ \mu $ (mean) and $ \sigma^{2} $ (variance) is
 81 | \begin{equation}
 82 | p(x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{1}{2} \left(\frac{x - \mu}{\sigma}\right)^{2} \right) \ .
 83 | \end{equation}
 84 | 
 85 | Notice that we do actually never need this general density. Why? We can transform any Gaussian distribution into a \textbf{standard normal distribution}. This is 
 86 | the normal distribution with 0 mean and unit variance $ \N{0}{1} $. It is so important that it even has its own notation. 
 87 | \begin{equation}
 88 | \Phi(x) = p(x) \mbox{ where } X \sim \N{0}{1}
 89 | \end{equation}
 90 | Any Gaussian variable can be normalized to a standard normal variable. This is often done in many applications.
 91 | 
 92 | \begin{Exercise}
 93 | Show that for $ X \sim \N{\mu}{\sigma^{2}} $ we have $ \frac{X - mu}{\sigma} $. The processes of subtracting the mean and dividing
 94 | by the standard deviation are called centering and normalization, respectively.
 95 | \end{Exercise}
 96 | 
 97 | \section{The Multivariate Gaussian$ ^{*} $}\label{sec:mvGauss}
 98 | 
 99 | Our goal in this section is to define a Gaussian distribution on $ \mathbb{R}^{n} $. This will require quite a bit of linear algebra. Readers who have not taken
100 | a linear algebra course are advised to skip this section.
101 | 
102 | Let us start out by considering a random vector whose $ n $ dimension are independent. That mean that the probability of the vector can be factorised.
103 | \begin{equation}
104 | p(\vec{x}) = \prod_{i=1}^{n} p(x_{i})
105 | \end{equation}
106 | If each of the dimensions is distributed according to the same Gaussian $ \mathcal{N}(\mu, \sigma^{2}) $, we can easily generate random vectors of this form 
107 | by making $ n $ independent from the Gaussian.
108 | 
109 | Unfortunately, this severely limits our ability to model data. Not only can we never model correlations between dimensions, we also require that all dimensions
110 | have the same variance. The data that we can model needs to be extremely homogeneous.
111 | 
112 | We could lessen this problem by drawing each random dimension from a different Gaussian. This way, we would be able to assign different means and variances to different
113 | dimensions. However, we could still not capture covariances. What we need is a single Gaussian over $ \mathbb{R}^{n} $. This will allow us to model (potentially) dependent
114 | dimensions. Having a mean vector with different mean values per dimension is trivial. In fact, we will further assume that the means of the dimensions are independent of
115 | each other. Thus our mean vector will simply be 
116 | $ \vec{\mu} = \begin{bmatrix}
117 | \mu_{1} & \ldots & \mu_{n}
118 | \end{bmatrix} $. We only demand that the variances of the dimensions may be correlated. To express such correlations we need to compactly store the variances and 
119 | covariances of the dimensions. To do this, we introduce \textbf{covariance matrices}.
120 | 
121 | \subsection{Covariance Matrices}
122 | 
123 | \begin{Definition}[Covariance matrix]
124 | A $ n \times n $ matrix $ \Sigma $ is called a covariance matrix of an $ n $-dimensional RV $ X $ if for $ 0 < i,j \leq n $ 
125 | $$ \Sigma_{j,i} = cov(X_{j}, X_{i}) \ . $$
126 | \end{Definition}
127 | 
128 | The covariance matrix is has a couple of important properties which we will use when computing with it.
129 | \begin{enumerate}
130 | \item \textbf{Symmetry:} follows from the definition and the symmetry of the covariance.
131 | \item \textbf{positive semi-definiteness:} See below.
132 | \end{enumerate}
133 | Notice that some authors will actually define covariance matrices to be symmetric, positive semi-definite matrices. This is fine in so far as any matrix with
134 | these properties is a valid covariance matrix. When we construct models of data, we may actually simply stipulate the (co)-variances and thus build a covariance matrix.
135 | 
136 | \textbf{Proof of positive semi-definiteness} Recall that a $ n \times n $ matrix $ M $ is positive semi-definite (PSD) 
137 | if for all $ z \in \mathbb{R}\setminus \{0\} $ it holds
138 | that $ z^{\top}Mz \geq 0 $. Observe that we can write a covariance matrix $ \Sigma $ as the expectation of an outer product. 
139 | \begin{equation}
140 | \Sigma = \E\left[(\vec{X} - E[\vec{X}])^{\top}(\vec{X} - E[\vec{X}])\right]
141 | \end{equation}
142 | For all $ z \in 
143 | \mathbb{R}\setminus \{0\} $ we have
144 | \begin{align}
145 | z\Sigma z &= z^{\top}\Sigma z \\
146 | &= z \E\left[(\vec{X} - E[\vec{X}])^{\top}(\vec{X} - E[\vec{X}])\right] z^{\top} \\
147 | &=  \E\left[ z (\vec{X} - E[\vec{X}])^{\top}(\vec{X} - E[\vec{X}]) z^{\top} \right] \\
148 | &= \E\left[(\vec{X} - E[\vec{X}])zz^{\top} (\vec{X} - E[\vec{X}])^{\top} \right] \\
149 | &= \E\left[(\vec{X} - E[\vec{X}])c (\vec{X} - E[\vec{X}])^{\top} \right] \\
150 | &= c \E\left[(\vec{X} - E[\vec{X}]) (\vec{X} - E[\vec{X}])^{\top} \right] \geq 0
151 | \end{align}
152 | where $ c $ is some positive constant. The result essentially follows from the linearity of expectation. 
153 | 
154 | The importance of being positive semi-definite may not be immediately apparent. It lies in the fact that many results are easily proven for positive semi-definite 
155 | matrices. Any result that holds for positive semi-definite matrices also holds for covariance matrices. We will occasionally use this property in our proofs below.
156 | 
157 | Another important result is based solely on the symmetry of the matrix. By the spectral theorem we know that any symmetric matrix $ M $ can be factorized as
158 | \begin{equation}\label{eq:eigenvalueDecomp}
159 | M = U \Lambda U^{-1}
160 | \end{equation}
161 | where $ \Lambda $ is a diagonal matrix and $ U $ is orthonormal. Let us try to interpret this decomposition. The orthonormal matrix $ U^{-1} $ is a linear map from
162 | $ \mathbb{R}^{n} $ to $ \mathbb{R}^{n} $. It effectively rotates the input. The matrix $ \Lambda $ then scales the each row of the input and finally the matrix
163 | $ U $ rotates the scaled input back. From the spectral theorem we know that the entries of $ \Lambda $ are the eigenvalues of $ M $. Therefore, the columns of $ U $
164 | are the corresponding eigenvectors normalized to unit length. The decomposition thus gives us an efficient way of finding the eigenvalues of $ M $. We are now going
165 | to show that these eigenvalues are always non-negative for PSD matrices.
166 | 
167 | \begin{Lemma}[Eigenvalues of PSD matrices are non-negative]
168 | Assume this was not the case. Let $ z $ be an eigenvalue of a positive 
169 | semi-definite matrix $ A $ with negative eigenvalue $ \lambda $. Then we get $ z^{\top}Az = z^{\top}\lambda z = \lambda z^{\top} z < 0 $ which contradicts the
170 | premise that $ A $ is positive semi-definite. $ \square $
171 | \end{Lemma}
172 | 
173 | We conclude that positive semi-definite matrices (and thus covariance matrices) only have non-negative eigenvalues.
174 | This in turn implies that PSD matrices always have roots. These roots can easily be derived as 
175 | \begin{equation}\label{eq:PSDRoots}
176 | M^{\nicefrac{1}{2}} = U \Lambda^{\nicefrac{1}{2}}\Lambda^{\nicefrac{1}{2}} U^{-1} = U \Lambda^{\nicefrac{1}{2}}U^{-1} U \Lambda^{\nicefrac{1}{2}} U^{1} \ .
177 | \end{equation} 
178 | 
179 | Covariance matrices are not always used in practice. It is sometimes more convenient to use their inverse instead. That inverse, $ \Sigma^{-1} $, is called a precision
180 | matrix. The names are telling: The entries in the covariance matrix measure to what extend two dimensions grow or shrink in relation to each other. The higher
181 | that value the more deviation from the mean we will observe. 
182 | The entries in the precision matrix tell us how precise (i.e. how close to the mean) the distribution is. Higher
183 | precisions means that we are going to observe less deviation from the mean vector.
184 | 
185 | \subsection{Deriving the Density}
186 | 
187 | Now that we have learned about the covariance matrix, we are all set to define the multivariate Gaussian. Let us take a step back and remind ourselves of how easy
188 | it was to generate random vectors with independent means and variances. For the multivariate Gaussian, we will have to replace the mean with a mean vector (whose
189 | components are again independent\footnote{Notice that we our presentation is taking place in a frequentist setting. In Bayesian probability theory, the claim that
190 | the dimensions of the mean vector are independent may very well be false.}) and a covariance matrix, changing the notation from $ \N{\mu}{\sigma^{2}} $
191 | to $ \N{\vec{\mu}}{\Sigma} $. As before, the parameter values are exactly equal to the mean and (co)variance of the distribution.
192 | 
193 | While we have not yet properly defined the multivariate Gaussian, we can already explore some of its properties. By simple linearity of expectation, we have
194 | for any vector $ \vec{y} \in \mathbb{R}^{n} $ and any random Vector $ X \sim \N{\vec{\mu}}{\Sigma} $ that $ \E[X + y] = \E[X] + y $ and therefore that
195 | $ X + y \sim \N{\vec{\mu} + y}{\Sigma} $. Similarly, by properties of the (co)variance and the expecation, 
196 | we know that for any matrix $ A \in \mathbb{R}^{n} $ it holds that
197 | $ var(AX) = A^{2}var(X) = AA\Sigma = A\Sigma A^{\top} $ and therefore that $ AX \sim \N{A\vec{\mu}}{A\Sigma A^{\top}} $. 
198 | 
199 | Taken together, the fact that $ AX + \vec{y} \sim \N{A\vec{\mu} + \vec{y}}{A\Sigma A^{\top}} $ is called the \textbf{affine property} of the Gaussian distribution.
200 | Any affine transformation of a Gaussian RV will again yield a Gaussian RV. We will exploit this fact to define the multivariate Gaussian distribution. Recall
201 | how easy our lives would be if all variance components in the multivariate Gaussian were independent; more easy even if the variance components were also identical.
202 | Let us start from this scenario with unit variance. The \textbf{standard multivariate Gaussian} is then simply $ \N{0}{I} $ where $ I $ is the identity matrix. Clearly,
203 | the rows and columns of this matrix are orthogonal and hence independent. Moreover, only the diagonal is populated and all diagonal values are the same, meaning
204 | that we have independent and identical variances but no covariance. We can now construct an infinitely many other multivariate Gaussians with the same covariance
205 | properties by shifting the mean. Given a RV $ X \sim \N{0}{I} $ we achieve this by defining $ Y = X + \vec{\mu} \sim \N{\vec{\mu}}{I} $ (this follows from the affine 
206 | property), where $ \vec{\mu} $ is our desired mean.
207 | 
208 | Now that we can derive multivariate Gaussians with any mean we like, let us turn to the covariance matrix. We can change the identical variance by simply multiplying
209 | a standard normal RV with a scalar of our choice. Formally, if $ X \sim \N{0}{I} $ and $ \sigma \in \mathbb{R} $ then 
210 | $ Y = \sigma X \sim \N{0} {\sigma I \sigma} = \N{0} {\sigma^{2}I} $. This shouldn't come as too much of a surprise since this is how we would adjust 
211 | the variance of a univariate Gaussian. However, we still cannot model covariance at this point. 
212 | 
213 | Instead of multiplying $ X \sim \N{0}{I} $ with a scalar, let us use a matrix instead. In the interest of relieving all suspense, let us call this matrix $ \Sigma^{\nicefrac{1}{2}} $
214 | (you see where we are getting at, aren't you?). By the affine property, we have 
215 | $ Y = \Sigma^{\nicefrac{1}{2}} X \sim \N{0}{\Sigma^{\nicefrac{1}{2}} I \Sigma^{\nicefrac{1}{2}^{\top}}} = \N{0}{\Sigma} $. 
216 | %Notice that although we have called the matrix that we multiply $ X $ with $ \Sigma^{\nicefrac{1}{2}} $, we do usually not need to compute this square root explicitly. Any 
217 | %square matrix will do the job.
218 | 
219 | What we can conclude from the above is that we can derive any multivariate Gaussian distribution from the standard normal multivariate normal simply by applying an
220 | appropriate affine transformation. Thus, all we need to do is to derive the density for the standard multivariate Gaussian. This is super-simple! The mean is $ 0 $
221 | in all dimensions and the variances are identically and independently 1. For a vector $ \vec{x} \in \mathbb{R}^{n} $ such that $ \vec{x} \sim \N{0}{1} $ this means
222 | \begin{align}\label{eq:mvstandardNormal}
223 | p(\vec{\mu}) &= \prod_{i=1}^{n} p(x_{i}) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi} \times 1} \exp \left(-\frac{1}{2}\left( \frac{x_{i} - 0}{1} \right)^{2} \right) \\
224 | &= \frac{1}{\sqrt{(2\pi)^{n}}} \exp \left(-\frac{1}{2}\left( \sum_{i=1}^{n}x_{i}^{2} \right) \right) \ . \nonumber
225 | \end{align}
226 | 
227 | We know the density of the standard multivariate normal distribution and we know how to derive any other multivariate Gaussian from that distribution. Before
228 | we derive the general density for multivariate Gaussians, let us finally define multivariate Gaussian RVs.
229 | \begin{Definition}[Multivariate Normal Distribution]
230 | An $ n $-dimensional random vector $ \vec{X} \in \mathbb{R}^{n} $ has a multivariate normal distribution with an $ n $-dimensional mean parameter 
231 | $ \vec{\mu} $ and an $ n \times n $ covariance matrix $ \Sigma $ if it has the same distribution as $ \mu + LZ $ where $ LL^{T} = \Sigma $ and 
232 | the dimensions of $ Z $ are i.i.d. according to a univariate standard normal distribution, i.e. $ Z_{i} \sim \N{0}{1} $ for $ 0 < i \leq n $.
233 | \end{Definition}
234 | 
235 | With this definition at hand, let us derive the general multivariate density. The problem is that in a covariance matrix the variances are not independent anymore.
236 | Thus, we cannot readily apply the factorization from Equation~\eqref{eq:mvstandardNormal}. The question now is whether we can substitute the covariance matrix with
237 | another matrix that where the variances are indeed independent. The spectral theorem answers this question positively. Recall that all square matrices can be decomposed
238 | according to Equation~\eqref{eq:eigenvalueDecomp}. The matrix $ \Lambda $ has its eigenvalues on the diagonal. Since it is congruent with the original matrix $ M $,
239 | they both have the same eigenvalues. It is clear from Equation~\eqref{eq:PSDRoots} that we can use $ U \Lambda^{\nicefrac{1}{2}} = \Sigma^{\nicefrac{1}{2}} $ when applying the affine transformation of the standard normal distribution.
240 | For $ X \sim \N{0}{I}, \vec{\mu} \in \mathbb{R}^{n}, A = U\Lambda^{\nicefrac{1}{2}}U^{-1} \in \mathbb{R}^{n\times n} $ and $ Y = AX + \vec{\mu} $ we exploit
241 | the fact that $ Y \sim \N{\vec{\mu}}{AIA^{\top}} = \N{\vec{\mu}}{\Sigma} $. In the following, we use $ M_{i} $ to denote the $ i^{th} $ row of a matrix.
242 | \begin{align}
243 | &p(\vec{y}) 
244 | = p(A\vec{x} + \vec{\mu}) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi} \sum_{j=1}^{n}A_{ij}} \exp \left(-\frac{1}{2} \left( \frac{(A\vec{x})_{i} - \vec{\mu}_{i}}{\sum_{j=1}^{n}A_{ij}} \right)^{2}  \right) \\
245 | &= \frac{1}{\sqrt{\left( 2\pi \right)^{n}} \prod_{i=1}^{n}\sum_{j=1}^{n}A_{ij}}
246 | \exp \left(-\frac{1}{2} \sum_{i=1}^{n} \left( (A\vec{x})_{i} - \vec{\mu}_{i} \right)^{2} \sum_{j=1}^{n}A_{ij}^{-2} \right) \\
247 | &= \frac{1}{\sqrt{\left( 2\pi \right)^{n}} \prod_{i=1}^{n}\sum_{j=1}^{n}A_{ij}}
248 | \exp \left(-\frac{1}{2} \sum_{i=1}^{n} \left( ( A\vec{x} )_{i} - \vec{\mu}_{i}\right) \sum_{j=1}^{n} A_{ij}^{-2} \left((A\vec{x})_{j} - \vec{\mu}_{j} \right)^{\top} \right) 
249 | \label{eq:quadraticForm} \\
250 | &= \frac{1}{\sqrt{\left( 2\pi \right)^{n}} \prod_{i=1}^{n}\sum_{j=1}^{n}A_{ij}}
251 | \exp \left(-\frac{1}{2} \left( A\vec{x} - \vec{\mu}\right) \Sigma^{-1} \left(A\vec{x} - \vec{\mu} \right)^{\top} \right) \\
252 | &= \frac{1}{\sqrt{\left( 2\pi \right)^{n} |\Sigma|}}
253 | \exp \left(-\frac{1}{2} \left( A\vec{x} - \vec{\mu}\right) \Sigma^{-1} \left(A\vec{x} - \vec{\mu} \right)^{\top} \right) \label{eq:mvGDensityDet}
254 | \end{align}
255 | Before we interpret this density (whose standardly given form is \eqref{eq:mvGDensityDet}) let us clarify the derivation. In order to change the indices in
256 | Equation~\eqref{eq:quadraticForm} we have 
257 | used the fact that for any $ n\times n $ square matrix $ M $ and vector $ \vec{x} \in \mathbb{R}^{n} $ we have the equality $ \vec{x}^{2}M = \vec{x}M\vec{x}^{\top} $.
258 | We then replaced $ A^{2} $ with $ \Sigma $. In the final line we have explicitly calculated the sum in the normalizer.
259 | \begin{align}
260 | \sum_{j=1}^{n}A_{ij} &= \sum_{j=1}^{n} \sum_{k=1}^{n} U_{ik} \sum_{l=1}^{n} \Lambda^{\nicefrac{1}{2}}_{kl} U^{\top}_{lj} \\
261 | &= \sum_{j=1}^{n} \sum_{k=1}^{n} U_{ik} \Lambda^{\nicefrac{1}{2}}_{kk} U^{\top}_{kj} \label{eq:diagonality} \\
262 | &= \sum_{j=1}^{n} \sum_{k=1}^{n} U_{ik} U_{jk} \Lambda_{kk}^{\nicefrac{1}{2}} = \Lambda_{ii}^{\nicefrac{1}{2}} \label{eq:orthogonality}
263 | \end{align}
264 | In the above, line \eqref{eq:diagonality} follows because $ \Lambda $ is diagonal and the last identity in line \eqref{eq:orthogonality} follows from the fact that
265 | $ U $ is orthogonal. The product in the normalizer is now a product of (square roots of) eigenvalues of a diagonal matrix, which is equal to the (square root of the) 
266 | determinant of that matrix. Since congruent matrices have the same eigenvalues, this is the same as the determinant of $ \Sigma $. This completes our derivation of the multivariate Gaussian density.
267 | 
268 | From the derivation, you probably already have some intuition of what's going on. Let us make this intuition more precise. A normal distribution with zero mean
269 | and covariance matrix $ \sigma I $ for $ \sigma \in \mathbb{R} $ defines a ball in which most of the probability mass lies. Being a ball, this structure is perfectly
270 | round, showing the same amount of deviance in all directions \textbf{FIGURE A}. Since the mean is zero, the ball is centred at the origin. If we change the mean, we are 
271 | performing a shift of the balls centre away from the mean \textbf{FIGURE B}. We can also add bumps to the ball by letting the elements on the diagonal of the
272 | covariance matrix vary independently \textbf{FIGURE C}. Things become really interesting, though, when we use a full covariance matrix. Then we can define an
273 | ellipsoid which contains most of the mass \textbf{FIGURE D}. 
274 | 
275 | \begin{knitrout}
276 | \definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
277 | 
278 | 
279 | {\ttfamily\noindent\bfseries\color{errorcolor}{\#\# Error: could not find function "{}dmvnorm"{}}}\end{kframe}
280 | \includegraphics[width=\maxwidth]{figure/multiGauss-1} 
281 | 
282 | \end{knitrout}
283 | 
284 | How is all of this accomplished? By decomposing the covariance matrix, we have already seen that the covariance mostly depends on the eigenvalues of the covariance matrix.
285 | In fact, since scaling is done by $ U\Lambda^{\nicefrac{1}{2}} U^{-1} $, it is the square roots of the eigenvalues that define the spread. They are the dimension-wise standard deviations.
286 | The matrix $ \Lambda $ performs the same mapping as $ A $, only in eigenspace. As we have seen, this mapping is much simpler in eigenspace because $ \Lambda $ is
287 | diagonal. The process of computing the multivariate Gaussian density can thus be broken down into 3 steps: map into eigenspace, apply the transformation given by
288 | $ \Lambda $ and map back into the original space. 
289 | 


--------------------------------------------------------------------------------
/chapter7/chapter7.tex:
--------------------------------------------------------------------------------
  1 | \documentclass[a4paper,11pt,leqno]{report}\usepackage[]{graphicx}\usepackage[]{color}
  2 | %% maxwidth is the original width if it is less than linewidth
  3 | %% otherwise use linewidth (to make sure the graphics do not exceed the margin)
  4 | \makeatletter
  5 | \def\maxwidth{ %
  6 |   \ifdim\Gin@nat@width>\linewidth
  7 |     \linewidth
  8 |   \else
  9 |     \Gin@nat@width
 10 |   \fi
 11 | }
 12 | \makeatother
 13 | 
 14 | \definecolor{fgcolor}{rgb}{0.345, 0.345, 0.345}
 15 | \newcommand{\hlnum}[1]{\textcolor[rgb]{0.686,0.059,0.569}{#1}}%
 16 | \newcommand{\hlstr}[1]{\textcolor[rgb]{0.192,0.494,0.8}{#1}}%
 17 | \newcommand{\hlcom}[1]{\textcolor[rgb]{0.678,0.584,0.686}{\textit{#1}}}%
 18 | \newcommand{\hlopt}[1]{\textcolor[rgb]{0,0,0}{#1}}%
 19 | \newcommand{\hlstd}[1]{\textcolor[rgb]{0.345,0.345,0.345}{#1}}%
 20 | \newcommand{\hlkwa}[1]{\textcolor[rgb]{0.161,0.373,0.58}{\textbf{#1}}}%
 21 | \newcommand{\hlkwb}[1]{\textcolor[rgb]{0.69,0.353,0.396}{#1}}%
 22 | \newcommand{\hlkwc}[1]{\textcolor[rgb]{0.333,0.667,0.333}{#1}}%
 23 | \newcommand{\hlkwd}[1]{\textcolor[rgb]{0.737,0.353,0.396}{\textbf{#1}}}%
 24 | 
 25 | \usepackage{framed}
 26 | \makeatletter
 27 | \newenvironment{kframe}{%
 28 |  \def\at@end@of@kframe{}%
 29 |  \ifinner\ifhmode%
 30 |   \def\at@end@of@kframe{\end{minipage}}%
 31 |   \begin{minipage}{\columnwidth}%
 32 |  \fi\fi%
 33 |  \def\FrameCommand##1{\hskip\@totalleftmargin \hskip-\fboxsep
 34 |  \colorbox{shadecolor}{##1}\hskip-\fboxsep
 35 |      % There is no \\@totalrightmargin, so:
 36 |      \hskip-\linewidth \hskip-\@totalleftmargin \hskip\columnwidth}%
 37 |  \MakeFramed {\advance\hsize-\width
 38 |    \@totalleftmargin\z@ \linewidth\hsize
 39 |    \@setminipage}}%
 40 |  {\par\unskip\endMakeFramed%
 41 |  \at@end@of@kframe}
 42 | \makeatother
 43 | 
 44 | \definecolor{shadecolor}{rgb}{.97, .97, .97}
 45 | \definecolor{messagecolor}{rgb}{0, 0, 0}
 46 | \definecolor{warningcolor}{rgb}{1, 0, 1}
 47 | \definecolor{errorcolor}{rgb}{1, 0, 0}
 48 | \newenvironment{knitrout}{}{} % an empty environment to be redefined in TeX
 49 | 
 50 | \usepackage{alltt}
 51 | 
 52 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption, graphicx, enumitem}
 53 | \usepackage{nicefrac}
 54 | 
 55 | \usepackage{hyperref}
 56 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true}
 57 | 
 58 | \newmdtheoremenv{Definition}{Definition}[chapter]
 59 | \newmdtheoremenv{Exercise}[Definition]{Exercise}
 60 | \newmdtheoremenv{Theorem}[Definition]{Theorem}
 61 | \newmdtheoremenv{Lemma}[Definition]{Lemma}
 62 | 
 63 | \newcommand{\supp}{\operatorname{supp}} 
 64 | \newcommand{\E}{\mathbb{E}}
 65 | \newcommand{\eps}{\varepsilon}
 66 | 
 67 | \DeclareSymbolFont{extraup}{U}{zavm}{m}{n}
 68 | \DeclareMathSymbol{\varheart}{\mathalpha}{extraup}{86}
 69 | \DeclareMathSymbol{\vardiamond}{\mathalpha}{extraup}{87}
 70 | 
 71 | 
 72 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}}
 73 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}}
 74 | 
 75 | \title{Basic Probability}
 76 | \date{}
 77 | \IfFileExists{upquote.sty}{\usepackage{upquote}}{}
 78 | \begin{document}
 79 | 
 80 | \setcounter{chapter}{6}
 81 | 
 82 | \chapter{Basics of Information Theory}
 83 | 
 84 | When we talk about \textit{information}, we often use the term in qualitative sense. We say things like 
 85 | \textit{This is valuable information} or 
 86 | \textit{We have a lack of information}. We can also make statements about some information being more helpful than other. For a long time, however,
 87 | people have been unable to quantify information. The person who succeeded in this endeavour was \href{https://en.wikipedia.org/wiki/Claude_Shannon}{Claude E. Shannon}
 88 | who with his famous 1948 article \textit{A Mathematical Theory of Communication} single-handedly created a new discipline: Information Theory! He also revolutionised
 89 | digital communication and can be seen as one of the main contributors to our modern communication systems like the telephone, the internet etc. 
 90 | 
 91 | The beauty about information theory is that it is based on probability theory and many results from probability theory seamlessly carry over to information theory.
 92 | In this chapter, we are going to discuss the bare basics of information theory. These basics are often enough to understand many information-theoretic arguments
 93 | that researchers make in fields like computer science, psychology and linguistics.
 94 | 
 95 | \section{Surprisal and Entropy}
 96 | Shannon's idea of information is as simple as it is compelling. The amount of \emph{surprisal} of an event $E$ is defined as the inverse probability $1/P(E)$. Intuitively, rare events (where $P(E)$ is small) are more surprising than those occurring with high probability (where $P(E)$ is high). If we are observing a realisation of a random variable, this realisation is surprising if it is unlikely to occur according to the distribution of that random variable. However, if the probability for the realisation is very low, then on average it does not occur very often, meaning that if we sample from the RV repeatedly, we are not surprised very often. We are not surprised when the probability mass of the distribution is concentrated on only a small subset of its support. 
 97 | 
 98 | On the other hand, we quite often are surprised, if we cannot predict what the outcome of our next draw from the RV might be. We are surprised when the distribution over values of the RV is (close to) uniform. Thus, we are going to be most surprised on average if we are observing realisations of a uniformly distributed RV.
 99 | 
100 | Shannon's idea was that observing RVs that cause a lot of surprises is informative because we cannot predict the outcomes and with each new outcome we have effectively learned something (namely that the $ i^{th} $ outcome took on the value that it did). Observing RVs with very concentrated distributions is not very informative under this conception because by just choosing the most probable outcome we can correctly predict most actually observed outcomes. Obviously, if I manage to predict an outcome beforehand, its occurrence is not teaching me anything.
101 | 
102 | The goal of Shannon was to find a function that captures this intuitive idea. He eventually found it and showed that it is the only function to have properties that encompass the intuition. This function is called the \textbf{entropy} of a RV and it is simply the expected \textbf{surprisal} value, expressed in bits.
103 | 
104 | \begin{Definition}[Surprisal]
105 | The surprisal (value) of an outcome $ x \in \supp(X) $ of some RV $ X
106 | $ is defined as $ -\log_{2}(P(X=x)) = \log_2(\frac{1}{P(X=x)})$.
107 | \end{Definition} 
108 | 
109 | Notice that we are using the logarithm of base 2 here. This is because surprisal and entropy are standardly measured in bits. Intuitively, the surprisal measures how many bits one needs to encode an observed outcome given that one knows the distribution underlying that outcome. Check \href{http://www.umsl.edu/~fraundorfp/egsurpriNOLOGS.html}{this website} to get a feeling for surprisal values measured in bits.
110 | 
111 | \begin{Definition}[Entropy]
112 | The entropy $H(P_X)$ of a RV $ X $ with distribution $P_X$ is defined as 
113 | $$H(P_X) := \E[-\log_{2}(P(X=x))] = - \!\! \sum_{x \in \supp(X)} P(X=x) \log_2(P(X=x)) \, .$$ 
114 | For the ease of notation, we often write $H(X)$ instead of $H(P_X)$.
115 | \end{Definition}
116 | 
117 | The notational convenience of writing $H(X)$ instead of $H(P_X)$ can be confusing, because entropy is really assigning a (non-negative) real number to a distribution, i.e.\ $H(X)$ is {\bf not a function} of the random variable $X$ and it is {\bf not a random variable} either! Formally, for any random variable $X$ with distribution $P_X$ over the set $\mathcal{X}=\supp(X)$ (which might be categorical, i.e.\ $X$ could for instance take on values ``blue'', ``red'' and ``green''), we consider the surprisal function (in bits) $f(x) := -\log_2(P(X=x))$ mapping elements $x \in \mathcal{X}$ to real numbers $f(x) \in \mathbb{R}$. In that case, the surprisal $f(X)$ is a random variable over the reals and its expected value is well defined and called entropy $H(X) = H(P_X) := \E_X[f(X)]$. 
118 | 
119 | As an example, we consider the categorical random variable $X$ with distribution $P(X=\varheart)=P(X=\clubsuit)=1/4, P(X=\spadesuit)=1/2$. In that case, $\supp(X) = \{\varheart, \clubsuit, \spadesuit \}$ and surprisal values in bits are $f(\varheart)=f(\clubsuit)=\log_2(4)=2, f(\spadesuit)=\log_2(2)=1$. The entropy is the expected surprisal value, i.e.\ the individual surprisal valuse weighted with their corresponding probabilities of occurring: $H(X) = \E_X[f(X)] = \frac{1}{4} \cdot 2 + \frac{1}{4} \cdot 2 + \frac{1}{2} \cdot 1 = 3/2$. 
120 | 
121 | The entropy ``does not care'' about the actual outcomes or labels of a random variable, but only about the distribution! In fact, not even the order of the actual probabilities matter, as we are taking an expected value and the additive terms commute. You can verify that the calculation of $H(X)=3/2$ in the example above does apply to all random variables $X$ with distribution $(1/2, 1/4, 1/4)$, no matter what the actual outcomes are. 
122 | 
123 | \begin{Exercise}
124 | Compute the entropy of $Y \sim Binomial(n=2,p=1/2)$.
125 | \end{Exercise}
126 | 
127 | The simplest and simultaneously most important example of entropy is given in Figure~\ref{fig:binaryEntropy} which shows the entropy of the Bernoulli distribution as a function of the parameter $ \theta \in [0,1]$. The entropy function of the Bernoulli is often called the \textbf{binary entropy} $h(\theta) := -\theta \cdot \log_2(\theta) - (1-\theta) \log_2(1-\theta)$. It measures the information of a binary decision, like a coin flip or an answer to a yes/no-question.
128 | The entropy of the Bernoulli attains its maximum of 1 bit when the distribution is uniform, i.e.\ when both choices are equally 
129 | probable. The entropy is 0 if and only if the coin is fully biased towards heads or tails. As explained above, the entropy of the distributions $(\theta, 1-\theta)$ and $(1-\theta,\theta)$ is the same and therefore $h(\theta)=h(1-\theta)$ and the graph is symmetric around $1/2$.
130 | 
131 | \begin{knitrout}
132 | \definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{figure}[t!]
133 | 
134 | {\centering \includegraphics[width=\maxwidth]{figure/binaryEntropy-1} 
135 | 
136 | }
137 | 
138 | \caption[Binary entropy function]{Binary entropy function}\label{fig:binaryEntropy}
139 | \end{figure}
140 | 
141 | 
142 | \end{knitrout}
143 | 
144 | \medskip
145 | From the plot is it also easy to see that entropy is never negative. It holds in general that entropy is non-negative,
146 | because entropy is defined as expectation of surprisal and surprisal is the negative logarithm of probabilities. 
147 | Because $ \log(x) \leq 0 $ for $ x \in (0,1] $, it is clear that $ -\log(x) \geq 0 $ for $ x $ in the same
148 | interval. Notice that from here on we drop the subscript and by convention let $ \log = \log_{2} $.
149 | 
150 | A standard interpretation of the entropy is that it quantifies uncertainty. As we have pointed out before, a uniform distribution means that you are most uncertain and indeed the uniform distribution maximizes the entropy. However, the more choices you have to pick from uniformly, the more uncertain you are going to be.  The entropy function also captures this intuition. Notice that if a discrete distribution is uniform, all probabilities are $ \frac{1}{|\supp(X)|} $. Clearly, as we increase $ |\supp(X)| $, we decrease the probabilities. By decreasing the probabilities, we increase their negative logarithms, and hence their average surprisal. Let us make this intuition more formal.
151 | 
152 | \begin{Theorem}
153 | A discrete RV $ X $ with uniform distribution and support of size $ n $ has entropy
154 | $ H(X) = \log(n) $.
155 | \end{Theorem}
156 | 
157 | \paragraph{Proof:}
158 | \begin{align}
159 | H(X) &= \underset{x \in \supp(X)}{\sum}-\log(P(X=x))P(X=x) \\
160 | &= \underset{x \in \supp(X)}{\sum} -\log(\frac{1}{|\supp(X)|})P(X=x) \\
161 | &= \underset{x \in \supp(X)}{\sum}\log(n)P(X=x) = \log(n) \, .
162 | \hspace{1cm} \square
163 | \end{align}
164 | 
165 | \begin{Exercise}
166 | You are trying to learn chess and you start by studying where chess grandmasters move their king when it
167 | is positioned in one of the middle fields of the board. The king can move to any of the adjoining 8 fields. Since
168 | you do not know a thing about chess yet, you assume that each move is equally probable. In this situation,
169 | what is the entropy of moving the king?
170 | \end{Exercise}
171 | 
172 | One of the first important results in information theory is \href{https://en.wikipedia.org/wiki/Shannon%27s_source_coding_theorem}{Shannon's source-coding theorem} which states that the entropy $H(X)$ of a random variable $X$ measures how many bits one will need on average to encode an outcome that is generated by the distribution $ P_{X} $. 
173 | This result applies to the real-world problem of data compression. Assume that $N$ data points are drawn iid from the distribution $P_X$. In that case, the source-coding theorem tells us that on average, we will need $N \cdot H(X)$ bits to store the (optimally compressed) data. For example, let $P_X$ be the $Bernoulli(\theta)$ distribution over bits. In the case $\theta=1/2$, we have $N$ perfectly random bits which cannot be compressed, and hence we need $N \cdot H(X) = N \cdot h(\theta) = N \cdot h(1/2) = N$ bits of storage. For the general case $\theta \neq 1/2$ when the individual bits are biased, the graph of the binary entropy $h(\theta)$ in Figure~\ref{fig:binaryEntropy} tells us exactly what the compression ratio will be. We will not cover the proof of the source-coding theorem here, but refer to the literature instead.
174 | 
175 | 
176 | \section{Conditional Entropy}
177 | At the outset of this section we promised you that you could easily transfer results from probability 
178 | theory to information theory. We will not be able to show any kind of linearity for entropy because it contains
179 | log-terms and the logarithm is not linear. We can however find alternative expressions for joint entropy (where 
180 | the joint entropy is simply the entropy of a joint RV). Before we do so, let us also define the notion of 
181 | conditional entropy. We have seen in Section~\ref{sec:jointconditionaldistributions} that $P_{X|Y=y}$ is a valid probability distribution for any $y \in \supp(Y)$ such that $P(Y=y)>0$. Hence, we can also define its conditional entropy.
182 | 
183 | \begin{Definition}[Conditional Entropy]
184 | For two jointly distributed RVs $ X,Y $ and $y \in \supp(Y)$ such that $P(Y=y)>0$, the conditional entropy of $ X $ given that $ Y=y $ is defined as
185 | \begin{align*}
186 | H(X | Y=y) &:= \E_X[-\log_{2}(P(X=x | Y=y))] \\
187 | &= - \!\! \sum_{x \in \supp(X)} P(X=x | Y=y) \log_2(P(X=x | Y=y))\, . 
188 | \end{align*}
189 | The conditional entropy of $X$ given $Y$ is defined as
190 | $$ H(X | Y) := \E_Y[ H(X | Y) ] = \sum_{y \in \supp(Y)} P(Y=y) H(X | Y=y) \, .$$
191 | \end{Definition}
192 | 
193 | Intuitively, $H(X | Y)$ is the (average) uncertainty of $X$ after learning $Y$. Intuitively, learning $Y$ (and in fact any information) cannot increase your uncertainty about $X$. Formally, one can prove the following 
194 | \begin{Lemma}[see e.g.\ Proposition~4 of \href{http://homepages.cwi.nl/~schaffne/courses/inftheory/2016/notes/CramerFehr.pdf}{this script}] \label{lemma:noincrease}
195 | For any two random variables $X,Y$ with joint distribution $P_{XY}$, it holds that $H(X | Y) \leq H(X)$.
196 | \end{Lemma}
197 | Note however, that this non-increase of uncertainty only holds on average, as illustrated by the following example:
198 | 
199 | \paragraph{Example}
200 | Consider the binary random variables $X$ and $Y$, with joint distribution
201 | \begin{align*}
202 | &P(X=0,Y=0) = \frac{1}{2}, \quad P(X=0,Y=1) = \frac{1}{4}\\
203 | &P(X=1,Y=0) = 0, \quad P(X=1,Y=1) = \frac{1}{4}.
204 | \end{align*}
205 | By marginalization, we find that $P(X=0) = \frac{3}{4}$ and $P(X=1) = \frac{1}{4}$, while $P(Y=0) = P(Y=1) = \frac{1}{2}$. This allows us to make the following computations:
206 | \begin{align*}
207 | H(X,Y) &= \frac{1}{2}\log 2 + \frac{1}{4} \log 4  + \frac{1}{4} \log 4 = \frac{3}{2}\\
208 | H(X) &= h\left(\frac{1}{4}\right) = h\left(\frac{3}{4}\right) \approx 0.81\\
209 | H(Y) &= h\left(\frac{1}{2}\right) = 1\\
210 | H(X|Y) &= P(Y=0) \cdot H(X | Y=0) + P(Y=1) \cdot H(X | Y=1)\\
211 | &= \frac{1}{2} \cdot 0 + \frac12 \cdot 1 = \frac12 \\
212 | H(Y|X) &= P(X=0) \cdot H(Y | X=0) + P(X=1) \cdot H(Y | X=1)\\
213 | &= \frac{3}{4} \cdot h\left(\frac{1}{3} \right) + \frac{1}{4} \cdot 0 \approx 0.69
214 | \end{align*}
215 | % We also could have computed $H(X|Y)$ and $H(Y|X)$ directly through the definition of conditional entropy.
216 | Note that for this specific distribution, learning the outcome $Y=1$ increases the uncertainty about $X$, $H(X|Y=1) > H(X)$, but on average, we always have $H(X|Y) \leq H(X)$. It is important to remember that Lemma~\ref{lemma:noincrease} only holds on average, not for specific values of $Y$. Note also that in this example, $H(X|Y) \neq H(Y|X)$. 
217 | 
218 | It is not a coincidence that the joint entropy $H(X,Y)$ in the example above is equal to $H(X|Y)+H(Y)$ and $H(Y|X)+H(X)$. One can prove this chain rule in general:
219 | 
220 | \begin{align*}
221 | H(X,Y) &= \underset{\substack{x \in \supp(X)\\y \in \supp(Y)}}{\sum} -\log(P(X=x,Y=y)) \times P(X=x, Y=y) \\
222 | \begin{split}
223 | &= \underset{\substack{x \in \supp(X)\\ y \in \supp(Y)}}{\sum} -\log(P(X=x \mid  Y=y)) \times P(X=x,Y=y) \\ 
224 | &\qquad - \underset{y \in \supp(Y)}{\sum}\log(P(Y=y)) \times \sum_{x \in \supp(X)} P(X=x,Y=y) 
225 | \end{split} \\
226 | \begin{split}
227 | &=\sum_{y \in \supp(Y)} P(Y=y) \times \sum_{x \in \supp(X)} -\log(P(X=x \mid  Y=y)) \times P(X=x \mid Y=y) \\ &\qquad - \underset{y \in \supp(Y)}{\sum}\log(P(Y=y)) \times P(Y=y)
228 | \end{split} \\
229 | &= H(X | Y) + H(Y) \; .
230 | \end{align*}
231 | 
232 | \begin{Exercise}
233 | Prove that $ H(X,Y | Z) = H(X | Z) + H(Y | Z) $ if $ X \bot Y \mid Z $.
234 | \end{Exercise}
235 | As corollary, we get that $H(X,Y)=H(X)+H(Y)$ for independent random variables $X$ and $Y$. More generally, the entropy of $n$ independent random variables is $H(X_1^n) = \sum_{i=1}^n H(X_i)$.
236 | 
237 | 
238 | \section{An Information-Theoretic View on EM}
239 | Now that we have seen some information-theoretic concepts, you may be happy to hear that there is an information-theoretic interpretation
240 | of EM. This interpretation helps us to get a better intuition for the algorithm. To formulate that interpretation we need
241 | one more concept, however.
242 | 
243 | \begin{Definition}[Relative Entropy]
244 | The relative entropy of RVs \\ $ X,Y $ with distributions $P_X, P_Y$ and $\supp(X) \subseteq \supp(Y) $ is defined as
245 | $$ D(P_X||P_Y) := \sum_{x \in \supp(X)} P(X=x) \log \frac{P(X=x)}{P(Y=x)} \ . $$
246 | If $ P(Y=x) = 0 $ for any $ x \in \supp(X) $ we define $ D(P_X||P_Y) = \infty $. As with entropy, we often abbreviate $D(P_X||P_Y)$ with  $D(X||Y)$.
247 | \end{Definition}
248 | 
249 | The relative entropy is commonly known as \textbf{Kullback-Leibler (KL)} divergence. It measures the entropy of $ X $ as scaled to $ Y $. Intuitively,
250 | it gives a measure of how ``far away'' $ P_{X} $ is from $ P_{Y} $. To
251 | understand ``far away'', recall that entropy is a measure of
252 | uncertainty. 
253 | % The
254 | % relative entropy measure the uncertainty that you have about $ P_{X} $ if you know $ P_{Y} $\chris{hard to see why at this point}.
255 | This uncertainty is low if both distributions place most
256 | of their mass on the same outcomes. Since $ \log(1) = 0 $ the relative entropy is 0 if $ P_{X} = P_{Y} $.
257 | 
258 | It is worthwhile to point out the difference between relative and conditional entropy. Conditional entropy is the average entropy of $ X $ given that you
259 | know what value $ Y $ takes on. In the case of relative entropy you do not know the value of $ Y $, only its distribution.
260 | 
261 | \begin{Exercise}
262 | Show that $ D(X,Y||Y) = H(X | Y) $. Furthermore show that $ D(X,Y||Y) = H(X) $ if $ X\bot Y $.
263 | \end{Exercise}
264 | 
265 | 
266 | Let us start by remembering why we need EM. We have a model that defines a joint distribution
267 | over observed ($ x $) and latent data ($ z $). Such a model generally looks as follows:
268 | \begin{equation}
269 | P(X=x, Z=z  \mid  \Theta = \theta) = P(X=x \mid Z=z, \Theta=\theta) P(Z=z \mid \Theta = \theta)
270 | \end{equation}
271 | where we have chosen a factorization that provides a separate term for a distribution over only the
272 | latent data.
273 | 
274 | Recall that the goal of the EM algorithm is to iteratively increase the likelihood through consecutive
275 | updates of parameter estimates. These updates are achieved through maximum-likelihood estimation based
276 | on expected sufficient statistics. We are now going to show that a) EM computes a lower bound on the
277 | marginal log-likelihood of the data in each iteration and b) that this lower bound becomes tight when the
278 | expected sufficient statistics are taken with respect to the model posterior. The latter implies that
279 | EM performs the optimal update in each iteration.
280 | 
281 | Let us start by expanding the data log-likelihood and then lower-bounding it.
282 | \begin{align}
283 | &\log(P(X=x \mid \Theta=\theta)) = \log(\sum_y P(X=x, Y=y \mid  \Theta = \theta))  \\
284 | &= \log\left(\sum_{y} Q(Y=y \mid \Phi=\phi)\frac{P(X=x, Y=y \mid  \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\
285 | &\geq \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(X=x, Y=y \mid  \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right)
286 | \label{eq:ELBO1}
287 | \end{align}
288 | Here, we have used \href{https://en.wikipedia.org/wiki/Jensen\%27s_inequality}{Jensen's Inequality} to
289 | derive the lower bound. Observe that the log is indeed a concave function. 
290 | 
291 | We also have introduced
292 | an auxiliary distribution $ Q $ over the latent variables with parameters $ \phi $. 
293 | For reasons that we will explain shortly,
294 | this distributions is often called the \textbf{variational distribution} and its parameters the
295 | \textbf{variational parameters}. The letter $ Q $ is slightly non-standard to denote distributions but
296 | we are are following conventions from the field of \textbf{variational inference} here.
297 | 
298 | In the next step, we factorise the model distribution in order to recover a KL divergence term between
299 | the variational distribution and the model posterior over latent variables.
300 | \begin{align}
301 | &\sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(X=x, Y=y \mid  \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\
302 | &= \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(Y=y \mid X=x, \Theta = \theta)P(X=x \mid \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\
303 | &= \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(Y=y \mid X=x, \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) + \log(P(X=x \mid \Theta=\theta)) \\
304 | &= -D(Q||P) + \log(P(X=x \mid \Theta=\theta)) \label{eq:ELBO2}
305 | \end{align}
306 | Equation~\eqref{eq:ELBO2} gives us two insights. First it quantifies the gap between the lower bound
307 | and the actual data likelihood. This gap is equal to the KL divergence between the variational distribution
308 | and the model posterior over latent variables. Second, since KL divergence is always positive, the bound only becomes
309 | tight when $ P=Q $. But this is exactly what is happening in the E-step! The E-step sets $ P=Q $ and
310 | then computes expectations under that distribution (see Equation~\eqref{eq:ELBO1}). Thus, the E-step increases
311 | the lower bound on the marginal log-likelihood.
312 | 
313 | Looking back at Equation~\eqref{eq:ELBO1}, we also see that the M-step increases the lower bound because 
314 | it maximises $ \E\left[P(X=x, Y=y\mid \Theta = \theta)\right] $. We conclude that both steps
315 | are increasing the lower bound on the log-likelihood. We therefore conclude that EM increases the data likelihood
316 | in every iteration (or leaves it unchanged at worst).
317 | 
318 | We will finish with a quick rejoinder on variational inference. EM is a special case of variational inference.
319 | Variational inference is any inference procedure which uses an auxiliary distribution $ Q $ to compute
320 | a lower bound on the likelihood. In the general setting, the auxiliary distribution can be different from the 
321 | model posterior. This means that the bound never gets tight. However, in models in which the exact posterior 
322 | is hard (read: impossible) to compute, using a non-tight lower bound instead can be incredibly useful!
323 | 
324 | The reason this inference procedure is called \textit{variational} is because it is based on the 
325 | \href{https://en.wikipedia.org/wiki/Calculus_of_variations}{calculus of variations}. This works mostly
326 | like normal calculus except that standard operations like differentiation are done with respect to functions
327 | instead of variables.
328 | 
329 | %Naively, we could take the expectation with respect to any distribution
330 | %over latent values. Obviously, we would like to find the best one, i.e. the one that is closest to the
331 | %actual posterior. We can formalize this by introducing an auxiliary distribution\footnote{We follow
332 | %standard notation here by denoting the auxiliary distribution $ Q $ instead of $ P $. Also, the
333 | %parameter variable is chosen so as to distinguish it from the parameter variable of our model.} 
334 | %$ Q(z\mid\Phi=\phi) $ under
335 | %which we compute the expected sufficient statistics. We want to find the auxiliary distribution that
336 | %is closest to actual posterior $ P_{Z\midX=x,\Theta=\theta} $. We measure closeness in an information-theoretic
337 | %sense using KL-divergence. Formally, our goal is to find 
338 | %\begin{equation}
339 | %Q^{*}_{Z\mid\Phi=\phi} = \underset{Q_{Z\mid\Phi=\phi}}{\mbox{arg min}}~D\left( Q_{Z\mid\Phi=\phi} || P_{Z \mid X=x,\Theta=\theta} \right) \ .
340 | %\end{equation}
341 | 
342 | 
343 | 
344 | \section*{Further Material}
345 | 
346 | At the ILLC, there is a whole course about information theory, \href{http://homepages.cwi.nl/~schaffne/courses/inftheory/}{currently taught by Christian Schaffner}. David MacKay also offers \href{http://www.inference.phy.cam.ac.uk/itprnn/book.pdf}{a free book on the subject}. Finally,
347 | Coursera also offers \href{https://www.coursera.org/course/informationtheory}{an online course on information theory}.
348 | 
349 | The information-theoretic formulation of EM was pioneered in this \href{http://www.cs.toronto.edu/~fritz/absps/emk.pdf}{paper}. A very recent and intelligible 
350 | \href{https://arxiv.org/abs/1601.00670}{tutorial on variational inference} can be found on the archive.
351 | 
352 | \end{document}
353 | 
354 | %%% Local Variables:
355 | %%% mode: latex
356 | %%% TeX-master: "chapter7"
357 | %%% End:
358 | 
359 | \end{document}
360 | 


--------------------------------------------------------------------------------
/chapter2/chapter2_forInclude.tex:
--------------------------------------------------------------------------------
  1 | \chapter{Axiomatic Probability Theory}
  2 | 
  3 | \section{Axioms of Probability}
  4 | In the previous chapter, we have introduced sample spaces and event spaces. We would like to be able
  5 | to express that certain events are more (or less) likely than others. 
  6 | Therefore, we are going to measure the probability of events in a mathematically precise sense. 
  7 | 
  8 | \begin{Definition}[Finite Measure]\label{axioms}
  9 | A \emph{finite measure} is a function $ \mu: \mathcal{S} \rightarrow \mathbb{R} : S \mapsto \mu (S) $ 
 10 | that maps elements
 11 | from a countable set of sets $ \mathcal{S} $ (formally a \href{http://en.wikipedia.org/wiki/Sigma-algebra}
 12 | {$ \sigma $-algebra}) to real numbers. Such a measure has the following properties:
 13 | \begin{enumerate}
 14 | \item $ \mu(S) \in \mathbb{R} $ for $ S \in \mathcal{S} \, ,$
 15 | \item $ \mu\left( \underset{i = 1}{\overset{\infty}{\bigcup}} S_{i} \right)
 16 | = \underset{i = 1}{\overset{\infty}{\sum}} \mu \left( S_{i} \right) $ for disjoint sets $S_1, S_2, \ldots$ \, . \label{countableAdditivty}
 17 | \end{enumerate}
 18 | \end{Definition}
 19 | 
 20 | Notice that we are restricting ourselves to finite measures here, i.e. the value of the measure can never
 21 | be infinite. This restriction makes sense as probabilities are finite as well. Property \ref{countableAdditivty} is 
 22 | known as \emph{countable additivity}. 
 23 | 
 24 | Let 
 25 | $ S = \underset{i=1}{\overset{n}{\bigcup}} S_{i} $ for some positive natural number $ n $ and disjoint 
 26 | $ S_{i} $ and $ S_{j} = 
 27 | \emptyset $ for $ j > n $. By
 28 | countable additivity, we then get
 29 | \begin{equation}
 30 | \mu(S) = \mu(\underset{i=1}{\overset{\infty}{\bigcup}} S_{i}) = \mu \left( \underset{i=1}{\overset{n}{\bigcup}} S_{i} \cup 
 31 | \underset{j=n+1}{\overset{\infty}{\bigcup}} \emptyset \right) 
 32 | = \underset{i=1}{\overset{n}{\sum}} \mu ( S_{i} )
 33 | + \underset{j=n+1}{\overset{\infty}{\sum}} \mu ({\emptyset})
 34 | \end{equation}
 35 | 
 36 | Since the $ S_{i} $ are disjoint, we
 37 | must have $ \mu(S) = \underset{i=1}{\overset{n}{\sum}} \mu (S_{i}) $ and it follows that 
 38 | $ \mu(\emptyset) = 0 $. We conclude that the empty set has measure $ 0 $ for all measures. Furthermore, we also see from the above
 39 | derivation that countable additivity implies finite additivity, i.e.
 40 | $ \mu(S) = \underset{i=1}{\overset{n}{\sum}} \mu(S_{i}) $ for finite positive $ n $ (again, this only
 41 | holds if the $ S_{i} $ are disjoint).
 42 | 
 43 | Examples of measures are not hard to find. In fact, we have already seen a measure,
 44 | namely the function $ |\cdot| $ that counts the elements of a set (check yourself that it really is a 
 45 | measure). Another measure is the Dirac-measure that is related to the characteristic
 46 | function of a set. While the characteristic function tells you whether any object belongs to a given set,
 47 | the Dirac-measure tells you whether any set contains a given object. Let us call the object in question
 48 | $ a $. Then its Dirac measure $ \delta_{a}(S) = 1 $ iff $ a \in S $ and 0 otherwise (check yourself that the Dirac-measure indeed is a measure).
 49 | 
 50 | Apart from these examples, there is one measure, however, that is going to be the star of the rest of this 
 51 | script, namely the \textbf{probability measure}.
 52 | 
 53 | \begin{Definition}[Probability measure]\label{def:probmeasure}
 54 | A probability measure \\ $ \mathbb{P}: \mathcal{A} \rightarrow \mathbb{R}, A \mapsto \mathbb{P}(A) $
 55 | on an event space $ \mathcal{A} $ associated with a sample space $ \Omega $ has the
 56 | following properties:
 57 | \begin{enumerate}
 58 | \item $ \mathbb{P}(A) \geq 0 $ for all $ A \in \mathcal{A} \,$,
 59 | \item $ \mathbb{P}\left( \underset{i = 1}{\overset{\infty}{\bigcup}} A_{i} \right)
 60 | = \underset{i = 1}{\overset{\infty}{\sum}} \mathbb{P} \left( A_{i} \right) \,$ for disjoint events $A_1,A_2,\ldots$ \, ,  \label{union}
 61 | \item $ \mathbb{P}(\Omega) = 1 \,$. \label{unity}
 62 | \end{enumerate}
 63 | \end{Definition}
 64 | 
 65 | Notice that we only added Property~\ref{unity} to the general definition of a measure. Hence, a
 66 | \textbf{probability} (the value that the probability measure assigns to an event) will always lie in the real interval 
 67 | $[0,1]$. The above three axioms for a probability measure are often referred to as \emph{axioms of probability}
 68 | or \emph{Kolmogorov axioms} after their inventor \href{https://en.wikipedia.org/wiki/Andrey_Kolmogorov}{Andrey
 69 | Kolmogorov}.
 70 | 
 71 | We have already discussed uniform probabilities in the previous chapter. We can now formally explain
 72 | what we meant by that. The uniform probability measure $ \mathbb{P} $ has the property that
 73 | $ \mathbb{P}(\{\omega\}) = \frac{1}{|\Omega|} $ for all $ \omega \in \Omega $. At this point, the
 74 | distinction between sample and event spaces becomes important. We cannot measure the elements of a
 75 | sample space, only the elements of an event space! Recall our convention that we will always assume
 76 | that $ \mathcal{A} = \mathcal{P}(\Omega) $ which obviously contains a singleton for each element in
 77 | $ \Omega $. Using this assumption, the uniform probability measure is indeed well-defined. Whenever we talk about
 78 | \textit{uniform probability}, we either mean the uniform probability measure or, more often, the real
 79 | value $ \frac{1}{|\Omega|} $ to which this measure uniformly evaluates.
 80 | 
 81 | In order to create a tight relationship between a sample space, an event space and a probability measure,
 82 | we introduce the concept of a \textbf{probability space}. Probability spaces are also known as 
 83 | \textbf{(probabilistic) experiments}.
 84 | 
 85 | \begin{Definition}[Probability space] \label{def:ProbabilitySpace}
 86 | A probability space is a triple $ (\Omega, \mathcal{A}, \mathbb{P}) $, consisting of a sample space $ \Omega $,
 87 | an event space $ \mathcal{A} $ and a probability measure $ \mathbb{P} $.
 88 | \end{Definition}
 89 | 
 90 | If we roll a die, for example, we have the sample space $ \Omega = \{1,2,3,4,5,6\} $ and, by 
 91 | convention, the event space $ \mathcal{A} = \mathcal{P}(\Omega) $. If we add the uniform probability measure, 
 92 | we have constructed  a \emph{probabilistic experiment}. We can use it to answer a couple of questions. For example, we 
 93 | might wonder about the probability of obtaining an even number. By Property~\ref{union} of our definition, this 
 94 | probability is given by
 95 | \begin{align}
 96 | \mathbb{P}(\{2,4,6)\}) &= \mathbb{P}(\{2\} \cup \{4\} \cup \{6\}) \\
 97 | &= \mathbb{P}(\{2\}) + \mathbb{P}(\{4\})
 98 | + \mathbb{P}(\{6\}) = \frac{1}{6} + \frac{1}{6} + \frac{1}{6} = \frac{1}{2}
 99 | \end{align}
100 | 
101 | Notice that this calculation is rather cumbersome. After all, we might just have evaluated 
102 | $ \mathbb{P}(\{2,4,6\}) $ directly. This is because by convention we have $ \mathcal{A} = \mathcal{P}(\Omega) $ which certainly contains $ \{2,4,6\} $.
103 | Since the probability measure is defined on $ \mathcal{A} $, it must map $ \{2,4,6\} $ to some real number.
104 | However, the above calculation points to an interesting fact. In order
105 | to fully specify a probability measure, is suffices to specify the measure on the singleton sets of the
106 | event space. By countable additivity, this assignment already specifies the measure on the entire event space, as we can
107 | construct any event as a countable union of singletons.
108 | 
109 | It is important to point out that we just chose the uniform probability measure as the one that seems ``natural'' for
110 | a die roll. However, nobody is forcing us to do so. In fact, Definition~\ref{def:ProbabilitySpace} allows us to impose arbitrary probability measures.
111 | 
112 | \begin{Exercise}
113 | Let us consider a rigged die. Take $ (\Omega, \mathcal{A}, \mathbb{P}) $ with $ \Omega $ and $ \mathcal{A} = \mathcal{P}(\Omega) $ 
114 | as in the uniform die-roll example before, but use the 
115 | probability measure specified by \\ $ \mathbb{P} = \{(\{1\},0), (\{2\}, \frac{1}{12}), (\{3\}, \frac{1}{6}), (\{4\}, \frac{1}{6}), (\{5\}, \frac{1}{3}),
116 | (\{6\},\frac{1}{4}) \} $.
117 | \begin{enumerate}
118 | \item Verify that $ \mathbb{P} $ is indeed a probability measure.
119 | \item Compute the probability of obtaining a number strictly smaller than $ 5 $ in this experiment.
120 | \end{enumerate}
121 | \end{Exercise}
122 | 
123 | \section{Probability of Arbitrary Unions of Events}
124 | \begin{figure}
125 | \center
126 | \begin{subfigure}{.4\textwidth}
127 | \begin{venndiagram2sets}[labelA=$ E_{1} $, labelB= $ E_{2} $, labelAB= $ E_{3} $, shade=red!40]
128 | \fillACapB
129 | \end{venndiagram2sets}
130 | \caption{}
131 | \label{Venn2}
132 | \end{subfigure}
133 | ~
134 | \begin{subfigure}{.4\textwidth}
135 | \begin{venndiagram3sets}[labelA=$ E_{1} $, labelB=$ E_{2} $, labelC=$ E_{3} $, labelOnlyAB=$ - $, 
136 | labelOnlyBC=$ - $, labelOnlyAC=$ - $, labelABC=$ + $, shade=red!40]
137 | \fillACapB
138 | \fillACapC
139 | \fillBCapC
140 | \end{venndiagram3sets}
141 | \caption{}
142 | \label{Venn3}
143 | \end{subfigure}
144 | \caption{\ref{Venn2}: Two overlapping events $ E_{1} $ and $ E_{2} $. Their intersection 
145 | (the coloured region) gets counted twice if we add up their probabilities. \\
146 | \ref{Venn3}: Venn diagram with 3 events. First we deduct 
147 | $ E_{1} \cap E_{2}, E_{1} \cap E_{3}, E_{2} \cap E_{3} $ in order to prevent double counting and then
148 | we add in $ E_{1} \cap E_{2} \cap E_{3} $. Deductions and additions are indicated by pluses and minuses.}
149 | \end{figure}
150 | 
151 | We have seen how to compute probabilities of events if they can be formed as unions of \textit{disjoint}
152 | events. The natural question to ask is what to do if we want to compute the probability of the \emph{union
153 | of non-disjoint events}. In order to reason about this problem, we first take a step back and think about the
154 | outcomes of our probability space. We know that each event with non-zero probability contains at least one outcome (since
155 | $ \mathbb{P}(\emptyset) = 0 $, we can safely ignore the empty event). Let us assume that we take the union of events
156 | $ E_{1} $ and $ E_{2} $ with $ E_{1} \cap E_{2} = E_{3} \not = \emptyset $. This means that the outcomes
157 | in $ E_{3} $ are contained in both $ E_{1} $ and $ E_{2} $. This situation is illustrated in Figure~\ref{Venn2}. If we were to simply add up the probabilities of $ E_{1} $ and $ E_{2} $, we 
158 | would effectively count the contribution of the outcomes in $ E_{3} $ twice. We would hence
159 | get an overestimate of the actual value of $ \mathbb{P}(E_{1} \cup E_{2}) $.
160 | In order to avoid this we will need to subtract the probability of $ E_{3} $ one time. This leads us to the following formulation:
161 | \begin{equation}
162 | \mathbb{P}(E_{1} \cup E_{2}) = \mathbb{P}(E_{1}) + \mathbb{P}(E_{2}) - \mathbb{P}(E_{1} \cap E_{2})
163 | \end{equation}
164 | 
165 | Notice that this is fully general in that it is true even if $ E_{1} $ and $ E_{2} $ were disjoint. In that
166 | case, their intersection would be empty. We can generalize this principle to the (countable) union of
167 | an arbitrary number of events. This will give us a principled way of calculating the probability of any
168 | union of events. This calculation technique is know as the \textbf{Inclusion-Exclusion principle}.
169 | 
170 | \newpage
171 | \begin{Theorem}[Inclusion-Exclusion principle]
172 | The probability of any (countable) union of events $ E_{1}, \ldots, E_{n} $ can be computed as
173 | \begin{equation}\label{eq:incexc}
174 | \mathbb{P} \left( \underset{i=1}{\overset{n}{\bigcup}} E_{i} \right) 
175 | = \underset{i=1}{\overset{n}{\sum}} (-1)^{i+1} \left( \underset{j_{1}<\ldots<j_{i}}{\sum} 
176 | \mathbb{P} \left(E_{j_{1}} \cap \ldots \cap E_{j_{i}} \right) \right)
177 | \end{equation}
178 | \end{Theorem}
179 | 
180 | We are going to proceed with a combinatorial proof of the Inclusion-Exclusion principle. It is
181 | very elegant but invokes the \href{https://en.wikipedia.org/wiki/Binomial_theorem}{binomial theorem}.
182 | For completeness sake we will prove the binomial theorem
183 | at the end of this chapter. For now, just trust us that it exists and is correct.
184 | 
185 | \paragraph{Proof} We are going to focus on a particular outcome $ \omega $ that is contained in $ m $ events 
186 | which we call without loss of generality $ E_{1}, \ldots, E_{m} $ for some $ m < n $. 
187 | Notice that we can safely neglect all events which do not contain
188 | $ \omega $, since $ \omega $ is not going to contribute to their probability. 
189 | 
190 | For all the $ E_{i} $, $ 1 \leq i \leq m $ in which $ \omega $ is contained, it is certainly true
191 | that $ \omega $ is also contained in their intersections. The Inclusion-Exclusion-principle
192 | adds up or subtracts the probabilities of intersections of a given size. Notice that any intersection of more
193 | than $ m $ events will not contain $ \omega $ as we intersect with at least one event that does not
194 | contain $ \omega $. Thus, we only need to consider intersections of our $ m $ $ \omega $-containing
195 | sets.
196 | 
197 | When $ i = 1 $ the intersection is trivial, as it just consists of one event. How many ways are there to pick
198 | one out of $ m $ events? The answer is $ \binom{m}{1} $. This is the number of times that $ \omega $
199 | contributes to the overall probability. At this point we have an overestimate of that probability 
200 | (compare this to Figure~\ref{Venn2}). 
201 | Next we subtract the probabilities of the mutual intersections ($ i = 2 $).
202 | By the same reasoning as before, the contribution of $ \omega $ is deducted $ \binom{m}{2} $ times
203 | which gives us an underestimate since $ \binom{m}{1} \leq \binom{m}{2} $ for $ m \geq 3 $. Since we
204 | are adding and subtracting in alternation, we will now keep flip-flopping between under- and over
205 | estimates. After considering all intersections of up to $ m $ sets, we should get the correct result,
206 | however.
207 | 
208 | What we want to prove is that the right-hand side of \eqref{eq:incexc} counts $ \omega $'s contribution 
209 | to the overall probability exactly once (because this is what happens on the left hand-side of \eqref{eq:incexc}).
210 | That is, we have to prove that
211 | \begin{equation} \label{InExProofStep1}
212 | 1 = \underset{i=1}{\overset{m}{\sum}}(-1)^{i-1}\binom{m}{i}
213 | \end{equation}
214 | 
215 | We are right on our way towards exploiting the binomial theorem. Let us first state it.
216 | \begin{equation} \label{binomTheorem}
217 | (p + q)^{m} = \underset{i=0}{\overset{m}{\sum}} \binom{m}{i} p^{i}q^{n-i} 
218 | \end{equation}
219 | 
220 | Setting $p=(-1)$ and $q=1$, and multiplying both sides with $(-1)$, we obtain
221 | \[
222 | -(-1+1)^m = - \sum_{i=0}^n \binom{m}{i} (-1)^{i}
223 | \]
224 | which can be rewritten as
225 | \begin{equation} \label{eq:rewrite}
226 | 0 = -1 + \sum_{i=1}^n \binom{m}{i} (-1)^{i+1} \, ,
227 | \end{equation}
228 | because $\binom{m}{0} = 1$. Equation~\eqref{eq:rewrite} implies \eqref{InExProofStep1} which we needed to prove. $ \square $\bigskip
229 | %
230 | %It may not be entirely obvious how to use the theorem in our situation, but hold on. First observe
231 | %that the sum in Equation~\eqref{binomTheorem} actually starts counting at 0 whereas in Equation~\eqref{InExProofStep1}1}
232 | %it starts counting from 1. However, we already know that $ \binom{m}{0} = 1 $. Thus we can add this 
233 | %binomial coefficient to the left hand side in Equation~\eqref{InExProofStep1} and at the same time subtract 1 from the
234 | %right hand side. This gives us \eqref{InExProofStep2}, which is the equality that we are going to prove
235 | %in what follows. Since we derived it from Equation~\ref{InExProofStep1}, it implies that if \eqref{InExProofStep2}
236 | %holds, so does \eqref{InExProofStep1}.
237 | %\begin{equation} \label{InExProofStep2}
238 | %\underset{i=0}{\overset{m}{\sum}}(-1)^{i}\binom{m}{i} = 0
239 | %\end{equation} 
240 | %
241 | %We notice that $ 0 $ in turn can be expressed as $ (1-1) $. Furthermore, any power of 0 is still 0.
242 | %Thus we rewrite again to get
243 | %\begin{equation}
244 | %\underset{i=0}{\overset{m}{\sum}}\binom{m}{i}(-1)^{i} = (1-1)^{m}
245 | %\end{equation}
246 | %
247 | %This looks a lot like the binomial theorem, doesn't it? Just one component is missing. We have
248 | %our $ p $ from Equation~\eqref{binomTheorem} which is $ -1 $ in this case. We still need to account for 
249 | %$ q $, though. Luckily, setting $ q=1 $ and taking powers does not change anything. Thus we can safely write
250 | %\begin{equation}
251 | %\underset{i=0}{\overset{m}{\sum}}\binom{m}{i}(-1)^{i}1^{m-i} = (1-1)^{m}
252 | %\end{equation} 
253 | %
254 | %This equality is true by the binomial theorem (Equation~\eqref{binomTheorem}) which it is an instantiation of.
255 | %We have thus succeeded in proofing the Inclusion-Exclusion principle. If the proof is not entirely
256 | %clear to you, please go over it again. $ \square $
257 | 
258 | At this point we have done our fair share of math and found out how to calculate the probability of a union
259 | of events. We should ask ourselves what the probability of a union of events even tells us. Observe that an event 
260 | occurs whenever we draw an outcome from our sample space that is contained in that event. By taking the union
261 | of events $ E_{1}, \ldots, E_{n} $ we form a new event $ E $ that (possibly) contains more outcomes than each 
262 | of the original events. Thus, the probability of the $ E $ will be higher than (or the same as) the 
263 | probability of each of $ E_{1}, \ldots, E_{n} $. What we are measuring then, is the probability that 
264 | \textit{any} of the events $ E_{1}, \ldots, E_{n} $ occur. Crucially, we do not care anymore which one of them
265 | occurs.
266 | 
267 | What we are missing is a way to express the probability that a given number of events occur 
268 | \textit{together}. This concept is so important that we have a dedicated name for it, that of 
269 | \textbf{joint probability}.
270 | 
271 | \begin{Definition}[Joint probability] \label{def:jointprob}
272 | The joint probability of a (countable) set of events $ \{E_{1}, \ldots, E_{n}\} $ is defined as
273 | $$ \mathbb{P}(E_{1} \cap \ldots \cap E_{n}) $$
274 | Sometimes one also finds the alternative notation 
275 | $$ \mathbb{P}(E_{1}, \ldots, E_{n}) $$
276 | \end{Definition}
277 | 
278 | Wow, that was simple! We don not event need to prove another rule for calculating the joint probability.
279 | After all, we already know how to take the intersection of sets. Annoyingly, one problem remains: our
280 | definition of event spaces does not guarantee that they contain the intersections of their members. Or does 
281 | it? Well, let us see whether we can ``paraphrase'' what an intersection is.
282 | 
283 | \begin{align}
284 | E_{1} \cap E_{2} = \Omega \backslash ((\Omega \backslash E_{1}) \cup (\Omega \backslash E_{2}))
285 | \end{align}
286 | 
287 | All the operations on the right hand side are defined for events spaces. We have thus solved our problem since
288 | we have shown that we can indeed do intersection in event spaces.
289 | To convince yourself that this is correct, you may want to consult Figure~\ref{Venn2}. Alternatively, you
290 | may also just realise that this is an instance of \href{https://en.wikipedia.org/wiki/De_Morgan's_laws}
291 | {DeMorgan's laws} which you should know from set theory. Notice that we do not claim that this is the only 
292 | valid ``paraphrase''. Feel free to find others, if you like!
293 | 
294 | 
295 | 
296 | 
297 | \section{Probability of Complements of Events}
298 | At this point we are capable to do most probabilistic computations that we will encounter in this course.
299 | From here on, it is all about making our lives easier. For example, how would you solve the following 
300 | problem.
301 | 
302 | \begin{Exercise}
303 | You are observing a panel of 200 light bulbs and you know that at least one of them will light up once you 
304 | press a button. What is the probability that any except the 87th bulb will light up? Note: this is a 
305 | conceptual exercise. For the very keen ones, you can obtain the probability for each bulb to be turned on
306 | by typing the following into the Python interpreter:
307 | 
308 | \begin{lstlisting}
309 | import numpy
310 | 
311 | probabilities = numpy.random.rand(1,200)
312 | print probabilities/probabilities.sum()
313 | \end{lstlisting} 
314 | \end{Exercise}
315 | 
316 | The point of the above exercise is that it will be awfully cumbersome to compute the probability of the
317 | union of the singletons $ {E_{i}} $ where $ 1 \leq i \leq 200 $ and $ i \not = 87 $. On the other hand
318 | we can easily look up $ \mathbb{P}(E_{87}) $. The question is whether we can exploit this simpler calculation
319 | to help us answer the original question. Here we will again make use of the properties of event spaces.
320 | For any event $ E $ in our event space we also have $ \Omega \backslash E $ in the same space. Furthermore,
321 | $ E $ and $ \Omega \backslash E $ are disjoint which by our probability axioms means that we can simply add
322 | up their probabilities if we want to calculate the probability of their union. But what's the union
323 | of $ E $ and $ \Omega \backslash E $? It's exactly $ \Omega $. From axiom \ref{unity} we know that 
324 | $ \mathbb{P}(\Omega) = 1 $. By simple algebraic manipulations we find that 
325 | \begin{equation}
326 | \mathbb{P}(\Omega \backslash E) = 1 - \mathbb{P}(E)
327 | \end{equation}
328 | 
329 | Thus if we want to find the probability that any but the $ 87th $ bulb will light up, we simply compute
330 | the probability that the $ 87th $ bulb will light up will light up and subtract that from 1. This is a rather 
331 | general strategy to simplify calculations whenever the probability of an event is hard to compute. 
332 | Maybe the probability of the complement of that event will be easier to compute.
333 | 
334 | \begin{Exercise}
335 | Show that in general $$ \mathbb{P}(E_{1}\backslash (E_{1}\cap E_{2})) 
336 | = \mathbb{P}(E_{1}) - \mathbb{P}(E_{1}\cap E_{2}) $$
337 | \end{Exercise}
338 | 
339 | \section{Conditional Probability and Independence}
340 | After we have seen how to measure the probability of events, we are going to introduce another
341 | tremendously important concept, that of \textbf{conditional probability} measures.
342 | 
343 | \begin{Definition}[Conditional probability measure]
344 | The probability of an event $ E_{i} $ conditioned on another event $ E_{j} $ with $ \mathbb{P}(E_{j}) > 0 $ 
345 | is defined as $$ \mathbb{P}(E_{i}|E_{j}) := \dfrac{\mathbb{P}(E_{i} \cap E_{j})}{\mathbb{P}(E_{j})} $$
346 | \label{condProb}
347 | \end{Definition}
348 | 
349 | Before we get into the math of conditional probabilities, let us try to understand the meaning of this concept. 
350 | When we are computing the conditional probability of an event $ E_{i} $, we re-scale with the 
351 | probability of the conditioning event $ E_{j} $. If $ E_{j} \not = \Omega $, $ \mathbb{P}(E_{j}) $
352 | might be smaller than 1. Thus, this rescaling assumes \textit{that $ E_{j} $ has already occurred}. In other
353 | words, we are excluding all outcomes that are not in $ E_{j} $ from further consideration (even though they
354 | may be in $ E_{i} $). The interpretation of conditional probabilities is that they are the probabilities
355 | of events assuming that another event has already occurred.
356 | 
357 | Another interpretation is that when working with a conditional probability measure, we are in fact working
358 | in a new probability space, where $ \Omega_{new} = E_{2} $, i.e.\ our new sample space is the conditioning event.
359 | Notice that this also means that our probability measure will change and become the
360 | measure from Definition~\ref{condProb}.
361 | 
362 | Here comes the cool part: although we have introduced a new concept, all the properties of probability
363 | measures that we know by now will seamlessly carry over to conditional probabilities, if we can prove
364 | that the conditional probability measure is a probability measure according to our axioms.
365 | 
366 | \begin{Exercise}
367 | Use the axioms from Definition~\ref{def:probmeasure} to prove that $ \mathbb{P}(\cdot|E_{j}) $ is a probability measure.
368 | \end{Exercise} 
369 | 
370 | We will make use of conditional probabilities quite a lot in this course. We will later see a way in which
371 | they help us to decompose joint probability distributions. For now, we are going to focus on the fact that
372 | they are also related to the idea of independence of events.
373 | 
374 | \begin{Definition}[Independence]
375 | Two events $ E_{1}, E_{2} $ are said to be independent if 
376 | $$ \mathbb{P}(E_{1} \cap E_{2}) = \mathbb{P}(E_{1}) \times \mathbb{P}(E_{2}) $$
377 | Independence of two events is denoted as $ E_{1} \bot E_{2} $.
378 | \end{Definition}
379 | 
380 | This definition relates to conditional probabilities in the following way: assume that $ E_{1} \bot E_{2} $.
381 | Then we get
382 | \begin{equation}
383 | \mathbb{P}(E_{1}|E_{2}) = \dfrac{\mathbb{P}(E_{1} \cap E_{2})}{\mathbb{P}(E_{2})}
384 | = \dfrac{\mathbb{P}(E_{1}) \times
385 |   \mathbb{P}(E_{2})}{\mathbb{P}(E_{2})} = \mathbb{P}(E_{1}) \, .
386 | \end{equation}
387 | Hence, independence of two events $ E_{1} \bot E_{2}$ is equivalent with $\mathbb{P}(E_{1}|E_{2}) = \mathbb{P}(E_{1}) $.
388 | 
389 | \begin{Exercise}
390 |  Prove that $E_1 \bot E_2$ is also equivalent with $\mathbb{P}(E_{2}|E_{1}) = \mathbb{P}(E_{2}) $. 
391 | \end{Exercise}
392 | 
393 | Independence will prove to be a useful concept in later chapters. More precisely, we will often
394 | just \textit{assume} that two events (or random variables -- see the next chapter) are independent. Although
395 | such an independence assumption might not always hold in practice, it will allow us to formulate much simpler probabilistic models.
396 | 
397 | 
398 | \section{A Remark on the Interpretation \\ of Probabilities$^{*}$}
399 | 
400 | This concludes our introduction of axiomatic probability theory. We know that a probability is
401 | a real number in $ [0,1] $. For all that we are going to do in this course (and in most follow-up courses)
402 | this is fully sufficient. However, some of you may wonder what a ``natural'' interpretation of probabilities
403 | would be. There are two dominating views on that. One postulates that if we were to take A LOT (read: almost
404 | infinitely many) samples from a sample space, the probability of an event is its frequency amongst these
405 | samples divided by the total number of samples taken. For those of you who know limits, this principle can be
406 | formalized as $ \mathbb{P}(E) = \underset{n \rightarrow \infty}{lim} \dfrac{\#E}{n} $. This view
407 | is known as the \emph{frequentist view}.
408 | 
409 | The second view postulates that probabilities are an expression for degrees of belief. Basically, 
410 | if you assign $ \mathbb{P}(E) $ to an event $ E $, then $ \mathbb{P}(E) $ is the strength of your personal belief that
411 | $ E $ will occur. This latter view is known as the \emph{Bayesian view}.
412 | 
413 | Which conception of probability you choose is a philosophical matter and does not really impact the math.
414 | That is why we will not care about this issue in this course. However, it is useful to at least be aware
415 | of these two views (if only to appear knowledgeable in a conversation you may have with your philosopher 
416 | friends).
417 | 
418 | 
419 | \section{The Binomial Theorem}
420 | The binomial theorem from Equation~\ref{binomTheorem} is actually not that hard to prove. We will do so by
421 | induction. As a base case we choose $ m = 0 $. Then the equality is easy to see.
422 | \begin{equation}
423 | (p + q)^{0} = 1 = \binom{0}{0}p^{0}q^{0}
424 | \end{equation}
425 | 
426 | Next, we assume that the theorem holds for $ m = n $. What we want to show is that it also holds for
427 | $ m = n + 1 $. We achieve this by algebraic manipulation.
428 | 
429 | \begin{align}
430 | (p + q)^{n+1} &= (p + q)^{n} \times (p + q) \\
431 | &= (p+q)^{n}p + (p+q)^{n}q \\
432 | &= p\underset{i=0}{\overset{n}{\sum}} \binom{n}{i} p^{i}q^{n-i} + q\underset{i=0}{\overset{n}{\sum}} \binom{n}{i} p^{i}q^{n-i} \label{indcutiveHyp} \\
433 | &= \underset{i=0}{\overset{n}{\sum}} \binom{n}{i} p^{i+1}q^{n-i} + \underset{i=0}{\overset{n}{\sum}} \binom{n}{i} p^{i}q^{n+1-i} \\
434 | &= \underset{j=1}{\overset{n+1}{\sum}} \binom{n}{j-1} p^{j}q^{n+1-j} + \underset{i=0}{\overset{n}{\sum}} \binom{n}{i} p^{i}q^{n+1-i} \label{variableSwitch} \\
435 | &= \binom{n}{n} p^{n+1}q^{(n+1)-(n+1)} + \underset{k=1}{\overset{n}{\sum}} \binom{n}{k-1} p^{k}q^{n+1-k} \label{pullOut} \\
436 | &+ \binom{n}{0} p^{0}q^{n+1} + \underset{k=1}{\overset{n}{\sum}} \binom{n}{k} p^{k}q^{n+1-k} \nonumber \label{collapseSums}
437 | \end{align}
438 | \begin{align}
439 | &= q^{n+1} + p^{n+1} + \underset{k=1}{\overset{n}{\sum}} \left(\binom{n}{k} + \binom{n}{k-1}\right) p^{i}q^{n+1-k} \\
440 | &= q^{n+1} + p^{n+1} + \underset{k=1}{\overset{n}{\sum}} \left(\dfrac{n!}{k!(n-k)!} + \dfrac{n!}{(k-1)!(n-k+1)!}\right) p^{i}q^{n+1-k} \\
441 | &= q^{n+1} + p^{n+1} + \underset{k=1}{\overset{n}{\sum}} \left(\dfrac{n!(n+1-k)}{k!(n+1-k)!} + \dfrac{n!k}{k!(n-k+1)!}\right) p^{k}q^{n+1-k} \\
442 | &= q^{n+1} + p^{n+1} + \underset{k=1}{\overset{n}{\sum}} \left(\dfrac{n!(n+1)}{k!(n+1-k)!}\right) p^{i}q^{n+1-k} \\
443 | &= q^{n+1} + p^{n+1} + \underset{k=1}{\overset{n}{\sum}} \binom{n+1}{k} p^{k}q^{n+1-k} \\
444 | &= \underset{i=0}{\overset{n+1}{\sum}} \binom{n}{k} p^{k}q^{n-k}
445 | \end{align}
446 | 
447 | Let us clarify some parts of the proof. We use the induction hypothesis to expand the terms in Line~\ref{indcutiveHyp}. 
448 | In Line~\ref{variableSwitch}, we switch the variable $ i $ in the first summand to $ j = i+1 $. The
449 | reason why we do this is because we want to achieve congruence with the exponents of the second summand. In the following line we 
450 | uniformly name the variables $ k $. Since $ k $ has to run over a common range, we chop off the ends of both sums that stick out. In the first
451 | sum of line \ref{variableSwitch} that is the summand that corresponds to $ j=n+1 $ and in the second sum it is the summand that corresponds
452 | to $ i = 0 $. We pull out both of them in line \ref{pullOut} and then collapse the sums in line \ref{collapseSums}. The following lines 
453 | are basically just an exercise in manipulation fractions. The jump from the second-to-last to the last line is allowed because
454 | $$ q^{n+1} = \binom{n+1}{0}p^{0}q^{n+1-0} $$ and $$ p^{n+1} = \binom{n+1}{n+1}p^{n+1}q^{(n+1)-(n+1)} $$
455 | which are exactly the quantities that we need to add to make our sum reach from $ 0 $ to $ n+1 $. This completes the proof.
456 | 
457 | \section*{Further Reading}
458 | A very quick and dirty introduction to measure theory is provided by Maya Gupta and can be found 
459 | \href{https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2006-0008.pdf}{here}. If you are
460 | looking for something more extensive that also motivates event spaces and the like you may want to 
461 | take a look at \href{http://www.stat.ncsu.edu/people/fuentes/courses/st778/lectures/ross}{this script}
462 | by Ross Leadbatter and Stamatis Cambanis (which has also been
463 | published as a book).
464 | 
465 | 
466 | 
467 | 
468 | %%% Local Variables:
469 | %%% mode: latex
470 | %%% TeX-master: "chapter2"
471 | %%% End:
472 | 


--------------------------------------------------------------------------------