├── chapter7 ├── cache │ ├── __packages │ ├── entropy_0768bb85c32cc280dca5323ae92ed8b1.rdb │ ├── entropy_0768bb85c32cc280dca5323ae92ed8b1.rdx │ ├── entropy_0768bb85c32cc280dca5323ae92ed8b1.RData │ ├── binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.rdb │ ├── binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.rdx │ └── binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.RData ├── chapter7.pdf ├── figure │ ├── entropy-1.pdf │ └── binaryEntropy-1.pdf ├── chapter7.Rnw ├── chapter7_forInclude.Rnw ├── chapter7_forInclude.tex └── chapter7.tex ├── chapter6 ├── cache │ ├── __packages │ ├── binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.rdb │ ├── binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.rdx │ ├── computation_08c23b2f0871f6eae4e9010b10244f2a.rdb │ ├── computation_08c23b2f0871f6eae4e9010b10244f2a.rdx │ ├── binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.RData │ ├── computation_08c23b2f0871f6eae4e9010b10244f2a.RData │ ├── binomPosteriors_caabe6c3a8215386b680b66a783c3a55.RData │ ├── binomPosteriors_caabe6c3a8215386b680b66a783c3a55.rdb │ ├── binomPosteriors_caabe6c3a8215386b680b66a783c3a55.rdx │ ├── mixturePosterior_7b51293945933662ab25491259906a02.rdb │ ├── mixturePosterior_7b51293945933662ab25491259906a02.rdx │ └── mixturePosterior_7b51293945933662ab25491259906a02.RData ├── chapter6.pdf ├── makePlots.R ├── makePlots.R~ ├── chapter6.Rnw └── tikzlibrarybayesnet.code.tex ├── fullscript ├── cache │ ├── __packages │ ├── computation_08c23b2f0871f6eae4e9010b10244f2a.rdb │ ├── computation_08c23b2f0871f6eae4e9010b10244f2a.rdx │ └── computation_08c23b2f0871f6eae4e9010b10244f2a.RData ├── BasicProbabilityAndStatistics.pdf ├── BasicProbabilityAndStatistics.Rnw └── tikzlibrarybayesnet.code.tex ├── multivariateGaussian ├── cache │ ├── __packages │ ├── 3dgauss_9c4da507c5196242e80b280153e7c995.rdb │ ├── 3dgauss_9c4da507c5196242e80b280153e7c995.rdx │ ├── 3dgauss_9c4da507c5196242e80b280153e7c995.RData │ ├── multiGauss_cc3916828b40ec191d5b5bdee9808c87.rdb │ ├── multiGauss_cc3916828b40ec191d5b5bdee9808c87.rdx │ └── multiGauss_cc3916828b40ec191d5b5bdee9808c87.RData ├── figures │ ├── 3dgauss-1.pdf │ ├── uniGauss-1.pdf │ └── multiGauss-1.pdf ├── multivariateGaussian.pdf ├── multivariateGaussian.Rnw └── multivariateGaussian_forInclude.tex ├── chapter3 ├── cdf.png ├── chapter3.pdf ├── scaledRV.png ├── histogram.png ├── distribution.png ├── chapter3.tex └── makePlots.R ├── chapter1 ├── chapter1.pdf ├── chapter1.tex └── chapter1_forInclude.tex ├── chapter2 ├── chapter2.pdf ├── chapter2.tex └── chapter2_forInclude.tex ├── chapter4 ├── chapter4.pdf ├── chapter4.tex └── chapter4_forInclude.tex ├── chapter5 ├── chapter5.pdf ├── dense_likelihood.png ├── sparse_likelihood.png ├── chapter5.tex ├── makePlots.R └── BernoulliData.txt ├── additionalMaterial ├── sufficient-statistics.pdf └── sufficient-statistics.tex ├── README.md ├── contributors └── contributors.tex └── .gitignore /chapter7/cache/__packages: -------------------------------------------------------------------------------- 1 | base 2 | -------------------------------------------------------------------------------- /chapter6/cache/__packages: -------------------------------------------------------------------------------- 1 | base 2 | knitr 3 | -------------------------------------------------------------------------------- /fullscript/cache/__packages: -------------------------------------------------------------------------------- 1 | base 2 | knitr 3 | -------------------------------------------------------------------------------- /multivariateGaussian/cache/__packages: -------------------------------------------------------------------------------- 1 | base 2 | knitr 3 | mvtnorm 4 | -------------------------------------------------------------------------------- /chapter3/cdf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter3/cdf.png -------------------------------------------------------------------------------- /chapter1/chapter1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter1/chapter1.pdf -------------------------------------------------------------------------------- /chapter2/chapter2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter2/chapter2.pdf -------------------------------------------------------------------------------- /chapter3/chapter3.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter3/chapter3.pdf -------------------------------------------------------------------------------- /chapter3/scaledRV.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter3/scaledRV.png -------------------------------------------------------------------------------- /chapter4/chapter4.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter4/chapter4.pdf -------------------------------------------------------------------------------- /chapter5/chapter5.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter5/chapter5.pdf -------------------------------------------------------------------------------- /chapter6/chapter6.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/chapter6.pdf -------------------------------------------------------------------------------- /chapter7/chapter7.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/chapter7.pdf -------------------------------------------------------------------------------- /chapter3/histogram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter3/histogram.png -------------------------------------------------------------------------------- /chapter3/distribution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter3/distribution.png -------------------------------------------------------------------------------- /chapter5/dense_likelihood.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter5/dense_likelihood.png -------------------------------------------------------------------------------- /chapter7/figure/entropy-1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/figure/entropy-1.pdf -------------------------------------------------------------------------------- /chapter5/sparse_likelihood.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter5/sparse_likelihood.png -------------------------------------------------------------------------------- /chapter7/figure/binaryEntropy-1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/figure/binaryEntropy-1.pdf -------------------------------------------------------------------------------- /additionalMaterial/sufficient-statistics.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/additionalMaterial/sufficient-statistics.pdf -------------------------------------------------------------------------------- /fullscript/BasicProbabilityAndStatistics.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/fullscript/BasicProbabilityAndStatistics.pdf -------------------------------------------------------------------------------- /multivariateGaussian/figures/3dgauss-1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/figures/3dgauss-1.pdf -------------------------------------------------------------------------------- /multivariateGaussian/figures/uniGauss-1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/figures/uniGauss-1.pdf -------------------------------------------------------------------------------- /multivariateGaussian/figures/multiGauss-1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/figures/multiGauss-1.pdf -------------------------------------------------------------------------------- /multivariateGaussian/multivariateGaussian.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/multivariateGaussian.pdf -------------------------------------------------------------------------------- /chapter7/cache/entropy_0768bb85c32cc280dca5323ae92ed8b1.rdb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/cache/entropy_0768bb85c32cc280dca5323ae92ed8b1.rdb -------------------------------------------------------------------------------- /chapter7/cache/entropy_0768bb85c32cc280dca5323ae92ed8b1.rdx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/cache/entropy_0768bb85c32cc280dca5323ae92ed8b1.rdx -------------------------------------------------------------------------------- /chapter7/cache/entropy_0768bb85c32cc280dca5323ae92ed8b1.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/cache/entropy_0768bb85c32cc280dca5323ae92ed8b1.RData -------------------------------------------------------------------------------- /chapter6/cache/binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.rdb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.rdb -------------------------------------------------------------------------------- /chapter6/cache/binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.rdx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.rdx -------------------------------------------------------------------------------- /chapter6/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.rdb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.rdb -------------------------------------------------------------------------------- /chapter6/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.rdx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.rdx -------------------------------------------------------------------------------- /chapter6/cache/binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/binomCounts_4279ad5f35d038d0bb6a2b1c58e7efae.RData -------------------------------------------------------------------------------- /chapter6/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.RData -------------------------------------------------------------------------------- /chapter7/cache/binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.rdb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/cache/binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.rdb -------------------------------------------------------------------------------- /chapter7/cache/binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.rdx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/cache/binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.rdx -------------------------------------------------------------------------------- /fullscript/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.rdb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/fullscript/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.rdb -------------------------------------------------------------------------------- /fullscript/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.rdx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/fullscript/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.rdx -------------------------------------------------------------------------------- /chapter6/cache/binomPosteriors_caabe6c3a8215386b680b66a783c3a55.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/binomPosteriors_caabe6c3a8215386b680b66a783c3a55.RData -------------------------------------------------------------------------------- /chapter6/cache/binomPosteriors_caabe6c3a8215386b680b66a783c3a55.rdb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/binomPosteriors_caabe6c3a8215386b680b66a783c3a55.rdb -------------------------------------------------------------------------------- /chapter6/cache/binomPosteriors_caabe6c3a8215386b680b66a783c3a55.rdx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/binomPosteriors_caabe6c3a8215386b680b66a783c3a55.rdx -------------------------------------------------------------------------------- /chapter6/cache/mixturePosterior_7b51293945933662ab25491259906a02.rdb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/mixturePosterior_7b51293945933662ab25491259906a02.rdb -------------------------------------------------------------------------------- /chapter6/cache/mixturePosterior_7b51293945933662ab25491259906a02.rdx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/mixturePosterior_7b51293945933662ab25491259906a02.rdx -------------------------------------------------------------------------------- /chapter7/cache/binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter7/cache/binaryEntropy_b3b89a0cd80be7f7ab2b002be60728e7.RData -------------------------------------------------------------------------------- /fullscript/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/fullscript/cache/computation_08c23b2f0871f6eae4e9010b10244f2a.RData -------------------------------------------------------------------------------- /chapter6/cache/mixturePosterior_7b51293945933662ab25491259906a02.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/chapter6/cache/mixturePosterior_7b51293945933662ab25491259906a02.RData -------------------------------------------------------------------------------- /multivariateGaussian/cache/3dgauss_9c4da507c5196242e80b280153e7c995.rdb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/cache/3dgauss_9c4da507c5196242e80b280153e7c995.rdb -------------------------------------------------------------------------------- /multivariateGaussian/cache/3dgauss_9c4da507c5196242e80b280153e7c995.rdx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/cache/3dgauss_9c4da507c5196242e80b280153e7c995.rdx -------------------------------------------------------------------------------- /multivariateGaussian/cache/3dgauss_9c4da507c5196242e80b280153e7c995.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/cache/3dgauss_9c4da507c5196242e80b280153e7c995.RData -------------------------------------------------------------------------------- /multivariateGaussian/cache/multiGauss_cc3916828b40ec191d5b5bdee9808c87.rdb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/cache/multiGauss_cc3916828b40ec191d5b5bdee9808c87.rdb -------------------------------------------------------------------------------- /multivariateGaussian/cache/multiGauss_cc3916828b40ec191d5b5bdee9808c87.rdx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/cache/multiGauss_cc3916828b40ec191d5b5bdee9808c87.rdx -------------------------------------------------------------------------------- /multivariateGaussian/cache/multiGauss_cc3916828b40ec191d5b5bdee9808c87.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BasicProbability/LectureNotes/HEAD/multivariateGaussian/cache/multiGauss_cc3916828b40ec191d5b5bdee9808c87.RData -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # LectureNotes 2 | Lecture Notes (with exercises) for Basic Probability course at University of Amsterdam 3 | 4 | written in Aug 2015 by Philip Schulz, ILLC, UvA 5 | minor editing by Christian Schaffner, ILLC, UvA 6 | 7 | -------------------------------------------------------------------------------- /chapter6/makePlots.R: -------------------------------------------------------------------------------- 1 | # Script for creating the plots of chapter 6 2 | # Author: Philip Schulz 3 | 4 | x = seq(0,1,0.001) 5 | entropy = -log2(x)*x-log2(1-x)*(1-x) 6 | 7 | png("binaryEntropy.png", width=8, height=8, units="in", res=300) 8 | plot(x,entropy,type="l", xlab=expression(Theta), ylab = "H(X)") 9 | dev.off() 10 | -------------------------------------------------------------------------------- /chapter6/makePlots.R~: -------------------------------------------------------------------------------- 1 | # Script for creating the plots of chapter 6 2 | # Author: Philip Schulz 3 | 4 | x = seq(0,1,0.001) 5 | entropy = -log2(x)*x-log2(1-x)*(1-x) 6 | 7 | png("binaryEntropy.png", width=8, height=8, units="in", res=300) 8 | plot(x,entropy,type="l", xlab=expression("Theta"), ylab = "H(X)") 9 | dev.off() 10 | -------------------------------------------------------------------------------- /chapter1/chapter1.tex: -------------------------------------------------------------------------------- 1 | \documentclass[a4paper,11pt,leqno]{report} 2 | 3 | \usepackage{amsmath, amssymb, mdframed} 4 | \usepackage{hyperref} 5 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true} 6 | 7 | \newmdtheoremenv{Theorem}{Theorem}[chapter] 8 | \newmdtheoremenv{Definition}[Theorem]{Definition} 9 | \newmdtheoremenv{Exercise}[Theorem]{Exercise} 10 | 11 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}} 12 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}} 13 | 14 | 15 | \title{Basic Probability} 16 | \date{} 17 | 18 | \begin{document} 19 | 20 | \include{chapter1_forInclude} 21 | 22 | \end{document} -------------------------------------------------------------------------------- /chapter4/chapter4.tex: -------------------------------------------------------------------------------- 1 | \documentclass[a4paper,11pt,leqno]{report} 2 | 3 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption} 4 | \usepackage{nicefrac} 5 | 6 | \usepackage{hyperref} 7 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true} 8 | 9 | \newmdtheoremenv{Theorem}{Theorem}[chapter] 10 | \newmdtheoremenv{Definition}[Theorem]{Definition} 11 | \newmdtheoremenv{Exercise}[Theorem]{Exercise} 12 | 13 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}} 14 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}} 15 | 16 | \newcommand{\supp}{\operatorname{supp}} 17 | \newcommand{\E}{\mathbb{E}} 18 | 19 | \title{Basic Probability} 20 | \date{} 21 | 22 | \begin{document} 23 | 24 | \include{chapter4_forInclude} 25 | 26 | \end{document} -------------------------------------------------------------------------------- /contributors/contributors.tex: -------------------------------------------------------------------------------- 1 | \section*{Contributors} 2 | While we strive to continuously update this script and keep it on an acceptable level of grammaticality and mathematical correctness, it is unavoidable that some 3 | mistakes creep in. We are therefore utterly grateful to our contributors who have helped improving the script and would like to acknowledge their contributions here. 4 | \begin{itemize} 5 | \item Philip Michgelsen has corrected a mistake in the definition of event spaces in chapter 1. 6 | \item Bas Cornelissen has spotted a mistake in the statement of Markov's inequality. 7 | \item Jonathan Sippel has spotted a mistake in our example calculation 8 | of binary entropy. 9 | \item Thijs Baaijen has spotted a typo in Formula~\eqref{weatherRV}. 10 | \item Julia Turska has spotted various typos throughout the lecture notes (see issue 33 on the GitHub repository) 11 | \end{itemize} -------------------------------------------------------------------------------- /chapter5/chapter5.tex: -------------------------------------------------------------------------------- 1 | \documentclass[a4paper,11pt,leqno]{report} 2 | 3 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption, graphicx} 4 | \usepackage{nicefrac} 5 | 6 | \usepackage{hyperref} 7 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true} 8 | 9 | \newmdtheoremenv{Definition}{Definition}[chapter] 10 | \newmdtheoremenv{Exercise}[Definition]{Exercise} 11 | \newmdtheoremenv{Theorem}[Definition]{Theorem} 12 | \newmdtheoremenv{Lemma}[Definition]{Lemma} 13 | 14 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}} 15 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}} 16 | 17 | \newcommand{\supp}{\operatorname{supp}} 18 | \newcommand{\E}{\mathbb{E}} 19 | \newcommand{\eps}{\varepsilon} 20 | 21 | 22 | \title{Basic Probability} 23 | \date{} 24 | 25 | \begin{document} 26 | 27 | \setcounter{chapter}{4} 28 | \input{chapter5_forInclude} 29 | 30 | \end{document} -------------------------------------------------------------------------------- /chapter6/chapter6.Rnw: -------------------------------------------------------------------------------- 1 | \documentclass[a4paper,11pt,leqno]{report} 2 | 3 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption, graphicx, enumitem, tikz, bbm} 4 | \usepackage{nicefrac} 5 | \usetikzlibrary{bayesnet} 6 | 7 | \usepackage{hyperref} 8 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true} 9 | 10 | \newmdtheoremenv{Definition}{Definition}[chapter] 11 | \newmdtheoremenv{Exercise}[Definition]{Exercise} 12 | \newmdtheoremenv{Theorem}[Definition]{Theorem} 13 | \newmdtheoremenv{Lemma}[Definition]{Lemma} 14 | 15 | \newcommand{\supp}{\operatorname{supp}} 16 | \newcommand{\E}{\mathbb{E}} 17 | \newcommand{\eps}{\varepsilon} 18 | 19 | \newcommand{\id}[1]{\mathbbm{1}\left(#1\right)} 20 | 21 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}} 22 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}} 23 | 24 | \title{Basic Probability} 25 | \date{} 26 | 27 | <>= 28 | library(knitr) 29 | @ 30 | 31 | \begin{document} 32 | 33 | \setcounter{chapter}{5} 34 | <>= 35 | @ 36 | 37 | \end{document} -------------------------------------------------------------------------------- /chapter7/chapter7.Rnw: -------------------------------------------------------------------------------- 1 | \documentclass[a4paper,11pt,leqno]{report} 2 | 3 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption, graphicx, enumitem} 4 | \usepackage{nicefrac} 5 | 6 | \usepackage{hyperref} 7 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true} 8 | 9 | \newmdtheoremenv{Definition}{Definition}[chapter] 10 | \newmdtheoremenv{Exercise}[Definition]{Exercise} 11 | \newmdtheoremenv{Theorem}[Definition]{Theorem} 12 | \newmdtheoremenv{Lemma}[Definition]{Lemma} 13 | 14 | \newcommand{\supp}{\operatorname{supp}} 15 | \newcommand{\E}{\mathbb{E}} 16 | \newcommand{\eps}{\varepsilon} 17 | 18 | \DeclareSymbolFont{extraup}{U}{zavm}{m}{n} 19 | \DeclareMathSymbol{\varheart}{\mathalpha}{extraup}{86} 20 | \DeclareMathSymbol{\vardiamond}{\mathalpha}{extraup}{87} 21 | 22 | 23 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}} 24 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}} 25 | 26 | \title{Basic Probability} 27 | \date{} 28 | 29 | \begin{document} 30 | 31 | \setcounter{chapter}{6} 32 | <>= 33 | @ 34 | 35 | \end{document} 36 | -------------------------------------------------------------------------------- /multivariateGaussian/multivariateGaussian.Rnw: -------------------------------------------------------------------------------- 1 | \documentclass[a4paper,11pt,leqno]{report} 2 | 3 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption, graphicx, enumitem} 4 | \usepackage{nicefrac} 5 | 6 | \usepackage{hyperref} 7 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true} 8 | 9 | \newmdtheoremenv{Definition}{Definition}[chapter] 10 | \newmdtheoremenv{Exercise}[Definition]{Exercise} 11 | \newmdtheoremenv{Theorem}[Definition]{Theorem} 12 | \newmdtheoremenv{Lemma}[Definition]{Lemma} 13 | 14 | \newcommand{\supp}{\operatorname{supp}} 15 | \newcommand{\E}{\mathbb{E}} 16 | \newcommand{\eps}{\varepsilon} 17 | 18 | \newcommand{\N}[2]{\mathcal{N}\left( #1, #2 \right)} 19 | 20 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}} 21 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}} 22 | 23 | \title{Basic Probability} 24 | \date{} 25 | 26 | %% Load R packages 27 | <>= 28 | library(knitr) 29 | # for multivariate Gaussian 30 | library(mvtnorm) 31 | @ 32 | 33 | \begin{document} 34 | 35 | \setcounter{chapter}{5} 36 | <>= 37 | @ 38 | 39 | \end{document} -------------------------------------------------------------------------------- /chapter2/chapter2.tex: -------------------------------------------------------------------------------- 1 | \documentclass[a4paper,11pt,leqno]{report} 2 | 3 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption} 4 | % for Python code 5 | \usepackage[procnames]{listings} 6 | \definecolor{keywords}{RGB}{255,0,90} 7 | \definecolor{comments}{RGB}{0,0,113} 8 | \definecolor{red}{RGB}{160,0,0} 9 | \definecolor{green}{RGB}{0,150,0} 10 | 11 | \lstset{language=Python, 12 | basicstyle=\tt\small, 13 | keywordstyle=\color{keywords}, 14 | commentstyle=\color{comments}, 15 | stringstyle=\color{red}, 16 | showstringspaces=false, 17 | identifierstyle=\color{green}, 18 | procnamekeys={def,class}} 19 | % 20 | \usepackage{venndiagram} 21 | \usepackage{hyperref} 22 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true} 23 | 24 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}} 25 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}} 26 | 27 | \newmdtheoremenv{Theorem}{Theorem}[chapter] 28 | \newmdtheoremenv{Definition}[Theorem]{Definition} 29 | \newmdtheoremenv{Exercise}[Theorem]{Exercise} 30 | 31 | 32 | \title{Basic Probability} 33 | \date{} 34 | 35 | \begin{document} 36 | 37 | \setcounter{chapter}{1} 38 | \include{chapter2_forInclude} 39 | 40 | \end{document} -------------------------------------------------------------------------------- /chapter3/chapter3.tex: -------------------------------------------------------------------------------- 1 | \documentclass[a4paper,11pt,leqno]{report} 2 | 3 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption} 4 | \usepackage{nicefrac} 5 | \usepackage{graphicx} 6 | % for Python code 7 | \usepackage[procnames]{listings} 8 | \definecolor{keywords}{RGB}{255,0,90} 9 | \definecolor{comments}{RGB}{0,0,113} 10 | \definecolor{red}{RGB}{160,0,0} 11 | \definecolor{green}{RGB}{0,150,0} 12 | 13 | \lstset{language=Python, 14 | basicstyle=\tt\small, 15 | keywordstyle=\color{keywords}, 16 | commentstyle=\color{comments}, 17 | stringstyle=\color{red}, 18 | showstringspaces=false, 19 | identifierstyle=\color{green}, 20 | procnamekeys={def,class}} 21 | % 22 | \usepackage{venndiagram} 23 | \usepackage{hyperref} 24 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true} 25 | 26 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}} 27 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}} 28 | 29 | \newmdtheoremenv{Theorem}{Theorem}[chapter] 30 | \newmdtheoremenv{Definition}[Theorem]{Definition} 31 | \newmdtheoremenv{Exercise}[Theorem]{Exercise} 32 | 33 | 34 | % \DeclareMathOperator{\supp}{supp} 35 | \newcommand{\supp}{\operatorname{supp}} 36 | \newcommand{\E}{\mathbb{E}} 37 | \newcommand{\var}{\operatorname{var}} 38 | 39 | 40 | \title{Basic Probability} 41 | \date{} 42 | 43 | \begin{document} 44 | 45 | \setcounter{chapter}{2} 46 | \include{chapter3_forInclude} 47 | 48 | \end{document} -------------------------------------------------------------------------------- /chapter3/makePlots.R: -------------------------------------------------------------------------------- 1 | # R script for creating plots in chapter 3 2 | # Run as "Rsript makeplots.R" 3 | # Author: Philip Schulz 4 | 5 | # create vectors and compute mean 6 | x = seq(1:8) 7 | y = c(0.09, .21, .28, .23, .12, .04, .02, .01) 8 | mu = sum(x*y) 9 | 10 | # open stream to file 11 | png("distribution.png", width=8, height=8, units="in", res=300) 12 | 13 | # plot y against x 14 | plot(x,y,yaxp=c(0,0.35,7),xlab="Z",ylab="P(Z=z)", cex=1.5) 15 | # connect points and x-axis 16 | segments(x0=x, y0=rep(0,8), y1=y, lwd=5) 17 | # insert red lines 18 | abline(v=2,col="red",lwd=2) 19 | abline(v=5,col="red",lwd=2) 20 | # put arrow underneath x-axis to indicate mean 21 | arrows(mu,-0.03,mu,-.001,xpd=T) 22 | # close stream and save to file 23 | dev.off() 24 | 25 | # compute cdf 26 | z = cumsum(y) 27 | 28 | # open stream to file 29 | png("cdf.png", width=8, height=8, units="in", res=300) 30 | 31 | # plot z against x 32 | plot(x,z,ylab="F(z)",xlab="Z") 33 | # add vertical strokes 34 | for (i in (1:length(x)-1)) { lines(c(x[i],x[i]+1),c(z[i],z[i])) } 35 | # close stream and save to file 36 | dev.off() 37 | 38 | # add constant and scale X 39 | additive_constant = 3 40 | scale_factor = 2 41 | 42 | # calculate new expectation 43 | new_x = x*scale_factor+additive_constant 44 | new_mu = sum(new_x*y) 45 | 46 | all_x = c(x,new_x) 47 | all_y = c(y,y) 48 | 49 | png("scaledRV.png", width=8, height=8, units="in", res=300) 50 | 51 | plot(all_x, all_y, xlab="Z/X", ylab="P(Z=z)/P(X=x)") 52 | segments(x0=x, y0=rep(0,8), y1=y, col="blue") 53 | segments(x0=new_x, y0=rep(0,8), y1=y, col="red") 54 | arrows(mu,-0.03,mu,-.001,xpd=T, col="blue") 55 | arrows(new_mu,-0.03,new_mu,-.001,xpd=T, col="red") 56 | dev.off() -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | ## Core latex/pdflatex auxiliary files: 2 | *.aux 3 | *.lof 4 | *.log 5 | *.lot 6 | *.fls 7 | *.out 8 | *.toc 9 | 10 | ## Intermediate documents: 11 | *.dvi 12 | *-converted-to.* 13 | # these rules might exclude image files for figures etc. 14 | # *.ps 15 | # *.eps 16 | # *.pdf 17 | 18 | ## Bibliography auxiliary files (bibtex/biblatex/biber): 19 | *.bbl 20 | *.bcf 21 | *.blg 22 | *-blx.aux 23 | *-blx.bib 24 | *.brf 25 | *.run.xml 26 | 27 | ## Build tool auxiliary files: 28 | *.fdb_latexmk 29 | *.synctex 30 | *.synctex.gz 31 | *.synctex.gz(busy) 32 | *.pdfsync 33 | 34 | ## Auxiliary and intermediate files from other packages: 35 | 36 | 37 | # algorithms 38 | *.alg 39 | *.loa 40 | 41 | # achemso 42 | acs-*.bib 43 | 44 | # amsthm 45 | *.thm 46 | 47 | # beamer 48 | *.nav 49 | *.snm 50 | *.vrb 51 | 52 | #(e)ledmac/(e)ledpar 53 | *.end 54 | *.[1-9] 55 | *.[1-9][0-9] 56 | *.[1-9][0-9][0-9] 57 | *.[1-9]R 58 | *.[1-9][0-9]R 59 | *.[1-9][0-9][0-9]R 60 | *.eledsec[1-9] 61 | *.eledsec[1-9]R 62 | *.eledsec[1-9][0-9] 63 | *.eledsec[1-9][0-9]R 64 | *.eledsec[1-9][0-9][0-9] 65 | *.eledsec[1-9][0-9][0-9]R 66 | 67 | # glossaries 68 | *.acn 69 | *.acr 70 | *.glg 71 | *.glo 72 | *.gls 73 | 74 | # gnuplottex 75 | *-gnuplottex-* 76 | 77 | # hyperref 78 | *.brf 79 | 80 | # knitr 81 | *-concordance.tex 82 | *.tikz 83 | *-tikzDictionary 84 | 85 | # listings 86 | *.lol 87 | 88 | # makeidx 89 | *.idx 90 | *.ilg 91 | *.ind 92 | *.ist 93 | 94 | # minitoc 95 | *.maf 96 | *.mtc 97 | *.mtc[0-9] 98 | *.mtc[1-9][0-9] 99 | 100 | # minted 101 | _minted* 102 | *.pyg 103 | 104 | # morewrites 105 | *.mw 106 | 107 | # mylatexformat 108 | *.fmt 109 | 110 | # nomencl 111 | *.nlo 112 | 113 | # sagetex 114 | *.sagetex.sage 115 | *.sagetex.py 116 | *.sagetex.scmd 117 | 118 | # sympy 119 | *.sout 120 | *.sympy 121 | sympy-plots-for-*.tex/ 122 | 123 | # TikZ & PGF 124 | *.dpth 125 | *.md5 126 | *.auxlock 127 | 128 | # todonotes 129 | *.tdo 130 | 131 | # xindy 132 | *.xdy 133 | 134 | # WinEdt 135 | *.bak 136 | *.sav 137 | 138 | chapter4/chapter4.synctex.gz 139 | 140 | chapter5/chapter5.synctex_conflict-20150923-103239.gz 141 | 142 | chapter3/chapter3.rel 143 | 144 | chapter5/chapter5.synctex_conflict-20150923-151940.gz 145 | 146 | **/cache/ 147 | chapter7/figure/binaryEntropy-1.pdf 148 | 149 | chapter7/figure/binaryEntropy-1.pdf 150 | 151 | fullscript/figure/binaryEntropy-1.pdf 152 | 153 | .DS_Store 154 | .texpadtmp 155 | -------------------------------------------------------------------------------- /fullscript/BasicProbabilityAndStatistics.Rnw: -------------------------------------------------------------------------------- 1 | \documentclass[11pt,leqno,a4paper]{report} 2 | 3 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption, graphicx, enumitem, tikz, bbm} 4 | \usepackage{nicefrac} 5 | \usetikzlibrary{bayesnet} 6 | \usepackage{hyperref} 7 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true} 8 | % for Python code 9 | \usepackage[procnames]{listings} 10 | \definecolor{keywords}{RGB}{255,0,90} 11 | \definecolor{comments}{RGB}{0,0,113} 12 | \definecolor{red}{RGB}{160,0,0} 13 | \definecolor{green}{RGB}{0,150,0} 14 | 15 | \lstset{language=Python, 16 | basicstyle=\tt\small, 17 | keywordstyle=\color{keywords}, 18 | commentstyle=\color{comments}, 19 | stringstyle=\color{red}, 20 | showstringspaces=false, 21 | identifierstyle=\color{green}, 22 | procnamekeys={def,class}} 23 | % 24 | \usepackage{venndiagram} 25 | 26 | \newmdtheoremenv{Theorem}{Theorem}[chapter] 27 | \newmdtheoremenv{Definition}[Theorem]{Definition} 28 | \newmdtheoremenv{Exercise}[Theorem]{Exercise} 29 | \newmdtheoremenv{Lemma}[Theorem]{Lemma} 30 | 31 | \newcommand{\supp}{\operatorname{supp}} 32 | \newcommand{\E}{\mathbb{E}} 33 | \newcommand{\var}{\operatorname{var}} 34 | \newcommand{\eps}{\varepsilon} 35 | \newcommand{\id}[1]{\mathbbm{1}\left(#1\right)} 36 | 37 | \DeclareSymbolFont{extraup}{U}{zavm}{m}{n} 38 | \DeclareMathSymbol{\varheart}{\mathalpha}{extraup}{86} 39 | \DeclareMathSymbol{\vardiamond}{\mathalpha}{extraup}{87} 40 | 41 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}} 42 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}} 43 | 44 | 45 | \author{Philip Schulz \\ Christian Schaffner} 46 | \title{Basic Probability and Statistics} 47 | \date{last modified: \today} 48 | 49 | \begin{document} 50 | 51 | <>= 52 | library(knitr) 53 | @ 54 | 55 | \begin{titlepage} 56 | \maketitle 57 | \end{titlepage} 58 | 59 | \pagenumbering{roman} 60 | \tableofcontents 61 | \graphicspath{{../chapter3/}{../chapter5/}{../chapter6/}} 62 | 63 | % insert preface 64 | \newpage 65 | \input{../contributors/contributors} 66 | \clearpage 67 | \setcounter{page}{1} 68 | \pagenumbering{arabic} 69 | \input{../chapter1/chapter1_forInclude} 70 | \input{../chapter2/chapter2_forInclude} 71 | \input{../chapter3/chapter3_forInclude} 72 | \input{../chapter4/chapter4_forInclude} 73 | \input{../chapter5/chapter5_forInclude} 74 | <>= 75 | @ 76 | <>= 77 | @ 78 | 79 | 80 | \end{document} 81 | -------------------------------------------------------------------------------- /chapter5/makePlots.R: -------------------------------------------------------------------------------- 1 | # Script to generate likelihood plots for chapter 5. 2 | # Plots are based on samples of bit-sequences of length 10 3 | # Author: Philip Schulz 4 | 5 | sequence_length = 10 6 | sparse_samples = 2 7 | dense_samples = 50 8 | theta = 0.7 9 | 10 | dense_data = rbinom(dense_samples, sequence_length, theta) 11 | sparse_data1 = rbinom(sparse_samples, sequence_length, theta) 12 | sparse_data2 = rbinom(sparse_samples, sequence_length, theta) 13 | sparse_data3 = rbinom(sparse_samples, sequence_length, theta) 14 | 15 | dense_likelihood = double() 16 | sparse_likelihood1 = double() 17 | sparse_likelihood2 = double() 18 | sparse_likelihood3 = double() 19 | 20 | params = seq(0,1,0.001) 21 | 22 | for (param in params) { dense_likelihood = c(dense_likelihood, prod(dbinom(dense_data, sequence_length, param))) } 23 | for (param in params) { sparse_likelihood1 = c(sparse_likelihood1, prod(dbinom(sparse_data1, sequence_length, param))) } 24 | for (param in params) { sparse_likelihood2 = c(sparse_likelihood2, prod(dbinom(sparse_data2, sequence_length, param))) } 25 | for (param in params) { sparse_likelihood3= c(sparse_likelihood3, prod(dbinom(sparse_data3, sequence_length, param))) } 26 | 27 | dense_mode = max(dense_likelihood) 28 | sparse_mode1 = max(sparse_likelihood1) 29 | sparse_mode2 = max(sparse_likelihood2) 30 | sparse_mode3 = max(sparse_likelihood3) 31 | highest_mode = max(c(sparse_mode1, sparse_mode2, sparse_mode3)) 32 | 33 | dense_mode_idx = match(dense_mode, dense_likelihood)/length(params) 34 | sparse_mode1_idx = match(sparse_mode1, sparse_likelihood1)/length(params) 35 | sparse_mode2_idx = match(sparse_mode2, sparse_likelihood2)/length(params) 36 | sparse_mode3_idx = match(sparse_mode3, sparse_likelihood3)/length(params) 37 | 38 | png("sparse_likelihood.png", width=8, height=8, units="in", res=300) 39 | plot(params, sparse_likelihood1, xlab=expression(Theta), ylab="Likelihood", type ="l", col="blue", ylim = c(0,highest_mode)) 40 | axis(1,at = seq(0,10,0.1)) 41 | lines(params, sparse_likelihood2, col="green") 42 | lines(params, sparse_likelihood3, col="red") 43 | segments(x0=sparse_mode1_idx, y0=0, x1=sparse_mode1_idx, sparse_mode1) 44 | segments(x0=sparse_mode2_idx, y0=0, x1=sparse_mode2_idx, sparse_mode2) 45 | segments(x0=sparse_mode3_idx, y0=0, x1=sparse_mode3_idx, sparse_mode3) 46 | dev.off() 47 | 48 | png("dense_likelihood.png", width=8, height=8, units="in", res=300) 49 | plot(params, dense_likelihood, xlab=expression(Theta), ylab="Likelihood", type ="l", col="red") 50 | axis(1,at = seq(0,10,0.1)) 51 | segments(x0=dense_mode_idx, y0=0, x1=dense_mode_idx, dense_mode) 52 | dev.off() 53 | 54 | -------------------------------------------------------------------------------- /chapter5/BernoulliData.txt: -------------------------------------------------------------------------------- 1 | 86 94 85 81 88 2 | 80 82 84 89 84 3 | 82 81 85 84 80 4 | 87 86 86 88 87 5 | 87 88 80 83 80 6 | 79 83 92 81 88 7 | 86 78 80 82 82 8 | 84 85 87 90 82 9 | 82 75 81 83 86 10 | 86 83 71 85 84 11 | 87 82 79 84 87 12 | 79 83 85 82 87 13 | 82 85 90 85 86 14 | 83 84 87 82 84 15 | 84 83 90 84 84 16 | 85 82 87 75 85 17 | 92 87 83 87 82 18 | 80 86 84 89 88 19 | 90 83 79 84 78 20 | 82 84 81 89 84 21 | 84 86 80 84 82 22 | 87 86 85 81 88 23 | 81 82 85 81 79 24 | 85 83 88 86 90 25 | 81 83 77 77 90 26 | 86 90 87 84 83 27 | 86 79 88 79 86 28 | 88 82 74 83 77 29 | 79 85 84 78 90 30 | 83 85 87 80 78 31 | 87 82 86 81 90 32 | 85 89 84 85 81 33 | 87 85 82 86 87 34 | 79 86 86 79 82 35 | 89 88 82 86 84 36 | 73 83 84 86 82 37 | 83 81 80 81 78 38 | 85 79 86 76 77 39 | 82 83 82 81 88 40 | 83 81 79 84 80 41 | 86 81 84 90 77 42 | 84 87 88 85 81 43 | 86 86 87 80 84 44 | 86 84 90 75 82 45 | 82 83 84 84 88 46 | 80 79 87 82 82 47 | 82 87 80 80 84 48 | 79 82 79 80 87 49 | 83 83 77 86 84 50 | 83 85 83 91 92 51 | 85 87 88 88 88 52 | 87 75 84 79 80 53 | 80 87 86 89 85 54 | 79 84 75 90 87 55 | 86 83 86 86 81 56 | 87 79 88 87 88 57 | 87 84 91 80 81 58 | 85 83 81 84 83 59 | 84 83 81 87 80 60 | 87 86 90 89 84 61 | 86 85 85 83 85 62 | 84 84 91 88 85 63 | 77 73 86 80 83 64 | 80 81 84 83 84 65 | 83 83 90 85 81 66 | 87 83 79 89 81 67 | 84 81 85 85 88 68 | 85 85 82 89 86 69 | 89 85 91 84 81 70 | 88 75 82 82 81 71 | 88 84 83 87 87 72 | 84 85 87 89 88 73 | 89 82 81 79 91 74 | 82 80 86 86 85 75 | 86 80 84 86 79 76 | 87 82 87 84 82 77 | 85 82 82 82 88 78 | 82 86 76 85 90 79 | 85 83 86 89 85 80 | 92 92 89 79 81 81 | 87 89 81 88 83 82 | 88 86 88 86 87 83 | 89 81 84 86 85 84 | 87 88 89 81 83 85 | 83 85 82 83 75 86 | 82 88 76 80 82 87 | 89 86 81 90 86 88 | 88 84 92 84 77 89 | 85 82 89 85 88 90 | 77 87 83 91 86 91 | 83 85 90 94 76 92 | 73 81 82 77 77 93 | 84 90 81 79 85 94 | 90 83 80 85 86 95 | 83 84 85 87 88 96 | 80 80 87 81 82 97 | 87 84 85 86 80 98 | 92 82 77 84 85 99 | 86 83 82 81 84 100 | 87 86 82 84 83 101 | 82 86 82 82 79 102 | 84 86 84 78 85 103 | 88 83 76 83 83 104 | 89 81 84 85 87 105 | 76 89 79 85 77 106 | 79 81 80 87 85 107 | 81 90 85 89 84 108 | 92 78 78 87 84 109 | 85 85 85 77 87 110 | 79 81 84 81 81 111 | 76 83 91 83 86 112 | 81 86 82 86 86 113 | 82 88 80 91 85 114 | 85 78 83 89 83 115 | 85 81 84 86 85 116 | 89 89 86 86 88 117 | 80 85 82 84 73 118 | 87 81 83 86 85 119 | 79 87 80 81 85 120 | 82 88 85 86 81 121 | 81 84 86 84 84 122 | 83 80 83 86 87 123 | 85 88 85 87 85 124 | 88 83 84 78 81 125 | 86 88 79 89 86 126 | 92 84 84 82 83 127 | 82 87 87 86 87 128 | 79 89 82 85 85 129 | 87 86 81 83 83 130 | 88 86 86 80 80 131 | 86 85 79 88 86 132 | 82 89 86 84 85 133 | 83 83 78 83 83 134 | 91 88 87 84 85 135 | 75 82 84 82 85 136 | 85 82 83 84 79 137 | 81 89 84 84 89 138 | 81 84 82 90 89 139 | 80 82 89 85 80 140 | 86 86 90 91 81 141 | 82 79 81 86 88 142 | 94 80 87 86 85 143 | 82 87 83 81 83 144 | 83 83 77 89 82 145 | 82 82 81 84 91 146 | 75 90 87 79 88 147 | 83 89 82 83 85 148 | 79 86 86 85 89 149 | 88 81 81 82 85 150 | 83 90 81 72 78 151 | 86 84 85 76 86 152 | 89 78 80 82 87 153 | 82 83 84 87 80 154 | 83 82 86 90 87 155 | 83 84 85 80 88 156 | 77 84 84 86 87 157 | 81 89 84 84 80 158 | 80 82 82 83 92 159 | 82 80 84 85 80 160 | 79 78 80 78 86 161 | 87 82 85 85 77 162 | 83 84 88 92 86 163 | 87 83 84 84 83 164 | 84 82 84 88 90 165 | 80 84 76 81 75 166 | 88 87 90 86 89 167 | 82 87 85 85 88 168 | 82 76 86 79 82 169 | 87 89 92 76 78 170 | 85 81 89 84 80 171 | 81 80 85 82 81 172 | 90 89 84 85 78 173 | 84 78 80 85 89 174 | 72 80 84 88 79 175 | 85 84 75 87 79 176 | 82 75 91 81 85 177 | 88 87 83 84 82 178 | 89 84 86 83 81 179 | 87 90 84 86 86 180 | 85 89 82 83 91 181 | 85 81 83 84 80 182 | 86 92 79 84 87 183 | 80 83 83 77 88 184 | 87 83 90 80 85 185 | 82 84 84 77 86 186 | 84 93 86 86 80 187 | 78 86 85 86 81 188 | 82 81 81 84 84 189 | 83 87 81 83 79 190 | 83 83 83 84 84 191 | 76 80 85 83 79 192 | 80 78 82 86 81 193 | 84 78 76 82 81 194 | 82 88 84 81 83 195 | 80 83 81 88 81 196 | 90 77 88 86 82 197 | 86 87 88 84 88 198 | 79 79 84 88 86 199 | 86 92 79 86 82 200 | 81 88 85 78 82 201 | -------------------------------------------------------------------------------- /chapter6/tikzlibrarybayesnet.code.tex: -------------------------------------------------------------------------------- 1 | % tikzlibrary.code.tex 2 | % 3 | % Copyright 2010-2011 by Laura Dietz 4 | % Copyright 2012 by Jaakko Luttinen 5 | % 6 | % This file may be distributed and/or modified 7 | % 8 | % 1. under the LaTeX Project Public License and/or 9 | % 2. under the GNU General Public License. 10 | % 11 | % See the files LICENSE_LPPL and LICENSE_GPL for more details. 12 | 13 | % Load other libraries 14 | \usetikzlibrary{shapes} 15 | \usetikzlibrary{fit} 16 | \usetikzlibrary{chains} 17 | \usetikzlibrary{arrows} 18 | 19 | % Latent node 20 | \tikzstyle{latent} = [circle,fill=white,draw=black,inner sep=1pt, 21 | minimum size=20pt, font=\fontsize{10}{10}\selectfont, node distance=1] 22 | % Observed node 23 | \tikzstyle{obs} = [latent,fill=gray!25] 24 | % Constant node 25 | \tikzstyle{const} = [rectangle, inner sep=0pt, node distance=1] 26 | % Factor node 27 | \tikzstyle{factor} = [rectangle, fill=black,minimum size=5pt, inner 28 | sep=0pt, node distance=0.4] 29 | % Deterministic node 30 | \tikzstyle{det} = [latent, diamond] 31 | 32 | % Plate node 33 | \tikzstyle{plate} = [draw, rectangle, rounded corners, fit=#1] 34 | % Invisible wrapper node 35 | \tikzstyle{wrap} = [inner sep=0pt, fit=#1] 36 | % Gate 37 | \tikzstyle{gate} = [draw, rectangle, dashed, fit=#1] 38 | 39 | % Caption node 40 | \tikzstyle{caption} = [font=\footnotesize, node distance=0] % 41 | \tikzstyle{plate caption} = [caption, node distance=0, inner sep=0pt, 42 | below left=5pt and 0pt of #1.south east] % 43 | \tikzstyle{factor caption} = [caption] % 44 | \tikzstyle{every label} += [caption] % 45 | 46 | \tikzset{>={triangle 45}} 47 | 48 | %\pgfdeclarelayer{b} 49 | %\pgfdeclarelayer{f} 50 | %\pgfsetlayers{b,main,f} 51 | 52 | % \factoredge [options] {inputs} {factors} {outputs} 53 | \newcommand{\factoredge}[4][]{ % 54 | % Connect all nodes #2 to all nodes #4 via all factors #3. 55 | \foreach \f in {#3} { % 56 | \foreach \x in {#2} { % 57 | \path (\x) edge[-,#1] (\f) ; % 58 | %\draw[-,#1] (\x) edge[-] (\f) ; % 59 | } ; 60 | \foreach \y in {#4} { % 61 | \path (\f) edge[->,#1] (\y) ; % 62 | %\draw[->,#1] (\f) -- (\y) ; % 63 | } ; 64 | } ; 65 | } 66 | 67 | % \edge [options] {inputs} {outputs} 68 | \newcommand{\edge}[3][]{ % 69 | % Connect all nodes #2 to all nodes #3. 70 | \foreach \x in {#2} { % 71 | \foreach \y in {#3} { % 72 | \path (\x) edge [->,#1] (\y) ;% 73 | %\draw[->,#1] (\x) -- (\y) ;% 74 | } ; 75 | } ; 76 | } 77 | 78 | % \factor [options] {name} {caption} {inputs} {outputs} 79 | \newcommand{\factor}[5][]{ % 80 | % Draw the factor node. Use alias to allow empty names. 81 | \node[factor, label={[name=#2-caption]#3}, name=#2, #1, 82 | alias=#2-alias] {} ; % 83 | % Connect all inputs to outputs via this factor 84 | \factoredge {#4} {#2-alias} {#5} ; % 85 | } 86 | 87 | % \plate [options] {name} {fitlist} {caption} 88 | \newcommand{\plate}[4][]{ % 89 | \node[wrap=#3] (#2-wrap) {}; % 90 | \node[plate caption=#2-wrap] (#2-caption) {#4}; % 91 | \node[plate=(#2-wrap)(#2-caption), #1] (#2) {}; % 92 | } 93 | 94 | % \gate [options] {name} {fitlist} {inputs} 95 | \newcommand{\gate}[4][]{ % 96 | \node[gate=#3, name=#2, #1, alias=#2-alias] {}; % 97 | \foreach \x in {#4} { % 98 | \draw [-*,thick] (\x) -- (#2-alias); % 99 | } ;% 100 | } 101 | 102 | % \vgate {name} {fitlist-left} {caption-left} {fitlist-right} 103 | % {caption-right} {inputs} 104 | \newcommand{\vgate}[6]{ % 105 | % Wrap the left and right parts 106 | \node[wrap=#2] (#1-left) {}; % 107 | \node[wrap=#4] (#1-right) {}; % 108 | % Draw the gate 109 | \node[gate=(#1-left)(#1-right)] (#1) {}; % 110 | % Add captions 111 | \node[caption, below left=of #1.north ] (#1-left-caption) 112 | {#3}; % 113 | \node[caption, below right=of #1.north ] (#1-right-caption) 114 | {#5}; % 115 | % Draw middle separation 116 | \draw [-, dashed] (#1.north) -- (#1.south); % 117 | % Draw inputs 118 | \foreach \x in {#6} { % 119 | \draw [-*,thick] (\x) -- (#1); % 120 | } ;% 121 | } 122 | 123 | % \hgate {name} {fitlist-top} {caption-top} {fitlist-bottom} 124 | % {caption-bottom} {inputs} 125 | \newcommand{\hgate}[6]{ % 126 | % Wrap the left and right parts 127 | \node[wrap=#2] (#1-top) {}; % 128 | \node[wrap=#4] (#1-bottom) {}; % 129 | % Draw the gate 130 | \node[gate=(#1-top)(#1-bottom)] (#1) {}; % 131 | % Add captions 132 | \node[caption, above right=of #1.west ] (#1-top-caption) 133 | {#3}; % 134 | \node[caption, below right=of #1.west ] (#1-bottom-caption) 135 | {#5}; % 136 | % Draw middle separation 137 | \draw [-, dashed] (#1.west) -- (#1.east); % 138 | % Draw inputs 139 | \foreach \x in {#6} { % 140 | \draw [-*,thick] (\x) -- (#1); % 141 | } ;% 142 | } 143 | 144 | -------------------------------------------------------------------------------- /fullscript/tikzlibrarybayesnet.code.tex: -------------------------------------------------------------------------------- 1 | % tikzlibrary.code.tex 2 | % 3 | % Copyright 2010-2011 by Laura Dietz 4 | % Copyright 2012 by Jaakko Luttinen 5 | % 6 | % This file may be distributed and/or modified 7 | % 8 | % 1. under the LaTeX Project Public License and/or 9 | % 2. under the GNU General Public License. 10 | % 11 | % See the files LICENSE_LPPL and LICENSE_GPL for more details. 12 | 13 | % Load other libraries 14 | \usetikzlibrary{shapes} 15 | \usetikzlibrary{fit} 16 | \usetikzlibrary{chains} 17 | \usetikzlibrary{arrows} 18 | 19 | % Latent node 20 | \tikzstyle{latent} = [circle,fill=white,draw=black,inner sep=1pt, 21 | minimum size=20pt, font=\fontsize{10}{10}\selectfont, node distance=1] 22 | % Observed node 23 | \tikzstyle{obs} = [latent,fill=gray!25] 24 | % Constant node 25 | \tikzstyle{const} = [rectangle, inner sep=0pt, node distance=1] 26 | % Factor node 27 | \tikzstyle{factor} = [rectangle, fill=black,minimum size=5pt, inner 28 | sep=0pt, node distance=0.4] 29 | % Deterministic node 30 | \tikzstyle{det} = [latent, diamond] 31 | 32 | % Plate node 33 | \tikzstyle{plate} = [draw, rectangle, rounded corners, fit=#1] 34 | % Invisible wrapper node 35 | \tikzstyle{wrap} = [inner sep=0pt, fit=#1] 36 | % Gate 37 | \tikzstyle{gate} = [draw, rectangle, dashed, fit=#1] 38 | 39 | % Caption node 40 | \tikzstyle{caption} = [font=\footnotesize, node distance=0] % 41 | \tikzstyle{plate caption} = [caption, node distance=0, inner sep=0pt, 42 | below left=5pt and 0pt of #1.south east] % 43 | \tikzstyle{factor caption} = [caption] % 44 | \tikzstyle{every label} += [caption] % 45 | 46 | \tikzset{>={triangle 45}} 47 | 48 | %\pgfdeclarelayer{b} 49 | %\pgfdeclarelayer{f} 50 | %\pgfsetlayers{b,main,f} 51 | 52 | % \factoredge [options] {inputs} {factors} {outputs} 53 | \newcommand{\factoredge}[4][]{ % 54 | % Connect all nodes #2 to all nodes #4 via all factors #3. 55 | \foreach \f in {#3} { % 56 | \foreach \x in {#2} { % 57 | \path (\x) edge[-,#1] (\f) ; % 58 | %\draw[-,#1] (\x) edge[-] (\f) ; % 59 | } ; 60 | \foreach \y in {#4} { % 61 | \path (\f) edge[->,#1] (\y) ; % 62 | %\draw[->,#1] (\f) -- (\y) ; % 63 | } ; 64 | } ; 65 | } 66 | 67 | % \edge [options] {inputs} {outputs} 68 | \newcommand{\edge}[3][]{ % 69 | % Connect all nodes #2 to all nodes #3. 70 | \foreach \x in {#2} { % 71 | \foreach \y in {#3} { % 72 | \path (\x) edge [->,#1] (\y) ;% 73 | %\draw[->,#1] (\x) -- (\y) ;% 74 | } ; 75 | } ; 76 | } 77 | 78 | % \factor [options] {name} {caption} {inputs} {outputs} 79 | \newcommand{\factor}[5][]{ % 80 | % Draw the factor node. Use alias to allow empty names. 81 | \node[factor, label={[name=#2-caption]#3}, name=#2, #1, 82 | alias=#2-alias] {} ; % 83 | % Connect all inputs to outputs via this factor 84 | \factoredge {#4} {#2-alias} {#5} ; % 85 | } 86 | 87 | % \plate [options] {name} {fitlist} {caption} 88 | \newcommand{\plate}[4][]{ % 89 | \node[wrap=#3] (#2-wrap) {}; % 90 | \node[plate caption=#2-wrap] (#2-caption) {#4}; % 91 | \node[plate=(#2-wrap)(#2-caption), #1] (#2) {}; % 92 | } 93 | 94 | % \gate [options] {name} {fitlist} {inputs} 95 | \newcommand{\gate}[4][]{ % 96 | \node[gate=#3, name=#2, #1, alias=#2-alias] {}; % 97 | \foreach \x in {#4} { % 98 | \draw [-*,thick] (\x) -- (#2-alias); % 99 | } ;% 100 | } 101 | 102 | % \vgate {name} {fitlist-left} {caption-left} {fitlist-right} 103 | % {caption-right} {inputs} 104 | \newcommand{\vgate}[6]{ % 105 | % Wrap the left and right parts 106 | \node[wrap=#2] (#1-left) {}; % 107 | \node[wrap=#4] (#1-right) {}; % 108 | % Draw the gate 109 | \node[gate=(#1-left)(#1-right)] (#1) {}; % 110 | % Add captions 111 | \node[caption, below left=of #1.north ] (#1-left-caption) 112 | {#3}; % 113 | \node[caption, below right=of #1.north ] (#1-right-caption) 114 | {#5}; % 115 | % Draw middle separation 116 | \draw [-, dashed] (#1.north) -- (#1.south); % 117 | % Draw inputs 118 | \foreach \x in {#6} { % 119 | \draw [-*,thick] (\x) -- (#1); % 120 | } ;% 121 | } 122 | 123 | % \hgate {name} {fitlist-top} {caption-top} {fitlist-bottom} 124 | % {caption-bottom} {inputs} 125 | \newcommand{\hgate}[6]{ % 126 | % Wrap the left and right parts 127 | \node[wrap=#2] (#1-top) {}; % 128 | \node[wrap=#4] (#1-bottom) {}; % 129 | % Draw the gate 130 | \node[gate=(#1-top)(#1-bottom)] (#1) {}; % 131 | % Add captions 132 | \node[caption, above right=of #1.west ] (#1-top-caption) 133 | {#3}; % 134 | \node[caption, below right=of #1.west ] (#1-bottom-caption) 135 | {#5}; % 136 | % Draw middle separation 137 | \draw [-, dashed] (#1.west) -- (#1.east); % 138 | % Draw inputs 139 | \foreach \x in {#6} { % 140 | \draw [-*,thick] (\x) -- (#1); % 141 | } ;% 142 | } 143 | 144 | -------------------------------------------------------------------------------- /additionalMaterial/sufficient-statistics.tex: -------------------------------------------------------------------------------- 1 | \documentclass[a4paper,10pt,landscape,twocolumn]{scrartcl} 2 | 3 | %% Settings 4 | \newcommand\problemset{4} 5 | \newcommand\deadline{Wednesday September 28th, 22:00h} 6 | \newif\ifcomments 7 | \commentsfalse % hide comments 8 | %\commentstrue % show comments 9 | 10 | % Packages 11 | \usepackage{enumitem} 12 | \usepackage[usenames,dvipsnames]{color} 13 | \usepackage{multicol} 14 | 15 | \usepackage{amsmath,amsthm,amssymb} 16 | \usepackage[empty]{fullpage} 17 | \usepackage{comment} 18 | 19 | % Styling 20 | \usepackage{tgpagella} 21 | \usepackage{AlegreyaSans} 22 | \setkomafont{section}{\Large\textsc} 23 | \RedeclareSectionCommand[afterskip=.3\baselineskip]{subsection} 24 | \setlength{\columnsep}{7em} 25 | \definecolor{gray}{gray}{.4} 26 | \definecolor{RED}{rgb}{.5,0,0} 27 | \renewcommand*{\pagemark}{} 28 | 29 | \usepackage{hyperref} 30 | \DeclareMathOperator{\Cov}{Cov} 31 | \DeclareMathOperator{\Cor}{Cor} 32 | \DeclareMathOperator{\Var}{Var} 33 | 34 | \begin{document} 35 | {\sffamily\flushleft\color{gray} 36 | \textsc{\bfseries basic probability: theory}\\ 37 | Master of Logic, University of Amsterdam, 2016\\ 38 | \textsc{teachers} Christian Schaffner and Philip Schulz 39 | \textsc{ta} Bas Cornelissen% 40 | } 41 | {\sffamily\flushleft\huge\bfseries 42 | Some notes on sufficient statistics 43 | }\\[1em]% 44 | 45 | \noindent 46 | \paragraph{The exercise} In exercise 1 of this weeks' board questions, you were given a set $x_1^n = (x_1, \dots, x_n)$ of $n$ i.i.d.\ observations that were all geometrically distributed. So they are observations of RV's $X_1, \dots X_n$ where 47 | \[ 48 | P(X_i = x_i \mid \Theta = \theta) = \text{Geom}(x_i \mid \theta) = (1-\theta)^{x_i} \theta. 49 | \] 50 | We had to show that $t := T(x_i) = \sum_{i=1}^n x_i$ is a sufficient statistic. 51 | 52 | What does that even mean? By the Factorization Theorem, it suffices to find two functions $g(\theta, t)$ and $h(x, t)$ such that 53 | \begin{align} 54 | P(X_1^n = x_1^n \mid \Theta = \theta) = g(\theta, t) \cdot h(x_1^n, t). 55 | \end{align} 56 | So what is our joint distribution? For legibility, we'll drop the random variables and write e.g. $P(x_1^n \mid \theta) := P(X_1^n = x_1^n \mid \Theta = \theta )$. By independence this is: 57 | \begin{align} 58 | P(x_1^n \mid \theta) = \prod_{i=1}^n (1-\theta)^{x_i} \theta = (1-\theta)^{\sum_{i=1}^n x_i} \cdot \theta^n. 59 | \end{align} 60 | 61 | \paragraph{The answer} 62 | Now observe that this is simply $(1-\theta)^t \theta^n$, so when we choose $g(\theta, t) := (1-\theta)^t \theta^n$ and $h(x, t) := 1$ we have found a factorization of the joint. By the Factorization Theorem, $t$ is thus a sufficient statistic. 63 | 64 | \paragraph{But why?} 65 | True as that may be, this feels a bit unsatisfactory. After all, the idea was that given the value of the sufficient statistic, it should be possible to write the PMF without using the parameter. The Factorization Theorem told you \emph{that} this is possible, but it doesn't tell you \emph{how} to do it. 66 | 67 | Or does it? In fact, the proof does. We essentially have to expand the conditional distribution of $x_1^n$ given $t$ and $\theta$: 68 | \begin{align}\label{eq:blabla} 69 | P(x_1^n \mid t, \theta) 70 | &= \frac{p(x_1^n, t \mid \theta)}{p(t \mid \theta)} 71 | = \frac{p(x_1^n \mid \theta)}{p(t\mid \theta)} 72 | = \frac{p(x_1^n \mid \theta)}{\sum_{z_1^n: t(z_1^n) = t} p(z_1^n, t \mid \theta)}. 73 | \end{align} 74 | In the second equality we used the fact that $t$ is a deterministic function of $x_1^n$ so the probability of $x_1^n$ and $t$ is exactly the same as the probability of $x_1^n$. In the third equality we used a little trick, writing a marginal as a marginalized joint. 75 | 76 | Recall that we actually had a factorization of $p(x_1^n \mid \theta)$, which we can now fill in in \eqref{eq:blabla} to get 77 | \begin{align}\label{eq:blabla2} 78 | P(x_1^n \mid t, \theta) 79 | = \frac{g(\theta, t) \cdot h(x_1^n, t)}{\sum_{z_1^n} g(\theta, t) \cdot h(z_1^n, t)} 80 | = \frac{h(x_1^n,t)}{\sum_{z_1^n} h(z_1^n, t)} 81 | \end{align} 82 | And since we know $h(x_1^n, t) = 1$, we can actually calculate this as 83 | \begin{align}\label{eq:blabla3} 84 | P(x_1^n \mid t, \theta) 85 | = \frac{1}{\sum_{z_1^n: T(z_1^n) = t} 1} 86 | = \frac{1}{|\{z_1^n: T(z_1^n) = t\}|} 87 | \end{align} 88 | --- if you manage to count the set in the denominator, that is. 89 | 90 | \paragraph{The lesson} 91 | Taking a step back, consider the conditional probability of $x_1^n$ given $t$, as expressed in the first equality of \eqref{eq:blabla}. That is the thing we want to write without using $\theta$, and we can do so if we somehow manage to cancel out the $\theta$ in the numerator against the $\theta$'s in the denominator. This is precisely what happened in the last step of \eqref{eq:blabla2}. Working with actual distributions, this might however be very difficult. Also, finding the actual distribution \emph{without} the $\theta$ need not be easy: you have to deal with the sum in \eqref{eq:blabla2}. 92 | 93 | What else should now be clear? For example: if we have data $x_1^n$ and $y_1^n$ with the same sufficient statistic $T(x_1^n) = T(y_1^n) = t$, drawn from two distributions, with parameters $\theta$ and $\theta'$, then by \eqref{eq:blabla2} 94 | \[ 95 | P(x_1^n \mid t, \theta) = P(y_1^n \mid t, \theta'). 96 | \] 97 | 98 | We can also say something about the original distributions, not conditioned on $t$. The distributions of $x_1^n$ and $y_1^n$ differ from another only in the normalizing constant. We can make that more explicit as follows: 99 | \begin{align} 100 | P(x_1^n \mid \theta) 101 | &= P(t \mid \theta) \cdot P(x_1^n \mid t, \theta)\\ 102 | &= P(t \mid \theta') \cdot P(y_1^n \mid t, \theta') \\ 103 | &=\frac{P(t \mid \theta)}{P(t \mid \theta')} \cdot P(t \mid \theta') \cdot P(y_1^n \mid t, \theta')\\ 104 | &= \frac{P(t \mid \theta)}{P(t \mid \theta')} \cdot P(y_1^n \mid \theta') 105 | \end{align} 106 | 107 | 108 | 109 | 110 | \end{document} 111 | -------------------------------------------------------------------------------- /chapter4/chapter4_forInclude.tex: -------------------------------------------------------------------------------- 1 | 2 | \setcounter{chapter}{3} 3 | \chapter{Bayes' rule and its applications} 4 | 5 | \section{The chain rule} 6 | 7 | This chapter is going to focus on how to re-write joint and conditional probabilities. When we turn to statistics later on, it will 8 | turn out that it is often hard to define a joint distribution over many variables. Likewise, it can be hard to calculate 9 | the probability distribution of a RV $ X $ conditioned on a RV $ Y $ but it may be much easier to find the distribution of $ Y $ 10 | conditioned on $ X $. In this chapter we are essentially trying to find simpler expressions for distributions that may be hard to 11 | compute. 12 | 13 | The first general method for simplifying a joint distribution is known as the \textbf{chain rule}. For completeness' sake, we are going to formulate the chain rule first for events and then for random variables. 14 | 15 | \begin{Theorem}{\textbf{(Chain rule)}} \label{thm:chain} 16 | The joint probability of events $ E_{1}, \ldots, E_{n} $ can be factorised as 17 | $$ \mathbb{P}(E_{1}, \ldots, E_{n}) = \mathbb{P}(E_{1}) \times \mathbb{P}(E_{2}|E_{1}) \times \ldots \times \mathbb{P}(E_{n}|E_{1}, \ldots, E_{n-1}) $$ 18 | \end{Theorem} 19 | Recall from Definition~\ref{def:jointprob} the notation 20 | $\mathbb{P}(E_1,E_2) = \mathbb{P}(E_1 \cap E_2)$ for denoting the 21 | probability that both events $E_1$ and $E_2$ occur. Also remember that 22 | we use the abbreviation $E_1^n := E_1, \ldots, E_n$; so for the case 23 | of events, we have $\mathbb{P}(E_1^n) = \mathbb{P}(\bigcap_{i=1}^n E_i)$. There are a couple of things to note about the chain rule: First of all, the numbering of the events is arbitrary. That means that it does not matter in which 24 | order we decompose the joint probability. We could just as well start with any $ E_{i} $ for $ 1 \leq i \leq n $. Second we used the 25 | word \textit{factorise}. This simply means that we decompose any expression (in this case a joint probability) into a product. Products are 26 | nice in that we can arrange them in any order that we like (i.e.\ they commute). Moreover, products make a lot of calculations easier, as we will 27 | see later. 28 | 29 | Let us go ahead and actually prove the chain rule. 30 | \paragraph{Proof of Theorem~\ref{thm:chain}} We are going to do so inductively and choose $ \mathbb{P}(E_{1}, E_{2}) $ as our 31 | base case. Then we simply employ the definition of conditional probability to get 32 | \begin{equation} 33 | \mathbb{P}(E_{1}, E_{2}) = \mathbb{P}(E_{1}) \times \dfrac{\mathbb{P}(E_{1}, E_{2})}{\mathbb{P}(E_{1})} = \mathbb{P}(E_{1}) \times \mathbb{P}(E_{2}|E_{1}) 34 | \end{equation} 35 | 36 | Let us assume that the chain rule holds for events $ E_{1}, \ldots, E_{n-1} $. We will abbreviate them as $ E_{1}^{n-1} $. Then we get 37 | \begin{equation} 38 | \mathbb{P}(E_{1}^{n-1}, E_{n}) = \mathbb{P}(E_{1}^{n-1}) \times \dfrac{\mathbb{P}(E_{1}^{n-1}, E_{n})}{\mathbb{P}(E_{1}^{n-1})} 39 | = \mathbb{P}(E_{1}^{n-1}) \times \mathbb{P}(E_{n}|E_{1}^{n-1}) 40 | \end{equation} 41 | 42 | Since $ \mathbb{P}(E_{1}^{n-1}) $ factorises according to the chain 43 | rule by our induction hypothesis, we have completed the proof. 44 | $ \square $\bigskip 45 | 46 | The chain rule can make our lives even simpler if we have independent events. Assume we want to compute the joint probability of 3 events 47 | $ E_{1},E_{2},E_{3} $ and we also know that $ E_{1} \bot E_{2} $. In this case our factorisation becomes \eqref{simpleFactor} where 48 | the first equality follows from the chain rule and the second equality follows from independence between $ E_{1} $ and $ E_{2} $. 49 | \begin{align} \label{simpleFactor} 50 | \mathbb{P}(E_{1}, E_{2}, E_{3}) &= \mathbb{P}(E_{1}) \times \mathbb{P}(E_{2}|E_{1}) \times \mathbb{P}(E_{3}|E_{1},E_{2}) \\ 51 | &= \mathbb{P}(E_{1}) \times \mathbb{P}(E_{2}) \times \mathbb{P}(E_{3}|E_{1},E_{2}) \nonumber 52 | \end{align} 53 | 54 | We can now state the chain rule for random variables. There are two ways you can go about proving it. Either you 55 | calculate the probability of a specific setting of the variables or you just do the proof based on the distributions of the RVs. 56 | So in the first case you would have to prove that 57 | \begin{align*} 58 | \forall x_1,\ldots,x_n: &P(X_{1} = x_{1}, \ldots, X_{n} = x_{n}) \\ 59 | &= P(X_{1} = x_{1}) \times \ldots \times P(X_{n} = x_{n}|X_{1}=x_{1}, \ldots, X_{n-1} = x_{n-1}) 60 | \end{align*} 61 | whereas in the second case you would simply prove that 62 | \begin{align*} 63 | P_{X_{1}^{n}} = \overset{n}{\underset{i=1}{\sum}}P_{X_{i}|X_{1}^{i-1}} 64 | \end{align*} 65 | 66 | Incidentally, we also introduce a very short notation for the chain rule above. Note that it is not quite correct, since if 67 | $ i = 1 $ we would be conditioning on $ X_{0} $. That is not to bad however, since we can always define ourselves a constant variable $ X_{0} $ that does not affect the distribution. Moreover, this notation is really just meant to be convenient, so you should just accept it as is when you encounter it in papers. 68 | 69 | \begin{Exercise} 70 | Prove the chain rule for random variables. The proof is totally analogous to the one given for events. 71 | \end{Exercise} 72 | 73 | \begin{Exercise} 74 | Let $X_0$ be a constant RV, i.e.\ there exists $c \in \mathbb{R}$ such that $P(X_0 = c)=1$. 75 | Prove that $X_0$ is independent of any set of other random variables $X_1,\ldots,X_n$. 76 | \end{Exercise} 77 | 78 | \section{Bayes' rule} 79 | 80 | In this section we are going to prove \textbf{Bayes' rule}. The rule follows directly from the chain rule. 81 | The proof is really simple and thus of no great interest in and by itself. The consequences of Bayes' rule 82 | are huge however. It will basically allow us to invert a conditional probability distribution. You may rightfully 83 | ask: what's the deal? Well, as we said in the beginning, it may be hard to compute a conditional distribution in one 84 | direction but much easier to compute it in the other direction. On top of that, Bayes' rule opens up a whole range of new possibilities. We will discuss those as we proceed in this chapter. 85 | \begin{Theorem}{\textbf{(Bayes' rule)}} 86 | The probability distribution of a random variable $ X $ given a random variable $ Y $ can be computed as 87 | $$ P_{X|Y} = \dfrac{P_{Y|X}P_{X}}{P_{Y}} $$ 88 | \end{Theorem} 89 | 90 | And here comes the proof: 91 | \begin{equation} 92 | P_{X|Y} = \dfrac{P_{XY}}{P_{Y}} = \dfrac{P_{Y|X}P_{X}}{P_{Y}} \, . \qquad \square 93 | \end{equation} 94 | 95 | That was the proof! Considering how simple it was, it will be surprising to see what kind of benefits we can get out 96 | of Bayes' rule. To get us started, let us introduce some terminology. In particular, each of the terms 97 | in Bayes' rule has a specific name. You should really learn these names by heart as they crop up all over the place. 98 | 99 | $$ \mathit{posterior} = \dfrac{\mathit{likelihood} \times \mathit{prior}}{\mathit{marginal~likelihood}} $$ 100 | 101 | The posterior is what we get after we have completed the computation. However, its name is related to the prior. 102 | The prior is just the probability that we would place on $ P(X=x) $ \textit{a priori}. Therefore $ P_{X} $ 103 | is also known as the prior distribution. When we divide the product of likelihood and prior by the 104 | marginal likelihood we get a new distribution 105 | over $ X $ that is conditioned on $ Y $. This is the distribution that we place on $ X $ \textit{a posteriori}, i.e. 106 | after having taken into account information about $ X $ that we may get from knowing the value of $ Y $. The marginal 107 | likelihood of $ Y $ is simply needed to normalize the expression to a probability distribution (i.e. to make sure that 108 | it sums to one). Why is it called marginal likelihood? The reason for this is how you can compute it. Recall that when 109 | we are given a joint distribution $ P_{XY} $, we can obtain the distribution 110 | $ P_{Y} $ by simply marginalizing over $ X $. 111 | \begin{equation} 112 | P(Y=y) = \sum_{x \in \supp(X)}P(X=x, Y=y) 113 | \end{equation} 114 | 115 | In addition to that, the chain rule allows us to factorise the joint probability. Thus we get 116 | \begin{equation} 117 | P(Y=y) = \sum_{x \in \supp(X)} P(Y=y|X=x) \times P(X=x) 118 | \end{equation} 119 | 120 | If you think that this looks an awful lot like the enumerator of Bayes' rule then you are exactly on the right track. 121 | Essentially, we are just summing over all possible denominators (with respect to $ X $). Let us make this more 122 | concrete with an example. Assume that we are given two coins. One of them is fair, meaning that it is equally probable 123 | to come up heads or tails. The other coin is biased towards tails and we happen to know that its probability to come up 124 | heads is only $ 0.3 $. Which coin is flipped is captured by a random variable $ X $ that takes on the value 0 if the 125 | fair coin is used and the value 1 if the biased coin is used. We have no idea which coin is going to be tossed, it could 126 | be either one. Therefore we set our prior to $ P(X=0) = P(X=1) = 0.5 $. 127 | 128 | We flip the chosen coin 10 times and obtain 8 heads. The number of heads obtained during the 10 tosses is going 129 | to be encoded by $ Y $. Since all tosses are independent of each other, $ Y $ will 130 | follow a binomial distribution. For each of the two coins we also know the parameter of the binomial distribution. 131 | For the fair coin it is $ \theta = 0.5 $ and for the biased coin it is $ \theta = 0.3 $. Let us compute each of the 132 | enumerators separately. 133 | \begin{align} 134 | P(Y=8|X=0) \times P(X=0) &= \binom{10}{8} 0.5^8 (1-0.5)^2 \times 0.5 = 0.02195 \label{bayes1}\\ 135 | P(Y=8|X=1) \times P(X=1) &= \binom{10}{8} 0.3^8 (1-0.3)^2 \times 0.5 = 0.0007 \label{bayes2} 136 | \end{align} 137 | Remember that $ Y \sim binom(10,\theta) $ and that $ \theta=0.5 $ if $ X=0 $ and $ \theta=0.3 $ if $ X=1 $. 138 | 139 | All that is left do is to compute the marginal likelihood of $ Y $. Luckily for us, $ X $ only assumes two 140 | values, so we only need to add up \eqref{bayes1} and \eqref{bayes2}. 141 | \begin{align} 142 | P(Y=8) = &P(Y=8|X=0) \times P(X=0) \\ 143 | &+ P(Y=8|X=1) \times P(X=1) = 0.02265 \nonumber 144 | \end{align} 145 | 146 | And finally we can apply Bayes' rule to compute the posterior probabilities of $ X $. 147 | \begin{align} 148 | P(X=0|Y=8) &= \dfrac{P(Y=8|X=0) \times P(X=0)}{P(Y=8)} \\ 149 | &= \dfrac{0.2195}{0.02265} = 0.969 \nonumber \\ 150 | P(X=1|Y=8) &= \dfrac{P(Y=8|X=1) \times P(X=1)}{P(Y=8)} \\ 151 | &= \dfrac{0.0005}{0.02265} = 0.031 \nonumber \\ 152 | \end{align} 153 | 154 | There is a probability of $ 0.969 $ that the fair coin has been tossed when a sequence with eight heads is 155 | generated and only a probability of $ 0.031 $ that the biased coin was tossed. Obviously, the probability of the fair 156 | coin is much higher. But how much higher? We can take the ratio of the two probabilities. This gives us 157 | $ \nicefrac{0.969}{0.031} \approx 31 $. We can conclude that the fair coin is 31 times more likely to have generated the sequence with 158 | 8 heads than the biased coin. But wait a second, can we maybe find this ratio somewhere else? It turns out that 159 | the ratio of the likelihoods is the same! That is $ \nicefrac{0.0439}{0.0014} \approx 31 $. 160 | 161 | We started out by assuming that both coins were equally likely to be used. However, we then observed a sequence of 10 tosses, 8 of 162 | which were heads and that made it 31 times more likely that the fair coin was used. What if the priors had not been equal? 163 | Actually, there is a more general story: While calculating the actual probabilities involves a lot of number crunching, just telling whether or not an observation will make one or the other event more likely is not too hard. [For the rest of this chapter, we assume that we only condition on events with non-zero probabilities such as $P(Y=y)>0$ so that we are never dividing by 0]. 164 | \begin{align*} 165 | \frac{P(X=x_{1}|Y=y)}{P(X=x_{2}|Y=y)} &= \frac{\dfrac{P(Y=y|X=x_{1})P(X=x_{1})}{P(Y=y)}}{\dfrac{P(Y=y|X=x_{2})P(X=x_{2})}{P(Y=y)}} \\[1em] 166 | &= \frac{P(Y=y|X=x_{1})P(X=x_{1})}{P(Y=y|X=x_{2})P(X=x_{2})} 167 | \end{align*} 168 | 169 | From the above equalities, we see that the ratio of the posterior probabilities is determined by the ratio of the likelihood times the 170 | prior. In our coin example, the priors were the same so it was only the likelihood that mattered. If the ratio of any of the above 171 | terms is greater than 1, the posterior will change in favour of $ X=x_{1} $. If the ratio is smaller than 1 the posterior changes 172 | in favour of $ X=x_{2} $. If the ratio is exactly 1, the posterior stays unchanged. 173 | 174 | Notice that in general, although our observations may shift the posterior in favour of $ X=x_{2} $, say, this shift does not necessarily imply that 175 | $ P(X=x_{2}|Y=y) $ will be greater than $ P(X=x_{1}|Y=y) $. The condition that $ P(X=x_{2}|Y=y) $ is bigger than $ P(X=x_{1}|Y=y) $ can be rewritten as follows 176 | \begin{align*} 177 | P(X=x_{1}|Y=y) &< P(X=x_{2}|Y=y) &\Leftrightarrow \\ 178 | \dfrac{P(Y=y|X=x_{1})P(X=x_{1})}{P(Y=y)} &< \dfrac{P(Y=y|X=x_{2})P(X=x_{2})}{P(Y=y)} &\Leftrightarrow \\ 179 | P(Y=y|X=x_{1})P(X=x_{1}) &< P(Y=y|X=x_{2})P(X=x_{2}) &\Leftrightarrow \\ 180 | \dfrac{P(Y=y|X=x_{1})}{P(Y=y|X=x_{2})} &< \dfrac{P(X=x_{2})}{P(X=x_{1})} 181 | \end{align*} 182 | 183 | The last line is of particular interest as it elucidates the relationship between the prior and the likelihood. Only if the likelihood 184 | ratio for $ X=x_{1} $ over $ X=x_{2} $ is smaller than the reversed prior ratio will the posterior probability of $ X=x_{2} $ 185 | be greater than that of $ X=x_{1} $. This means that if we have strongly asymmetric priors (like $ P(X=x_{1}) = 0.9 $ 186 | and $ P(X=x_{2}) = 0.1 $), the likelihood needs to discriminate very well between the two cases in order to tip the scale in 187 | favour of $ X=x_{2} $. In that sense the prior and the likelihood can be seen as battling forces whose equilibrium gives us 188 | the posterior. 189 | 190 | But enough theory about Bayes' rule, it is about time you apply it! To that end, we present you an exercise that is, in some variation, 191 | contained in virtually every textbook on probability theory, statistics or machine learning. Have fun with it! 192 | 193 | \begin{Exercise} 194 | A random person walks into the doctor's office to be tested for a particular disease. The disease can be fatal if not treated. However, 195 | successful treatment is possible if the disease is discovered early enough. It is commonly known that the disease occurs in 1 out 196 | of 1000 people of the country's population. The doctor will administer a test that with a probability of 99\% returns a positive results 197 | if the patient does indeed have the disease. At the same time, the test also returns a positive result in 5\% of the cases where the 198 | patient does not have the disease. After the test has been administered to the patient in question, it returns a positive result. 199 | What is the probability that the patient is infected with the disease? 200 | \\ 201 | Proceed as follows: 202 | \begin{enumerate} 203 | \item Write down a guess for what you think the probability might be (do not consider any math at this point). 204 | \item Calculate that probability. 205 | \item Check whether there is a considerable difference between your initial guess and the calculated probability. Go on to examine 206 | how the different factors have influenced the probability of the patient having the disease. 207 | \end{enumerate} 208 | \end{Exercise} 209 | 210 | Let us finish up this section with some more notation. In many applications of Bayes' rule we only want to know which outcome is 211 | the most likely, without worrying too much about the actual probabilities. Likewise, there is a range of situations where we 212 | just want to assign a score to outcomes and do not demand this score to be a probability. Throughout this chapter, 213 | we have repeatedly encountered the following phenomenon: In order to rank the values of an RV according to their probabilities, we do not necessarily need to compute the marginal likelihood since it cancels in all these comparisons anyway. Therefore, you will often see authors stating that 214 | \begin{equation} \label{proportionality} 215 | P(X=x|Y=y) \propto P(Y=y|X=x)P(X=x) 216 | \end{equation} 217 | 218 | This equation reads as ``the posterior is proportional to the product of the likelihood and the prior''. In general, if we have two quantities 219 | $ a $ and $ b $, then by $ a \propto b $ we mean that there is some constant 220 | $ C \in \mathbb{R} \setminus \{0\} $ such that $ a = Cb $. Notice 221 | that the probability distribution is a function and hence we require $ C $ to be the same across the domain of that function (that 222 | is $ C $ should be the same for all values of $ X $). 223 | 224 | \begin{Exercise} 225 | What is the value of $ C $ in Equation~\eqref{proportionality}? 226 | \end{Exercise} 227 | 228 | 229 | 230 | \section{Na\"ive Bayes} 231 | In this section, we introduce a rather crude application of Bayes's rule which is surprisingly successful nonetheless. 232 | Assume that instead of one random variable we are observing a sequence of random variables. Thus our problem is the following: 233 | \begin{equation} 234 | P(Y=y|X_{1}^{n}=x_{1}^{n}) \propto P(X_{1}^{n}=x_{1}^{n}|Y=y) \times P(Y=y) 235 | \end{equation} 236 | 237 | By the chain rule we can decompose the right-hand side into 238 | \begin{align} 239 | P(Y=y|X_{1}^{n}=x_{1}^{n}) 240 | \propto &P(X_{1}=x_{1}|Y=y) \times \ldots \nonumber \\ 241 | &\times P(X_{n}=x_{n}|Y=y,X_{1}^{n-1}=x_1^{n-1}) \times P(Y=y) \nonumber 242 | \end{align} 243 | 244 | We are now going to introduce the aforementioned crudeness into the model by assuming that all $ X_1,\ldots,X_n$ are conditionally independent given $ Y $. Notice that 245 | this is just an assumption that we are making without justification. In fact, it is very likely wrong. However, it makes our 246 | lives much easier because we only have to deal with very simple terms of the form $ P(X_{i}=x_{i}|Y=y) $. Because of the 247 | crudeness of our assumptions, this probabilistic model is known as \textbf{na\"ive Bayes} (sometimes also 248 | stupid Bayes). 249 | 250 | \begin{Definition} 251 | A na\"ive Bayes model is a probabilistic model that assumes 252 | $$ P_{Y|X_{1}^{n}} \propto P_Y P_{X_{1}|Y} P_{X_{2}|Y} \cdots P_{X_{n}|Y} $$ 253 | \end{Definition} 254 | Once we know all the component distributions $P_{X_i|Y}$, calculating the result is pretty straightforward. 255 | 256 | In order to illustrate how na\"ive Bayes works we are going to employ one of its showcase applications where it indeed had 257 | a lot of success in real life. The application we are talking about is text classification. The task is the following: you 258 | are given some documents and for each of the documents you have to assign a label signifying its class. What you consider 259 | a class depends on your actual application setting, but usually classes are broad categories, such as legal texts, medical 260 | texts etc. If you manage to succeed at this task, you can accomplish a lot of things automatically that required humans before. For example, you could tag online news with their relevant categories and people who are interested in 261 | a particular category will then have an easier time finding the news related to that category. Crucially, since you will 262 | write a computer program that does the classification for you, you will not need to read any of the texts yourself. This automation will obviously allow you to classify huge quantities of text in a very short amount of time. 263 | 264 | \begin{Exercise} 265 | A collection of text (or any other kind of data for that matter) is often called a \textbf{corpus}. Here we are going to 266 | use a toy corpus. The corpus just consists of two sentences and we assume that each sentence constitutes 267 | a document. 268 | The categories that you can label the documents with are 269 | finance (0), medicine (1) or law (2). You can find the corpus (the pmfs of the distributions) below. For simplicity, we are not going to distinguish between lower and upper case words (this is actually common practice). For better 270 | readability, we are also using the actual words instead of their numerical encodings as values for the random 271 | variables. Just remember that those words could also be represented as real random variables. To shorten notation, we 272 | will use pmfs. If the probability of a word given a category is not specified, take it to be 0. 273 | 274 | 275 | Your task is to classify these two documents correctly using a Na\"ive Bayes Model that conditions each 276 | word's probability on the document class. Please also report the posterior probability for the correct label. 277 | \end{Exercise} 278 | 279 | \newpage 280 | \textbf{The corpus:} 281 | \begin{itemize} 282 | \item a fact has been revealed 283 | \item the doctor's judgement has not been reliable 284 | \end{itemize} 285 | 286 | \textbf{The document category pmfs:} 287 | \begin{itemize} 288 | \item $ p(0) = 0.3 $ 289 | \item $ p(1) = 0.2 $ 290 | \item $ p(2) = 0.5 $ 291 | \end{itemize} 292 | 293 | \textbf{The lexical distribution for document category finance (0):} 294 | \begin{align*} 295 | &p(\mathit{a}|0) = 0.19~~p(\mathit{fact}|0)= 0.14~~p(\mathit{has}|0)=0.13~~p(\mathit{been}|0)=0.12 \\ 296 | &p(\mathit{revealed}|0)=0.04~~p(\mathit{the}|0)=0.21~~p(\mathit{doctor's}|0)=0.03 \\ 297 | &p(\mathit{judgement}|0)=0~~p(\mathit{not}|0)=0.11~~p(\mathit{reliable}|0)=0.03 298 | \end{align*} 299 | 300 | \textbf{The lexical distribution for document category medicine (1):} 301 | \begin{align*} 302 | &p(\mathit{a}|1) = 0.02~~p(\mathit{fact}|1)= 0.08~~p(\mathit{has}|1)=0.13~~p(\mathit{been}|1)=0.13 \\ 303 | &p(\mathit{revealed}|1)=0.01~~p(\mathit{the}|1)=0.18~~p(\mathit{doctor's}|1)=0.06 \\ 304 | &p(\mathit{judgement}|1)=0.14~~p(\mathit{not}|1)=0.20~~p(\mathit{reliable}|1)=0.05 305 | \end{align*} 306 | 307 | \textbf{The lexical distribution for document category law (2):} 308 | \begin{align*} 309 | &p(\mathit{a}|2) = 0.18~~p(\mathit{fact}|2)= 0.03~~p(\mathit{has}|2)=0.05~~p(\mathit{been}|2)=0.13 \\ 310 | &p(\mathit{revealed}|2)=0.10~~p(\mathit{the}|2)=0.14~~p(\mathit{doctor's}|2)=0.06 \\ 311 | &p(\mathit{judgement}|2)=0.07~~p(\mathit{not}|2)=0.08~~p(\mathit{reliable}|2)=0.16 312 | \end{align*} 313 | 314 | \section*{Further Reading} 315 | Here, we have only scratched the surface of what Bayes' rule allows us to do. To get a wider outlook on what else is possible, 316 | you can consult \href{http://www.cs.ubc.ca/~murphyk/Bayes/bayesrule.html}{Kevin Murphy's webpage}. 317 | 318 | %%% Local Variables: 319 | %%% mode: latex 320 | %%% TeX-master: "chapter4" 321 | %%% End: 322 | -------------------------------------------------------------------------------- /chapter7/chapter7_forInclude.Rnw: -------------------------------------------------------------------------------- 1 | \chapter{Basics of Information Theory} 2 | 3 | When we talk about \textit{information}, we often use the term in qualitative sense. We say things like 4 | \textit{This is valuable information} or 5 | \textit{We have a lack of information}. We can also make statements about some information being more helpful than other. For a long time, however, 6 | people have been unable to quantify information. The person who succeeded in this endeavour was \href{https://en.wikipedia.org/wiki/Claude_Shannon}{Claude E. Shannon} 7 | who with his famous 1948 article \textit{A Mathematical Theory of Communication} single-handedly created a new discipline: Information Theory! He also revolutionised 8 | digital communication and can be seen as one of the main contributors to our modern communication systems like the telephone, the internet etc. 9 | 10 | The beauty about information theory is that it is based on probability theory and many results from probability theory seamlessly carry over to information theory. 11 | In this chapter, we are going to discuss the bare basics of information theory. These basics are often enough to understand many information-theoretic arguments 12 | that researchers make in fields like computer science, psychology and linguistics. 13 | 14 | \section{Surprisal and Entropy} 15 | Shannon's idea of information is as simple as it is compelling. The amount of \emph{surprisal} of an event $E$ is defined as the inverse probability $1/P(E)$. Intuitively, rare events (where $P(E)$ is small) are more surprising than those occurring with high probability (where $P(E)$ is high). If we are observing a realisation of a random variable, this realisation is surprising if it is unlikely to occur according to the distribution of that random variable. However, if the probability for the realisation is very low, then on average it does not occur very often, meaning that if we sample from the RV repeatedly, we are not surprised very often. We are not surprised when the probability mass of the distribution is concentrated on only a small subset of its support. 16 | 17 | On the other hand, we quite often are surprised, if we cannot predict what the outcome of our next draw from the RV might be. We are surprised when the distribution over values of the RV is (close to) uniform. Thus, we are going to be most surprised on average if we are observing realisations of a uniformly distributed RV. 18 | 19 | Shannon's idea was that observing RVs that cause a lot of surprises is informative because we cannot predict the outcomes and with each new outcome we have effectively learned something (namely that the $ i^{th} $ outcome took on the value that it did). Observing RVs with very concentrated distributions is not very informative under this conception because by just choosing the most probable outcome we can correctly predict most actually observed outcomes. Obviously, if I manage to predict an outcome beforehand, its occurrence is not teaching me anything. 20 | 21 | The goal of Shannon was to find a function that captures this intuitive idea. He eventually found it and showed that it is the only function to have properties that encompass the intuition. This function is called the \textbf{entropy} of a RV and it is simply the expected \textbf{surprisal} value, expressed in bits. 22 | 23 | \begin{Definition}[Surprisal] 24 | The surprisal (value) of an outcome $ x \in \supp(X) $ of some RV $ X 25 | $ is defined as $ -\log_{2}(P(X=x)) = \log_2(\frac{1}{P(X=x)})$. 26 | \end{Definition} 27 | 28 | Notice that we are using the logarithm of base 2 here. This is because surprisal and entropy are standardly measured in bits. Intuitively, the surprisal measures how many bits one needs to encode an observed outcome given that one knows the distribution underlying that outcome. Check \href{http://www.umsl.edu/~fraundorfp/egsurpriNOLOGS.html}{this website} to get a feeling for surprisal values measured in bits. 29 | 30 | \begin{Definition}[Entropy] 31 | The entropy $H(P_X)$ of a RV $ X $ with distribution $P_X$ is defined as 32 | $$H(P_X) := \E[-\log_{2}(P(X=x))] = - \!\! \sum_{x \in \supp(X)} P(X=x) \log_2(P(X=x)) \, .$$ 33 | For the ease of notation, we often write $H(X)$ instead of $H(P_X)$. 34 | \end{Definition} 35 | 36 | The notational convenience of writing $H(X)$ instead of $H(P_X)$ can be confusing, because entropy is really assigning a (non-negative) real number to a distribution, i.e.\ $H(X)$ is {\bf not a function} of the random variable $X$ and it is {\bf not a random variable} either! Formally, for any random variable $X$ with distribution $P_X$ over the set $\mathcal{X}=\supp(X)$ (which might be categorical, i.e.\ $X$ could for instance take on values ``blue'', ``red'' and ``green''), we consider the surprisal function (in bits) $f(x) := -\log_2(P(X=x))$ mapping elements $x \in \mathcal{X}$ to real numbers $f(x) \in \mathbb{R}$. In that case, the surprisal $f(X)$ is a random variable over the reals and its expected value is well defined and called entropy $H(X) = H(P_X) := \E_X[f(X)]$. 37 | 38 | As an example, we consider the categorical random variable $X$ with distribution $P(X=\varheart)=P(X=\clubsuit)=1/4, P(X=\spadesuit)=1/2$. In that case, $\supp(X) = \{\varheart, \clubsuit, \spadesuit \}$ and surprisal values in bits are $f(\varheart)=f(\clubsuit)=\log_2(4)=2, f(\spadesuit)=\log_2(2)=1$. The entropy is the expected surprisal value, i.e.\ the individual surprisal valuse weighted with their corresponding probabilities of occurring: $H(X) = \E_X[f(X)] = \frac{1}{4} \cdot 2 + \frac{1}{4} \cdot 2 + \frac{1}{2} \cdot 1 = 3/2$. 39 | 40 | The entropy ``does not care'' about the actual outcomes or labels of a random variable, but only about the distribution! In fact, not even the order of the actual probabilities matter, as we are taking an expected value and the additive terms commute. You can verify that the calculation of $H(X)=3/2$ in the example above does apply to all random variables $X$ with distribution $(1/2, 1/4, 1/4)$, no matter what the actual outcomes are. 41 | 42 | \begin{Exercise} 43 | Compute the entropy of $Y \sim Binomial(n=2,p=1/2)$. 44 | \end{Exercise} 45 | 46 | The simplest and simultaneously most important example of entropy is given in Figure~\ref{fig:binaryEntropy} which shows the entropy of the Bernoulli distribution as a function of the parameter $ \theta \in [0,1]$. The entropy function of the Bernoulli is often called the \textbf{binary entropy} $h(\theta) := -\theta \cdot \log_2(\theta) - (1-\theta) \log_2(1-\theta)$. It measures the information of a binary decision, like a coin flip or an answer to a yes/no-question. 47 | The entropy of the Bernoulli attains its maximum of 1 bit when the distribution is uniform, i.e.\ when both choices are equally 48 | probable. The entropy is 0 if and only if the coin is fully biased towards heads or tails. As explained above, the entropy of the distributions $(\theta, 1-\theta)$ and $(1-\theta,\theta)$ is the same and therefore $h(\theta)=h(1-\theta)$ and the graph is symmetric around $1/2$. 49 | 50 | <>= 51 | x = seq(0,1,.001) 52 | y = -(x*log2(x)+(1-x)*log2(1-x)) 53 | plot(x,y,ylab=expression(h(theta)), xlab=expression(theta),type="l") 54 | @ 55 | 56 | \medskip 57 | From the plot is it also easy to see that entropy is never negative. It holds in general that entropy is non-negative, 58 | because entropy is defined as expectation of surprisal and surprisal is the negative logarithm of probabilities. 59 | Because $ \log(x) \leq 0 $ for $ x \in (0,1] $, it is clear that $ -\log(x) \geq 0 $ for $ x $ in the same 60 | interval. Notice that from here on we drop the subscript and by convention let $ \log = \log_{2} $. 61 | 62 | A standard interpretation of the entropy is that it quantifies uncertainty. As we have pointed out before, a uniform distribution means that you are most uncertain and indeed the uniform distribution maximizes the entropy. However, the more choices you have to pick from uniformly, the more uncertain you are going to be. The entropy function also captures this intuition. Notice that if a discrete distribution is uniform, all probabilities are $ \frac{1}{|\supp(X)|} $. Clearly, as we increase $ |\supp(X)| $, we decrease the probabilities. By decreasing the probabilities, we increase their negative logarithms, and hence their average surprisal. Let us make this intuition more formal. 63 | 64 | \begin{Theorem} 65 | A discrete RV $ X $ with uniform distribution and support of size $ n $ has entropy 66 | $ H(X) = \log(n) $. 67 | \end{Theorem} 68 | 69 | \paragraph{Proof:} 70 | \begin{align} 71 | H(X) &= \underset{x \in \supp(X)}{\sum}-\log(P(X=x))P(X=x) \\ 72 | &= \underset{x \in \supp(X)}{\sum} -\log(\frac{1}{|\supp(X)|})P(X=x) \\ 73 | &= \underset{x \in \supp(X)}{\sum}\log(n)P(X=x) = \log(n) \, . 74 | \hspace{1cm} \square 75 | \end{align} 76 | 77 | \begin{Exercise} 78 | You are trying to learn chess and you start by studying where chess grandmasters move their king when it 79 | is positioned in one of the middle fields of the board. The king can move to any of the adjoining 8 fields. Since 80 | you do not know a thing about chess yet, you assume that each move is equally probable. In this situation, 81 | what is the entropy of moving the king? 82 | \end{Exercise} 83 | 84 | One of the first important results in information theory is \href{https://en.wikipedia.org/wiki/Shannon%27s_source_coding_theorem}{Shannon's source-coding theorem} which states that the entropy $H(X)$ of a random variable $X$ measures how many bits one will need on average to encode an outcome that is generated by the distribution $ P_{X} $. 85 | This result applies to the real-world problem of data compression. Assume that $N$ data points are drawn iid from the distribution $P_X$. In that case, the source-coding theorem tells us that on average, we will need $N \cdot H(X)$ bits to store the (optimally compressed) data. For example, let $P_X$ be the $Bernoulli(\theta)$ distribution over bits. In the case $\theta=1/2$, we have $N$ perfectly random bits which cannot be compressed, and hence we need $N \cdot H(X) = N \cdot h(\theta) = N \cdot h(1/2) = N$ bits of storage. For the general case $\theta \neq 1/2$ when the individual bits are biased, the graph of the binary entropy $h(\theta)$ in Figure~\ref{fig:binaryEntropy} tells us exactly what the compression ratio will be. We will not cover the proof of the source-coding theorem here, but refer to the literature instead. 86 | 87 | 88 | \section{Conditional Entropy} 89 | At the outset of this section we promised you that you could easily transfer results from probability 90 | theory to information theory. We will not be able to show any kind of linearity for entropy because it contains 91 | log-terms and the logarithm is not linear. We can however find alternative expressions for joint entropy (where 92 | the joint entropy is simply the entropy of a joint RV). Before we do so, let us also define the notion of 93 | conditional entropy. We have seen in Section~\ref{sec:jointconditionaldistributions} that $P_{X|Y=y}$ is a valid probability distribution for any $y \in \supp(Y)$ such that $P(Y=y)>0$. Hence, we can also define its conditional entropy. 94 | 95 | \begin{Definition}[Conditional Entropy] 96 | For two jointly distributed RVs $ X,Y $ and $y \in \supp(Y)$ such that $P(Y=y)>0$, the conditional entropy of $ X $ given that $ Y=y $ is defined as 97 | \begin{align*} 98 | H(X | Y=y) &:= \E_X[-\log_{2}(P(X=x | Y=y))] \\ 99 | &= - \!\! \sum_{x \in \supp(X)} P(X=x | Y=y) \log_2(P(X=x | Y=y))\, . 100 | \end{align*} 101 | The conditional entropy of $X$ given $Y$ is defined as 102 | $$ H(X | Y) := \E_Y[ H(X | Y) ] = \sum_{y \in \supp(Y)} P(Y=y) H(X | Y=y) \, .$$ 103 | \end{Definition} 104 | 105 | Intuitively, $H(X | Y)$ is the (average) uncertainty of $X$ after learning $Y$. Intuitively, learning $Y$ (and in fact any information) cannot increase your uncertainty about $X$. Formally, one can prove the following 106 | \begin{Lemma}[see e.g.\ Proposition~4 of \href{http://homepages.cwi.nl/~schaffne/courses/inftheory/2016/notes/CramerFehr.pdf}{this script}] \label{lemma:noincrease} 107 | For any two random variables $X,Y$ with joint distribution $P_{XY}$, it holds that $H(X | Y) \leq H(X)$. 108 | \end{Lemma} 109 | Note however, that this non-increase of uncertainty only holds on average, as illustrated by the following example: 110 | 111 | \paragraph{Example} 112 | Consider the binary random variables $X$ and $Y$, with joint distribution 113 | \begin{align*} 114 | &P(X=0,Y=0) = \frac{1}{2}, \quad P(X=0,Y=1) = \frac{1}{4}\\ 115 | &P(X=1,Y=0) = 0, \quad P(X=1,Y=1) = \frac{1}{4}. 116 | \end{align*} 117 | By marginalization, we find that $P(X=0) = \frac{3}{4}$ and $P(X=1) = \frac{1}{4}$, while $P(Y=0) = P(Y=1) = \frac{1}{2}$. This allows us to make the following computations: 118 | \begin{align*} 119 | H(X,Y) &= \frac{1}{2}\log 2 + \frac{1}{4} \log 4 + \frac{1}{4} \log 4 = \frac{3}{2}\\ 120 | H(X) &= h\left(\frac{1}{4}\right) = h\left(\frac{3}{4}\right) \approx 0.81\\ 121 | H(Y) &= h\left(\frac{1}{2}\right) = 1\\ 122 | H(X|Y) &= P(Y=0) \cdot H(X | Y=0) + P(Y=1) \cdot H(X | Y=1)\\ 123 | &= \frac{1}{2} \cdot 0 + \frac12 \cdot 1 = \frac12 \\ 124 | H(Y|X) &= P(X=0) \cdot H(Y | X=0) + P(X=1) \cdot H(Y | X=1)\\ 125 | &= \frac{3}{4} \cdot h\left(\frac{1}{3} \right) + \frac{1}{4} \cdot 0 \approx 0.69 126 | \end{align*} 127 | % We also could have computed $H(X|Y)$ and $H(Y|X)$ directly through the definition of conditional entropy. 128 | Note that for this specific distribution, learning the outcome $Y=1$ increases the uncertainty about $X$, $H(X|Y=1) > H(X)$, but on average, we always have $H(X|Y) \leq H(X)$. It is important to remember that Lemma~\ref{lemma:noincrease} only holds on average, not for specific values of $Y$. Note also that in this example, $H(X|Y) \neq H(Y|X)$. 129 | 130 | It is not a coincidence that the joint entropy $H(X,Y)$ in the example above is equal to $H(X|Y)+H(Y)$ and $H(Y|X)+H(X)$. One can prove this chain rule in general: 131 | 132 | \begin{align*} 133 | H(X,Y) &= \underset{\substack{x \in \supp(X)\\y \in \supp(Y)}}{\sum} -\log(P(X=x,Y=y)) \times P(X=x, Y=y) \\ 134 | \begin{split} 135 | &= \underset{\substack{x \in \supp(X)\\ y \in \supp(Y)}}{\sum} -\log(P(X=x \mid Y=y)) \times P(X=x,Y=y) \\ 136 | &\qquad - \underset{y \in \supp(Y)}{\sum}\log(P(Y=y)) \times \sum_{x \in \supp(X)} P(X=x,Y=y) 137 | \end{split} \\ 138 | \begin{split} 139 | &=\sum_{y \in \supp(Y)} P(Y=y) \times \sum_{x \in \supp(X)} -\log(P(X=x \mid Y=y)) \times P(X=x \mid Y=y) \\ &\qquad - \underset{y \in \supp(Y)}{\sum}\log(P(Y=y)) \times P(Y=y) 140 | \end{split} \\ 141 | &= H(X | Y) + H(Y) \; . 142 | \end{align*} 143 | 144 | \begin{Exercise} 145 | Prove that $ H(X,Y | Z) = H(X | Z) + H(Y | Z) $ if $ X \bot Y \mid Z $. 146 | \end{Exercise} 147 | As corollary, we get that $H(X,Y)=H(X)+H(Y)$ for independent random variables $X$ and $Y$. More generally, the entropy of $n$ independent random variables is $H(X_1^n) = \sum_{i=1}^n H(X_i)$. 148 | 149 | 150 | \section{An Information-Theoretic View on EM} 151 | Now that we have seen some information-theoretic concepts, you may be happy to hear that there is an information-theoretic interpretation 152 | of EM. This interpretation helps us to get a better intuition for the algorithm. To formulate that interpretation we need 153 | one more concept, however. 154 | 155 | \begin{Definition}[Relative Entropy] 156 | The relative entropy of RVs \\ $ X,Y $ with distributions $P_X, P_Y$ and $\supp(X) \subseteq \supp(Y) $ is defined as 157 | $$ D(P_X||P_Y) := \sum_{x \in \supp(X)} P(X=x) \log \frac{P(X=x)}{P(Y=x)} \ . $$ 158 | If $ P(Y=x) = 0 $ for any $ x \in \supp(X) $ we define $ D(P_X||P_Y) = \infty $. As with entropy, we often abbreviate $D(P_X||P_Y)$ with $D(X||Y)$. 159 | \end{Definition} 160 | 161 | The relative entropy is commonly known as \textbf{Kullback-Leibler (KL)} divergence. It measures the entropy of $ X $ as scaled to $ Y $. Intuitively, 162 | it gives a measure of how ``far away'' $ P_{X} $ is from $ P_{Y} $. To 163 | understand ``far away'', recall that entropy is a measure of 164 | uncertainty. 165 | % The 166 | % relative entropy measure the uncertainty that you have about $ P_{X} $ if you know $ P_{Y} $\chris{hard to see why at this point}. 167 | This uncertainty is low if both distributions place most 168 | of their mass on the same outcomes. Since $ \log(1) = 0 $ the relative entropy is 0 if $ P_{X} = P_{Y} $. 169 | 170 | It is worthwhile to point out the difference between relative and conditional entropy. Conditional entropy is the average entropy of $ X $ given that you 171 | know what value $ Y $ takes on. In the case of relative entropy you do not know the value of $ Y $, only its distribution. 172 | 173 | \begin{Exercise} 174 | Show that $ D(X,Y||Y) = H(X | Y) $. Furthermore show that $ D(X,Y||Y) = H(X) $ if $ X\bot Y $. 175 | \end{Exercise} 176 | 177 | 178 | Let us start by remembering why we need EM. We have a model that defines a joint distribution 179 | over observed ($ x $) and latent data ($ z $). Such a model generally looks as follows: 180 | \begin{equation} 181 | P(X=x, Z=z \mid \Theta = \theta) = P(X=x \mid Z=z, \Theta=\theta) P(Z=z \mid \Theta = \theta) 182 | \end{equation} 183 | where we have chosen a factorization that provides a separate term for a distribution over only the 184 | latent data. 185 | 186 | Recall that the goal of the EM algorithm is to iteratively increase the likelihood through consecutive 187 | updates of parameter estimates. These updates are achieved through maximum-likelihood estimation based 188 | on expected sufficient statistics. We are now going to show that a) EM computes a lower bound on the 189 | marginal log-likelihood of the data in each iteration and b) that this lower bound becomes tight when the 190 | expected sufficient statistics are taken with respect to the model posterior. The latter implies that 191 | EM performs the optimal update in each iteration. 192 | 193 | Let us start by expanding the data log-likelihood and then lower-bounding it. 194 | \begin{align} 195 | &\log(P(X=x \mid \Theta=\theta)) = \log(\sum_y P(X=x, Y=y \mid \Theta = \theta)) \\ 196 | &= \log\left(\sum_{y} Q(Y=y \mid \Phi=\phi)\frac{P(X=x, Y=y \mid \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\ 197 | &\geq \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(X=x, Y=y \mid \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) 198 | \label{eq:ELBO1} 199 | \end{align} 200 | Here, we have used \href{https://en.wikipedia.org/wiki/Jensen\%27s_inequality}{Jensen's Inequality} to 201 | derive the lower bound. Observe that the log is indeed a concave function. 202 | 203 | We also have introduced 204 | an auxiliary distribution $ Q $ over the latent variables with parameters $ \phi $. 205 | For reasons that we will explain shortly, 206 | this distributions is often called the \textbf{variational distribution} and its parameters the 207 | \textbf{variational parameters}. The letter $ Q $ is slightly non-standard to denote distributions but 208 | we are are following conventions from the field of \textbf{variational inference} here. 209 | 210 | In the next step, we factorise the model distribution in order to recover a KL divergence term between 211 | the variational distribution and the model posterior over latent variables. 212 | \begin{align} 213 | &\sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(X=x, Y=y \mid \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\ 214 | &= \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(Y=y \mid X=x, \Theta = \theta)P(X=x \mid \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\ 215 | &= \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(Y=y \mid X=x, \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) + \log(P(X=x \mid \Theta=\theta)) \\ 216 | &= -D(Q||P) + \log(P(X=x \mid \Theta=\theta)) \label{eq:ELBO2} 217 | \end{align} 218 | Equation~\eqref{eq:ELBO2} gives us two insights. First it quantifies the gap between the lower bound 219 | and the actual data likelihood. This gap is equal to the KL divergence between the variational distribution 220 | and the model posterior over latent variables. Second, since KL divergence is always positive, the bound only becomes 221 | tight when $ P=Q $. But this is exactly what is happening in the E-step! The E-step sets $ P=Q $ and 222 | then computes expectations under that distribution (see Equation~\eqref{eq:ELBO1}). Thus, the E-step increases 223 | the lower bound on the marginal log-likelihood. 224 | 225 | Looking back at Equation~\eqref{eq:ELBO1}, we also see that the M-step increases the lower bound because 226 | it maximises $ \E\left[P(X=x, Y=y\mid \Theta = \theta)\right] $. We conclude that both steps 227 | are increasing the lower bound on the log-likelihood. We therefore conclude that EM increases the data likelihood 228 | in every iteration (or leaves it unchanged at worst). 229 | 230 | We will finish with a quick rejoinder on variational inference. EM is a special case of variational inference. 231 | Variational inference is any inference procedure which uses an auxiliary distribution $ Q $ to compute 232 | a lower bound on the likelihood. In the general setting, the auxiliary distribution can be different from the 233 | model posterior. This means that the bound never gets tight. However, in models in which the exact posterior 234 | is hard (read: impossible) to compute, using a non-tight lower bound instead can be incredibly useful! 235 | 236 | The reason this inference procedure is called \textit{variational} is because it is based on the 237 | \href{https://en.wikipedia.org/wiki/Calculus_of_variations}{calculus of variations}. This works mostly 238 | like normal calculus except that standard operations like differentiation are done with respect to functions 239 | instead of variables. 240 | 241 | %Naively, we could take the expectation with respect to any distribution 242 | %over latent values. Obviously, we would like to find the best one, i.e. the one that is closest to the 243 | %actual posterior. We can formalize this by introducing an auxiliary distribution\footnote{We follow 244 | %standard notation here by denoting the auxiliary distribution $ Q $ instead of $ P $. Also, the 245 | %parameter variable is chosen so as to distinguish it from the parameter variable of our model.} 246 | %$ Q(z\mid\Phi=\phi) $ under 247 | %which we compute the expected sufficient statistics. We want to find the auxiliary distribution that 248 | %is closest to actual posterior $ P_{Z\midX=x,\Theta=\theta} $. We measure closeness in an information-theoretic 249 | %sense using KL-divergence. Formally, our goal is to find 250 | %\begin{equation} 251 | %Q^{*}_{Z\mid\Phi=\phi} = \underset{Q_{Z\mid\Phi=\phi}}{\mbox{arg min}}~D\left( Q_{Z\mid\Phi=\phi} || P_{Z \mid X=x,\Theta=\theta} \right) \ . 252 | %\end{equation} 253 | 254 | 255 | 256 | \section*{Further Material} 257 | 258 | At the ILLC, there is a whole course about information theory, \href{http://homepages.cwi.nl/~schaffne/courses/inftheory/}{currently taught by Christian Schaffner}. David MacKay also offers \href{http://www.inference.phy.cam.ac.uk/itprnn/book.pdf}{a free book on the subject}. Finally, 259 | Coursera also offers \href{https://www.coursera.org/course/informationtheory}{an online course on information theory}. 260 | 261 | The information-theoretic formulation of EM was pioneered in this \href{http://www.cs.toronto.edu/~fritz/absps/emk.pdf}{paper}. A very recent and intelligible 262 | \href{https://arxiv.org/abs/1601.00670}{tutorial on variational inference} can be found on the archive. 263 | 264 | \end{document} 265 | 266 | %%% Local Variables: 267 | %%% mode: latex 268 | %%% TeX-master: "chapter7" 269 | %%% End: 270 | -------------------------------------------------------------------------------- /chapter7/chapter7_forInclude.tex: -------------------------------------------------------------------------------- 1 | \chapter{Basics of Information Theory} 2 | 3 | When we talk about \textit{information}, we often use the term in qualitative sense. We say things like 4 | \textit{This is valuable information} or 5 | \textit{We have a lack of information}. We can also make statements about some information being more helpful than other. For a long time, however, 6 | people have been unable to quantify information. The person who succeeded in this endeavour was \href{https://en.wikipedia.org/wiki/Claude_Shannon}{Claude E. Shannon} 7 | who with his famous 1948 article \textit{A Mathematical Theory of Communication} single-handedly created a new discipline: Information Theory! He also revolutionised 8 | digital communication and can be seen as one of the main contributors to our modern communication systems like the telephone, the internet etc. 9 | 10 | The beauty about information theory is that it is based on probability theory and many results from probability theory seamlessly carry over to information theory. 11 | In this chapter, we are going to discuss the bare basics of information theory. These basics are often enough to understand many information-theoretic arguments 12 | that researchers make in fields like computer science, psychology and linguistics. 13 | 14 | \section{Surprisal and Entropy} 15 | Shannon's idea of information is as simple as it is compelling. The amount of \emph{surprisal} of an event $E$ is defined as the inverse probability $1/P(E)$. Intuitively, rare events (where $P(E)$ is small) are more surprising than those occurring with high probability (where $P(E)$ is high). If we are observing a realisation of a random variable, this realisation is surprising if it is unlikely to occur according to the distribution of that random variable. However, if the probability for the realisation is very low, then on average it does not occur very often, meaning that if we sample from the RV repeatedly, we are not surprised very often. We are not surprised when the probability mass of the distribution is concentrated on only a small subset of its support. 16 | 17 | On the other hand, we quite often are surprised, if we cannot predict what the outcome of our next draw from the RV might be. We are surprised when the distribution over values of the RV is (close to) uniform. Thus, we are going to be most surprised on average if we are observing realisations of a uniformly distributed RV. 18 | 19 | Shannon's idea was that observing RVs that cause a lot of surprises is informative because we cannot predict the outcomes and with each new outcome we have effectively learned something (namely that the $ i^{th} $ outcome took on the value that it did). Observing RVs with very concentrated distributions is not very informative under this conception because by just choosing the most probable outcome we can correctly predict most actually observed outcomes. Obviously, if I manage to predict an outcome beforehand, its occurrence is not teaching me anything. 20 | 21 | The goal of Shannon was to find a function that captures this intuitive idea. He eventually found it and showed that it is the only function to have properties that encompass the intuition. This function is called the \textbf{entropy} of a RV and it is simply the expected \textbf{surprisal} value, expressed in bits. 22 | 23 | \begin{Definition}[Surprisal] 24 | The surprisal (value) of an outcome $ x \in \supp(X) $ of some RV $ X 25 | $ is defined as $ -\log_{2}(P(X=x)) = \log_2(\frac{1}{P(X=x)})$. 26 | \end{Definition} 27 | 28 | Notice that we are using the logarithm of base 2 here. This is because surprisal and entropy are standardly measured in bits. Intuitively, the surprisal measures how many bits one needs to encode an observed outcome given that one knows the distribution underlying that outcome. Check \href{http://www.umsl.edu/~fraundorfp/egsurpriNOLOGS.html}{this website} to get a feeling for surprisal values measured in bits. 29 | 30 | \begin{Definition}[Entropy] 31 | The entropy $H(P_X)$ of a RV $ X $ with distribution $P_X$ is defined as 32 | $$H(P_X) := \E[-\log_{2}(P(X=x))] = - \!\! \sum_{x \in \supp(X)} P(X=x) \log_2(P(X=x)) \, .$$ 33 | For the ease of notation, we often write $H(X)$ instead of $H(P_X)$. 34 | \end{Definition} 35 | 36 | The notational convenience of writing $H(X)$ instead of $H(P_X)$ can be confusing, because entropy is really assigning a (non-negative) real number to a distribution, i.e.\ $H(X)$ is {\bf not a function} of the random variable $X$ and it is {\bf not a random variable} either! Formally, for any random variable $X$ with distribution $P_X$ over the set $\mathcal{X}=\supp(X)$ (which might be categorical, i.e.\ $X$ could for instance take on values ``blue'', ``red'' and ``green''), we consider the surprisal function (in bits) $f(x) := -\log_2(P(X=x))$ mapping elements $x \in \mathcal{X}$ to real numbers $f(x) \in \mathbb{R}$. In that case, the surprisal $f(X)$ is a random variable over the reals and its expected value is well defined and called entropy $H(X) = H(P_X) := \E_X[f(X)]$. 37 | 38 | As an example, we consider the categorical random variable $X$ with distribution $P(X=\varheart)=P(X=\clubsuit)=1/4, P(X=\spadesuit)=1/2$. In that case, $\supp(X) = \{\varheart, \clubsuit, \spadesuit \}$ and surprisal values in bits are $f(\varheart)=f(\clubsuit)=\log_2(4)=2, f(\spadesuit)=\log_2(2)=1$. The entropy is the expected surprisal value, i.e.\ the individual surprisal valuse weighted with their corresponding probabilities of occurring: $H(X) = \E_X[f(X)] = \frac{1}{4} \cdot 2 + \frac{1}{4} \cdot 2 + \frac{1}{2} \cdot 1 = 3/2$. 39 | 40 | The entropy ``does not care'' about the actual outcomes or labels of a random variable, but only about the distribution! In fact, not even the order of the actual probabilities matter, as we are taking an expected value and the additive terms commute. You can verify that the calculation of $H(X)=3/2$ in the example above does apply to all random variables $X$ with distribution $(1/2, 1/4, 1/4)$, no matter what the actual outcomes are. 41 | 42 | \begin{Exercise} 43 | Compute the entropy of $Y \sim Binomial(n=2,p=1/2)$. 44 | \end{Exercise} 45 | 46 | The simplest and simultaneously most important example of entropy is given in Figure~\ref{fig:binaryEntropy} which shows the entropy of the Bernoulli distribution as a function of the parameter $ \theta \in [0,1]$. The entropy function of the Bernoulli is often called the \textbf{binary entropy} $h(\theta) := -\theta \cdot \log_2(\theta) - (1-\theta) \log_2(1-\theta)$. It measures the information of a binary decision, like a coin flip or an answer to a yes/no-question. 47 | The entropy of the Bernoulli attains its maximum of 1 bit when the distribution is uniform, i.e.\ when both choices are equally 48 | probable. The entropy is 0 if and only if the coin is fully biased towards heads or tails. As explained above, the entropy of the distributions $(\theta, 1-\theta)$ and $(1-\theta,\theta)$ is the same and therefore $h(\theta)=h(1-\theta)$ and the graph is symmetric around $1/2$. 49 | 50 | \begin{knitrout} 51 | \definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{figure}[t!] 52 | 53 | {\centering \includegraphics[width=\maxwidth]{figure/binaryEntropy-1} 54 | 55 | } 56 | 57 | \caption[Binary entropy function]{Binary entropy function}\label{fig:binaryEntropy} 58 | \end{figure} 59 | 60 | 61 | \end{knitrout} 62 | 63 | \medskip 64 | From the plot is it also easy to see that entropy is never negative. It holds in general that entropy is non-negative, 65 | because entropy is defined as expectation of surprisal and surprisal is the negative logarithm of probabilities. 66 | Because $ \log(x) \leq 0 $ for $ x \in (0,1] $, it is clear that $ -\log(x) \geq 0 $ for $ x $ in the same 67 | interval. Notice that from here on we drop the subscript and by convention let $ \log = \log_{2} $. 68 | 69 | A standard interpretation of the entropy is that it quantifies uncertainty. As we have pointed out before, a uniform distribution means that you are most uncertain and indeed the uniform distribution maximizes the entropy. However, the more choices you have to pick from uniformly, the more uncertain you are going to be. The entropy function also captures this intuition. Notice that if a discrete distribution is uniform, all probabilities are $ \frac{1}{|\supp(X)|} $. Clearly, as we increase $ |\supp(X)| $, we decrease the probabilities. By decreasing the probabilities, we increase their negative logarithms, and hence their average surprisal. Let us make this intuition more formal. 70 | 71 | \begin{Theorem} 72 | A discrete RV $ X $ with uniform distribution and support of size $ n $ has entropy 73 | $ H(X) = \log(n) $. 74 | \end{Theorem} 75 | 76 | \paragraph{Proof:} 77 | \begin{align} 78 | H(X) &= \underset{x \in \supp(X)}{\sum}-\log(P(X=x))P(X=x) \\ 79 | &= \underset{x \in \supp(X)}{\sum} -\log(\frac{1}{|\supp(X)|})P(X=x) \\ 80 | &= \underset{x \in \supp(X)}{\sum}\log(n)P(X=x) = \log(n) \, . 81 | \hspace{1cm} \square 82 | \end{align} 83 | 84 | \begin{Exercise} 85 | You are trying to learn chess and you start by studying where chess grandmasters move their king when it 86 | is positioned in one of the middle fields of the board. The king can move to any of the adjoining 8 fields. Since 87 | you do not know a thing about chess yet, you assume that each move is equally probable. In this situation, 88 | what is the entropy of moving the king? 89 | \end{Exercise} 90 | 91 | One of the first important results in information theory is \href{https://en.wikipedia.org/wiki/Shannon%27s_source_coding_theorem}{Shannon's source-coding theorem} which states that the entropy $H(X)$ of a random variable $X$ measures how many bits one will need on average to encode an outcome that is generated by the distribution $ P_{X} $. 92 | This result applies to the real-world problem of data compression. Assume that $N$ data points are drawn iid from the distribution $P_X$. In that case, the source-coding theorem tells us that on average, we will need $N \cdot H(X)$ bits to store the (optimally compressed) data. For example, let $P_X$ be the $Bernoulli(\theta)$ distribution over bits. In the case $\theta=1/2$, we have $N$ perfectly random bits which cannot be compressed, and hence we need $N \cdot H(X) = N \cdot h(\theta) = N \cdot h(1/2) = N$ bits of storage. For the general case $\theta \neq 1/2$ when the individual bits are biased, the graph of the binary entropy $h(\theta)$ in Figure~\ref{fig:binaryEntropy} tells us exactly what the compression ratio will be. We will not cover the proof of the source-coding theorem here, but refer to the literature instead. 93 | 94 | 95 | \section{Conditional Entropy} 96 | At the outset of this section we promised you that you could easily transfer results from probability 97 | theory to information theory. We will not be able to show any kind of linearity for entropy because it contains 98 | log-terms and the logarithm is not linear. We can however find alternative expressions for joint entropy (where 99 | the joint entropy is simply the entropy of a joint RV). Before we do so, let us also define the notion of 100 | conditional entropy. We have seen in Section~\ref{sec:jointconditionaldistributions} that $P_{X|Y=y}$ is a valid probability distribution for any $y \in \supp(Y)$ such that $P(Y=y)>0$. Hence, we can also define its conditional entropy. 101 | 102 | \begin{Definition}[Conditional Entropy] 103 | For two jointly distributed RVs $ X,Y $ and $y \in \supp(Y)$ such that $P(Y=y)>0$, the conditional entropy of $ X $ given that $ Y=y $ is defined as 104 | \begin{align*} 105 | H(X | Y=y) &:= \E_X[-\log_{2}(P(X=x | Y=y))] \\ 106 | &= - \!\! \sum_{x \in \supp(X)} P(X=x | Y=y) \log_2(P(X=x | Y=y))\, . 107 | \end{align*} 108 | The conditional entropy of $X$ given $Y$ is defined as 109 | $$ H(X | Y) := \E_Y[ H(X | Y) ] = \sum_{y \in \supp(Y)} P(Y=y) H(X | Y=y) \, .$$ 110 | \end{Definition} 111 | 112 | Intuitively, $H(X | Y)$ is the (average) uncertainty of $X$ after learning $Y$. Intuitively, learning $Y$ (and in fact any information) cannot increase your uncertainty about $X$. Formally, one can prove the following 113 | \begin{Lemma}[see e.g.\ Proposition~4 of \href{http://homepages.cwi.nl/~schaffne/courses/inftheory/2016/notes/CramerFehr.pdf}{this script}] \label{lemma:noincrease} 114 | For any two random variables $X,Y$ with joint distribution $P_{XY}$, it holds that $H(X | Y) \leq H(X)$. 115 | \end{Lemma} 116 | Note however, that this non-increase of uncertainty only holds on average, as illustrated by the following example: 117 | 118 | \paragraph{Example} 119 | Consider the binary random variables $X$ and $Y$, with joint distribution 120 | \begin{align*} 121 | &P(X=0,Y=0) = \frac{1}{2}, \quad P(X=0,Y=1) = \frac{1}{4}\\ 122 | &P(X=1,Y=0) = 0, \quad P(X=1,Y=1) = \frac{1}{4}. 123 | \end{align*} 124 | By marginalization, we find that $P(X=0) = \frac{3}{4}$ and $P(X=1) = \frac{1}{4}$, while $P(Y=0) = P(Y=1) = \frac{1}{2}$. This allows us to make the following computations: 125 | \begin{align*} 126 | H(X,Y) &= \frac{1}{2}\log 2 + \frac{1}{4} \log 4 + \frac{1}{4} \log 4 = \frac{3}{2}\\ 127 | H(X) &= h\left(\frac{1}{4}\right) = h\left(\frac{3}{4}\right) \approx 0.81\\ 128 | H(Y) &= h\left(\frac{1}{2}\right) = 1\\ 129 | H(X|Y) &= P(Y=0) \cdot H(X | Y=0) + P(Y=1) \cdot H(X | Y=1)\\ 130 | &= \frac{1}{2} \cdot 0 + \frac12 \cdot 1 = \frac12 \\ 131 | H(Y|X) &= P(X=0) \cdot H(Y | X=0) + P(X=1) \cdot H(Y | X=1)\\ 132 | &= \frac{3}{4} \cdot h\left(\frac{1}{3} \right) + \frac{1}{4} \cdot 0 \approx 0.69 133 | \end{align*} 134 | % We also could have computed $H(X|Y)$ and $H(Y|X)$ directly through the definition of conditional entropy. 135 | Note that for this specific distribution, learning the outcome $Y=1$ increases the uncertainty about $X$, $H(X|Y=1) > H(X)$, but on average, we always have $H(X|Y) \leq H(X)$. It is important to remember that Lemma~\ref{lemma:noincrease} only holds on average, not for specific values of $Y$. Note also that in this example, $H(X|Y) \neq H(Y|X)$. 136 | 137 | It is not a coincidence that the joint entropy $H(X,Y)$ in the example above is equal to $H(X|Y)+H(Y)$ and $H(Y|X)+H(X)$. One can prove this chain rule in general: 138 | 139 | \begin{align*} 140 | H(X,Y) &= \underset{\substack{x \in \supp(X)\\y \in \supp(Y)}}{\sum} -\log(P(X=x,Y=y)) \times P(X=x, Y=y) \\ 141 | \begin{split} 142 | &= \underset{\substack{x \in \supp(X)\\ y \in \supp(Y)}}{\sum} -\log(P(X=x \mid Y=y)) \times P(X=x,Y=y) \\ 143 | &\qquad - \underset{y \in \supp(Y)}{\sum}\log(P(Y=y)) \times \sum_{x \in \supp(X)} P(X=x,Y=y) 144 | \end{split} \\ 145 | \begin{split} 146 | &=\sum_{y \in \supp(Y)} P(Y=y) \times \sum_{x \in \supp(X)} -\log(P(X=x \mid Y=y)) \times P(X=x \mid Y=y) \\ &\qquad - \underset{y \in \supp(Y)}{\sum}\log(P(Y=y)) \times P(Y=y) 147 | \end{split} \\ 148 | &= H(X | Y) + H(Y) \; . 149 | \end{align*} 150 | 151 | \begin{Exercise} 152 | Prove that $ H(X,Y | Z) = H(X | Z) + H(Y | Z) $ if $ X \bot Y \mid Z $. 153 | \end{Exercise} 154 | As corollary, we get that $H(X,Y)=H(X)+H(Y)$ for independent random variables $X$ and $Y$. More generally, the entropy of $n$ independent random variables is $H(X_1^n) = \sum_{i=1}^n H(X_i)$. 155 | 156 | 157 | \section{An Information-Theoretic View on EM} 158 | Now that we have seen some information-theoretic concepts, you may be happy to hear that there is an information-theoretic interpretation 159 | of EM. This interpretation helps us to get a better intuition for the algorithm. To formulate that interpretation we need 160 | one more concept, however. 161 | 162 | \begin{Definition}[Relative Entropy] 163 | The relative entropy of RVs \\ $ X,Y $ with distributions $P_X, P_Y$ and $\supp(X) \subseteq \supp(Y) $ is defined as 164 | $$ D(P_X||P_Y) := \sum_{x \in \supp(X)} P(X=x) \log \frac{P(X=x)}{P(Y=x)} \ . $$ 165 | If $ P(Y=x) = 0 $ for any $ x \in \supp(X) $ we define $ D(P_X||P_Y) = \infty $. As with entropy, we often abbreviate $D(P_X||P_Y)$ with $D(X||Y)$. 166 | \end{Definition} 167 | 168 | The relative entropy is commonly known as \textbf{Kullback-Leibler (KL)} divergence. It measures the entropy of $ X $ as scaled to $ Y $. Intuitively, 169 | it gives a measure of how ``far away'' $ P_{X} $ is from $ P_{Y} $. To 170 | understand ``far away'', recall that entropy is a measure of 171 | uncertainty. 172 | % The 173 | % relative entropy measure the uncertainty that you have about $ P_{X} $ if you know $ P_{Y} $\chris{hard to see why at this point}. 174 | This uncertainty is low if both distributions place most 175 | of their mass on the same outcomes. Since $ \log(1) = 0 $ the relative entropy is 0 if $ P_{X} = P_{Y} $. 176 | 177 | It is worthwhile to point out the difference between relative and conditional entropy. Conditional entropy is the average entropy of $ X $ given that you 178 | know what value $ Y $ takes on. In the case of relative entropy you do not know the value of $ Y $, only its distribution. 179 | 180 | \begin{Exercise} 181 | Show that $ D(X,Y||Y) = H(X | Y) $. Furthermore show that $ D(X,Y||Y) = H(X) $ if $ X\bot Y $. 182 | \end{Exercise} 183 | 184 | 185 | Let us start by remembering why we need EM. We have a model that defines a joint distribution 186 | over observed ($ x $) and latent data ($ z $). Such a model generally looks as follows: 187 | \begin{equation} 188 | P(X=x, Z=z \mid \Theta = \theta) = P(X=x \mid Z=z, \Theta=\theta) P(Z=z \mid \Theta = \theta) 189 | \end{equation} 190 | where we have chosen a factorization that provides a separate term for a distribution over only the 191 | latent data. 192 | 193 | Recall that the goal of the EM algorithm is to iteratively increase the likelihood through consecutive 194 | updates of parameter estimates. These updates are achieved through maximum-likelihood estimation based 195 | on expected sufficient statistics. We are now going to show that a) EM computes a lower bound on the 196 | marginal log-likelihood of the data in each iteration and b) that this lower bound becomes tight when the 197 | expected sufficient statistics are taken with respect to the model posterior. The latter implies that 198 | EM performs the optimal update in each iteration. 199 | 200 | Let us start by expanding the data log-likelihood and then lower-bounding it. 201 | \begin{align} 202 | &\log(P(X=x \mid \Theta=\theta)) = \log(\sum_y P(X=x, Y=y \mid \Theta = \theta)) \\ 203 | &= \log\left(\sum_{y} Q(Y=y \mid \Phi=\phi)\frac{P(X=x, Y=y \mid \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\ 204 | &\geq \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(X=x, Y=y \mid \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) 205 | \label{eq:ELBO1} 206 | \end{align} 207 | Here, we have used \href{https://en.wikipedia.org/wiki/Jensen\%27s_inequality}{Jensen's Inequality} to 208 | derive the lower bound. Observe that the log is indeed a concave function. 209 | 210 | We also have introduced 211 | an auxiliary distribution $ Q $ over the latent variables with parameters $ \phi $. 212 | For reasons that we will explain shortly, 213 | this distributions is often called the \textbf{variational distribution} and its parameters the 214 | \textbf{variational parameters}. The letter $ Q $ is slightly non-standard to denote distributions but 215 | we are are following conventions from the field of \textbf{variational inference} here. 216 | 217 | In the next step, we factorise the model distribution in order to recover a KL divergence term between 218 | the variational distribution and the model posterior over latent variables. 219 | \begin{align} 220 | &\sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(X=x, Y=y \mid \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\ 221 | &= \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(Y=y \mid X=x, \Theta = \theta)P(X=x \mid \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\ 222 | &= \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(Y=y \mid X=x, \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) + \log(P(X=x \mid \Theta=\theta)) \\ 223 | &= -D(Q||P) + \log(P(X=x \mid \Theta=\theta)) \label{eq:ELBO2} 224 | \end{align} 225 | Equation~\eqref{eq:ELBO2} gives us two insights. First it quantifies the gap between the lower bound 226 | and the actual data likelihood. This gap is equal to the KL divergence between the variational distribution 227 | and the model posterior over latent variables. Second, since KL divergence is always positive, the bound only becomes 228 | tight when $ P=Q $. But this is exactly what is happening in the E-step! The E-step sets $ P=Q $ and 229 | then computes expectations under that distribution (see Equation~\eqref{eq:ELBO1}). Thus, the E-step increases 230 | the lower bound on the marginal log-likelihood. 231 | 232 | Looking back at Equation~\eqref{eq:ELBO1}, we also see that the M-step increases the lower bound because 233 | it maximises $ \E\left[P(X=x, Y=y\mid \Theta = \theta)\right] $. We conclude that both steps 234 | are increasing the lower bound on the log-likelihood. We therefore conclude that EM increases the data likelihood 235 | in every iteration (or leaves it unchanged at worst). 236 | 237 | We will finish with a quick rejoinder on variational inference. EM is a special case of variational inference. 238 | Variational inference is any inference procedure which uses an auxiliary distribution $ Q $ to compute 239 | a lower bound on the likelihood. In the general setting, the auxiliary distribution can be different from the 240 | model posterior. This means that the bound never gets tight. However, in models in which the exact posterior 241 | is hard (read: impossible) to compute, using a non-tight lower bound instead can be incredibly useful! 242 | 243 | The reason this inference procedure is called \textit{variational} is because it is based on the 244 | \href{https://en.wikipedia.org/wiki/Calculus_of_variations}{calculus of variations}. This works mostly 245 | like normal calculus except that standard operations like differentiation are done with respect to functions 246 | instead of variables. 247 | 248 | %Naively, we could take the expectation with respect to any distribution 249 | %over latent values. Obviously, we would like to find the best one, i.e. the one that is closest to the 250 | %actual posterior. We can formalize this by introducing an auxiliary distribution\footnote{We follow 251 | %standard notation here by denoting the auxiliary distribution $ Q $ instead of $ P $. Also, the 252 | %parameter variable is chosen so as to distinguish it from the parameter variable of our model.} 253 | %$ Q(z\mid\Phi=\phi) $ under 254 | %which we compute the expected sufficient statistics. We want to find the auxiliary distribution that 255 | %is closest to actual posterior $ P_{Z\midX=x,\Theta=\theta} $. We measure closeness in an information-theoretic 256 | %sense using KL-divergence. Formally, our goal is to find 257 | %\begin{equation} 258 | %Q^{*}_{Z\mid\Phi=\phi} = \underset{Q_{Z\mid\Phi=\phi}}{\mbox{arg min}}~D\left( Q_{Z\mid\Phi=\phi} || P_{Z \mid X=x,\Theta=\theta} \right) \ . 259 | %\end{equation} 260 | 261 | 262 | 263 | \section*{Further Material} 264 | 265 | At the ILLC, there is a whole course about information theory, \href{http://homepages.cwi.nl/~schaffne/courses/inftheory/}{currently taught by Christian Schaffner}. David MacKay also offers \href{http://www.inference.phy.cam.ac.uk/itprnn/book.pdf}{a free book on the subject}. Finally, 266 | Coursera also offers \href{https://www.coursera.org/course/informationtheory}{an online course on information theory}. 267 | 268 | The information-theoretic formulation of EM was pioneered in this \href{http://www.cs.toronto.edu/~fritz/absps/emk.pdf}{paper}. A very recent and intelligible 269 | \href{https://arxiv.org/abs/1601.00670}{tutorial on variational inference} can be found on the archive. 270 | 271 | \end{document} 272 | 273 | %%% Local Variables: 274 | %%% mode: latex 275 | %%% TeX-master: "chapter7" 276 | %%% End: 277 | -------------------------------------------------------------------------------- /chapter1/chapter1_forInclude.tex: -------------------------------------------------------------------------------- 1 | \chapter{Basic Probability And Combinatorics} 2 | 3 | \section*{Notational conventions} 4 | In this script we make use of certain notational conventions. We \textbf{bold-face} newly introduced 5 | technical terms on first mention. Those are the terms whose definitions you are expected to know by heart 6 | in this and following courses. \textit{Italics} serve the purpose of highlighting passages in the 7 | script but also to discriminate linguistic examples from the rest of the text. Occasionally, we will 8 | point to online references outside of this script. The corresponding links are coloured in 9 | \href{http://en.wikibooks.org/wiki/LaTeX/Hyperlinks}{blue} and you are encouraged to click them. 10 | 11 | We denote sets with uppercase letters and overload notation by using $ |\cdot| $ as both a function 12 | that yields the cardinality of a set and the length of a sequence. Besides using standard notation 13 | for set union and intersection we denote the complement of a set S with respect to another set X by 14 | $ S\backslash X $. 15 | 16 | \section{Introduction} 17 | \subsection{Why study probability theory?} 18 | The fact that you have picked up this script and started reading it demonstrates that you already have 19 | some interest in learning about probability theory. This probably means that you also have some conception 20 | of what probability theory is and what to do with it. Nevertheless, we will take the opportunity to 21 | quickly give you some additional motivations for studying probability theory. 22 | 23 | This script is all about formalizing the notion of probability. In particular, we are interested in 24 | giving a formal interpretation to statements like ``A is more probable than B''. Let us take a simple 25 | example to demonstrate why this is useful: Suppose it is Monday and you have a date scheduled for 26 | Friday. Obviously you want to impress your date. Unluckily, however, you have tendency to be broke 27 | come weekends. The decision you have to make now is whether to take your date to a fancy restaurant 28 | (the impressive but expensive option) or to just go for drinks (the cheaper option). On what basis can 29 | you make this decision? Well, you can ask yourself whether it is more likely that you are broke on 30 | Friday night or not. If you think that you being broke is more probable than you going for drinks, otherwise 31 | you opt for the fancy restaurant. 32 | 33 | The above is an example where we have used the intuitive notion of probability to assist us in decision 34 | making. The first part, the computation of the probabilities of events (e.g. you being broke or not) 35 | is something that we are going to develop in some detail in this script. The second part, the development 36 | of a so-called \textit{decision rule} (e.g. to plan for the circumstances that are most probable to 37 | occur in the future) is something that will be covered in later courses. 38 | 39 | Here is a second example of what one can do with probability theory. Assume you want to invest in the 40 | stock market. You will be putting in some money now and then you want to cash in on your gains (or losses) 41 | in ten years time, say. Notice that this time around simply asking whether it is more probable that your 42 | stock has risen or fallen in price is not enough. Even if your stock is worth more in ten years than it 43 | was when you bought it, the absolute increase may be so miniscule that you could have found much better 44 | investment options that would have yielded more gains. Worse even, if your gain is a smaller percentage 45 | of your original capital than the overall inflation that occurred during the ten years of your investment, 46 | you will actually have incurred a loss in terms of pure market power! So instead of asking whether 47 | or not your stock will be worth more than what it was when you first bought it, you should rather 48 | ask how much of an absolute gain you can expect from your investment. This second application of probability 49 | theory, the computation of expectations over real values, is something we are going to cover in this 50 | script, as well. 51 | 52 | Alright, we hope that this has gotten you excited for the rest of the script. Let's get going! 53 | 54 | \section{Sample spaces and events} 55 | The whole of probability theory is based on assigning probability values to elements of a 56 | \textbf{sample space}. The members of the sample space are referred to as \textbf{outcomes} or \textbf{samples}. 57 | 58 | \begin{Definition}[Sample Space] A sample space is any \href{http://en.wikipedia.org/wiki/Borel_set}{Borel set} 59 | $ \Omega $. We denote the members of a sample space by $ \omega \in \Omega $. 60 | \end{Definition} 61 | 62 | Standard examples of sample spaces are the flipping of a coin and the rolling of a die. Formally, 63 | the sample space of a die roll is $ \Omega = \{1,2,3,4,5,6\} $. The sample space of a coin toss 64 | would consist of heads and tails. However, it is often more convenient to represent outcomes numerically. 65 | In the context of this course, we will achieve this by imposing any total order on the sample space and then identifying the outcomes with the positions they occupy in the corresponding ordered list. In this spirit we let 66 | the sample space of a coin toss be $ \Omega = \{1,2\} $ where $ 1 $ represents heads and $ 2 $ represents 67 | tails, say (the other way around would be just as fine). 68 | 69 | More generally, we denote a sample space with $ n $ members as $ \Omega = \{1,\ldots,n\} $. A useful 70 | metaphor that we will often use is to think of generating an outcome from a sample space as a blind draw from an urn with $ n $ balls 71 | that are numbered and possibly coloured but otherwise indistinguishable. The rolling of a die, for example, 72 | corresponds to drawing a ball from an urn with balls numbered $ 1 $ to $ 6 $. A somewhat more involved 73 | example is that of writing an English sentence of six words, for example the sentence: 74 | \textit{To be or not to be}. The process of writing this sentence can be conceptualized as drawing 75 | six balls from an urn that contains balls corresponding to words 76 | in the English language\footnote{This is obviously a very unrealistic conception of how English 77 | sentences are written as it totally ignores the fact that the words in a sentence are dependent on each 78 | other and have to be placed in a particular order.}. Note that this will be a rather large urn as 79 | \href{http://www.languagemonitor.com/number-of-words/number-of-words-in-the-english-language-1008879} 80 | {the vocabulary of the English language has already exceeded 1 million words}. 81 | 82 | In our sample spaces as defined above, it is easy to distinguish individual outcomes. However, often times 83 | we do not care about the outcomes themselves but about properties that some of them share. In the 84 | die example we might be only interested in whether the outcome is even or odd. Transferring this scenario to the urn metaphor we would colour the balls with odd numbers green and the balls 85 | with even numbers red. Again, any other colours are just as fine. All that matters is that 86 | we can discriminate a member of $ E = \{2,4,6\} $ from a member of $ O = \{1,3,5\} $. We do \textit{not} 87 | need to discriminate between the outcomes that are members of the same set! In this particular setting 88 | $ E $ and $ O $ are the \textbf{events} that we are interested in. 89 | 90 | \begin{Definition}[Event] 91 | An event $ A $ is any subset $ A \subseteq \Omega $. 92 | \end{Definition} 93 | 94 | Events are what usually interests us in probability theory. Just as with outcomes, we can 95 | also define the notion of an event space. 96 | 97 | \newpage 98 | \begin{Definition}[Event space] 99 | An event space associated with a sample space $ \Omega $ is a set $ \mathcal{A} $ such that 100 | \begin{enumerate} 101 | \item $ \mathcal{A} $ is non-empty 102 | \item If $ A \in \mathcal{A} $ then $ A \subseteq \Omega $ 103 | \item If $ A \in \mathcal{A} $ then $ \Omega \setminus A \in \mathcal{A} $ 104 | \item If $ A,B \in \mathcal{A} $ then $ A \cup B \in \mathcal{A} $ 105 | \end{enumerate} 106 | \end{Definition} 107 | 108 | Notice that since $ \emptyset \subseteq S $ for any set $ S $ we always have $ \Omega \in \mathcal{A} $ 109 | by item 3. 110 | 111 | \begin{Exercise} 112 | You can also arrive at the conclusion that $ \Omega \in \mathcal{A} $ always holds in a 113 | different (and arguably more cumbersome) way. How so? 114 | %Solution: By item 1, $ \mathcal{A} $ is non-empty. Thus we can assume $ A \in \mathcal{A} $. But then also 115 | %$ A^{C_{\Omega}} \in \mathcal{A} $ by item 3. Item 4 then implies that $ \Omega \in \mathcal{A} $. 116 | \end{Exercise} 117 | 118 | The fact that event spaces are closed under the set complement operation is very convenient. Say I 119 | organized a dinner party and invited $ 10 $ people. The day after you ask me if more than $ 8 $ people 120 | actually showed up. I just answer that I was very disappointed that my friends Mary and Paul did 121 | not come. Although I did not directly address your question you know that the answer is negative. After 122 | all, I informed you that the complement event of the event you asked about had occurred. 123 | 124 | \begin{Exercise} 125 | In the above party example, what is the sample space? What is the smallest possible event space that is necessary to 126 | model the situation just described? 127 | % Solution: $ \Omega = \{x_{1} \ldots x_{10} | x_{i} \in \{0,1\}\} $ 128 | % $ \mathcal{A} = \{\Omega, \emptyset, \{\omega \in \Omega | \sum x_{i} > 8\}, 129 | % \{\omega \in \Omega | \sum x_{i} \leq 8\}\} $ 130 | \end{Exercise} 131 | 132 | In general, we will not worry too much about constructing an event space every time we encounter a new 133 | problem. The \textbf{power set} of the sample space conveniently happens to fulfil all the requirements 134 | we have for event spaces, so we will just always use it. Thus, all we will ever need to worry about 135 | is the construction of sample spaces since we now know how to construct event spaces from them in a 136 | simple manner. In case you are a bit rusty, here is a reminder of what a power 137 | set is. 138 | 139 | \begin{Definition}[Power Set] 140 | The power set $ \mathcal{P}(S) $ of any set $ S $ is defined as $ \mathcal{P}(S) := \underset{s \subseteq S}{\bigcup}~s $. 141 | \end{Definition} 142 | 143 | In general, this leaves us with the pair $ (\Omega, \mathcal{P}(\Omega)) $. For outcomes in a sample space, 144 | let us stress again an important difference, namely that $ \omega \in \Omega $ but 145 | $ \{\omega\} \in \mathcal{A}$. 146 | 147 | \section{Some basic combinatorics} 148 | Combinatorics is the mathematics of counting. Counting is of course a very basic problem that may 149 | be solved by just looking at each element of a set. However, this na\"ive procedure is often 150 | unreasonably time consuming. Moreover, it does not allow us to make general statements about sets of any 151 | size, i.e. sets of size $ n $. 152 | 153 | In order to assess the size of our sample spaces, we would like to make such general statements. The reason 154 | is that when we are dealing with probability we often start from \textbf{uniform probabilities} 155 | on the sample space where by uniform probability we simply mean the value $ \frac{1}{|\Omega|} $. This 156 | is the probability we will assign to each and every $ \omega \in \Omega $. We now say that all the 157 | elements in our sample space are equally probable. 158 | Note that at this point we are using probabilities solely for the purpose of motivating combinatorics which 159 | is kind of a hack because we haven't even told you yet what a probability is. However, we hope that you 160 | find the idea of uniform probabilities somewhat intuitive. 161 | 162 | Let us start from scratch: What is the cardinality (size) of the sample space of a die roll? It 163 | is $ 6 $ because $ |\{1,2,3,4,5,6\}| = 6 $. Now what if we roll two dice? The sample space for each 164 | individual die is already known. Let us call it $ \Omega_{1} $. The sample space for the rolling of two dice 165 | is then just the Cartesian product of two such sample spaces, i.e. 166 | $ \Omega_{2} = \Omega_{1} \times \Omega_{1} = \{(x,y)|x \in \Omega_{1}, y \in \Omega_{1}\} $. Since 167 | the cardinality of the Cartesian product of two sets $ S $ and $ S' $ is $ |S| \times |S'| $ we conclude 168 | that $ |\Omega_{2}| = |\Omega_{1} \times \Omega_{1}| = |\Omega_{1}| \times |\Omega_{1}| 169 | = |\Omega_{1}|^{2} = 36 $. 170 | 171 | Unsurprisingly, this method of performing a draw from the same sample space (urn) multiple times generalizes to any number of 172 | times $ n > 2 $. Nicely enough, it also generalizes to sets of different sizes (again by the Cartesian product 173 | argument from above). However, we have to impose one important restriction on the use of this technique: it 174 | may only be applied when the sample spaces are independent, i.e. when the outcome of one space does 175 | not affect the outcome of the other. Often times, we will simply assume that this is the case, though. 176 | 177 | The technique of inferring the size of a complex sample space from the sizes of the sample spaces 178 | it is constructed from is known as the \textbf{basic principle of counting}. 179 | 180 | \begin{Definition}[Basic principle of counting] 181 | The basic principle of counting states that if two draws from sample spaces of size 182 | $ M $ and $ N $ respectively are performed independently of each other then the sample space 183 | composed from them has size $ M \times N $. 184 | \end{Definition} 185 | 186 | \begin{Exercise} 187 | Let us assume that a football game is played for strictly 90 minutes. Both teams start with 11 players. 188 | A red card to a player results 189 | in that player being sent off the pitch. According to the rules of football, the game is stopped prematurely when either 190 | team has only 6 or fewer players remaining on the pitch. We are now interested in how many possible 191 | situations (we assume that situations occur in one-minute intervals) there are in which the game still progresses, 192 | one or more red cards have been issued and exactly four goals have been scored. Give the corresponding sample space and its size. 193 | %Solution: We define three sample spaces: $ \Omega_{M} = {1 \cdots 89} $ for minutes played, 194 | %$ \Omega_{R} = \{(x_{1},x_{2})|x_{1},x_{2} \in \{0,1,2,3,4\}, x_{1} + x_{2} > 0\} $ for red cards shown and 195 | %$ \Omega_{G} = \{(x_{1},x_{2})|x_{1},x_{2} \in \{0,1,2,3,4\}, x_{1} + x_{2} = 4\} $ for total goals 196 | %scored. Clearly, $ |\Omega_{M}| = 89 $, $ |\Omega_{R}| = 20 $ and $ |\Omega_{G}| = 5 $. Our total 197 | %sample space is the Cartesian product of those three and its size is $ 89\times 20 \times 6 = 8900 $. 198 | \end{Exercise} 199 | 200 | Note that up to now we have implicitly assumed that we would put every drawn ball back into the urn. This 201 | is also referred to as \textbf{sampling with replacement}. Let us now look at problems for \textbf{sampling 202 | without replacement}, i.e.\ problems where we are shrinking our sample space at each draw. One class of such 203 | problems is known as \textbf{permutation} problems. 204 | 205 | \begin{Definition}[Permutation] 206 | A permutation on a set $ S $ is a bijection $ \sigma : S \rightarrow S : s \mapsto \sigma(s) $. 207 | \end{Definition} 208 | 209 | Often times people also use the word permutation to refer to the image of a set under a permutation. What we 210 | need permutations for in practice is the reordering of ordered sets (which we will call lists). For example 211 | the permutations of the list $ L = (1,2,3) $ are: 212 | \begin{itemize} 213 | \item $ \sigma_{1} = \{1 \mapsto 1, 2 \mapsto 2, 3 \mapsto 3 \} \hfill \sigma_{1}(L) = (1,2,3) $ 214 | \item $ \sigma_{2} = \{1 \mapsto 1, 2 \mapsto 3, 3 \mapsto 2 \} \hfill \sigma_{1}(L) = (1,3,2) $ 215 | \item $ \sigma_{3} = \{1 \mapsto 2, 2 \mapsto 1, 3 \mapsto 3 \} \hfill \sigma_{1}(L) = (2,1,3) $ 216 | \item $ \sigma_{4} = \{1 \mapsto 2, 2 \mapsto 3, 3 \mapsto 1 \} \hfill \sigma_{1}(L) = (2,3,1) $ 217 | \item $ \sigma_{5} = \{1 \mapsto 3, 2 \mapsto 1, 3 \mapsto 2 \} \hfill \sigma_{1}(L) = (3,1,2) $ 218 | \item $ \sigma_{6} = \{1 \mapsto 3, 2 \mapsto 2, 3 \mapsto 1 \} \hfill \sigma_{1}(L) = (3,2,1) $ 219 | \end{itemize} 220 | 221 | The way to think about a permutation as a draw from an urn is to look at each of the positions in the list in 222 | turn and insert an element from $ S $. Since a permutation is a bijection, we can only use each 223 | $ s \in S $ exactly once. This is precisely what it means to sample without replacement. Once a ball 224 | is drawn, it is removed from the urn. Let us make this effect concrete in the above example. For position one 225 | we have three elements to choose from. Hence we are dealing with a sample space of size $ 3 $. Position two 226 | still leaves us $ 2 $ choices, giving us a sample space of size $ 2 $. Finally, the element in the last position 227 | is totally determined as we are dealing with a sample space of size $ 1 $. 228 | % The danger here is that we might be giving the impression that you can sample from a sample space $\Omega$ without replacement which does not make much sense in the probability world. 229 | 230 | Applying the basic principle of counting we now know that there are $ 3 \times 2 \times 1 $ permutations of the list 231 | $ (1,2,3) $. Incidentally, this proves our above example to be correct. More generally, if we have to reorder 232 | a list with $ n $ distinct elements (or draw without replacement from an urn with $ n $ numbered balls), there 233 | are $ n \times (n-1) \times \ldots \times 2 \times 1 $ permutations. Since this is pretty painful to write down 234 | we introduce a more succinct notation, provided by the \textbf{factorial} function. 235 | 236 | \begin{Definition}[Factorial] 237 | The factorial $ n! $ of a non-negative natural number $ n \in \mathbb{N} $ is defined recursively as 238 | \begin{itemize} 239 | \item $ 0! = 1 $ 240 | \item $ k! = k\times (k-1)! $ for $ 0 < k \leq n $ 241 | \end{itemize} 242 | \end{Definition} 243 | 244 | From the above discussion we can now conclude that the number of permutations on a set or list of size $ n $ 245 | is $ n! $. 246 | 247 | We can also define the notion of a k-permutation on a set $ S $ of size $ n $ such that $ k < n $. 248 | This means we are still drawing without replacement but we do not fully empty the urn. The reasoning for how 249 | many of those k-permutations there are remains exactly the same. There are $ n \times (n-1) \times (n-k+2) 250 | \times (n-k+1) $ such permutations (make sure you understand why!). In order to ease notation we can again 251 | sneak in the factorial through multiplying this number with $ 1 $ in disguise. Concretely, we write 252 | \begin{align*} 253 | &n \times (n-1) \times \ldots \times (n-k+2) \times (n-k+1) \times 1 \\ 254 | =& n \times (n-1) \times \ldots \times (n-k+2) \times (n-k+1) \times \frac{(n-k)!}{(n-k)!} \\ 255 | =& \frac{n!}{(n-k)!} 256 | \end{align*} 257 | for the number of k-permutations on a set of size $ n $. 258 | 259 | We will not see k-permutations all that often in this script but they constitute a helpful stepping stone to another 260 | concept that will be of crucial importance. Let us draw $ k $ balls from an urn with $ n $ balls where $ k \leq n $ and disregard 261 | the order in which we draw them. A classical example of such a setting would be the lottery where you are only interested in the 262 | balls drawn but not in the order in which they were drawn. We already know that for a set of $ k $ balls there are $ \frac{n!}{(n-k)!} $ 263 | orders in which we can draw them, as this is a $ k $-permutation on our urn. Now, though, we need to get rid off the different 264 | orderings. This is to say that we want to count each set of $ k $ balls that we can draw only once and not once per permutation of it. 265 | Luckily, we know how many permutations of a set of size $ k $ there are, namely $ k! $. Thus we divide out this number of permutations, 266 | yielding $ \frac{n!}{(n-k)!\times k!} $ as the number of possible ways to draw $ k $ \textit{different} balls from an urn with $ n $ 267 | balls. 268 | At this point we should take a break and pat our own backs. After all, we have just derived one of the most important combinatorial 269 | formulas, which is known as the \textbf{binomial coefficient}. 270 | 271 | \begin{Definition}[Binomial co-efficient] 272 | The binomial coefficient $ \binom{n}{k} $ is defined as 273 | $$ \binom{n}{k} := \dfrac{n!}{(n-k)!\times k!} $$ 274 | for $ 0 < n, 0 \leq k \leq n $. It counts the number of ways 275 | to sample $ k $ distinct elements from a set with a total of $ n $ elements without regard to the order in which they are drawn. 276 | For this reason, it is pronounced ``n choose k''. 277 | \end{Definition} 278 | 279 | \begin{Exercise} 280 | In the German lottery you have to bet on a set of $ 6 $ numbered balls to be drawn out of a total of $ 49 $ balls. Assuming that 281 | each ball is equally likely to be drawn, what is the chance of an individual bet to win the jackpot? The Dutch lottery is 282 | slightly more involved. They also draw an additional coloured ball from $ 6 $ coloured balls. In order to win the jackpot you need to have 283 | the number-colour combination right. What is your chance here? 284 | %Solution: There are $ \binom{49}{6} = 13.983.816 $ ways of betting on $ 6 $ balls. Thus the win probability is $ \frac{1}{13.983.816} $. 285 | %For the Dutch lottery its even $ \frac{1}{13.983.816 \times 6} = \frac{1}{83.902.896} $. 286 | \end{Exercise} 287 | 288 | The binomial coefficient will become crucially important later on. A common application, that you will see in this and other courses 289 | is counting the number of bit strings with certain properties. A bit is a variable that can take on values in $ \{0,1\} $. By the 290 | basic principle of counting there are $ 2^{n} $ bit strings of length $ n $. How many bit strings of length $ 5 $ are there that contain 291 | exactly $ 3 $ ones? Well, there are $ 2^{5} = 32 $ bit strings of that length in total and $ \binom{5}{3} = 10 $ of them contain exactly 292 | three ones. Unsurprisingly, this is the same number of 5-bit strings with exactly $ 2 $ zeros. 293 | The moral lesson here is that $ \binom{n}{k} = \binom{n}{n-k} $ as can be easily seen from the definition. Some other trivia about the 294 | binomial coefficient are that $ \binom{n}{0} = \binom{n}{n} = 1 $. Again, this follows directly from the definition. Somewhat trickier 295 | is the fact that $ \binom{n}{1} = \binom{n}{n-1} = n $. Can you derive this? 296 | 297 | We can straightforwardly generalize the idea of the binomial coefficient to choosing more than just one set of objects. This means that instead of just 298 | looking at red versus non-red balls, say, we now distinguish between all the colours in our urn. For our strings this means that we move away from 299 | bit strings to strings with large alphabets, e.g. strings written in the English alphabet (which has 26 letters). Let's say we have $ r $ red, 300 | $ b $ blue, $ g $ green and $ y $ yellow balls in our urn such that $ n = r+b+g+y $ is the total number of balls in the urn. How many different 301 | colour sequences can we draw? Well, we first arrange the $ r $ red balls in $ r $ out of $ n $ positions. This can be done in 302 | $ \binom{n}{r} $ ways. We then place the $ b $ blue balls in $ \binom{n-r}{b} $ ways. Next, we place the $ g $ green balls in $ \binom{n-r-b}{g} $ 303 | ways. Finally, we place the remaining yellow balls deterministically in the remaining positions since $ \binom{n-r-b-g}{y} = \binom{y}{y} = 1 $. 304 | We compute the total number of arrangements as 305 | 306 | \begin{gather} 307 | \binom{n}{r} \binom{n-r}{b} \binom{n-r-b}{g} \binom{n-r-b-g}{y} = \\ 308 | \dfrac{n!}{r!\times (n-r)!} \times \dfrac{(n-r)!}{b!\times (n-r-b)!} \times \dfrac{(n-r-b)!}{g! \times (n-r-b-g)!} \times 1 = \\ 309 | \dfrac{n!}{r!b!g!y!} 310 | \end{gather} 311 | 312 | Observe that the last equality follows because many of the factorials cancel and because we know that $ n-r-b-g = y $. We have now worked with 313 | only four colours, but the general case follows directly by induction on the number of colours (with the binomial coefficient as base case). 314 | Thus, we can define the \textbf{multinomial coefficient}. 315 | 316 | \begin{Definition}[Multinomial co-efficient] 317 | The multinomial coefficient for choosing $ k $ sets of objects with size $ m_{k} $ from a total of $ 0 < n = \underset{i=1}{\overset{k}{\sum}m_{i}} $ 318 | objects is $$ \dfrac{n!}{\underset{i=1}{\overset{k}{\prod}}m_{i}!} $$ 319 | \end{Definition} 320 | 321 | \section*{Further material} 322 | For a slow and thorough introduction to combinatorics, see \href{http://eu.wiley.com/WileyCDA/WileyTitle/productCd-111840436X.html}{Faticoni (2013): 323 | Combinatorics}. At the ILLC, there is \href{http://homepages.cwi.nl/~rdewolf/combinatorics14.html}{a biannual course on combinatorics}, 324 | taught by Ronald de Wolf. Online, Princeton also offers \href{https://www.coursera.org/course/ac}{a course on combinatorics}. 325 | 326 | 327 | %%% Local Variables: 328 | %%% mode: latex 329 | %%% TeX-master: "chapter1" 330 | %%% End: 331 | -------------------------------------------------------------------------------- /multivariateGaussian/multivariateGaussian_forInclude.tex: -------------------------------------------------------------------------------- 1 | \chapter{The Gaussian Distribution} 2 | 3 | If there is any one distribution that has traversed mathematics and found a home in cultural memory, it surely is the \textbf{Gaussian} or \textbf{normal distribution} 4 | (both names are common and we will use them interchangeably here). 5 | Not only is it super-useful in many data modelling applications, it also has a host of convenient mathematical properties, some of which we are going 6 | to explore in this chapter. 7 | 8 | Before going into any detail, let us first motivate this distribution. What we want is a distribution on a real vector space ($ \mathbb{R}^{n} $). We will 9 | start out with the simplest case and fist look at the Gaussian distribution on the real line. Our desiderata for the Gaussian\footnote{Notice that Gauss' original 10 | motivation was different from ours. While we are giving a largely geometric account of the normal distribution, Gauss was concerned with finding a distribution 11 | on $ n $ independent points whose maximum likelihood estimate (See Section~\ref{eq:parameterEstimation} would be } are as follows: 12 | \begin{itemize} 13 | \item The distribution should be centred around one specific point which we will call the mean 14 | \item The more distant a point is from the mean, the less probable it should be 15 | \item The distance metric should be adjustable so as to assign distant points more or less probability as needed 16 | \item Equally distant points should have the same probability, independent of their direction 17 | \end{itemize} 18 | 19 | \section{The Univariate Gaussian} 20 | 21 | The Gaussian distribution is one of the most important and most widely used distributions in all of statistics. The reason is that many natural observations 22 | tend to be normally distributed. Many other distributions are also based on it or can be approximated by a Gaussian. Finally, there are several mathematical 23 | properties of the Gaussian that make calculating with it rather easy. In this Section we will look at the \textbf{univariate} Gaussian distribution, that is 24 | the Gaussian distribution in one dimension. In Section~\ref{sec:mvGauss} we will also see how to model data in $ \mathbb{R}^{n} $ that is extremely complexly 25 | structured with \textbf{multivariate} Gaussian distributions. 26 | 27 | \begin{figure} 28 | \center 29 | \begin{knitrout} 30 | \definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor} 31 | \includegraphics[width=\maxwidth]{figures/uniGauss-1} 32 | 33 | \end{knitrout} 34 | \caption{Standard normal density (left) and with variance $ \sigma^{2} = 2 $ (right).} 35 | \label{fig:uniGauss} 36 | \end{figure} 37 | 38 | \subsection{Deriving the Density} 39 | 40 | What we want is a distribution that models spatial data, i.e. data that lives in some vector space. There should be a centre of mass around which the data concentrates 41 | and deviation from that centre of mass should be ``penalized'', meaning that the further away from the centre a data point is, the less probable it should be. 42 | Since we are interested in modelling spatial data in real vector spaces, we will choose the \href{https://en.wikipedia.org/wiki/Euclidean_distance}{Euclidian 43 | distance} as a distance measure. In the case of one dimension, the Euclidean distance is simply the absolute difference. For $ x,y \in \mathbb{R} $, the Euclidian 44 | distance is $ (y - x)^{2} $. Notice that this is symmetric as any good distance metric should be. 45 | 46 | As it looks right now, all deviations are going to be penalized to the same extent. In other words: the Euclidian distance is linear in the difference of two points. 47 | What if we want to be a bit stricter and penalize points that are far away from the centre even more or conversely, if we wanted to be lenient and diminish the penalty 48 | for deviation from the centre? In such a case we would have to scale the Euclidian distance. In fact, there is a generalization of the Euclidean distance that 49 | allows for scaling. It is called the \href{https://en.wikipedia.org/wiki/Mahalanobis_distance}{Mahalanobis distance}. In the one-dimensional case, it introduces 50 | a scale factor by which the difference between two point is scaled. The Mahalanobis distance between $ x,y \in \mathbb{R} $ is 51 | \begin{equation*} 52 | \left(\frac{x - y}{\sigma}\right)^{2} 53 | \end{equation*} 54 | where $ \sigma > 0 $ is an adjustable scale factor. If $ \sigma < 1 $ it will exaggerate the difference between $ x $ and $ y $ and hence lead to a greater 55 | penalty for distant points. Conversely, if $ \sigma > 1 $ it will lessen the difference between $ x $ and $ y $ and therefore lead to a smaller penalty for 56 | distant points. The square of $ \sigma $ is called the \textbf{variance} and used to parametrize the Gaussian distribution, while $ \sigma $ itself is 57 | known as the \textbf{standard deviation}. 58 | 59 | Now that we have found an appropriate (and adjustable!) distance metric, we have to turn it into a probability density. The standard way of turning any quantity 60 | into a probability density is by simply exponentiating it. This way, it is guaranteed to be positive. In the present case, we actually want that the probability 61 | decreases as the distance between the two points increases. Thus we are actually going to exponentiate the negative of the Mahalanobis distance. Finally, we 62 | might want to differentiate that distance at some point. Whenever we do so we are going to have to deal with the squaring function. In order to make our lives 63 | easier when differentiating, we also prefix the Mahalanobis distance with $ \nicefrac{1}{2} $ before exponentiating it. The result is 64 | \begin{equation} 65 | \exp\left(-\frac{1}{2} \left(\frac{x - y}{\sigma}\right)^{2} \right) \ . 66 | \end{equation} 67 | 68 | Notice that so far we have one adjustable parameter in that expression, namely the scale factor $ \sigma $ from the Mahalanobis distance. Initially we said 69 | that points which follow a Gaussian distribution should be arranged around a centre which is more commonly known as the \textbf{mean}. 70 | Let us call this mean $ \mu $. In order to vary the location of the centre, 71 | we turn $ \mu $ into a parameter (we simply replace $ y $ with $ \mu $). To recap, $ \mu $ determines the location 72 | of the centre of the Gaussian density and $ \sigma $ scales it. The parameters are therefore called \textbf{location parameter} and \textbf{scale parameter}, respectively. 73 | We now have an expression that is proportional to the Gaussian density. Whenever a RV $ X $ is distributed according to a normal distribution with location parameter 74 | $ \mu $ and scale parameter $ \sigma $, we write $ X \sim \N{\mu}{\sigma} $. The corresponding density is 75 | \begin{equation} 76 | p(x) \propto \exp\left(-\frac{1}{2} \left(\frac{x - \mu}{\sigma}\right)^{2} \right) \ . 77 | \end{equation} 78 | In order to get a proper density, we still need to normalize. This requires a non-trivial integration that falls without the scope of this subsection\footnote{If 79 | you are interested in seeing several different proofs, check \href{https://en.wikipedia.org/wiki/Gaussian_integral}{here}. Laplace's proof is probably the easiest to follow.}. We 80 | will just state the normalizer here. The full univariate normal density with parameters $ \mu $ (mean) and $ \sigma^{2} $ (variance) is 81 | \begin{equation} 82 | p(x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{1}{2} \left(\frac{x - \mu}{\sigma}\right)^{2} \right) \ . 83 | \end{equation} 84 | 85 | Notice that we do actually never need this general density. Why? We can transform any Gaussian distribution into a \textbf{standard normal distribution}. This is 86 | the normal distribution with 0 mean and unit variance $ \N{0}{1} $. It is so important that it even has its own notation. 87 | \begin{equation} 88 | \Phi(x) = p(x) \mbox{ where } X \sim \N{0}{1} 89 | \end{equation} 90 | Any Gaussian variable can be normalized to a standard normal variable. This is often done in many applications. 91 | 92 | \begin{Exercise} 93 | Show that for $ X \sim \N{\mu}{\sigma^{2}} $ we have $ \frac{X - mu}{\sigma} $. The processes of subtracting the mean and dividing 94 | by the standard deviation are called centering and normalization, respectively. 95 | \end{Exercise} 96 | 97 | \section{The Multivariate Gaussian$ ^{*} $}\label{sec:mvGauss} 98 | 99 | Our goal in this section is to define a Gaussian distribution on $ \mathbb{R}^{n} $. This will require quite a bit of linear algebra. Readers who have not taken 100 | a linear algebra course are advised to skip this section. 101 | 102 | Let us start out by considering a random vector whose $ n $ dimension are independent. That mean that the probability of the vector can be factorised. 103 | \begin{equation} 104 | p(\vec{x}) = \prod_{i=1}^{n} p(x_{i}) 105 | \end{equation} 106 | If each of the dimensions is distributed according to the same Gaussian $ \mathcal{N}(\mu, \sigma^{2}) $, we can easily generate random vectors of this form 107 | by making $ n $ independent from the Gaussian. 108 | 109 | Unfortunately, this severely limits our ability to model data. Not only can we never model correlations between dimensions, we also require that all dimensions 110 | have the same variance. The data that we can model needs to be extremely homogeneous. 111 | 112 | We could lessen this problem by drawing each random dimension from a different Gaussian. This way, we would be able to assign different means and variances to different 113 | dimensions. However, we could still not capture covariances. What we need is a single Gaussian over $ \mathbb{R}^{n} $. This will allow us to model (potentially) dependent 114 | dimensions. Having a mean vector with different mean values per dimension is trivial. In fact, we will further assume that the means of the dimensions are independent of 115 | each other. Thus our mean vector will simply be 116 | $ \vec{\mu} = \begin{bmatrix} 117 | \mu_{1} & \ldots & \mu_{n} 118 | \end{bmatrix} $. We only demand that the variances of the dimensions may be correlated. To express such correlations we need to compactly store the variances and 119 | covariances of the dimensions. To do this, we introduce \textbf{covariance matrices}. 120 | 121 | \subsection{Covariance Matrices} 122 | 123 | \begin{Definition}[Covariance matrix] 124 | A $ n \times n $ matrix $ \Sigma $ is called a covariance matrix of an $ n $-dimensional RV $ X $ if for $ 0 < i,j \leq n $ 125 | $$ \Sigma_{j,i} = cov(X_{j}, X_{i}) \ . $$ 126 | \end{Definition} 127 | 128 | The covariance matrix is has a couple of important properties which we will use when computing with it. 129 | \begin{enumerate} 130 | \item \textbf{Symmetry:} follows from the definition and the symmetry of the covariance. 131 | \item \textbf{positive semi-definiteness:} See below. 132 | \end{enumerate} 133 | Notice that some authors will actually define covariance matrices to be symmetric, positive semi-definite matrices. This is fine in so far as any matrix with 134 | these properties is a valid covariance matrix. When we construct models of data, we may actually simply stipulate the (co)-variances and thus build a covariance matrix. 135 | 136 | \textbf{Proof of positive semi-definiteness} Recall that a $ n \times n $ matrix $ M $ is positive semi-definite (PSD) 137 | if for all $ z \in \mathbb{R}\setminus \{0\} $ it holds 138 | that $ z^{\top}Mz \geq 0 $. Observe that we can write a covariance matrix $ \Sigma $ as the expectation of an outer product. 139 | \begin{equation} 140 | \Sigma = \E\left[(\vec{X} - E[\vec{X}])^{\top}(\vec{X} - E[\vec{X}])\right] 141 | \end{equation} 142 | For all $ z \in 143 | \mathbb{R}\setminus \{0\} $ we have 144 | \begin{align} 145 | z\Sigma z &= z^{\top}\Sigma z \\ 146 | &= z \E\left[(\vec{X} - E[\vec{X}])^{\top}(\vec{X} - E[\vec{X}])\right] z^{\top} \\ 147 | &= \E\left[ z (\vec{X} - E[\vec{X}])^{\top}(\vec{X} - E[\vec{X}]) z^{\top} \right] \\ 148 | &= \E\left[(\vec{X} - E[\vec{X}])zz^{\top} (\vec{X} - E[\vec{X}])^{\top} \right] \\ 149 | &= \E\left[(\vec{X} - E[\vec{X}])c (\vec{X} - E[\vec{X}])^{\top} \right] \\ 150 | &= c \E\left[(\vec{X} - E[\vec{X}]) (\vec{X} - E[\vec{X}])^{\top} \right] \geq 0 151 | \end{align} 152 | where $ c $ is some positive constant. The result essentially follows from the linearity of expectation. 153 | 154 | The importance of being positive semi-definite may not be immediately apparent. It lies in the fact that many results are easily proven for positive semi-definite 155 | matrices. Any result that holds for positive semi-definite matrices also holds for covariance matrices. We will occasionally use this property in our proofs below. 156 | 157 | Another important result is based solely on the symmetry of the matrix. By the spectral theorem we know that any symmetric matrix $ M $ can be factorized as 158 | \begin{equation}\label{eq:eigenvalueDecomp} 159 | M = U \Lambda U^{-1} 160 | \end{equation} 161 | where $ \Lambda $ is a diagonal matrix and $ U $ is orthonormal. Let us try to interpret this decomposition. The orthonormal matrix $ U^{-1} $ is a linear map from 162 | $ \mathbb{R}^{n} $ to $ \mathbb{R}^{n} $. It effectively rotates the input. The matrix $ \Lambda $ then scales the each row of the input and finally the matrix 163 | $ U $ rotates the scaled input back. From the spectral theorem we know that the entries of $ \Lambda $ are the eigenvalues of $ M $. Therefore, the columns of $ U $ 164 | are the corresponding eigenvectors normalized to unit length. The decomposition thus gives us an efficient way of finding the eigenvalues of $ M $. We are now going 165 | to show that these eigenvalues are always non-negative for PSD matrices. 166 | 167 | \begin{Lemma}[Eigenvalues of PSD matrices are non-negative] 168 | Assume this was not the case. Let $ z $ be an eigenvalue of a positive 169 | semi-definite matrix $ A $ with negative eigenvalue $ \lambda $. Then we get $ z^{\top}Az = z^{\top}\lambda z = \lambda z^{\top} z < 0 $ which contradicts the 170 | premise that $ A $ is positive semi-definite. $ \square $ 171 | \end{Lemma} 172 | 173 | We conclude that positive semi-definite matrices (and thus covariance matrices) only have non-negative eigenvalues. 174 | This in turn implies that PSD matrices always have roots. These roots can easily be derived as 175 | \begin{equation}\label{eq:PSDRoots} 176 | M^{\nicefrac{1}{2}} = U \Lambda^{\nicefrac{1}{2}}\Lambda^{\nicefrac{1}{2}} U^{-1} = U \Lambda^{\nicefrac{1}{2}}U^{-1} U \Lambda^{\nicefrac{1}{2}} U^{1} \ . 177 | \end{equation} 178 | 179 | Covariance matrices are not always used in practice. It is sometimes more convenient to use their inverse instead. That inverse, $ \Sigma^{-1} $, is called a precision 180 | matrix. The names are telling: The entries in the covariance matrix measure to what extend two dimensions grow or shrink in relation to each other. The higher 181 | that value the more deviation from the mean we will observe. 182 | The entries in the precision matrix tell us how precise (i.e. how close to the mean) the distribution is. Higher 183 | precisions means that we are going to observe less deviation from the mean vector. 184 | 185 | \subsection{Deriving the Density} 186 | 187 | Now that we have learned about the covariance matrix, we are all set to define the multivariate Gaussian. Let us take a step back and remind ourselves of how easy 188 | it was to generate random vectors with independent means and variances. For the multivariate Gaussian, we will have to replace the mean with a mean vector (whose 189 | components are again independent\footnote{Notice that we our presentation is taking place in a frequentist setting. In Bayesian probability theory, the claim that 190 | the dimensions of the mean vector are independent may very well be false.}) and a covariance matrix, changing the notation from $ \N{\mu}{\sigma^{2}} $ 191 | to $ \N{\vec{\mu}}{\Sigma} $. As before, the parameter values are exactly equal to the mean and (co)variance of the distribution. 192 | 193 | While we have not yet properly defined the multivariate Gaussian, we can already explore some of its properties. By simple linearity of expectation, we have 194 | for any vector $ \vec{y} \in \mathbb{R}^{n} $ and any random Vector $ X \sim \N{\vec{\mu}}{\Sigma} $ that $ \E[X + y] = \E[X] + y $ and therefore that 195 | $ X + y \sim \N{\vec{\mu} + y}{\Sigma} $. Similarly, by properties of the (co)variance and the expecation, 196 | we know that for any matrix $ A \in \mathbb{R}^{n} $ it holds that 197 | $ var(AX) = A^{2}var(X) = AA\Sigma = A\Sigma A^{\top} $ and therefore that $ AX \sim \N{A\vec{\mu}}{A\Sigma A^{\top}} $. 198 | 199 | Taken together, the fact that $ AX + \vec{y} \sim \N{A\vec{\mu} + \vec{y}}{A\Sigma A^{\top}} $ is called the \textbf{affine property} of the Gaussian distribution. 200 | Any affine transformation of a Gaussian RV will again yield a Gaussian RV. We will exploit this fact to define the multivariate Gaussian distribution. Recall 201 | how easy our lives would be if all variance components in the multivariate Gaussian were independent; more easy even if the variance components were also identical. 202 | Let us start from this scenario with unit variance. The \textbf{standard multivariate Gaussian} is then simply $ \N{0}{I} $ where $ I $ is the identity matrix. Clearly, 203 | the rows and columns of this matrix are orthogonal and hence independent. Moreover, only the diagonal is populated and all diagonal values are the same, meaning 204 | that we have independent and identical variances but no covariance. We can now construct an infinitely many other multivariate Gaussians with the same covariance 205 | properties by shifting the mean. Given a RV $ X \sim \N{0}{I} $ we achieve this by defining $ Y = X + \vec{\mu} \sim \N{\vec{\mu}}{I} $ (this follows from the affine 206 | property), where $ \vec{\mu} $ is our desired mean. 207 | 208 | Now that we can derive multivariate Gaussians with any mean we like, let us turn to the covariance matrix. We can change the identical variance by simply multiplying 209 | a standard normal RV with a scalar of our choice. Formally, if $ X \sim \N{0}{I} $ and $ \sigma \in \mathbb{R} $ then 210 | $ Y = \sigma X \sim \N{0} {\sigma I \sigma} = \N{0} {\sigma^{2}I} $. This shouldn't come as too much of a surprise since this is how we would adjust 211 | the variance of a univariate Gaussian. However, we still cannot model covariance at this point. 212 | 213 | Instead of multiplying $ X \sim \N{0}{I} $ with a scalar, let us use a matrix instead. In the interest of relieving all suspense, let us call this matrix $ \Sigma^{\nicefrac{1}{2}} $ 214 | (you see where we are getting at, aren't you?). By the affine property, we have 215 | $ Y = \Sigma^{\nicefrac{1}{2}} X \sim \N{0}{\Sigma^{\nicefrac{1}{2}} I \Sigma^{\nicefrac{1}{2}^{\top}}} = \N{0}{\Sigma} $. 216 | %Notice that although we have called the matrix that we multiply $ X $ with $ \Sigma^{\nicefrac{1}{2}} $, we do usually not need to compute this square root explicitly. Any 217 | %square matrix will do the job. 218 | 219 | What we can conclude from the above is that we can derive any multivariate Gaussian distribution from the standard normal multivariate normal simply by applying an 220 | appropriate affine transformation. Thus, all we need to do is to derive the density for the standard multivariate Gaussian. This is super-simple! The mean is $ 0 $ 221 | in all dimensions and the variances are identically and independently 1. For a vector $ \vec{x} \in \mathbb{R}^{n} $ such that $ \vec{x} \sim \N{0}{1} $ this means 222 | \begin{align}\label{eq:mvstandardNormal} 223 | p(\vec{\mu}) &= \prod_{i=1}^{n} p(x_{i}) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi} \times 1} \exp \left(-\frac{1}{2}\left( \frac{x_{i} - 0}{1} \right)^{2} \right) \\ 224 | &= \frac{1}{\sqrt{(2\pi)^{n}}} \exp \left(-\frac{1}{2}\left( \sum_{i=1}^{n}x_{i}^{2} \right) \right) \ . \nonumber 225 | \end{align} 226 | 227 | We know the density of the standard multivariate normal distribution and we know how to derive any other multivariate Gaussian from that distribution. Before 228 | we derive the general density for multivariate Gaussians, let us finally define multivariate Gaussian RVs. 229 | \begin{Definition}[Multivariate Normal Distribution] 230 | An $ n $-dimensional random vector $ \vec{X} \in \mathbb{R}^{n} $ has a multivariate normal distribution with an $ n $-dimensional mean parameter 231 | $ \vec{\mu} $ and an $ n \times n $ covariance matrix $ \Sigma $ if it has the same distribution as $ \mu + LZ $ where $ LL^{T} = \Sigma $ and 232 | the dimensions of $ Z $ are i.i.d. according to a univariate standard normal distribution, i.e. $ Z_{i} \sim \N{0}{1} $ for $ 0 < i \leq n $. 233 | \end{Definition} 234 | 235 | With this definition at hand, let us derive the general multivariate density. The problem is that in a covariance matrix the variances are not independent anymore. 236 | Thus, we cannot readily apply the factorization from Equation~\eqref{eq:mvstandardNormal}. The question now is whether we can substitute the covariance matrix with 237 | another matrix that where the variances are indeed independent. The spectral theorem answers this question positively. Recall that all square matrices can be decomposed 238 | according to Equation~\eqref{eq:eigenvalueDecomp}. The matrix $ \Lambda $ has its eigenvalues on the diagonal. Since it is congruent with the original matrix $ M $, 239 | they both have the same eigenvalues. It is clear from Equation~\eqref{eq:PSDRoots} that we can use $ U \Lambda^{\nicefrac{1}{2}} = \Sigma^{\nicefrac{1}{2}} $ when applying the affine transformation of the standard normal distribution. 240 | For $ X \sim \N{0}{I}, \vec{\mu} \in \mathbb{R}^{n}, A = U\Lambda^{\nicefrac{1}{2}}U^{-1} \in \mathbb{R}^{n\times n} $ and $ Y = AX + \vec{\mu} $ we exploit 241 | the fact that $ Y \sim \N{\vec{\mu}}{AIA^{\top}} = \N{\vec{\mu}}{\Sigma} $. In the following, we use $ M_{i} $ to denote the $ i^{th} $ row of a matrix. 242 | \begin{align} 243 | &p(\vec{y}) 244 | = p(A\vec{x} + \vec{\mu}) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi} \sum_{j=1}^{n}A_{ij}} \exp \left(-\frac{1}{2} \left( \frac{(A\vec{x})_{i} - \vec{\mu}_{i}}{\sum_{j=1}^{n}A_{ij}} \right)^{2} \right) \\ 245 | &= \frac{1}{\sqrt{\left( 2\pi \right)^{n}} \prod_{i=1}^{n}\sum_{j=1}^{n}A_{ij}} 246 | \exp \left(-\frac{1}{2} \sum_{i=1}^{n} \left( (A\vec{x})_{i} - \vec{\mu}_{i} \right)^{2} \sum_{j=1}^{n}A_{ij}^{-2} \right) \\ 247 | &= \frac{1}{\sqrt{\left( 2\pi \right)^{n}} \prod_{i=1}^{n}\sum_{j=1}^{n}A_{ij}} 248 | \exp \left(-\frac{1}{2} \sum_{i=1}^{n} \left( ( A\vec{x} )_{i} - \vec{\mu}_{i}\right) \sum_{j=1}^{n} A_{ij}^{-2} \left((A\vec{x})_{j} - \vec{\mu}_{j} \right)^{\top} \right) 249 | \label{eq:quadraticForm} \\ 250 | &= \frac{1}{\sqrt{\left( 2\pi \right)^{n}} \prod_{i=1}^{n}\sum_{j=1}^{n}A_{ij}} 251 | \exp \left(-\frac{1}{2} \left( A\vec{x} - \vec{\mu}\right) \Sigma^{-1} \left(A\vec{x} - \vec{\mu} \right)^{\top} \right) \\ 252 | &= \frac{1}{\sqrt{\left( 2\pi \right)^{n} |\Sigma|}} 253 | \exp \left(-\frac{1}{2} \left( A\vec{x} - \vec{\mu}\right) \Sigma^{-1} \left(A\vec{x} - \vec{\mu} \right)^{\top} \right) \label{eq:mvGDensityDet} 254 | \end{align} 255 | Before we interpret this density (whose standardly given form is \eqref{eq:mvGDensityDet}) let us clarify the derivation. In order to change the indices in 256 | Equation~\eqref{eq:quadraticForm} we have 257 | used the fact that for any $ n\times n $ square matrix $ M $ and vector $ \vec{x} \in \mathbb{R}^{n} $ we have the equality $ \vec{x}^{2}M = \vec{x}M\vec{x}^{\top} $. 258 | We then replaced $ A^{2} $ with $ \Sigma $. In the final line we have explicitly calculated the sum in the normalizer. 259 | \begin{align} 260 | \sum_{j=1}^{n}A_{ij} &= \sum_{j=1}^{n} \sum_{k=1}^{n} U_{ik} \sum_{l=1}^{n} \Lambda^{\nicefrac{1}{2}}_{kl} U^{\top}_{lj} \\ 261 | &= \sum_{j=1}^{n} \sum_{k=1}^{n} U_{ik} \Lambda^{\nicefrac{1}{2}}_{kk} U^{\top}_{kj} \label{eq:diagonality} \\ 262 | &= \sum_{j=1}^{n} \sum_{k=1}^{n} U_{ik} U_{jk} \Lambda_{kk}^{\nicefrac{1}{2}} = \Lambda_{ii}^{\nicefrac{1}{2}} \label{eq:orthogonality} 263 | \end{align} 264 | In the above, line \eqref{eq:diagonality} follows because $ \Lambda $ is diagonal and the last identity in line \eqref{eq:orthogonality} follows from the fact that 265 | $ U $ is orthogonal. The product in the normalizer is now a product of (square roots of) eigenvalues of a diagonal matrix, which is equal to the (square root of the) 266 | determinant of that matrix. Since congruent matrices have the same eigenvalues, this is the same as the determinant of $ \Sigma $. This completes our derivation of the multivariate Gaussian density. 267 | 268 | From the derivation, you probably already have some intuition of what's going on. Let us make this intuition more precise. A normal distribution with zero mean 269 | and covariance matrix $ \sigma I $ for $ \sigma \in \mathbb{R} $ defines a ball in which most of the probability mass lies. Being a ball, this structure is perfectly 270 | round, showing the same amount of deviance in all directions \textbf{FIGURE A}. Since the mean is zero, the ball is centred at the origin. If we change the mean, we are 271 | performing a shift of the balls centre away from the mean \textbf{FIGURE B}. We can also add bumps to the ball by letting the elements on the diagonal of the 272 | covariance matrix vary independently \textbf{FIGURE C}. Things become really interesting, though, when we use a full covariance matrix. Then we can define an 273 | ellipsoid which contains most of the mass \textbf{FIGURE D}. 274 | 275 | \begin{knitrout} 276 | \definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe} 277 | 278 | 279 | {\ttfamily\noindent\bfseries\color{errorcolor}{\#\# Error: could not find function "{}dmvnorm"{}}}\end{kframe} 280 | \includegraphics[width=\maxwidth]{figure/multiGauss-1} 281 | 282 | \end{knitrout} 283 | 284 | How is all of this accomplished? By decomposing the covariance matrix, we have already seen that the covariance mostly depends on the eigenvalues of the covariance matrix. 285 | In fact, since scaling is done by $ U\Lambda^{\nicefrac{1}{2}} U^{-1} $, it is the square roots of the eigenvalues that define the spread. They are the dimension-wise standard deviations. 286 | The matrix $ \Lambda $ performs the same mapping as $ A $, only in eigenspace. As we have seen, this mapping is much simpler in eigenspace because $ \Lambda $ is 287 | diagonal. The process of computing the multivariate Gaussian density can thus be broken down into 3 steps: map into eigenspace, apply the transformation given by 288 | $ \Lambda $ and map back into the original space. 289 | -------------------------------------------------------------------------------- /chapter7/chapter7.tex: -------------------------------------------------------------------------------- 1 | \documentclass[a4paper,11pt,leqno]{report}\usepackage[]{graphicx}\usepackage[]{color} 2 | %% maxwidth is the original width if it is less than linewidth 3 | %% otherwise use linewidth (to make sure the graphics do not exceed the margin) 4 | \makeatletter 5 | \def\maxwidth{ % 6 | \ifdim\Gin@nat@width>\linewidth 7 | \linewidth 8 | \else 9 | \Gin@nat@width 10 | \fi 11 | } 12 | \makeatother 13 | 14 | \definecolor{fgcolor}{rgb}{0.345, 0.345, 0.345} 15 | \newcommand{\hlnum}[1]{\textcolor[rgb]{0.686,0.059,0.569}{#1}}% 16 | \newcommand{\hlstr}[1]{\textcolor[rgb]{0.192,0.494,0.8}{#1}}% 17 | \newcommand{\hlcom}[1]{\textcolor[rgb]{0.678,0.584,0.686}{\textit{#1}}}% 18 | \newcommand{\hlopt}[1]{\textcolor[rgb]{0,0,0}{#1}}% 19 | \newcommand{\hlstd}[1]{\textcolor[rgb]{0.345,0.345,0.345}{#1}}% 20 | \newcommand{\hlkwa}[1]{\textcolor[rgb]{0.161,0.373,0.58}{\textbf{#1}}}% 21 | \newcommand{\hlkwb}[1]{\textcolor[rgb]{0.69,0.353,0.396}{#1}}% 22 | \newcommand{\hlkwc}[1]{\textcolor[rgb]{0.333,0.667,0.333}{#1}}% 23 | \newcommand{\hlkwd}[1]{\textcolor[rgb]{0.737,0.353,0.396}{\textbf{#1}}}% 24 | 25 | \usepackage{framed} 26 | \makeatletter 27 | \newenvironment{kframe}{% 28 | \def\at@end@of@kframe{}% 29 | \ifinner\ifhmode% 30 | \def\at@end@of@kframe{\end{minipage}}% 31 | \begin{minipage}{\columnwidth}% 32 | \fi\fi% 33 | \def\FrameCommand##1{\hskip\@totalleftmargin \hskip-\fboxsep 34 | \colorbox{shadecolor}{##1}\hskip-\fboxsep 35 | % There is no \\@totalrightmargin, so: 36 | \hskip-\linewidth \hskip-\@totalleftmargin \hskip\columnwidth}% 37 | \MakeFramed {\advance\hsize-\width 38 | \@totalleftmargin\z@ \linewidth\hsize 39 | \@setminipage}}% 40 | {\par\unskip\endMakeFramed% 41 | \at@end@of@kframe} 42 | \makeatother 43 | 44 | \definecolor{shadecolor}{rgb}{.97, .97, .97} 45 | \definecolor{messagecolor}{rgb}{0, 0, 0} 46 | \definecolor{warningcolor}{rgb}{1, 0, 1} 47 | \definecolor{errorcolor}{rgb}{1, 0, 0} 48 | \newenvironment{knitrout}{}{} % an empty environment to be redefined in TeX 49 | 50 | \usepackage{alltt} 51 | 52 | \usepackage{amsmath, amssymb, mdframed, caption, subcaption, graphicx, enumitem} 53 | \usepackage{nicefrac} 54 | 55 | \usepackage{hyperref} 56 | \hypersetup{colorlinks=true, urlcolor=blue, breaklinks=true} 57 | 58 | \newmdtheoremenv{Definition}{Definition}[chapter] 59 | \newmdtheoremenv{Exercise}[Definition]{Exercise} 60 | \newmdtheoremenv{Theorem}[Definition]{Theorem} 61 | \newmdtheoremenv{Lemma}[Definition]{Lemma} 62 | 63 | \newcommand{\supp}{\operatorname{supp}} 64 | \newcommand{\E}{\mathbb{E}} 65 | \newcommand{\eps}{\varepsilon} 66 | 67 | \DeclareSymbolFont{extraup}{U}{zavm}{m}{n} 68 | \DeclareMathSymbol{\varheart}{\mathalpha}{extraup}{86} 69 | \DeclareMathSymbol{\vardiamond}{\mathalpha}{extraup}{87} 70 | 71 | 72 | \newcommand{\philip}[1]{ \textcolor{red}{\textbf{Philip:} #1}} 73 | \newcommand{\chris}[1]{ \textcolor{blue}{\textbf{Chris:} #1}} 74 | 75 | \title{Basic Probability} 76 | \date{} 77 | \IfFileExists{upquote.sty}{\usepackage{upquote}}{} 78 | \begin{document} 79 | 80 | \setcounter{chapter}{6} 81 | 82 | \chapter{Basics of Information Theory} 83 | 84 | When we talk about \textit{information}, we often use the term in qualitative sense. We say things like 85 | \textit{This is valuable information} or 86 | \textit{We have a lack of information}. We can also make statements about some information being more helpful than other. For a long time, however, 87 | people have been unable to quantify information. The person who succeeded in this endeavour was \href{https://en.wikipedia.org/wiki/Claude_Shannon}{Claude E. Shannon} 88 | who with his famous 1948 article \textit{A Mathematical Theory of Communication} single-handedly created a new discipline: Information Theory! He also revolutionised 89 | digital communication and can be seen as one of the main contributors to our modern communication systems like the telephone, the internet etc. 90 | 91 | The beauty about information theory is that it is based on probability theory and many results from probability theory seamlessly carry over to information theory. 92 | In this chapter, we are going to discuss the bare basics of information theory. These basics are often enough to understand many information-theoretic arguments 93 | that researchers make in fields like computer science, psychology and linguistics. 94 | 95 | \section{Surprisal and Entropy} 96 | Shannon's idea of information is as simple as it is compelling. The amount of \emph{surprisal} of an event $E$ is defined as the inverse probability $1/P(E)$. Intuitively, rare events (where $P(E)$ is small) are more surprising than those occurring with high probability (where $P(E)$ is high). If we are observing a realisation of a random variable, this realisation is surprising if it is unlikely to occur according to the distribution of that random variable. However, if the probability for the realisation is very low, then on average it does not occur very often, meaning that if we sample from the RV repeatedly, we are not surprised very often. We are not surprised when the probability mass of the distribution is concentrated on only a small subset of its support. 97 | 98 | On the other hand, we quite often are surprised, if we cannot predict what the outcome of our next draw from the RV might be. We are surprised when the distribution over values of the RV is (close to) uniform. Thus, we are going to be most surprised on average if we are observing realisations of a uniformly distributed RV. 99 | 100 | Shannon's idea was that observing RVs that cause a lot of surprises is informative because we cannot predict the outcomes and with each new outcome we have effectively learned something (namely that the $ i^{th} $ outcome took on the value that it did). Observing RVs with very concentrated distributions is not very informative under this conception because by just choosing the most probable outcome we can correctly predict most actually observed outcomes. Obviously, if I manage to predict an outcome beforehand, its occurrence is not teaching me anything. 101 | 102 | The goal of Shannon was to find a function that captures this intuitive idea. He eventually found it and showed that it is the only function to have properties that encompass the intuition. This function is called the \textbf{entropy} of a RV and it is simply the expected \textbf{surprisal} value, expressed in bits. 103 | 104 | \begin{Definition}[Surprisal] 105 | The surprisal (value) of an outcome $ x \in \supp(X) $ of some RV $ X 106 | $ is defined as $ -\log_{2}(P(X=x)) = \log_2(\frac{1}{P(X=x)})$. 107 | \end{Definition} 108 | 109 | Notice that we are using the logarithm of base 2 here. This is because surprisal and entropy are standardly measured in bits. Intuitively, the surprisal measures how many bits one needs to encode an observed outcome given that one knows the distribution underlying that outcome. Check \href{http://www.umsl.edu/~fraundorfp/egsurpriNOLOGS.html}{this website} to get a feeling for surprisal values measured in bits. 110 | 111 | \begin{Definition}[Entropy] 112 | The entropy $H(P_X)$ of a RV $ X $ with distribution $P_X$ is defined as 113 | $$H(P_X) := \E[-\log_{2}(P(X=x))] = - \!\! \sum_{x \in \supp(X)} P(X=x) \log_2(P(X=x)) \, .$$ 114 | For the ease of notation, we often write $H(X)$ instead of $H(P_X)$. 115 | \end{Definition} 116 | 117 | The notational convenience of writing $H(X)$ instead of $H(P_X)$ can be confusing, because entropy is really assigning a (non-negative) real number to a distribution, i.e.\ $H(X)$ is {\bf not a function} of the random variable $X$ and it is {\bf not a random variable} either! Formally, for any random variable $X$ with distribution $P_X$ over the set $\mathcal{X}=\supp(X)$ (which might be categorical, i.e.\ $X$ could for instance take on values ``blue'', ``red'' and ``green''), we consider the surprisal function (in bits) $f(x) := -\log_2(P(X=x))$ mapping elements $x \in \mathcal{X}$ to real numbers $f(x) \in \mathbb{R}$. In that case, the surprisal $f(X)$ is a random variable over the reals and its expected value is well defined and called entropy $H(X) = H(P_X) := \E_X[f(X)]$. 118 | 119 | As an example, we consider the categorical random variable $X$ with distribution $P(X=\varheart)=P(X=\clubsuit)=1/4, P(X=\spadesuit)=1/2$. In that case, $\supp(X) = \{\varheart, \clubsuit, \spadesuit \}$ and surprisal values in bits are $f(\varheart)=f(\clubsuit)=\log_2(4)=2, f(\spadesuit)=\log_2(2)=1$. The entropy is the expected surprisal value, i.e.\ the individual surprisal valuse weighted with their corresponding probabilities of occurring: $H(X) = \E_X[f(X)] = \frac{1}{4} \cdot 2 + \frac{1}{4} \cdot 2 + \frac{1}{2} \cdot 1 = 3/2$. 120 | 121 | The entropy ``does not care'' about the actual outcomes or labels of a random variable, but only about the distribution! In fact, not even the order of the actual probabilities matter, as we are taking an expected value and the additive terms commute. You can verify that the calculation of $H(X)=3/2$ in the example above does apply to all random variables $X$ with distribution $(1/2, 1/4, 1/4)$, no matter what the actual outcomes are. 122 | 123 | \begin{Exercise} 124 | Compute the entropy of $Y \sim Binomial(n=2,p=1/2)$. 125 | \end{Exercise} 126 | 127 | The simplest and simultaneously most important example of entropy is given in Figure~\ref{fig:binaryEntropy} which shows the entropy of the Bernoulli distribution as a function of the parameter $ \theta \in [0,1]$. The entropy function of the Bernoulli is often called the \textbf{binary entropy} $h(\theta) := -\theta \cdot \log_2(\theta) - (1-\theta) \log_2(1-\theta)$. It measures the information of a binary decision, like a coin flip or an answer to a yes/no-question. 128 | The entropy of the Bernoulli attains its maximum of 1 bit when the distribution is uniform, i.e.\ when both choices are equally 129 | probable. The entropy is 0 if and only if the coin is fully biased towards heads or tails. As explained above, the entropy of the distributions $(\theta, 1-\theta)$ and $(1-\theta,\theta)$ is the same and therefore $h(\theta)=h(1-\theta)$ and the graph is symmetric around $1/2$. 130 | 131 | \begin{knitrout} 132 | \definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{figure}[t!] 133 | 134 | {\centering \includegraphics[width=\maxwidth]{figure/binaryEntropy-1} 135 | 136 | } 137 | 138 | \caption[Binary entropy function]{Binary entropy function}\label{fig:binaryEntropy} 139 | \end{figure} 140 | 141 | 142 | \end{knitrout} 143 | 144 | \medskip 145 | From the plot is it also easy to see that entropy is never negative. It holds in general that entropy is non-negative, 146 | because entropy is defined as expectation of surprisal and surprisal is the negative logarithm of probabilities. 147 | Because $ \log(x) \leq 0 $ for $ x \in (0,1] $, it is clear that $ -\log(x) \geq 0 $ for $ x $ in the same 148 | interval. Notice that from here on we drop the subscript and by convention let $ \log = \log_{2} $. 149 | 150 | A standard interpretation of the entropy is that it quantifies uncertainty. As we have pointed out before, a uniform distribution means that you are most uncertain and indeed the uniform distribution maximizes the entropy. However, the more choices you have to pick from uniformly, the more uncertain you are going to be. The entropy function also captures this intuition. Notice that if a discrete distribution is uniform, all probabilities are $ \frac{1}{|\supp(X)|} $. Clearly, as we increase $ |\supp(X)| $, we decrease the probabilities. By decreasing the probabilities, we increase their negative logarithms, and hence their average surprisal. Let us make this intuition more formal. 151 | 152 | \begin{Theorem} 153 | A discrete RV $ X $ with uniform distribution and support of size $ n $ has entropy 154 | $ H(X) = \log(n) $. 155 | \end{Theorem} 156 | 157 | \paragraph{Proof:} 158 | \begin{align} 159 | H(X) &= \underset{x \in \supp(X)}{\sum}-\log(P(X=x))P(X=x) \\ 160 | &= \underset{x \in \supp(X)}{\sum} -\log(\frac{1}{|\supp(X)|})P(X=x) \\ 161 | &= \underset{x \in \supp(X)}{\sum}\log(n)P(X=x) = \log(n) \, . 162 | \hspace{1cm} \square 163 | \end{align} 164 | 165 | \begin{Exercise} 166 | You are trying to learn chess and you start by studying where chess grandmasters move their king when it 167 | is positioned in one of the middle fields of the board. The king can move to any of the adjoining 8 fields. Since 168 | you do not know a thing about chess yet, you assume that each move is equally probable. In this situation, 169 | what is the entropy of moving the king? 170 | \end{Exercise} 171 | 172 | One of the first important results in information theory is \href{https://en.wikipedia.org/wiki/Shannon%27s_source_coding_theorem}{Shannon's source-coding theorem} which states that the entropy $H(X)$ of a random variable $X$ measures how many bits one will need on average to encode an outcome that is generated by the distribution $ P_{X} $. 173 | This result applies to the real-world problem of data compression. Assume that $N$ data points are drawn iid from the distribution $P_X$. In that case, the source-coding theorem tells us that on average, we will need $N \cdot H(X)$ bits to store the (optimally compressed) data. For example, let $P_X$ be the $Bernoulli(\theta)$ distribution over bits. In the case $\theta=1/2$, we have $N$ perfectly random bits which cannot be compressed, and hence we need $N \cdot H(X) = N \cdot h(\theta) = N \cdot h(1/2) = N$ bits of storage. For the general case $\theta \neq 1/2$ when the individual bits are biased, the graph of the binary entropy $h(\theta)$ in Figure~\ref{fig:binaryEntropy} tells us exactly what the compression ratio will be. We will not cover the proof of the source-coding theorem here, but refer to the literature instead. 174 | 175 | 176 | \section{Conditional Entropy} 177 | At the outset of this section we promised you that you could easily transfer results from probability 178 | theory to information theory. We will not be able to show any kind of linearity for entropy because it contains 179 | log-terms and the logarithm is not linear. We can however find alternative expressions for joint entropy (where 180 | the joint entropy is simply the entropy of a joint RV). Before we do so, let us also define the notion of 181 | conditional entropy. We have seen in Section~\ref{sec:jointconditionaldistributions} that $P_{X|Y=y}$ is a valid probability distribution for any $y \in \supp(Y)$ such that $P(Y=y)>0$. Hence, we can also define its conditional entropy. 182 | 183 | \begin{Definition}[Conditional Entropy] 184 | For two jointly distributed RVs $ X,Y $ and $y \in \supp(Y)$ such that $P(Y=y)>0$, the conditional entropy of $ X $ given that $ Y=y $ is defined as 185 | \begin{align*} 186 | H(X | Y=y) &:= \E_X[-\log_{2}(P(X=x | Y=y))] \\ 187 | &= - \!\! \sum_{x \in \supp(X)} P(X=x | Y=y) \log_2(P(X=x | Y=y))\, . 188 | \end{align*} 189 | The conditional entropy of $X$ given $Y$ is defined as 190 | $$ H(X | Y) := \E_Y[ H(X | Y) ] = \sum_{y \in \supp(Y)} P(Y=y) H(X | Y=y) \, .$$ 191 | \end{Definition} 192 | 193 | Intuitively, $H(X | Y)$ is the (average) uncertainty of $X$ after learning $Y$. Intuitively, learning $Y$ (and in fact any information) cannot increase your uncertainty about $X$. Formally, one can prove the following 194 | \begin{Lemma}[see e.g.\ Proposition~4 of \href{http://homepages.cwi.nl/~schaffne/courses/inftheory/2016/notes/CramerFehr.pdf}{this script}] \label{lemma:noincrease} 195 | For any two random variables $X,Y$ with joint distribution $P_{XY}$, it holds that $H(X | Y) \leq H(X)$. 196 | \end{Lemma} 197 | Note however, that this non-increase of uncertainty only holds on average, as illustrated by the following example: 198 | 199 | \paragraph{Example} 200 | Consider the binary random variables $X$ and $Y$, with joint distribution 201 | \begin{align*} 202 | &P(X=0,Y=0) = \frac{1}{2}, \quad P(X=0,Y=1) = \frac{1}{4}\\ 203 | &P(X=1,Y=0) = 0, \quad P(X=1,Y=1) = \frac{1}{4}. 204 | \end{align*} 205 | By marginalization, we find that $P(X=0) = \frac{3}{4}$ and $P(X=1) = \frac{1}{4}$, while $P(Y=0) = P(Y=1) = \frac{1}{2}$. This allows us to make the following computations: 206 | \begin{align*} 207 | H(X,Y) &= \frac{1}{2}\log 2 + \frac{1}{4} \log 4 + \frac{1}{4} \log 4 = \frac{3}{2}\\ 208 | H(X) &= h\left(\frac{1}{4}\right) = h\left(\frac{3}{4}\right) \approx 0.81\\ 209 | H(Y) &= h\left(\frac{1}{2}\right) = 1\\ 210 | H(X|Y) &= P(Y=0) \cdot H(X | Y=0) + P(Y=1) \cdot H(X | Y=1)\\ 211 | &= \frac{1}{2} \cdot 0 + \frac12 \cdot 1 = \frac12 \\ 212 | H(Y|X) &= P(X=0) \cdot H(Y | X=0) + P(X=1) \cdot H(Y | X=1)\\ 213 | &= \frac{3}{4} \cdot h\left(\frac{1}{3} \right) + \frac{1}{4} \cdot 0 \approx 0.69 214 | \end{align*} 215 | % We also could have computed $H(X|Y)$ and $H(Y|X)$ directly through the definition of conditional entropy. 216 | Note that for this specific distribution, learning the outcome $Y=1$ increases the uncertainty about $X$, $H(X|Y=1) > H(X)$, but on average, we always have $H(X|Y) \leq H(X)$. It is important to remember that Lemma~\ref{lemma:noincrease} only holds on average, not for specific values of $Y$. Note also that in this example, $H(X|Y) \neq H(Y|X)$. 217 | 218 | It is not a coincidence that the joint entropy $H(X,Y)$ in the example above is equal to $H(X|Y)+H(Y)$ and $H(Y|X)+H(X)$. One can prove this chain rule in general: 219 | 220 | \begin{align*} 221 | H(X,Y) &= \underset{\substack{x \in \supp(X)\\y \in \supp(Y)}}{\sum} -\log(P(X=x,Y=y)) \times P(X=x, Y=y) \\ 222 | \begin{split} 223 | &= \underset{\substack{x \in \supp(X)\\ y \in \supp(Y)}}{\sum} -\log(P(X=x \mid Y=y)) \times P(X=x,Y=y) \\ 224 | &\qquad - \underset{y \in \supp(Y)}{\sum}\log(P(Y=y)) \times \sum_{x \in \supp(X)} P(X=x,Y=y) 225 | \end{split} \\ 226 | \begin{split} 227 | &=\sum_{y \in \supp(Y)} P(Y=y) \times \sum_{x \in \supp(X)} -\log(P(X=x \mid Y=y)) \times P(X=x \mid Y=y) \\ &\qquad - \underset{y \in \supp(Y)}{\sum}\log(P(Y=y)) \times P(Y=y) 228 | \end{split} \\ 229 | &= H(X | Y) + H(Y) \; . 230 | \end{align*} 231 | 232 | \begin{Exercise} 233 | Prove that $ H(X,Y | Z) = H(X | Z) + H(Y | Z) $ if $ X \bot Y \mid Z $. 234 | \end{Exercise} 235 | As corollary, we get that $H(X,Y)=H(X)+H(Y)$ for independent random variables $X$ and $Y$. More generally, the entropy of $n$ independent random variables is $H(X_1^n) = \sum_{i=1}^n H(X_i)$. 236 | 237 | 238 | \section{An Information-Theoretic View on EM} 239 | Now that we have seen some information-theoretic concepts, you may be happy to hear that there is an information-theoretic interpretation 240 | of EM. This interpretation helps us to get a better intuition for the algorithm. To formulate that interpretation we need 241 | one more concept, however. 242 | 243 | \begin{Definition}[Relative Entropy] 244 | The relative entropy of RVs \\ $ X,Y $ with distributions $P_X, P_Y$ and $\supp(X) \subseteq \supp(Y) $ is defined as 245 | $$ D(P_X||P_Y) := \sum_{x \in \supp(X)} P(X=x) \log \frac{P(X=x)}{P(Y=x)} \ . $$ 246 | If $ P(Y=x) = 0 $ for any $ x \in \supp(X) $ we define $ D(P_X||P_Y) = \infty $. As with entropy, we often abbreviate $D(P_X||P_Y)$ with $D(X||Y)$. 247 | \end{Definition} 248 | 249 | The relative entropy is commonly known as \textbf{Kullback-Leibler (KL)} divergence. It measures the entropy of $ X $ as scaled to $ Y $. Intuitively, 250 | it gives a measure of how ``far away'' $ P_{X} $ is from $ P_{Y} $. To 251 | understand ``far away'', recall that entropy is a measure of 252 | uncertainty. 253 | % The 254 | % relative entropy measure the uncertainty that you have about $ P_{X} $ if you know $ P_{Y} $\chris{hard to see why at this point}. 255 | This uncertainty is low if both distributions place most 256 | of their mass on the same outcomes. Since $ \log(1) = 0 $ the relative entropy is 0 if $ P_{X} = P_{Y} $. 257 | 258 | It is worthwhile to point out the difference between relative and conditional entropy. Conditional entropy is the average entropy of $ X $ given that you 259 | know what value $ Y $ takes on. In the case of relative entropy you do not know the value of $ Y $, only its distribution. 260 | 261 | \begin{Exercise} 262 | Show that $ D(X,Y||Y) = H(X | Y) $. Furthermore show that $ D(X,Y||Y) = H(X) $ if $ X\bot Y $. 263 | \end{Exercise} 264 | 265 | 266 | Let us start by remembering why we need EM. We have a model that defines a joint distribution 267 | over observed ($ x $) and latent data ($ z $). Such a model generally looks as follows: 268 | \begin{equation} 269 | P(X=x, Z=z \mid \Theta = \theta) = P(X=x \mid Z=z, \Theta=\theta) P(Z=z \mid \Theta = \theta) 270 | \end{equation} 271 | where we have chosen a factorization that provides a separate term for a distribution over only the 272 | latent data. 273 | 274 | Recall that the goal of the EM algorithm is to iteratively increase the likelihood through consecutive 275 | updates of parameter estimates. These updates are achieved through maximum-likelihood estimation based 276 | on expected sufficient statistics. We are now going to show that a) EM computes a lower bound on the 277 | marginal log-likelihood of the data in each iteration and b) that this lower bound becomes tight when the 278 | expected sufficient statistics are taken with respect to the model posterior. The latter implies that 279 | EM performs the optimal update in each iteration. 280 | 281 | Let us start by expanding the data log-likelihood and then lower-bounding it. 282 | \begin{align} 283 | &\log(P(X=x \mid \Theta=\theta)) = \log(\sum_y P(X=x, Y=y \mid \Theta = \theta)) \\ 284 | &= \log\left(\sum_{y} Q(Y=y \mid \Phi=\phi)\frac{P(X=x, Y=y \mid \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\ 285 | &\geq \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(X=x, Y=y \mid \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) 286 | \label{eq:ELBO1} 287 | \end{align} 288 | Here, we have used \href{https://en.wikipedia.org/wiki/Jensen\%27s_inequality}{Jensen's Inequality} to 289 | derive the lower bound. Observe that the log is indeed a concave function. 290 | 291 | We also have introduced 292 | an auxiliary distribution $ Q $ over the latent variables with parameters $ \phi $. 293 | For reasons that we will explain shortly, 294 | this distributions is often called the \textbf{variational distribution} and its parameters the 295 | \textbf{variational parameters}. The letter $ Q $ is slightly non-standard to denote distributions but 296 | we are are following conventions from the field of \textbf{variational inference} here. 297 | 298 | In the next step, we factorise the model distribution in order to recover a KL divergence term between 299 | the variational distribution and the model posterior over latent variables. 300 | \begin{align} 301 | &\sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(X=x, Y=y \mid \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\ 302 | &= \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(Y=y \mid X=x, \Theta = \theta)P(X=x \mid \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) \\ 303 | &= \sum_{y} Q(Y=y \mid \Phi=\phi) \log\left(\frac{P(Y=y \mid X=x, \Theta = \theta)}{Q(Y=y \mid \Phi=\phi)}\right) + \log(P(X=x \mid \Theta=\theta)) \\ 304 | &= -D(Q||P) + \log(P(X=x \mid \Theta=\theta)) \label{eq:ELBO2} 305 | \end{align} 306 | Equation~\eqref{eq:ELBO2} gives us two insights. First it quantifies the gap between the lower bound 307 | and the actual data likelihood. This gap is equal to the KL divergence between the variational distribution 308 | and the model posterior over latent variables. Second, since KL divergence is always positive, the bound only becomes 309 | tight when $ P=Q $. But this is exactly what is happening in the E-step! The E-step sets $ P=Q $ and 310 | then computes expectations under that distribution (see Equation~\eqref{eq:ELBO1}). Thus, the E-step increases 311 | the lower bound on the marginal log-likelihood. 312 | 313 | Looking back at Equation~\eqref{eq:ELBO1}, we also see that the M-step increases the lower bound because 314 | it maximises $ \E\left[P(X=x, Y=y\mid \Theta = \theta)\right] $. We conclude that both steps 315 | are increasing the lower bound on the log-likelihood. We therefore conclude that EM increases the data likelihood 316 | in every iteration (or leaves it unchanged at worst). 317 | 318 | We will finish with a quick rejoinder on variational inference. EM is a special case of variational inference. 319 | Variational inference is any inference procedure which uses an auxiliary distribution $ Q $ to compute 320 | a lower bound on the likelihood. In the general setting, the auxiliary distribution can be different from the 321 | model posterior. This means that the bound never gets tight. However, in models in which the exact posterior 322 | is hard (read: impossible) to compute, using a non-tight lower bound instead can be incredibly useful! 323 | 324 | The reason this inference procedure is called \textit{variational} is because it is based on the 325 | \href{https://en.wikipedia.org/wiki/Calculus_of_variations}{calculus of variations}. This works mostly 326 | like normal calculus except that standard operations like differentiation are done with respect to functions 327 | instead of variables. 328 | 329 | %Naively, we could take the expectation with respect to any distribution 330 | %over latent values. Obviously, we would like to find the best one, i.e. the one that is closest to the 331 | %actual posterior. We can formalize this by introducing an auxiliary distribution\footnote{We follow 332 | %standard notation here by denoting the auxiliary distribution $ Q $ instead of $ P $. Also, the 333 | %parameter variable is chosen so as to distinguish it from the parameter variable of our model.} 334 | %$ Q(z\mid\Phi=\phi) $ under 335 | %which we compute the expected sufficient statistics. We want to find the auxiliary distribution that 336 | %is closest to actual posterior $ P_{Z\midX=x,\Theta=\theta} $. We measure closeness in an information-theoretic 337 | %sense using KL-divergence. Formally, our goal is to find 338 | %\begin{equation} 339 | %Q^{*}_{Z\mid\Phi=\phi} = \underset{Q_{Z\mid\Phi=\phi}}{\mbox{arg min}}~D\left( Q_{Z\mid\Phi=\phi} || P_{Z \mid X=x,\Theta=\theta} \right) \ . 340 | %\end{equation} 341 | 342 | 343 | 344 | \section*{Further Material} 345 | 346 | At the ILLC, there is a whole course about information theory, \href{http://homepages.cwi.nl/~schaffne/courses/inftheory/}{currently taught by Christian Schaffner}. David MacKay also offers \href{http://www.inference.phy.cam.ac.uk/itprnn/book.pdf}{a free book on the subject}. Finally, 347 | Coursera also offers \href{https://www.coursera.org/course/informationtheory}{an online course on information theory}. 348 | 349 | The information-theoretic formulation of EM was pioneered in this \href{http://www.cs.toronto.edu/~fritz/absps/emk.pdf}{paper}. A very recent and intelligible 350 | \href{https://arxiv.org/abs/1601.00670}{tutorial on variational inference} can be found on the archive. 351 | 352 | \end{document} 353 | 354 | %%% Local Variables: 355 | %%% mode: latex 356 | %%% TeX-master: "chapter7" 357 | %%% End: 358 | 359 | \end{document} 360 | -------------------------------------------------------------------------------- /chapter2/chapter2_forInclude.tex: -------------------------------------------------------------------------------- 1 | \chapter{Axiomatic Probability Theory} 2 | 3 | \section{Axioms of Probability} 4 | In the previous chapter, we have introduced sample spaces and event spaces. We would like to be able 5 | to express that certain events are more (or less) likely than others. 6 | Therefore, we are going to measure the probability of events in a mathematically precise sense. 7 | 8 | \begin{Definition}[Finite Measure]\label{axioms} 9 | A \emph{finite measure} is a function $ \mu: \mathcal{S} \rightarrow \mathbb{R} : S \mapsto \mu (S) $ 10 | that maps elements 11 | from a countable set of sets $ \mathcal{S} $ (formally a \href{http://en.wikipedia.org/wiki/Sigma-algebra} 12 | {$ \sigma $-algebra}) to real numbers. Such a measure has the following properties: 13 | \begin{enumerate} 14 | \item $ \mu(S) \in \mathbb{R} $ for $ S \in \mathcal{S} \, ,$ 15 | \item $ \mu\left( \underset{i = 1}{\overset{\infty}{\bigcup}} S_{i} \right) 16 | = \underset{i = 1}{\overset{\infty}{\sum}} \mu \left( S_{i} \right) $ for disjoint sets $S_1, S_2, \ldots$ \, . \label{countableAdditivty} 17 | \end{enumerate} 18 | \end{Definition} 19 | 20 | Notice that we are restricting ourselves to finite measures here, i.e. the value of the measure can never 21 | be infinite. This restriction makes sense as probabilities are finite as well. Property \ref{countableAdditivty} is 22 | known as \emph{countable additivity}. 23 | 24 | Let 25 | $ S = \underset{i=1}{\overset{n}{\bigcup}} S_{i} $ for some positive natural number $ n $ and disjoint 26 | $ S_{i} $ and $ S_{j} = 27 | \emptyset $ for $ j > n $. By 28 | countable additivity, we then get 29 | \begin{equation} 30 | \mu(S) = \mu(\underset{i=1}{\overset{\infty}{\bigcup}} S_{i}) = \mu \left( \underset{i=1}{\overset{n}{\bigcup}} S_{i} \cup 31 | \underset{j=n+1}{\overset{\infty}{\bigcup}} \emptyset \right) 32 | = \underset{i=1}{\overset{n}{\sum}} \mu ( S_{i} ) 33 | + \underset{j=n+1}{\overset{\infty}{\sum}} \mu ({\emptyset}) 34 | \end{equation} 35 | 36 | Since the $ S_{i} $ are disjoint, we 37 | must have $ \mu(S) = \underset{i=1}{\overset{n}{\sum}} \mu (S_{i}) $ and it follows that 38 | $ \mu(\emptyset) = 0 $. We conclude that the empty set has measure $ 0 $ for all measures. Furthermore, we also see from the above 39 | derivation that countable additivity implies finite additivity, i.e. 40 | $ \mu(S) = \underset{i=1}{\overset{n}{\sum}} \mu(S_{i}) $ for finite positive $ n $ (again, this only 41 | holds if the $ S_{i} $ are disjoint). 42 | 43 | Examples of measures are not hard to find. In fact, we have already seen a measure, 44 | namely the function $ |\cdot| $ that counts the elements of a set (check yourself that it really is a 45 | measure). Another measure is the Dirac-measure that is related to the characteristic 46 | function of a set. While the characteristic function tells you whether any object belongs to a given set, 47 | the Dirac-measure tells you whether any set contains a given object. Let us call the object in question 48 | $ a $. Then its Dirac measure $ \delta_{a}(S) = 1 $ iff $ a \in S $ and 0 otherwise (check yourself that the Dirac-measure indeed is a measure). 49 | 50 | Apart from these examples, there is one measure, however, that is going to be the star of the rest of this 51 | script, namely the \textbf{probability measure}. 52 | 53 | \begin{Definition}[Probability measure]\label{def:probmeasure} 54 | A probability measure \\ $ \mathbb{P}: \mathcal{A} \rightarrow \mathbb{R}, A \mapsto \mathbb{P}(A) $ 55 | on an event space $ \mathcal{A} $ associated with a sample space $ \Omega $ has the 56 | following properties: 57 | \begin{enumerate} 58 | \item $ \mathbb{P}(A) \geq 0 $ for all $ A \in \mathcal{A} \,$, 59 | \item $ \mathbb{P}\left( \underset{i = 1}{\overset{\infty}{\bigcup}} A_{i} \right) 60 | = \underset{i = 1}{\overset{\infty}{\sum}} \mathbb{P} \left( A_{i} \right) \,$ for disjoint events $A_1,A_2,\ldots$ \, , \label{union} 61 | \item $ \mathbb{P}(\Omega) = 1 \,$. \label{unity} 62 | \end{enumerate} 63 | \end{Definition} 64 | 65 | Notice that we only added Property~\ref{unity} to the general definition of a measure. Hence, a 66 | \textbf{probability} (the value that the probability measure assigns to an event) will always lie in the real interval 67 | $[0,1]$. The above three axioms for a probability measure are often referred to as \emph{axioms of probability} 68 | or \emph{Kolmogorov axioms} after their inventor \href{https://en.wikipedia.org/wiki/Andrey_Kolmogorov}{Andrey 69 | Kolmogorov}. 70 | 71 | We have already discussed uniform probabilities in the previous chapter. We can now formally explain 72 | what we meant by that. The uniform probability measure $ \mathbb{P} $ has the property that 73 | $ \mathbb{P}(\{\omega\}) = \frac{1}{|\Omega|} $ for all $ \omega \in \Omega $. At this point, the 74 | distinction between sample and event spaces becomes important. We cannot measure the elements of a 75 | sample space, only the elements of an event space! Recall our convention that we will always assume 76 | that $ \mathcal{A} = \mathcal{P}(\Omega) $ which obviously contains a singleton for each element in 77 | $ \Omega $. Using this assumption, the uniform probability measure is indeed well-defined. Whenever we talk about 78 | \textit{uniform probability}, we either mean the uniform probability measure or, more often, the real 79 | value $ \frac{1}{|\Omega|} $ to which this measure uniformly evaluates. 80 | 81 | In order to create a tight relationship between a sample space, an event space and a probability measure, 82 | we introduce the concept of a \textbf{probability space}. Probability spaces are also known as 83 | \textbf{(probabilistic) experiments}. 84 | 85 | \begin{Definition}[Probability space] \label{def:ProbabilitySpace} 86 | A probability space is a triple $ (\Omega, \mathcal{A}, \mathbb{P}) $, consisting of a sample space $ \Omega $, 87 | an event space $ \mathcal{A} $ and a probability measure $ \mathbb{P} $. 88 | \end{Definition} 89 | 90 | If we roll a die, for example, we have the sample space $ \Omega = \{1,2,3,4,5,6\} $ and, by 91 | convention, the event space $ \mathcal{A} = \mathcal{P}(\Omega) $. If we add the uniform probability measure, 92 | we have constructed a \emph{probabilistic experiment}. We can use it to answer a couple of questions. For example, we 93 | might wonder about the probability of obtaining an even number. By Property~\ref{union} of our definition, this 94 | probability is given by 95 | \begin{align} 96 | \mathbb{P}(\{2,4,6)\}) &= \mathbb{P}(\{2\} \cup \{4\} \cup \{6\}) \\ 97 | &= \mathbb{P}(\{2\}) + \mathbb{P}(\{4\}) 98 | + \mathbb{P}(\{6\}) = \frac{1}{6} + \frac{1}{6} + \frac{1}{6} = \frac{1}{2} 99 | \end{align} 100 | 101 | Notice that this calculation is rather cumbersome. After all, we might just have evaluated 102 | $ \mathbb{P}(\{2,4,6\}) $ directly. This is because by convention we have $ \mathcal{A} = \mathcal{P}(\Omega) $ which certainly contains $ \{2,4,6\} $. 103 | Since the probability measure is defined on $ \mathcal{A} $, it must map $ \{2,4,6\} $ to some real number. 104 | However, the above calculation points to an interesting fact. In order 105 | to fully specify a probability measure, is suffices to specify the measure on the singleton sets of the 106 | event space. By countable additivity, this assignment already specifies the measure on the entire event space, as we can 107 | construct any event as a countable union of singletons. 108 | 109 | It is important to point out that we just chose the uniform probability measure as the one that seems ``natural'' for 110 | a die roll. However, nobody is forcing us to do so. In fact, Definition~\ref{def:ProbabilitySpace} allows us to impose arbitrary probability measures. 111 | 112 | \begin{Exercise} 113 | Let us consider a rigged die. Take $ (\Omega, \mathcal{A}, \mathbb{P}) $ with $ \Omega $ and $ \mathcal{A} = \mathcal{P}(\Omega) $ 114 | as in the uniform die-roll example before, but use the 115 | probability measure specified by \\ $ \mathbb{P} = \{(\{1\},0), (\{2\}, \frac{1}{12}), (\{3\}, \frac{1}{6}), (\{4\}, \frac{1}{6}), (\{5\}, \frac{1}{3}), 116 | (\{6\},\frac{1}{4}) \} $. 117 | \begin{enumerate} 118 | \item Verify that $ \mathbb{P} $ is indeed a probability measure. 119 | \item Compute the probability of obtaining a number strictly smaller than $ 5 $ in this experiment. 120 | \end{enumerate} 121 | \end{Exercise} 122 | 123 | \section{Probability of Arbitrary Unions of Events} 124 | \begin{figure} 125 | \center 126 | \begin{subfigure}{.4\textwidth} 127 | \begin{venndiagram2sets}[labelA=$ E_{1} $, labelB= $ E_{2} $, labelAB= $ E_{3} $, shade=red!40] 128 | \fillACapB 129 | \end{venndiagram2sets} 130 | \caption{} 131 | \label{Venn2} 132 | \end{subfigure} 133 | ~ 134 | \begin{subfigure}{.4\textwidth} 135 | \begin{venndiagram3sets}[labelA=$ E_{1} $, labelB=$ E_{2} $, labelC=$ E_{3} $, labelOnlyAB=$ - $, 136 | labelOnlyBC=$ - $, labelOnlyAC=$ - $, labelABC=$ + $, shade=red!40] 137 | \fillACapB 138 | \fillACapC 139 | \fillBCapC 140 | \end{venndiagram3sets} 141 | \caption{} 142 | \label{Venn3} 143 | \end{subfigure} 144 | \caption{\ref{Venn2}: Two overlapping events $ E_{1} $ and $ E_{2} $. Their intersection 145 | (the coloured region) gets counted twice if we add up their probabilities. \\ 146 | \ref{Venn3}: Venn diagram with 3 events. First we deduct 147 | $ E_{1} \cap E_{2}, E_{1} \cap E_{3}, E_{2} \cap E_{3} $ in order to prevent double counting and then 148 | we add in $ E_{1} \cap E_{2} \cap E_{3} $. Deductions and additions are indicated by pluses and minuses.} 149 | \end{figure} 150 | 151 | We have seen how to compute probabilities of events if they can be formed as unions of \textit{disjoint} 152 | events. The natural question to ask is what to do if we want to compute the probability of the \emph{union 153 | of non-disjoint events}. In order to reason about this problem, we first take a step back and think about the 154 | outcomes of our probability space. We know that each event with non-zero probability contains at least one outcome (since 155 | $ \mathbb{P}(\emptyset) = 0 $, we can safely ignore the empty event). Let us assume that we take the union of events 156 | $ E_{1} $ and $ E_{2} $ with $ E_{1} \cap E_{2} = E_{3} \not = \emptyset $. This means that the outcomes 157 | in $ E_{3} $ are contained in both $ E_{1} $ and $ E_{2} $. This situation is illustrated in Figure~\ref{Venn2}. If we were to simply add up the probabilities of $ E_{1} $ and $ E_{2} $, we 158 | would effectively count the contribution of the outcomes in $ E_{3} $ twice. We would hence 159 | get an overestimate of the actual value of $ \mathbb{P}(E_{1} \cup E_{2}) $. 160 | In order to avoid this we will need to subtract the probability of $ E_{3} $ one time. This leads us to the following formulation: 161 | \begin{equation} 162 | \mathbb{P}(E_{1} \cup E_{2}) = \mathbb{P}(E_{1}) + \mathbb{P}(E_{2}) - \mathbb{P}(E_{1} \cap E_{2}) 163 | \end{equation} 164 | 165 | Notice that this is fully general in that it is true even if $ E_{1} $ and $ E_{2} $ were disjoint. In that 166 | case, their intersection would be empty. We can generalize this principle to the (countable) union of 167 | an arbitrary number of events. This will give us a principled way of calculating the probability of any 168 | union of events. This calculation technique is know as the \textbf{Inclusion-Exclusion principle}. 169 | 170 | \newpage 171 | \begin{Theorem}[Inclusion-Exclusion principle] 172 | The probability of any (countable) union of events $ E_{1}, \ldots, E_{n} $ can be computed as 173 | \begin{equation}\label{eq:incexc} 174 | \mathbb{P} \left( \underset{i=1}{\overset{n}{\bigcup}} E_{i} \right) 175 | = \underset{i=1}{\overset{n}{\sum}} (-1)^{i+1} \left( \underset{j_{1}<\ldots 0 $ 345 | is defined as $$ \mathbb{P}(E_{i}|E_{j}) := \dfrac{\mathbb{P}(E_{i} \cap E_{j})}{\mathbb{P}(E_{j})} $$ 346 | \label{condProb} 347 | \end{Definition} 348 | 349 | Before we get into the math of conditional probabilities, let us try to understand the meaning of this concept. 350 | When we are computing the conditional probability of an event $ E_{i} $, we re-scale with the 351 | probability of the conditioning event $ E_{j} $. If $ E_{j} \not = \Omega $, $ \mathbb{P}(E_{j}) $ 352 | might be smaller than 1. Thus, this rescaling assumes \textit{that $ E_{j} $ has already occurred}. In other 353 | words, we are excluding all outcomes that are not in $ E_{j} $ from further consideration (even though they 354 | may be in $ E_{i} $). The interpretation of conditional probabilities is that they are the probabilities 355 | of events assuming that another event has already occurred. 356 | 357 | Another interpretation is that when working with a conditional probability measure, we are in fact working 358 | in a new probability space, where $ \Omega_{new} = E_{2} $, i.e.\ our new sample space is the conditioning event. 359 | Notice that this also means that our probability measure will change and become the 360 | measure from Definition~\ref{condProb}. 361 | 362 | Here comes the cool part: although we have introduced a new concept, all the properties of probability 363 | measures that we know by now will seamlessly carry over to conditional probabilities, if we can prove 364 | that the conditional probability measure is a probability measure according to our axioms. 365 | 366 | \begin{Exercise} 367 | Use the axioms from Definition~\ref{def:probmeasure} to prove that $ \mathbb{P}(\cdot|E_{j}) $ is a probability measure. 368 | \end{Exercise} 369 | 370 | We will make use of conditional probabilities quite a lot in this course. We will later see a way in which 371 | they help us to decompose joint probability distributions. For now, we are going to focus on the fact that 372 | they are also related to the idea of independence of events. 373 | 374 | \begin{Definition}[Independence] 375 | Two events $ E_{1}, E_{2} $ are said to be independent if 376 | $$ \mathbb{P}(E_{1} \cap E_{2}) = \mathbb{P}(E_{1}) \times \mathbb{P}(E_{2}) $$ 377 | Independence of two events is denoted as $ E_{1} \bot E_{2} $. 378 | \end{Definition} 379 | 380 | This definition relates to conditional probabilities in the following way: assume that $ E_{1} \bot E_{2} $. 381 | Then we get 382 | \begin{equation} 383 | \mathbb{P}(E_{1}|E_{2}) = \dfrac{\mathbb{P}(E_{1} \cap E_{2})}{\mathbb{P}(E_{2})} 384 | = \dfrac{\mathbb{P}(E_{1}) \times 385 | \mathbb{P}(E_{2})}{\mathbb{P}(E_{2})} = \mathbb{P}(E_{1}) \, . 386 | \end{equation} 387 | Hence, independence of two events $ E_{1} \bot E_{2}$ is equivalent with $\mathbb{P}(E_{1}|E_{2}) = \mathbb{P}(E_{1}) $. 388 | 389 | \begin{Exercise} 390 | Prove that $E_1 \bot E_2$ is also equivalent with $\mathbb{P}(E_{2}|E_{1}) = \mathbb{P}(E_{2}) $. 391 | \end{Exercise} 392 | 393 | Independence will prove to be a useful concept in later chapters. More precisely, we will often 394 | just \textit{assume} that two events (or random variables -- see the next chapter) are independent. Although 395 | such an independence assumption might not always hold in practice, it will allow us to formulate much simpler probabilistic models. 396 | 397 | 398 | \section{A Remark on the Interpretation \\ of Probabilities$^{*}$} 399 | 400 | This concludes our introduction of axiomatic probability theory. We know that a probability is 401 | a real number in $ [0,1] $. For all that we are going to do in this course (and in most follow-up courses) 402 | this is fully sufficient. However, some of you may wonder what a ``natural'' interpretation of probabilities 403 | would be. There are two dominating views on that. One postulates that if we were to take A LOT (read: almost 404 | infinitely many) samples from a sample space, the probability of an event is its frequency amongst these 405 | samples divided by the total number of samples taken. For those of you who know limits, this principle can be 406 | formalized as $ \mathbb{P}(E) = \underset{n \rightarrow \infty}{lim} \dfrac{\#E}{n} $. This view 407 | is known as the \emph{frequentist view}. 408 | 409 | The second view postulates that probabilities are an expression for degrees of belief. Basically, 410 | if you assign $ \mathbb{P}(E) $ to an event $ E $, then $ \mathbb{P}(E) $ is the strength of your personal belief that 411 | $ E $ will occur. This latter view is known as the \emph{Bayesian view}. 412 | 413 | Which conception of probability you choose is a philosophical matter and does not really impact the math. 414 | That is why we will not care about this issue in this course. However, it is useful to at least be aware 415 | of these two views (if only to appear knowledgeable in a conversation you may have with your philosopher 416 | friends). 417 | 418 | 419 | \section{The Binomial Theorem} 420 | The binomial theorem from Equation~\ref{binomTheorem} is actually not that hard to prove. We will do so by 421 | induction. As a base case we choose $ m = 0 $. Then the equality is easy to see. 422 | \begin{equation} 423 | (p + q)^{0} = 1 = \binom{0}{0}p^{0}q^{0} 424 | \end{equation} 425 | 426 | Next, we assume that the theorem holds for $ m = n $. What we want to show is that it also holds for 427 | $ m = n + 1 $. We achieve this by algebraic manipulation. 428 | 429 | \begin{align} 430 | (p + q)^{n+1} &= (p + q)^{n} \times (p + q) \\ 431 | &= (p+q)^{n}p + (p+q)^{n}q \\ 432 | &= p\underset{i=0}{\overset{n}{\sum}} \binom{n}{i} p^{i}q^{n-i} + q\underset{i=0}{\overset{n}{\sum}} \binom{n}{i} p^{i}q^{n-i} \label{indcutiveHyp} \\ 433 | &= \underset{i=0}{\overset{n}{\sum}} \binom{n}{i} p^{i+1}q^{n-i} + \underset{i=0}{\overset{n}{\sum}} \binom{n}{i} p^{i}q^{n+1-i} \\ 434 | &= \underset{j=1}{\overset{n+1}{\sum}} \binom{n}{j-1} p^{j}q^{n+1-j} + \underset{i=0}{\overset{n}{\sum}} \binom{n}{i} p^{i}q^{n+1-i} \label{variableSwitch} \\ 435 | &= \binom{n}{n} p^{n+1}q^{(n+1)-(n+1)} + \underset{k=1}{\overset{n}{\sum}} \binom{n}{k-1} p^{k}q^{n+1-k} \label{pullOut} \\ 436 | &+ \binom{n}{0} p^{0}q^{n+1} + \underset{k=1}{\overset{n}{\sum}} \binom{n}{k} p^{k}q^{n+1-k} \nonumber \label{collapseSums} 437 | \end{align} 438 | \begin{align} 439 | &= q^{n+1} + p^{n+1} + \underset{k=1}{\overset{n}{\sum}} \left(\binom{n}{k} + \binom{n}{k-1}\right) p^{i}q^{n+1-k} \\ 440 | &= q^{n+1} + p^{n+1} + \underset{k=1}{\overset{n}{\sum}} \left(\dfrac{n!}{k!(n-k)!} + \dfrac{n!}{(k-1)!(n-k+1)!}\right) p^{i}q^{n+1-k} \\ 441 | &= q^{n+1} + p^{n+1} + \underset{k=1}{\overset{n}{\sum}} \left(\dfrac{n!(n+1-k)}{k!(n+1-k)!} + \dfrac{n!k}{k!(n-k+1)!}\right) p^{k}q^{n+1-k} \\ 442 | &= q^{n+1} + p^{n+1} + \underset{k=1}{\overset{n}{\sum}} \left(\dfrac{n!(n+1)}{k!(n+1-k)!}\right) p^{i}q^{n+1-k} \\ 443 | &= q^{n+1} + p^{n+1} + \underset{k=1}{\overset{n}{\sum}} \binom{n+1}{k} p^{k}q^{n+1-k} \\ 444 | &= \underset{i=0}{\overset{n+1}{\sum}} \binom{n}{k} p^{k}q^{n-k} 445 | \end{align} 446 | 447 | Let us clarify some parts of the proof. We use the induction hypothesis to expand the terms in Line~\ref{indcutiveHyp}. 448 | In Line~\ref{variableSwitch}, we switch the variable $ i $ in the first summand to $ j = i+1 $. The 449 | reason why we do this is because we want to achieve congruence with the exponents of the second summand. In the following line we 450 | uniformly name the variables $ k $. Since $ k $ has to run over a common range, we chop off the ends of both sums that stick out. In the first 451 | sum of line \ref{variableSwitch} that is the summand that corresponds to $ j=n+1 $ and in the second sum it is the summand that corresponds 452 | to $ i = 0 $. We pull out both of them in line \ref{pullOut} and then collapse the sums in line \ref{collapseSums}. The following lines 453 | are basically just an exercise in manipulation fractions. The jump from the second-to-last to the last line is allowed because 454 | $$ q^{n+1} = \binom{n+1}{0}p^{0}q^{n+1-0} $$ and $$ p^{n+1} = \binom{n+1}{n+1}p^{n+1}q^{(n+1)-(n+1)} $$ 455 | which are exactly the quantities that we need to add to make our sum reach from $ 0 $ to $ n+1 $. This completes the proof. 456 | 457 | \section*{Further Reading} 458 | A very quick and dirty introduction to measure theory is provided by Maya Gupta and can be found 459 | \href{https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2006-0008.pdf}{here}. If you are 460 | looking for something more extensive that also motivates event spaces and the like you may want to 461 | take a look at \href{http://www.stat.ncsu.edu/people/fuentes/courses/st778/lectures/ross}{this script} 462 | by Ross Leadbatter and Stamatis Cambanis (which has also been 463 | published as a book). 464 | 465 | 466 | 467 | 468 | %%% Local Variables: 469 | %%% mode: latex 470 | %%% TeX-master: "chapter2" 471 | %%% End: 472 | --------------------------------------------------------------------------------