├── gradient.png ├── math4ml.pdf ├── convex-set.png ├── nonconvex-set.png ├── convex-function.png ├── measure-probability.pdf ├── orthogonal-projection.png ├── measure-probability.bib ├── math4ml.bib ├── common.tex ├── math4ml.tex ├── cs189-convexity.tex ├── cs189-calculus-optimization.tex ├── cs189-probability.tex ├── measure-probability.tex └── cs189-linalg.tex /gradient.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gwthomas/math4ml/HEAD/gradient.png -------------------------------------------------------------------------------- /math4ml.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gwthomas/math4ml/HEAD/math4ml.pdf -------------------------------------------------------------------------------- /convex-set.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gwthomas/math4ml/HEAD/convex-set.png -------------------------------------------------------------------------------- /nonconvex-set.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gwthomas/math4ml/HEAD/nonconvex-set.png -------------------------------------------------------------------------------- /convex-function.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gwthomas/math4ml/HEAD/convex-function.png -------------------------------------------------------------------------------- /measure-probability.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gwthomas/math4ml/HEAD/measure-probability.pdf -------------------------------------------------------------------------------- /orthogonal-projection.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gwthomas/math4ml/HEAD/orthogonal-projection.png -------------------------------------------------------------------------------- /measure-probability.bib: -------------------------------------------------------------------------------- 1 | @book{folland, 2 | author = "Gerald B. Folland", 3 | title = "Real Analysis: Modern Techniques and Their Applications (Second Edition)", 4 | year = "1999", 5 | publisher = "John Wiley \& Sons", 6 | address = "New York" 7 | } 8 | 9 | @book{rigorousprob, 10 | author = "Jeffrey S. Rosenthal", 11 | title = "A First Look at Rigorous Probability Theory (Second Edition)", 12 | year = "2006", 13 | publisher = "World Scientific Publishing", 14 | address = "Singapore" 15 | } 16 | 17 | @book{pitman, 18 | author = "Jim Pitman", 19 | title = "Probability", 20 | year = "1993", 21 | publisher = "Springer-Verlag", 22 | address = "New York" 23 | } 24 | -------------------------------------------------------------------------------- /math4ml.bib: -------------------------------------------------------------------------------- 1 | @book{ladr, 2 | author = "Sheldon Axler", 3 | title = "Linear Algebra Done Right (Third Edition)", 4 | year = "2015", 5 | publisher = "Springer International Publishing" 6 | } 7 | 8 | @book{cvxopt, 9 | author = "Stephen Boyd and Lieven Vandenberghe", 10 | title = "Convex Optimization", 11 | year = "2009", 12 | publisher = "Cambridge University Press", 13 | address = "New York" 14 | } 15 | 16 | @book{numopt, 17 | author = "Jorge Nocedal and Stephen J. Wright", 18 | title = "Numerical Optimization", 19 | year = "2006", 20 | publisher = "Springer Science+Business Media", 21 | address = "New York" 22 | } 23 | 24 | @book{pitman, 25 | author = "Jim Pitman", 26 | title = "Probability", 27 | year = "1993", 28 | publisher = "Springer-Verlag", 29 | address = "New York" 30 | } 31 | 32 | @book{rigorousprob, 33 | author = "Jeffrey S. Rosenthal", 34 | title = "A First Look at Rigorous Probability Theory (Second Edition)", 35 | year = "2006", 36 | publisher = "World Scientific Publishing", 37 | address = "Singapore" 38 | } 39 | 40 | @book{rice, 41 | author = "John A. Rice", 42 | title = "Mathematical Statistics and Data Analysis", 43 | year = "2007", 44 | publisher = "Thomson Brooks/Cole", 45 | address = "Belmont, California" 46 | } 47 | 48 | @book{afam, 49 | author = "Ward Cheney", 50 | title = "Analysis for Applied Mathematics", 51 | year = "2001", 52 | publisher = "Springer Science+Business Medias", 53 | address = "New York" 54 | } 55 | -------------------------------------------------------------------------------- /common.tex: -------------------------------------------------------------------------------- 1 | % useful packages 2 | \usepackage{amsfonts,amsmath,amssymb,amsthm,bm,commath,enumerate,graphicx,hyperref,nicefrac,physics,subcaption} 3 | 4 | % formatting 5 | \setlength{\parskip}{0.5em} 6 | \setlength{\parindent}{0em} 7 | \usepackage[margin=1.25in]{geometry} 8 | \hypersetup{ 9 | colorlinks=true, 10 | linktoc=all, 11 | linkcolor=black, 12 | urlcolor=blue 13 | } 14 | 15 | % shorthand 16 | \DeclareMathOperator*{\argmax}{arg\,max} 17 | \DeclareMathOperator*{\argmin}{arg\,min} 18 | \DeclareMathOperator*{\dom}{dom} 19 | \DeclareMathOperator*{\range}{range} 20 | \DeclareMathOperator*{\diag}{diag} 21 | \DeclareMathOperator*{\Null}{null} 22 | \newcommand{\C}{\mathbb{C}} 23 | \newcommand{\F}{\mathbb{F}} 24 | \newcommand{\N}{\mathbb{N}} 25 | \newcommand{\Q}{\mathbb{Q}} 26 | \newcommand{\R}{\mathbb{R}} 27 | \newcommand{\Z}{\mathbb{Z}} 28 | \newcommand{\calA}{\mathcal{A}} 29 | \newcommand{\calB}{\mathcal{B}} 30 | \newcommand{\calC}{\mathcal{C}} 31 | \newcommand{\calD}{\mathcal{D}} 32 | \newcommand{\calF}{\mathcal{F}} 33 | \newcommand{\calH}{\mathcal{H}} 34 | \newcommand{\calI}{\mathcal{I}} 35 | \newcommand{\calL}{\mathcal{L}} 36 | \newcommand{\calM}{\mathcal{M}} 37 | \newcommand{\calN}{\mathcal{N}} 38 | \newcommand{\calP}{\mathcal{P}} 39 | \newcommand{\calR}{\mathcal{R}} 40 | \newcommand{\calX}{\mathcal{X}} 41 | \renewcommand{\vec}[1]{\mathbf{#1}} 42 | \newcommand{\mat}[1]{\mathbf{#1}} 43 | \newcommand{\matlit}[1]{\begin{bmatrix}#1\end{bmatrix}} 44 | \newcommand{\tran}{^{\!\top\!}} 45 | \newcommand{\inv}{^{-1}} 46 | \newcommand{\halfpow}{^{\frac{1}{2}}} 47 | \newcommand{\neghalfpow}{^{-\frac{1}{2}}} 48 | \renewcommand{\angle}[1]{\langle #1 \rangle} 49 | \newcommand{\bigangle}[1]{\left\langle #1 \right\rangle} 50 | \newcommand{\inner}[2]{\angle{#1, #2}} 51 | \newcommand{\biginner}[2]{\bigangle{#1, #2}} 52 | \renewcommand{\P}{\mathbb{P}} 53 | \newcommand{\pr}[1]{\P(#1)} 54 | \newcommand{\prbig}[1]{\P\big(#1\big)} 55 | \newcommand{\prbigg}[1]{\P\bigg(#1\bigg)} 56 | \newcommand{\prlr}[1]{\P\left(#1\right)} 57 | \newcommand{\comp}{^\text{c}} 58 | \newcommand{\given}{|} 59 | \renewcommand{\ev}[1]{\mathbb{E}[#1]} 60 | \newcommand{\evwrt}[2]{\mathbb{E}_{#1}[#2]} 61 | \renewcommand{\var}[1]{\operatorname{Var}(#1)} 62 | \newcommand{\cov}[2]{\operatorname{Cov}(#1, #2)} 63 | \newcommand{\bigev}[1]{\mathbb{E}\left[#1\right]} 64 | \newcommand{\bigvar}[1]{\operatorname{Var}\left(#1\right)} 65 | \newcommand{\bigcov}[2]{\operatorname{Cov}\left(#1, #2\right)} 66 | \newcommand{\iid}{\overset{\text{iid}}{\sim}} 67 | \newcommand{\bX}{\mathbf{X}} 68 | \newcommand{\term}[1]{\textbf{#1}} 69 | \newcommand{\tab}{\hspace{0.5cm}} 70 | \renewcommand{\a}{\vec{a}} 71 | \renewcommand{\b}{\vec{b}} 72 | \newcommand{\e}{\vec{e}} 73 | \newcommand{\g}{\vec{g}} 74 | \newcommand{\h}{\vec{h}} 75 | \renewcommand{\o}{\vec{o}} 76 | \newcommand{\q}{\vec{q}} 77 | \newcommand{\s}{\vec{s}} 78 | \newcommand{\x}{\vec{x}} 79 | \newcommand{\y}{\vec{y}} 80 | \newcommand{\w}{\vec{w}} 81 | \newcommand{\z}{\vec{z}} 82 | \newcommand{\A}{\mat{A}} 83 | \newcommand{\I}{\mat{I}} 84 | \newcommand{\xye}{\tilde{\x}} 85 | \newcommand{\dotcup}{\mathbin{\dot{\cup}}} 86 | \newcommand{\bigdotcup}{\mathop{\dot{\bigcup}}} 87 | 88 | \newtheorem{theorem}{Theorem} 89 | \newtheorem*{theorem*}{Theorem} 90 | \newtheorem{definition}{Definition} 91 | \newtheorem*{definition*}{Definition} 92 | \newtheorem{proposition}{Proposition} 93 | \newtheorem*{proposition*}{Proposition} 94 | \newtheorem{lemma}{Lemma} 95 | \newtheorem*{lemma*}{Lemma} 96 | \newtheorem{corollary}{Corollary} 97 | \newtheorem*{corollary*}{Corollary} 98 | \theoremstyle{remark} 99 | \newtheorem*{note}{Note} 100 | \newtheorem*{example}{Example} 101 | -------------------------------------------------------------------------------- /math4ml.tex: -------------------------------------------------------------------------------- 1 | \documentclass{article} 2 | \title{Mathematics for Machine Learning} 3 | \author{Garrett Thomas\\ 4 | Department of Electrical Engineering and Computer Sciences\\ 5 | University of California, Berkeley} 6 | 7 | \input{common.tex} 8 | 9 | \begin{document} 10 | \maketitle 11 | 12 | \section{About} 13 | Machine learning uses tools from a variety of mathematical fields. 14 | This document is an attempt to provide a summary of the mathematical background needed for an introductory class in machine learning, which at UC Berkeley is known as CS 189/289A. 15 | 16 | Our assumption is that the reader is already familiar with the basic concepts of multivariable calculus and linear algebra (at the level of UCB Math 53/54). 17 | We emphasize that this document is \textbf{not} a replacement for the prerequisite classes. 18 | Most subjects presented here are covered rather minimally; we intend to give an overview and point the interested reader to more comprehensive treatments for further details. 19 | 20 | Note that this document concerns math background for machine learning, not machine learning itself. 21 | We will not discuss specific machine learning models or algorithms except possibly in passing to highlight the relevance of a mathematical concept. 22 | 23 | Earlier versions of this document did not include proofs. 24 | We have begun adding in proofs where they are reasonably short and aid in understanding. 25 | These proofs are not necessary background for CS 189 but can be used to deepen the reader's understanding. 26 | 27 | You are free to distribute this document as you wish. 28 | The latest version can be found at \url{http://gwthomas.github.io/docs/math4ml.pdf}. 29 | Please report any mistakes to \url{gwthomas@berkeley.edu}. 30 | 31 | \newpage 32 | \tableofcontents 33 | 34 | \newpage 35 | \section{Notation} 36 | \begin{tabular}{|l|l|} 37 | \hline 38 | Notation & Meaning \\ 39 | \hline 40 | $\R$ & set of real numbers \\ 41 | $\R^n$ & set (vector space) of $n$-tuples of real numbers, endowed with the usual inner product \\ 42 | $\R^{m \times n}$ & set (vector space) of $m$-by-$n$ matrices \\ 43 | $\delta_{ij}$ & Kronecker delta, i.e. $\delta_{ij} = 1$ if $i = j$, $0$ otherwise \\ 44 | $\nabla f(\vec{x})$ & gradient of the function $f$ at $\x$ \\ 45 | $\nabla^2 f(\vec{x})$ & Hessian of the function $f$ at $\x$ \\ 46 | $\A\tran$ & transpose of the matrix $\A$ \\ 47 | $\Omega$ & sample space \\ 48 | $\pr{A}$ & probability of event $A$ \\ 49 | $p(X)$ & distribution of random variable $X$ \\ 50 | $p(x)$ & probability density/mass function evaluated at $x$ \\ 51 | $A\comp$ & complement of event $A$ \\ 52 | $A \dotcup B$ & union of $A$ and $B$, with the extra requirement that $A \cap B = \varnothing$ \\ 53 | $\ev{X}$ & expected value of random variable $X$ \\ 54 | $\var{X}$ & variance of random variable $X$ \\ 55 | $\cov{X}{Y}$ & covariance of random variables $X$ and $Y$ \\ 56 | \hline 57 | \end{tabular} 58 | 59 | \vspace{0.5cm} 60 | Other notes: 61 | \begin{itemize} 62 | \item Vectors and matrices are in bold (e.g. $\x, \A$). 63 | This is true for vectors in $\R^n$ as well as for vectors in general vector spaces. 64 | We generally use Greek letters for scalars and capital Roman letters for matrices and random variables. 65 | 66 | \item To stay focused at an appropriate level of abstraction, we restrict ourselves to real values. 67 | In many places in this document, it is entirely possible to generalize to the complex case, but we will simply state the version that applies to the reals. 68 | 69 | \item We assume that vectors are column vectors, i.e. that a vector in $\R^n$ can be interpreted as an $n$-by-$1$ matrix. 70 | As such, taking the transpose of a vector is well-defined (and produces a row vector, which is a $1$-by-$n$ matrix). 71 | \end{itemize} 72 | 73 | \newpage 74 | \section{Linear Algebra} 75 | \input{cs189-linalg.tex} 76 | 77 | \newpage 78 | \section{Calculus and Optimization} 79 | \input{cs189-calculus-optimization.tex} 80 | 81 | \newpage 82 | \section{Probability} 83 | \input{cs189-probability.tex} 84 | 85 | \newpage 86 | \section*{Acknowledgements} 87 | The author would like to thank Michael Franco for suggested clarifications, and Chinmoy Saayujya for catching a typo. 88 | 89 | \bibliography{math4ml} 90 | \addcontentsline{toc}{section}{References} 91 | \bibliographystyle{ieeetr} 92 | \nocite{*} 93 | \end{document} 94 | -------------------------------------------------------------------------------- /cs189-convexity.tex: -------------------------------------------------------------------------------- 1 | \term{Convexity} is a term that pertains to both sets and functions. 2 | For functions, there are different degrees of convexity, and how convex a function is tells us a lot about its minima: do they exist, are they unique, how quickly can we find them using optimization algorithms, etc. 3 | In this section, we present basic results regarding convexity, strict convexity, and strong convexity. 4 | 5 | \subsubsection{Convex sets} 6 | \begin{figure} 7 | \centering 8 | \begin{subfigure}[b]{0.45\linewidth} 9 | \includegraphics[width=\linewidth]{convex-set} 10 | \caption{A convex set} 11 | \end{subfigure} 12 | \begin{subfigure}[b]{0.45\linewidth} 13 | \includegraphics[width=\linewidth]{nonconvex-set} 14 | \caption{A non-convex set} 15 | \end{subfigure} 16 | \caption{What convex sets look like} 17 | \label{fig:convexset} 18 | \end{figure} 19 | 20 | A set $\calX \subseteq \R^d$ is \term{convex} if 21 | \[t\x + (1-t)\y \in \calX\] 22 | for all $\vec{x}, \vec{y} \in \calX$ and all $t \in [0,1]$. 23 | 24 | Geometrically, this means that all the points on the line segment between any two points in $\calX$ are also in $\calX$. 25 | See Figure \ref{fig:convexset} for a visual. 26 | 27 | Why do we care whether or not a set is convex? 28 | We will see later that the nature of minima can depend greatly on whether or not the feasible set is convex. 29 | Undesirable pathological results can occur when we allow the feasible set to be arbitrary, so for proofs we will need to assume that it is convex. 30 | Fortunately, we often want to minimize over all of $\R^d$, which is easily seen to be a convex set. 31 | 32 | \subsubsection{Basics of convex functions} 33 | In the remainder of this section, assume $f : \R^d \to \R$ unless otherwise noted. We'll start with the definitions and then give some results. 34 | 35 | A function $f$ is \term{convex} if 36 | \[f(t\vec{x} + (1-t)\vec{y}) \leq t f(\vec{x}) + (1-t)f(\vec{y})\] 37 | for all $\vec{x}, \vec{y} \in \dom f$ and all $t \in [0,1]$. 38 | 39 | If the inequality holds strictly (i.e. $<$ rather than $\leq$) for all $t \in (0,1)$ and $\x \neq \y$, then we say that $f$ is \term{strictly convex}. 40 | 41 | A function $f$ is \term{strongly convex with parameter $m$} (or \term{$m$-strongly convex}) if the function 42 | \[\x \mapsto f(\x) - \frac{m}{2}\|\x\|_2^2\] 43 | is convex. 44 | 45 | These conditions are given in increasing order of strength; strong convexity implies strict convexity which implies convexity. 46 | 47 | %\begin{proposition} 48 | %If $f$ is strictly convex, then $f$ is convex. 49 | %\end{proposition} 50 | %\begin{proof} 51 | %Suppose $\x, \y \in \dom f$ and $t \in [0,1]$. We break it down by cases: 52 | %\begin{enumerate} 53 | %\item $\x \neq \y$ and $t \in (0,1)$: the convexity condition is a clear consequence of the strict convexity condition. 54 | %\item $\x = \y$ and $t \in [0,1]$: we have 55 | %\[f(t\x + (1-t)\y) = f(t\x + (1-t)\x) = f(\x) = f(\x) + t f(\x) - t f(\y) = t f(\x) + (1-t)f(\y)\] 56 | %so the condition holds. 57 | %\item $t = 0$: then 58 | %\[f(t\x + (1-t)\y) = f(\y) = t f(\x) + (1-t)f(\y)\] 59 | %\item $t = 1$: similar to the $t = 0$ case. 60 | %\end{enumerate} 61 | %Hence $f$ is convex. 62 | %\end{proof} 63 | % 64 | %\begin{proposition} 65 | %If $f$ is $m$-strongly convex, then $f$ is strictly convex. 66 | %\end{proposition} 67 | %\begin{proof} 68 | %To-do. 69 | %\end{proof} 70 | 71 | \begin{figure} 72 | \centering 73 | \includegraphics[width=\linewidth]{convex-function} 74 | \caption{What convex functions look like} 75 | \label{fig:convexfunction} 76 | \end{figure} 77 | 78 | Geometrically, convexity means that the line segment between two points on the graph of $f$ lies on or above the graph itself. 79 | See Figure \ref{fig:convexfunction} for a visual. 80 | 81 | Strict convexity means that the graph of $f$ lies strictly above the line segment, except at the segment endpoints. 82 | (So actually the function in the figure appears to be strictly convex.) 83 | 84 | \subsubsection{Consequences of convexity} 85 | Why do we care if a function is (strictly/strongly) convex? 86 | 87 | Basically, our various notions of convexity have implications about the nature of minima. 88 | It should not be surprising that the stronger conditions tell us more about the minima. 89 | 90 | \begin{proposition} 91 | Let $\calX$ be a convex set. 92 | If $f$ is convex, then any local minimum of $f$ in $\calX$ is also a global minimum. 93 | \end{proposition} 94 | \begin{proof} 95 | Suppose $f$ is convex, and let $\x^*$ be a local minimum of $f$ in $\calX$. 96 | Then for some neighborhood $N \subseteq \calX$ about $\x^*$, we have $f(\x) \geq f(\x^*)$ for all $\x \in N$. 97 | Suppose towards a contradiction that there exists $\xye \in \calX$ such that $f(\xye) < f(\x^*)$. 98 | 99 | Consider the line segment $\x(t) = t\x^* + (1-t)\xye, ~ t \in [0,1]$, noting that $\x(t) \in \calX$ by the convexity of $\calX$. 100 | Then by the convexity of $f$, 101 | \[f(\x(t)) \leq tf(\x^*) + (1-t)f(\xye) < tf(\x^*) + (1-t)f(\x^*) = f(\x^*)\] 102 | for all $t \in (0,1)$. 103 | 104 | We can pick $t$ to be sufficiently close to $1$ that $\x(t) \in N$; then $f(\x(t)) \geq f(\x^*)$ by the definition of $N$, but $f(\x(t)) < f(\x^*)$ by the above inequality, a contradiction. 105 | 106 | It follows that $f(\x^*) \leq f(\x)$ for all $\x \in \calX$, so $\x^*$ is a global minimum of $f$ in $\calX$. 107 | \end{proof} 108 | 109 | \begin{proposition} 110 | Let $\calX$ be a convex set. 111 | If $f$ is strictly convex, then there exists at most one local minimum of $f$ in $\calX$. 112 | Consequently, if it exists it is the unique global minimum of $f$ in $\calX$. 113 | \end{proposition} 114 | \begin{proof} 115 | The second sentence follows from the first, so all we must show is that if a local minimum exists in $\calX$ then it is unique. 116 | 117 | Suppose $\x^*$ is a local minimum of $f$ in $\calX$, and suppose towards a contradiction that there exists a local minimum $\xye \in \calX$ such that $\xye \neq \x^*$. 118 | 119 | Since $f$ is strictly convex, it is convex, so $\x^*$ and $\xye$ are both global minima of $f$ in $\calX$ by the previous result. 120 | Hence $f(\x^*) = f(\xye)$. 121 | Consider the line segment $\x(t) = t\x^* + (1-t)\xye, ~ t \in [0,1]$, which again must lie entirely in $\calX$. 122 | By the strict convexity of $f$, 123 | \[f(\x(t)) < tf(\x^*) + (1-t)f(\xye) = tf(\x^*) + (1-t)f(\x^*) = f(\x^*)\] 124 | for all $t \in (0,1)$. 125 | But this contradicts the fact that $\x^*$ is a global minimum. 126 | Therefore if $\xye$ is a local minimum of $f$ in $\calX$, then $\xye = \x^*$, so $\x^*$ is the unique minimum in $\calX$. 127 | \end{proof} 128 | 129 | It is worthwhile to examine how the feasible set affects the optimization problem. 130 | We will see why the assumption that $\calX$ is convex is needed in the results above. 131 | 132 | Consider the function $f(x) = x^2$, which is a strictly convex function. 133 | The unique global minimum of this function in $\R$ is $x = 0$. 134 | But let's see what happens when we change the feasible set $\calX$. 135 | \begin{enumerate}[(i)] 136 | \item $\calX = \{1\}$: This set is actually convex, so we still have a unique global minimum. 137 | But it is not the same as the unconstrained minimum! 138 | 139 | \item $\calX = \R \setminus \{0\}$: This set is non-convex, and we can see that $f$ has no minima in $\calX$. 140 | For any point $x \in \calX$, one can find another point $y \in \calX$ such that $f(y) < f(x)$. 141 | 142 | \item $\calX = (-\infty,-1] \cup [0,\infty)$: This set is non-convex, and we can see that there is a local minimum ($x = -1$) which is distinct from the global minimum ($x = 0$). 143 | 144 | \item $\calX = (-\infty,-1] \cup [1,\infty)$: This set is non-convex, and we can see that there are two global minima ($x = \pm 1$). 145 | \end{enumerate} 146 | 147 | \subsubsection{Showing that a function is convex} 148 | Hopefully the previous section has convinced the reader that convexity is an important property. 149 | Next we turn to the issue of showing that a function is (strictly/strongly) convex. 150 | It is of course possible (in principle) to directly show that the condition in the definition holds, but this is usually not the easiest way. 151 | 152 | \begin{proposition} 153 | Norms are convex. 154 | \end{proposition} 155 | \begin{proof} 156 | Let $\|\cdot\|$ be a norm on a vector space $V$. Then for all $\x, \y \in V$ and $t \in [0,1]$, 157 | \[\|t\x + (1-t)\y\| \leq \|t\x\| + \|(1-t)\y\| = |t|\|\x\| + |1-t|\|\y\| = t\|\x\| + (1-t)\|\y\|\] 158 | where we have used respectively the triangle inequality, the homogeneity of norms, and the fact that $t$ and $1-t$ are nonnegative. 159 | Hence $\|\cdot\|$ is convex. 160 | \end{proof} 161 | 162 | \begin{proposition} 163 | Suppose $f$ is differentiable. Then $f$ is convex if and only if 164 | \[f(\y) \geq f(\x) + \angle{\nabla f(\x), \y - \x}\] 165 | for all $\x, \y \in \dom f$. 166 | \end{proposition} 167 | \begin{proof} 168 | To-do. 169 | \end{proof} 170 | 171 | \begin{proposition} 172 | Suppose $f$ is twice differentiable. 173 | Then 174 | \begin{enumerate}[(i)] 175 | \item $f$ is convex if and only if $\nabla^2 f(\x) \succeq 0$ for all $\x \in \dom f$. 176 | \item If $\nabla^2 f(\x) \succ 0$ for all $\x \in \dom f$, then $f$ is strictly convex. 177 | \item $f$ is $m$-strongly convex if and only if $\nabla^2 f(\x) \succeq mI$ for all $\x \in \dom f$. 178 | \end{enumerate} 179 | \end{proposition} 180 | \begin{proof} 181 | Omitted. 182 | \end{proof} 183 | 184 | \begin{proposition} 185 | If $f$ is convex and $\alpha \geq 0$, then $\alpha f$ is convex. 186 | \end{proposition} 187 | \begin{proof} 188 | Suppose $f$ is convex and $\alpha \geq 0$. Then for all $\x, \y \in \dom(\alpha f) = \dom f$, 189 | \begin{align*} 190 | (\alpha f)(t\x + (1-t)\y) &= \alpha f(t\x + (1-t)\y) \\ 191 | &\leq \alpha\left(tf(\x) + (1-t)f(\y)\right) \\ 192 | &= t(\alpha f(\x)) + (1-t)(\alpha f(\y)) \\ 193 | &= t(\alpha f)(\x) + (1-t)(\alpha f)(\y) 194 | \end{align*} 195 | so $\alpha f$ is convex. 196 | \end{proof} 197 | 198 | \begin{proposition} 199 | If $f$ and $g$ are convex, then $f+g$ is convex. 200 | Furthermore, if $g$ is strictly convex, then $f+g$ is strictly convex, and if $g$ is $m$-strongly convex, then $f+g$ is $m$-strongly convex. 201 | \end{proposition} 202 | \begin{proof} 203 | Suppose $f$ and $g$ are convex. Then for all $\x, \y \in \dom (f+g) = \dom f \cap \dom g$, 204 | \begin{align*} 205 | (f+g)(t\x + (1-t)\y) &= f(t\x + (1-t)\y) + g(t\x + (1-t)\y) \\ 206 | &\leq tf(\x) + (1-t)f(\y) + g(t\x + (1-t)\y) & \text{convexity of $f$} \\ 207 | &\leq tf(\x) + (1-t)f(\y) + tg(\x) + (1-t)g(\y) & \text{convexity of $g$} \\ 208 | &= t(f(\x) + g(\x)) + (1-t)(f(\y) + g(\y)) \\ 209 | &= t(f+g)(\x) + (1-t)(f+g)(\y) 210 | \end{align*} 211 | so $f + g$ is convex. 212 | 213 | If $g$ is strictly convex, the second inequality above holds strictly for $\x \neq \y$ and $t \in (0,1)$, so $f+g$ is strictly convex. 214 | 215 | If $g$ is $m$-strongly convex, then the function $h(\x) \equiv g(\x) - \frac{m}{2}\|\x\|_2^2$ is convex, so $f+h$ is convex. 216 | But 217 | \[(f+h)(\x) \equiv f(\x) + h(\x) \equiv f(\x) + g(\x) - \frac{m}{2}\|\x\|_2^2 \equiv (f+g)(\x) - \frac{m}{2}\|\x\|_2^2\] 218 | so $f+g$ is $m$-strongly convex. 219 | \end{proof} 220 | 221 | \begin{proposition} 222 | If $f_1, \dots, f_n$ are convex and $\alpha_1, \dots, \alpha_n \geq 0$, then 223 | \[\sum_{i=1}^n \alpha_i f_i\] 224 | is convex. 225 | \end{proposition} 226 | \begin{proof} 227 | Follows from the previous two propositions by induction. 228 | \end{proof} 229 | 230 | \begin{proposition} 231 | If $f$ is convex, then $g(\vec{x}) \equiv f(\A\x + \vec{b})$ is convex for any appropriately-sized $\A$ and $\b$. 232 | \end{proposition} 233 | \begin{proof} 234 | Suppose $f$ is convex and $g$ is defined like so. Then for all $\x, \y \in \dom g$, 235 | \begin{align*} 236 | g(t\x + (1-t)\y) &= f(\A(t\x + (1-t)\y) + \b) \\ 237 | &= f(t\A\x + (1-t)\A\y + \b) \\ 238 | &= f(t\A\x + (1-t)\A\y + t\b + (1-t)\b) \\ 239 | &= f(t(\A\x + \b) + (1-t)(\A\y + \b)) \\ 240 | &\leq tf(\A\x + \b) + (1-t)f(\A\y + \b) & \text{convexity of $f$} \\ 241 | &= tg(\x) + (1-t)g(\y) 242 | \end{align*} 243 | Thus $g$ is convex. 244 | \end{proof} 245 | 246 | \begin{proposition} 247 | If $f$ and $g$ are convex, then $h(\vec{x}) \equiv \max\{f(\vec{x}), g(\vec{x})\}$ is convex. 248 | \end{proposition} 249 | \begin{proof} 250 | Suppose $f$ and $g$ are convex and $h$ is defined like so. Then for all $\x, \y \in \dom h$, 251 | \begin{align*} 252 | h(t\x + (1-t)\y) &= \max\{f(t\x + (1-t)\y), g(t\x + (1-t)\y)\} \\ 253 | &\leq \max\{tf(\x) + (1-t)f(\y), tg(\x) + (1-t)g(\y)\} \\ 254 | &\leq \max\{tf(\x), tg(\x)\} + \max\{(1-t)f(\y), (1-t)g(\y)\} \\ 255 | &= t\max\{f(\x), g(\x)\} + (1-t)\max\{f(\y), g(\y)\} \\ 256 | &= th(\x) + (1-t)h(\y) 257 | \end{align*} 258 | Note that in the first inequality we have used convexity of $f$ and $g$ plus the fact that $a \leq c, b \leq d$ implies $\max\{a,b\} \leq \max\{c,d\}$. 259 | In the second inequality we have used the fact that $\max\{a+b, c+d\} \leq \max\{a,c\} + \max\{b,d\}$. 260 | 261 | Thus $h$ is convex. 262 | \end{proof} 263 | 264 | \subsubsection{Examples} 265 | A good way to gain intuition about the distinction between convex, strictly convex, and strongly convex functions is to consider examples where the stronger property fails to hold. 266 | 267 | Functions that are convex but not strictly convex: 268 | \begin{enumerate}[(i)] 269 | \item $f(\x) = \w\tran\x + \alpha$ for any $\w \in \R^d, \alpha \in \R$. 270 | Such a function is called an \term{affine function}, and it is both convex and concave. 271 | (In fact, a function is affine if and only if it is both convex and concave.) 272 | Note that linear functions and constant functions are special cases of affine functions. 273 | \item $f(\x) = \|\x\|_1$ 274 | \end{enumerate} 275 | 276 | Functions that are strictly but not strongly convex: 277 | \begin{enumerate}[(i)] 278 | \item $f(x) = x^4$. 279 | This example is interesting because it is strictly convex but you cannot show this fact via a second-order argument (since $f''(0) = 0$). 280 | \item $f(x) = \exp(x)$. 281 | This example is interesting because it's bounded below but has no local minimum. 282 | \item $f(x) = -\log x$. 283 | This example is interesting because it's strictly convex but not bounded below. 284 | \end{enumerate} 285 | 286 | Functions that are strongly convex: 287 | \begin{enumerate}[(i)] 288 | \item $f(\x) = \|\x\|_2^2$ 289 | \end{enumerate} 290 | -------------------------------------------------------------------------------- /cs189-calculus-optimization.tex: -------------------------------------------------------------------------------- 1 | Much of machine learning is about minimizing a \term{cost function} (also called an \term{objective function} in the optimization community), which is a scalar function of several variables that typically measures how poorly our model fits the data we have. 2 | 3 | \subsection{Extrema} 4 | Optimization is about finding \term{extrema}, which depending on the application could be minima or maxima. 5 | When defining extrema, it is necessary to consider the set of inputs over which we're optimizing. 6 | This set $\calX \subseteq \R^d$ is called the \term{feasible set}. 7 | If $\calX$ is the entire domain of the function being optimized (as it often will be for our purposes), we say that the problem is \term{unconstrained}. 8 | Otherwise the problem is \term{constrained} and may be much harder to solve, depending on the nature of the feasible set. 9 | 10 | Suppose $f : \R^d \to \R$. 11 | A point $\x$ is said to be a \term{local minimum} (resp. \term{local maximum}) of $f$ in $\calX$ if $f(\x) \leq f(\y)$ (resp. $f(\x) \geq f(\y)$) for all $\y$ in some neighborhood $N \subseteq \calX$ about $\x$.\footnote{ 12 | A \textbf{neighborhood} about $\x$ is an open set which contains $\x$. 13 | } 14 | Furthermore, if $f(\x) \leq f(\y)$ for all $\y \in \calX$, then $\x$ is a \term{global minimum} of $f$ in $\calX$ (similarly for global maximum). 15 | If the phrase ``in $\calX$'' is unclear from context, assume we are optimizing over the whole domain of the function. 16 | 17 | The qualifier \term{strict} (as in e.g. a strict local minimum) means that the inequality sign in the definition is actually a $>$ or $<$, with equality not allowed. 18 | This indicates that the extremum is unique within some neighborhood. 19 | 20 | Observe that maximizing a function $f$ is equivalent to minimizing $-f$, so optimization problems are typically phrased in terms of minimization without loss of generality. 21 | This convention (which we follow here) eliminates the need to discuss minimization and maximization separately. 22 | 23 | \subsection{Gradients} 24 | The single most important concept from calculus in the context of machine learning is the \term{gradient}. 25 | Gradients generalize derivatives to scalar functions of several variables. 26 | The gradient of $f : \R^d \to \R$, denoted $\nabla f$, is given by 27 | \[\nabla f = \matlit{\pdv{f}{x_1} \\ \vdots \\ \pdv{f}{x_n}} 28 | \tab\text{i.e.}\tab 29 | [\nabla f]_i = \pdv{f}{x_i}\] 30 | Gradients have the following very important property: $\nabla f(\x)$ points in the direction of \term{steepest ascent} from $\x$. 31 | Similarly, $-\nabla f(\x)$ points in the direction of \term{steepest descent} from $\x$. 32 | We will use this fact frequently when iteratively minimizing a function via \term{gradient descent}. 33 | 34 | \subsection{The Jacobian} 35 | The \term{Jacobian} of $f : \R^n \to \R^m$ is a matrix of first-order partial derivatives: 36 | \[\mat{J}_f = \matlit{ 37 | \pdv{f_1}{x_1} & \hdots & \pdv{f_1}{x_n} \\ 38 | \vdots & \ddots & \vdots \\ 39 | \pdv{f_m}{x_1} & \hdots & \pdv{f_m}{x_n}} 40 | \tab\text{i.e.}\tab 41 | [\mat{J}_f]_{ij} = \pdv{f_i}{x_j}\] 42 | Note the special case $m = 1$, where $\nabla f = \mat{J}_f\tran$. 43 | 44 | \subsection{The Hessian} 45 | The \term{Hessian} matrix of $f : \R^d \to \R$ is a matrix of second-order partial derivatives: 46 | \[\nabla^2 f = \matlit{ 47 | \pdv[2]{f}{x_1} & \hdots & \pdv{f}{x_1}{x_d} \\ 48 | \vdots & \ddots & \vdots \\ 49 | \pdv{f}{x_d}{x_1} & \hdots & \pdv[2]{f}{x_d}} 50 | \tab\text{i.e.}\tab 51 | [\nabla^2 f]_{ij} = {\pdv{f}{x_i}{x_j}}\] 52 | Recall that if the partial derivatives are continuous, the order of differentiation can be interchanged (Clairaut's theorem), so the Hessian matrix will be symmetric. 53 | This will typically be the case for differentiable functions that we work with. 54 | 55 | The Hessian is used in some optimization algorithms such as Newton's method. 56 | It is expensive to calculate but can drastically reduce the number of iterations needed to converge to a local minimum by providing information about the curvature of $f$. 57 | 58 | \subsection{Matrix calculus} 59 | Since a lot of optimization reduces to finding points where the gradient vanishes, it is useful to have differentiation rules for matrix and vector expressions. 60 | We give some common rules here. 61 | Probably the two most important for our purposes are 62 | \begin{align*} 63 | \nabla_\x &(\vec{a}\tran\x) = \vec{a} \\ 64 | \nabla_\x &(\x\tran\A\x) = (\A + \A\tran)\x 65 | \end{align*} 66 | Note that this second rule is defined only if $\A$ is square. 67 | Furthermore, if $\A$ is symmetric, we can simplify the result to $2\A\x$. 68 | 69 | \subsubsection{The chain rule} 70 | Most functions that we wish to optimize are not completely arbitrary functions, but rather are composed of simpler functions which we know how to handle. 71 | The chain rule gives us a way to calculate derivatives for a composite function in terms of the derivatives of the simpler functions that make it up. 72 | 73 | The chain rule from single-variable calculus should be familiar: 74 | \[(f \circ g)'(x) = f'(g(x))g'(x)\] 75 | where $\circ$ denotes function composition. 76 | There is a natural generalization of this rule to multivariate functions. 77 | \begin{proposition} 78 | Suppose $f : \R^m \to \R^k$ and $g : \R^n \to \R^m$. Then $f \circ g : \R^n \to \R^k$ and 79 | \[\mat{J}_{f \circ g}(\x) = \mat{J}_f(g(\x))\mat{J}_g(\x)\] 80 | \end{proposition} 81 | In the special case $k = 1$ we have the following corollary since $\nabla f = \mat{J}_f\tran$. 82 | \begin{corollary} 83 | Suppose $f : \R^m \to \R$ and $g : \R^n \to \R^m$. Then $f \circ g : \R^n \to \R$ and 84 | \[\nabla (f \circ g)(\x) = \mat{J}_g(\x)\tran \nabla f(g(\x))\] 85 | \end{corollary} 86 | 87 | \subsection{Taylor's theorem} 88 | Taylor's theorem has natural generalizations to functions of more than one variable. 89 | We give the version presented in \cite{numopt}. 90 | \begin{theorem} 91 | (Taylor's theorem) 92 | Suppose $f : \R^d \to \R$ is continuously differentiable, and let $\h \in \R^d$. 93 | Then there exists $t \in (0,1)$ such that 94 | \[f(\x + \h) = f(\x) + \nabla f(\x + t\h)\tran\h\] 95 | Furthermore, if $f$ is twice continuously differentiable, then 96 | \[\nabla f(\x + \h) = \nabla f(\x) + \int_0^1 \nabla^2 f(\x + t\h)\h \dd{t}\] 97 | and there exists $t \in (0,1)$ such that 98 | \[f(\x + \h) = f(\x) + \nabla f(\x)\tran\h + \frac{1}{2}\h\tran\nabla^2f(\x+t\h)\h\] 99 | \end{theorem} 100 | This theorem is used in proofs about conditions for local minima of unconstrained optimization problems. 101 | Some of the most important results are given in the next section. 102 | 103 | \subsection{Conditions for local minima} 104 | \begin{proposition} 105 | If $\x^*$ is a local minimum of $f$ and $f$ is continuously differentiable in a neighborhood of $\x^*$, then $\nabla f(\x^*) = \vec{0}$. 106 | \end{proposition} 107 | \begin{proof} 108 | Let $\x^*$ be a local minimum of $f$, and suppose towards a contradiction that $\nabla f(\x^*) \neq \vec{0}$. 109 | Let $\h = -\nabla f(\x^*)$, noting that by the continuity of $\nabla f$ we have 110 | \[\lim_{t \to 0} -\nabla f(\x^* + t\h) = -\nabla f(\x^*) = \h\] 111 | Hence 112 | \[\lim_{t \to 0} \h\tran\nabla f(\x^* + t\h) = \h\tran\nabla f(\x^*) = -\|\h\|_2^2 < 0\] 113 | Thus there exists $T > 0$ such that $\h\tran\nabla f(\x^* + t\h) < 0$ for all $t \in [0,T]$. 114 | Now we apply Taylor's theorem: for any $t \in (0,T]$, there exists $t' \in (0,t)$ such that 115 | \[f(\x^* + t\h) = f(\x^*) + t\h\tran \nabla f(\x^* + t'\h) < f(\x^*)\] 116 | whence it follows that $\x^*$ is not a local minimum, a contradiction. 117 | Hence $\nabla f(\x^*) = \vec{0}$. 118 | \end{proof} 119 | The proof shows us why the vanishing gradient is necessary for an extremum: if $\nabla f(\x)$ is nonzero, there always exists a sufficiently small step $\alpha > 0$ such that $f(\x - \alpha\nabla f(\x))) < f(\x)$. 120 | For this reason, $-\nabla f(\x)$ is called a \term{descent direction}. 121 | 122 | Points where the gradient vanishes are called \term{stationary points}. 123 | Note that not all stationary points are extrema. 124 | Consider $f : \R^2 \to \R$ given by $f(x,y) = x^2 - y^2$. 125 | We have $\nabla f(\vec{0}) = \vec{0}$, but the point $\vec{0}$ is the minimum along the line $y = 0$ and the maximum along the line $x = 0$. 126 | Thus it is neither a local minimum nor a local maximum of $f$. 127 | Points such as these, where the gradient vanishes but there is no local extremum, are called \term{saddle points}. 128 | 129 | We have seen that first-order information (i.e. the gradient) is insufficient to characterize local minima. 130 | But we can say more with second-order information (i.e. the Hessian). 131 | First we prove a necessary second-order condition for local minima. 132 | \begin{proposition} 133 | If $\x^*$ is a local minimum of $f$ and $f$ is twice continuously differentiable in a neighborhood of $\x^*$, then $\nabla^2 f(\x^*)$ is positive semi-definite. 134 | \end{proposition} 135 | \begin{proof} 136 | Let $\x^*$ be a local minimum of $f$, and suppose towards a contradiction that $\nabla^2 f(\x^*)$ is not positive semi-definite. 137 | Let $\h$ be such that $\h\tran\nabla^2 f(\x^*)\h < 0$, noting that by the continuity of $\nabla^2 f$ we have 138 | \[\lim_{t \to 0} \nabla^2 f(\x^* + t\h) = \nabla^2 f(\x^*)\] 139 | Hence 140 | \[\lim_{t \to 0} \h\tran\nabla^2 f(\x^* + t\h)\h = \h\tran\nabla^2 f(\x^*)\h < 0\] 141 | Thus there exists $T > 0$ such that $\h\tran\nabla^2 f(\x^* + t\h)\h < 0$ for all $t \in [0,T]$. 142 | Now we apply Taylor's theorem: for any $t \in (0,T]$, there exists $t' \in (0,t)$ such that 143 | \[f(\x^* + t\h) = f(\x^*) + \underbrace{t\h\tran\nabla f(\x^*)}_0 + \frac{1}{2}t^2\h\tran\nabla^2 f(\x^* + t'\h)\h < f(\x^*)\] 144 | where the middle term vanishes because $\nabla f(\x^*) = \vec{0}$ by the previous result. 145 | It follows that $\x^*$ is not a local minimum, a contradiction. 146 | Hence $\nabla^2 f(\x^*)$ is positive semi-definite. 147 | \end{proof} 148 | Now we give sufficient conditions for local minima. 149 | \begin{proposition} 150 | Suppose $f$ is twice continuously differentiable with $\nabla^2 f$ positive semi-definite in a neighborhood of $\x^*$, and that $\nabla f(\x^*) = \vec{0}$. 151 | Then $\x^*$ is a local minimum of $f$. 152 | Furthermore if $\nabla^2 f(\x^*)$ is positive definite, then $\x^*$ is a strict local minimum. 153 | \end{proposition} 154 | \begin{proof} 155 | Let $B$ be an open ball of radius $r > 0$ centered at $\x^*$ which is contained in the neighborhood. 156 | Applying Taylor's theorem, we have that for any $\h$ with $\|\h\|_2 < r$, there exists $t \in (0,1)$ such that 157 | \[f(\x^* + \h) = f(\x^*) + \underbrace{\h\tran\nabla f(\x^*)}_0 + \frac{1}{2}\h\tran\nabla^2 f(\x^* + t\h)\h \geq f(\x^*)\] 158 | The last inequality holds because $\nabla^2 f(\x^* + t\h)$ is positive semi-definite (since $\|t\h\|_2 = t\|\h\|_2 < \|\h\|_2 < r$), so $\h\tran\nabla^2 f(\x^* + t\h)\h \geq 0$. 159 | Since $f(\x^*) \leq f(\x^* + \h)$ for all directions $\h$ with $\|\h\|_2 < r$, we conclude that $\x^*$ is a local minimum. 160 | 161 | Now further suppose that $\nabla^2 f(\x^*)$ is strictly positive definite. 162 | Since the Hessian is continuous we can choose another ball $B'$ with radius $r' > 0$ centered at $\x^*$ such that $\nabla^2 f(\x)$ is positive definite for all $\x \in B'$. 163 | Then following the same argument as above (except with a strict inequality now since the Hessian is positive definite) we have $f(\x^* + \h) > f(\x^*)$ for all $\h$ with $0 < \|\h\|_2 < r'$. 164 | Hence $\x^*$ is a strict local minimum. 165 | \end{proof} 166 | Note that, perhaps counterintuitively, the conditions $\nabla f(\x^*) = \vec{0}$ and $\nabla^2 f(\x^*)$ positive semi-definite are not enough to guarantee a local minimum at $\x^*$! 167 | Consider the function $f(x) = x^3$. 168 | We have $f'(0) = 0$ and $f''(0) = 0$ (so the Hessian, which in this case is the $1 \times 1$ matrix $\matlit{0}$, is positive semi-definite). 169 | But $f$ has a saddle point at $x = 0$. 170 | The function $f(x) = -x^4$ is an even worse offender -- it has the same gradient and Hessian at $x = 0$, but $x = 0$ is a strict local maximum for this function! 171 | 172 | For these reasons we require that the Hessian remains positive semi-definite as long as we are close to $\x^*$. 173 | Unfortunately, this condition is not practical to check computationally, but in some cases we can verify it analytically (usually by showing that $\nabla^2 f(\x)$ is p.s.d. for all $\x \in \R^d$). 174 | Also, if $\nabla^2 f(\x^*)$ is strictly positive definite, the continuity assumption on $f$ implies this condition, so we don't have to worry. 175 | 176 | \subsection{Convexity} 177 | \input{cs189-convexity.tex} 178 | 179 | \subsection{Orthogonal projections} 180 | We now consider a particular kind of optimization problem that is particularly well-understood and can often be solved in closed form: given some point $\x$ in an inner product space $V$, find the closest point to $\x$ in a subspace $S$ of $V$. 181 | This process is referred to as \term{projection onto a subspace}. 182 | 183 | The following diagram should make it geometrically clear that, at least in Euclidean space, the solution is intimately related to orthogonality and the Pythagorean theorem: 184 | \begin{center} 185 | \includegraphics[width=0.5\linewidth]{orthogonal-projection} 186 | \end{center} 187 | Here $\y$ is an arbitrary element of the subspace $S$, and $\y^*$ is the point in $S$ such that $\x-\y^*$ is perpendicular to $S$. 188 | The hypotenuse of a right triangle (in this case $\|\x-\y\|$) is always longer than either of the legs (in this case $\|\x-\y^*\|$ and $\|\y^*-\y\|$), and when $\y \neq \y^*$ there always exists such a triangle between $\x$, $\y$, and $\y^*$. 189 | 190 | Our intuition from Euclidean space suggests that the closest point to $\x$ in $S$ has the perpendicularity property described above, and we now show that this is indeed the case. 191 | \begin{proposition} 192 | Suppose $\x \in V$ and $\y \in S$. 193 | Then $\y^*$ is the unique minimizer of $\|\x-\y\|$ over $\y \in S$ if and only if $\x-\y^* \perp S$. 194 | \end{proposition} 195 | \begin{proof} 196 | $(\implies)$ 197 | Suppose $\y^*$ is the unique minimizer of $\|\x-\y\|$ over $\y \in S$. 198 | That is, $\|\x-\y^*\| \leq \|\x-\y\|$ for all $\y \in S$, with equality only if $\y = \y^*$. 199 | Fix $\vec{v} \in S$ and observe that 200 | \begin{align*} 201 | g(t) &:= \|\x-\y^*+t\vec{v}\|^2 \\ 202 | &= \inner{\x-\y^*+t\vec{v}}{\x-\y^*+t\vec{v}} \\ 203 | &= \inner{\x-\y^*}{\x-\y^*} - 2t\inner{\x-\y^*}{\vec{v}} + t^2\inner{\vec{v}}{\vec{v}} \\ 204 | &= \|\x-\y^*\|^2 - 2t\inner{\x-\y^*}{\vec{v}} + t^2\|\vec{v}\|^2 205 | \end{align*} 206 | must have a minimum at $t = 0$ as a consequence of this assumption. 207 | Thus 208 | \[0 = g'(0) = \left.-2\inner{\x-\y^*}{\vec{v}} + 2t\|\vec{v}\|^2\right|_{t=0} = -2\inner{\x-\y^*}{\vec{v}}\] 209 | giving $\x-\y^* \perp \vec{v}$. 210 | Since $\vec{v}$ was arbitrary in $S$, we have $\x-\y^* \perp S$ as claimed. 211 | 212 | $(\impliedby)$ 213 | Suppose $\x-\y^* \perp S$. 214 | Observe that for any $\y \in S$, $\y^*-\y \in S$ because $\y^* \in S$ and $S$ is closed under subtraction. 215 | Under the hypothesis, $\x-\y^* \perp \y^*-\y$, so by the Pythagorean theorem, 216 | \[\|\x-\y\| = \|\x-\y^*+\y^*-\y\| = \|\x-\y^*\| + \|\y^*-\y\| \geq \|\x - \y^*\|\] 217 | and in fact the inequality is strict when $\y \neq \y^*$ since this implies $\|\y^*-\y\| > 0$. 218 | Thus $\y^*$ is the unique minimizer of $\|\x-\y\|$ over $\y \in S$. 219 | \end{proof} 220 | Since a unique minimizer in $S$ can be found for any $\x \in V$, we can define an operator 221 | \[P\x = \argmin_{\y \in S} \|\x-\y\|\] 222 | Observe that $P\y = \y$ for any $\y \in S$, since $\y$ has distance zero from itself and every other point in $S$ has positive distance from $\y$. 223 | Thus $P(P\x) = P\x$ for any $\x$ (i.e., $P^2 = P$) because $P\x \in S$. 224 | The identity $P^2 = P$ is actually one of the defining properties of a \term{projection}, the other being linearity. 225 | 226 | An immediate consequence of the previous result is that $\x - P\x \perp S$ for any $\x \in V$, and conversely that $P$ is the unique operator that satisfies this property for all $\x \in V$. 227 | For this reason, $P$ is known as an \term{orthogonal projection}. 228 | 229 | If we choose an orthonormal basis for the target subspace $S$, it is possible to write down a more specific expression for $P$. 230 | \begin{proposition} 231 | If $\e_1, \dots, \e_m$ is an orthonormal basis for $S$, then 232 | \[P\x = \sum_{i=1}^m \inner{\x}{\e_i}\e_i\] 233 | \end{proposition} 234 | \begin{proof} 235 | Let $\e_1, \dots, \e_m$ be an orthonormal basis for $S$, and suppose $\x \in V$. 236 | Then for all $j = 1, \dots, m$, 237 | \begin{align*} 238 | \biginner{\x-\sum_{i=1}^m \inner{\x}{\e_i}\e_i}{\e_j} &= \inner{\x}{\e_j} - \sum_{i=1}^m \inner{\x}{\e_i}\underbrace{\inner{\e_i}{\e_j}}_{\delta_{ij}} \\ 239 | &= \inner{\x}{\e_j} - \inner{\x}{\e_j} \\ 240 | &= 0 241 | \end{align*} 242 | We have shown that the claimed expression, call it $\tilde{P}\x$, satisfies $\x - \tilde{P}\x \perp \e_j$ for every element $\e_j$ of the orthonormal basis for $S$. 243 | It follows (by linearity of the inner product) that $\x - \tilde{P}\x \perp S$, so the previous result implies $P = \tilde{P}$. 244 | \end{proof} 245 | The fact that $P$ is a linear operator (and thus a proper projection, as earlier we showed $P^2 = P$) follows readily from this result. 246 | 247 | %Another useful fact about the orthogonal projection operator is that the metric it induces is \term{non-expansive}, i.e. $1$-Lipschitz. 248 | %\begin{proposition} 249 | %For any $\x \in V$, 250 | %\[\|P\x\| \leq \|\x\|\] 251 | %Thus for any $\x, \xye \in V$, 252 | %\[\|P\x - P\xye\| \leq \|\x-\xye\|\] 253 | %\end{proposition} 254 | %\begin{proof} 255 | %Suppose $\x \in V$. 256 | %Then 257 | %\[\|P\x\|^2 = \inner{P\x}{P\x} = \inner{\x}{P^2\x} = \inner{\x}{P\x} \leq \|\x\|\|P\x\|\] 258 | %using respectively the self-adjointness of $P$, the fact that $P^2 = P$, and the Cauchy-Schwarz inequality. 259 | %If $\|P\x\| = 0$, the inequality holds vacuously; otherwise we can divide both sides by $\|P\x\|$ to obtain $\|P\x\| \leq \|\x\|$. 260 | % 261 | %The second statement follows immediately from the first by linearity of $P$. 262 | %\end{proof} -------------------------------------------------------------------------------- /cs189-probability.tex: -------------------------------------------------------------------------------- 1 | Probability theory provides powerful tools for modeling and dealing with uncertainty. 2 | 3 | \subsection{Basics} 4 | Suppose we have some sort of randomized experiment (e.g. a coin toss, die roll) that has a fixed set of possible outcomes. 5 | This set is called the \term{sample space} and denoted $\Omega$. 6 | 7 | We would like to define probabilities for some \term{events}, which are subsets of $\Omega$. 8 | The set of events is denoted $\calF$.\footnote{ 9 | $\calF$ is required to be a $\sigma$-algebra for technical reasons; see \cite{rigorousprob}. 10 | } 11 | The \term{complement} of the event $A$ is another event, $A\comp = \Omega \setminus A$. 12 | 13 | Then we can define a \term{probability measure} $\P : \calF \to [0,1]$ which must satisfy 14 | \begin{enumerate}[(i)] 15 | \item $\pr{\Omega} = 1$ 16 | \item \term{Countable additivity}: for any countable collection of disjoint sets $\{A_i\} \subseteq \calF$, 17 | \[\prbigg{\bigcup_i A_i} = \sum_i \pr{A_i}\] 18 | \end{enumerate} 19 | The triple $(\Omega, \calF, \P)$ is called a \term{probability space}.\footnote{ 20 | Note that a probability space is simply a measure space in which the measure of the whole space equals 1. 21 | } 22 | 23 | If $\pr{A} = 1$, we say that $A$ occurs \term{almost surely} (often abbreviated a.s.).\footnote{ 24 | This is a probabilist's version of the measure-theoretic term \textit{almost everywhere}. 25 | }, and conversely $A$ occurs \term{almost never} if $\pr{A} = 0$. 26 | 27 | From these axioms, a number of useful rules can be derived. 28 | \begin{proposition} 29 | Let $A$ be an event. Then 30 | \begin{enumerate}[(i)] 31 | \item $\pr{A\comp} = 1 - \pr{A}$. 32 | \item If $B$ is an event and $B \subseteq A$, then $\pr{B} \leq \pr{A}$. 33 | \item $0 = \pr{\varnothing} \leq \pr{A} \leq \pr{\Omega} = 1$ 34 | \end{enumerate} 35 | \end{proposition} 36 | \begin{proof} 37 | (i) Using the countable additivity of $\P$, we have 38 | \[\pr{A} + \pr{A\comp} = \pr{A \dotcup A\comp} = \pr{\Omega} = 1\] 39 | 40 | To show (ii), suppose $B \in \calF$ and $B \subseteq A$. Then 41 | \[\pr{A} = \pr{B \dotcup (A \setminus B)} = \pr{B} + \pr{A \setminus B} \geq \pr{B}\] 42 | as claimed. 43 | 44 | For (iii): the middle inequality follows from (ii) since $\varnothing \subseteq A \subseteq \Omega$. 45 | We also have 46 | \[\pr{\varnothing} = \pr{\varnothing \dotcup \varnothing} = \pr{\varnothing} + \pr{\varnothing}\] 47 | by countable additivity, which shows $\pr{\varnothing} = 0$. 48 | \end{proof} 49 | 50 | \begin{proposition} 51 | If $A$ and $B$ are events, then $\pr{A \cup B} = \pr{A} + \pr{B} - \pr{A \cap B}$. 52 | \end{proposition} 53 | \begin{proof} 54 | The key is to break the events up into their various overlapping and non-overlapping parts. 55 | \begin{align*} 56 | \pr{A \cup B} &= \pr{(A \cap B) \dotcup (A \setminus B) \dotcup (B \setminus A)} \\ 57 | &= \pr{A \cap B} + \pr{A \setminus B} + \pr{B \setminus A} \\ 58 | &= \pr{A \cap B} + \pr{A} - \pr{A \cap B} + \pr{B} - \pr{A \cap B} \\ 59 | &= \pr{A} + \pr{B} - \pr{A \cap B} 60 | \end{align*} 61 | \end{proof} 62 | 63 | \begin{proposition} 64 | If $\{A_i\} \subseteq \calF$ is a countable set of events, disjoint or not, then 65 | \[\prbigg{\bigcup_i A_i} \leq \sum_i \pr{A_i}\] 66 | \end{proposition} 67 | This inequality is sometimes referred to as \term{Boole's inequality} or the \term{union bound}. 68 | \begin{proof} 69 | Define $B_1 = A_1$ and $B_i = A_i \setminus (\bigcup_{j < i} A_j)$ for $i > 1$, noting that $\bigcup_{j \leq i} B_j = \bigcup_{j \leq i} A_j$ for all $i$ and the $B_i$ are disjoint. 70 | Then 71 | \[\prbigg{\bigcup_i A_i} = \prbigg{\bigcup_i B_i} = \sum_i \pr{B_i} \leq \sum_i \pr{A_i}\] 72 | where the last inequality follows by monotonicity since $B_i \subseteq A_i$ for all $i$. 73 | \end{proof} 74 | 75 | \subsubsection{Conditional probability} 76 | The \term{conditional probability} of event $A$ given that event $B$ has occurred is written $\pr{A \given B}$ and defined as 77 | \[\pr{A \given B} = \frac{\pr{A \cap B}}{\pr{B}}\] 78 | assuming $\pr{B} > 0$.\footnote{ 79 | In some cases it is possible to define conditional probability on events of probability zero, but this is significantly more technical so we omit it. 80 | } 81 | 82 | \subsubsection{Chain rule} 83 | Another very useful tool, the \term{chain rule}, follows immediately from this definition: 84 | \[\pr{A \cap B} = \pr{A \given B}\pr{B} = \pr{B \given A}\pr{A}\] 85 | 86 | \subsubsection{Bayes' rule} 87 | Taking the equality from above one step further, we arrive at the simple but crucial \term{Bayes' rule}: 88 | \[\pr{A \given B} = \frac{\pr{B \given A}\pr{A}}{\pr{B}}\] 89 | It is sometimes beneficial to omit the normalizing constant and write 90 | \[\pr{A \given B} \propto \pr{A}\pr{B \given A}\] 91 | Under this formulation, $\pr{A}$ is often referred to as the \term{prior}, $\pr{A \given B}$ as the \term{posterior}, and $\pr{B \given A}$ as the \term{likelihood}. 92 | 93 | In the context of machine learning, we can use Bayes' rule to update our ``beliefs'' (e.g. values of our model parameters) given some data that we've observed. 94 | 95 | \subsection{Random variables} 96 | A \term{random variable} is some uncertain quantity with an associated probability distribution over the values it can assume. 97 | 98 | Formally, a random variable on a probability space $(\Omega, \calF, \P)$ is a function\footnote{ 99 | The function must be measurable. 100 | } $X: \Omega \to \R$.\footnote{ 101 | More generally, the codomain can be any measurable space, but $\R$ is the most common case by far and sufficient for our purposes. 102 | } 103 | 104 | We denote the range of $X$ by $X(\Omega) = \{X(\omega) : \omega \in \Omega\}$. 105 | To give a concrete example (taken from \cite{pitman}), suppose $X$ is the number of heads in two tosses of a fair coin. 106 | The sample space is 107 | \[\Omega = \{hh, tt, ht, th\}\] 108 | and $X$ is determined completely by the outcome $\omega$, i.e. $X = X(\omega)$. 109 | For example, the event $X = 1$ is the set of outcomes $\{ht, th\}$. 110 | 111 | It is common to talk about the values of a random variable without directly referencing its sample space. 112 | The two are related by the following definition: the event that the value of $X$ lies in some set $S \subseteq \R$ is 113 | \[X \in S = \{\omega \in \Omega : X(\omega) \in S\}\] 114 | Note that special cases of this definition include $X$ being equal to, less than, or greater than some specified value. 115 | For example 116 | \[\pr{X = x} = \pr{\{\omega \in \Omega : X(\omega) = x\}}\] 117 | 118 | A word on notation: we write $p(X)$ to denote the entire probability distribution of $X$ and $p(x)$ for the evaluation of the function $p$ at a particular value $x \in X(\Omega)$. 119 | Hopefully this (reasonably standard) abuse of notation is not too distracting. 120 | If $p$ is parameterized by some parameters $\theta$, we write $p(X; \vec{\theta})$ or $p(x; \vec{\theta})$, unless we are in a Bayesian setting where the parameters are considered a random variable, in which case we condition on the parameters. 121 | 122 | \subsubsection{The cumulative distribution function} 123 | The \term{cumulative distribution function} (c.d.f.) gives the probability that a random variable is at most a certain value: 124 | \[F(x) = \pr{X \leq x}\] 125 | The c.d.f. can be used to give the probability that a variable lies within a certain range: 126 | \[\pr{a < X \leq b} = F(b) - F(a)\] 127 | 128 | \subsubsection{Discrete random variables} 129 | A \term{discrete random variable} is a random variable that has a countable range and assumes each value in this range with positive probability. 130 | Discrete random variables are completely specified by their \term{probability mass function} (p.m.f.) $p : X(\Omega) \to [0,1]$ which satisfies 131 | \[\sum_{x \in X(\Omega)} p(x) = 1\] 132 | For a discrete $X$, the probability of a particular value is given exactly by its p.m.f.: 133 | \[\pr{X = x} = p(x)\] 134 | 135 | \subsubsection{Continuous random variables} 136 | A \term{continuous random variable} is a random variable that has an uncountable range and assumes each value in this range with probability zero. 137 | Most of the continuous random variables that one would encounter in practice are \term{absolutely continuous random variables}\footnote{ 138 | Random variables that are continuous but not absolutely continuous are called \term{singular random variables}. 139 | We will not discuss them, assuming rather that all continuous random variables admit a density function. 140 | }, which means that there exists a function $p : \R \to [0,\infty)$ that satisfies 141 | \[F(x) \equiv \int_{-\infty}^x p(z)\dd{z}\] 142 | The function $p$ is called a \term{probability density function} (abbreviated p.d.f.) and must satisfy 143 | \[\int_{-\infty}^\infty p(x)\dd{x} = 1\] 144 | The values of this function are not themselves probabilities, since they could exceed 1. 145 | However, they do have a couple of reasonable interpretations. 146 | One is as relative probabilities; even though the probability of each particular value being picked is technically zero, some points are still in a sense more likely than others. 147 | 148 | One can also think of the density as determining the probability that the variable will lie in a small range about a given value. 149 | This is because, for small $\epsilon > 0$, 150 | \[\pr{x-\epsilon \leq X \leq x+\epsilon} = \int_{x-\epsilon}^{x+\epsilon} p(z)\dd{z} \approx 2\epsilon p(x)\] 151 | using a midpoint approximation to the integral. 152 | 153 | Here are some useful identities that follow from the definitions above: 154 | \begin{align*} 155 | \pr{a \leq X \leq b} &= \int_a^b p(x)\dd{x} \\ 156 | p(x) &= F'(x) 157 | \end{align*} 158 | 159 | \subsubsection{Other kinds of random variables} 160 | There are random variables that are neither discrete nor continuous. 161 | For example, consider a random variable determined as follows: 162 | flip a fair coin, then the value is zero if it comes up heads, otherwise draw a number uniformly at random from $[1,2]$. 163 | Such a random variable can take on uncountably many values, but only finitely many of these with positive probability. 164 | We will not discuss such random variables because they are rather pathological and require measure theory to analyze. 165 | 166 | \subsection{Joint distributions} 167 | Often we have several random variables and we would like to get a distribution over some combination of them. 168 | A \term{joint distribution} is exactly this. 169 | For some random variables $X_1, \dots, X_n$, the joint distribution is written $p(X_1, \dots, X_n)$ and gives probabilities over entire assignments to all the $X_i$ simultaneously. 170 | 171 | \subsubsection{Independence of random variables} 172 | We say that two variables $X$ and $Y$ are \term{independent} if their joint distribution factors into their respective distributions, i.e. 173 | \[p(X, Y) = p(X)p(Y)\] 174 | We can also define independence for more than two random variables, although it is more complicated. 175 | Let $\{X_i\}_{i \in I}$ be a collection of random variables indexed by $I$, which may be infinite. 176 | Then $\{X_i\}$ are independent if for every finite subset of indices $i_1, \dots, i_k \in I$ we have 177 | \[p(X_{i_1}, \dots, X_{i_k}) = \prod_{j=1}^k p(X_{i_j})\] 178 | For example, in the case of three random variables, $X, Y, Z$, we require that $p(X,Y,Z) = p(X)p(Y)p(Z)$ as well as $p(X,Y) = p(X)p(Y)$, $p(X,Z) = p(X)p(Z)$, and $p(Y,Z) = p(Y)p(Z)$. 179 | 180 | It is often convenient (though perhaps questionable) to assume that a bunch of random variables are \term{independent and identically distributed} (i.i.d.) so that their joint distribution can be factored entirely: 181 | \[p(X_1, \dots, X_n) = \prod_{i=1}^n p(X_i)\] 182 | where $X_1, \dots, X_n$ all share the same p.m.f./p.d.f. 183 | 184 | \subsubsection{Marginal distributions} 185 | If we have a joint distribution over some set of random variables, it is possible to obtain a distribution for a subset of them by ``summing out'' (or ``integrating out'' in the continuous case) the variables we don't care about: 186 | \[p(X) = \sum_{y} p(X, y)\] 187 | 188 | \subsection{Great Expectations} 189 | If we have some random variable $X$, we might be interested in knowing what is the ``average'' value of $X$. 190 | This concept is captured by the \term{expected value} (or \term{mean}) $\ev{X}$, which is defined as 191 | \[\ev{X} = \sum_{x \in X(\Omega)} xp(x)\] 192 | for discrete $X$ and as 193 | \[\ev{X} = \int_{-\infty}^\infty xp(x)\dd{x}\] 194 | for continuous $X$. 195 | 196 | In words, we are taking a weighted sum of the values that $X$ can take on, where the weights are the probabilities of those respective values. 197 | The expected value has a physical interpretation as the ``center of mass'' of the distribution. 198 | 199 | \subsubsection{Properties of expected value} 200 | A very useful property of expectation is that of linearity: 201 | \[\bigev{\sum_{i=1}^n \alpha_i X_i + \beta} = \sum_{i=1}^n \alpha_i \ev{X_i} + \beta\] 202 | Note that this holds even if the $X_i$ are not independent! 203 | 204 | But if they are independent, the product rule also holds: 205 | \[\bigev{\prod_{i=1}^n X_i} = \prod_{i=1}^n \ev{X_i}\] 206 | 207 | \subsection{Variance} 208 | Expectation provides a measure of the ``center'' of a distribution, but frequently we are also interested in what the ``spread'' is about that center. 209 | We define the variance $\var{X}$ of a random variable $X$ by 210 | \[\var{X} = \bigev{\left(X - \ev{X}\right)^2}\] 211 | In words, this is the average squared deviation of the values of $X$ from the mean of $X$. 212 | Using a little algebra and the linearity of expectation, it is straightforward to show that 213 | \[\var{X} = \ev{X^2} - \ev{X}^2\] 214 | 215 | \subsubsection{Properties of variance} 216 | Variance is not linear (because of the squaring in the definition), but one can show the following: 217 | \[\var{\alpha X + \beta} = \alpha^2 \var{X}\] 218 | Basically, multiplicative constants become squared when they are pulled out, and additive constants disappear (since the variance contributed by a constant is zero). 219 | 220 | Furthermore, if $X_1, \dots, X_n$ are uncorrelated\footnote{ 221 | We haven't defined this yet; see the Correlation section below 222 | }, then 223 | \[\var{X_1 + \dots + X_n} = \var{X_1} + \dots + \var{X_n}\] 224 | 225 | \subsubsection{Standard deviation} 226 | Variance is a useful notion, but it suffers from that fact the units of variance are not the same as the units of the random variable (again because of the squaring). 227 | To overcome this problem we can use \term{standard deviation}, which is defined as $\sqrt{\var{X}}$. 228 | The standard deviation of $X$ has the same units as $X$. 229 | 230 | \subsection{Covariance} 231 | Covariance is a measure of the linear relationship between two random variables. 232 | We denote the covariance between $X$ and $Y$ as $\cov{X}{Y}$, and it is defined to be 233 | \[\cov{X}{Y} = \ev{(X-\ev{X})(Y-\ev{Y})}\] 234 | Note that the outer expectation must be taken over the joint distribution of $X$ and $Y$. 235 | 236 | Again, the linearity of expectation allows us to rewrite this as 237 | \[\cov{X}{Y} = \ev{XY} - \ev{X}\ev{Y}\] 238 | Comparing these formulas to the ones for variance, it is not hard to see that $\var{X} = \cov{X}{X}$. 239 | 240 | A useful property of covariance is that of \term{bilinearity}: 241 | \begin{align*} 242 | \cov{\alpha X + \beta Y}{Z} &= \alpha\cov{X}{Z} + \beta\cov{Y}{Z} \\ 243 | \cov{X}{\alpha Y + \beta Z} &= \alpha\cov{X}{Y} + \beta\cov{X}{Z} 244 | \end{align*} 245 | 246 | \subsubsection{Correlation} 247 | Normalizing the covariance gives the \term{correlation}: 248 | \[\rho(X, Y) = \frac{\cov{X}{Y}}{\sqrt{\var{X}\var{Y}}}\] 249 | Correlation also measures the linear relationship between two variables, but unlike covariance always lies between $-1$ and $1$. 250 | 251 | Two variables are said to be \term{uncorrelated} if $\cov{X}{Y} = 0$ because $\cov{X}{Y} = 0$ implies that $\rho(X, Y) = 0$. 252 | If two variables are independent, then they are uncorrelated, but the converse does not hold in general. 253 | 254 | \subsection{Random vectors} 255 | So far we have been talking about \term{univariate distributions}, that is, distributions of single variables. 256 | But we can also talk about \term{multivariate distributions} which give distributions of \term{random vectors}: 257 | \[\bX = \matlit{X_1 \\ \vdots \\ X_n}\] 258 | The summarizing quantities we have discussed for single variables have natural generalizations to the multivariate case. 259 | 260 | Expectation of a random vector is simply the expectation applied to each component: 261 | \[\ev{\bX} = \matlit{\ev{X_1} \\ \vdots \\ \ev{X_n}}\] 262 | 263 | The variance is generalized by the \term{covariance matrix}: 264 | \[\mat{\Sigma} = \ev{(\bX - \ev{\bX})(\bX - \ev{\bX})\tran} = \matlit{ 265 | \var{X_1} & \cov{X_1}{X_2} & \hdots & \cov{X_1}{X_n} \\ 266 | \cov{X_2}{X_1} & \var{X_2} & \hdots & \cov{X_2}{X_n} \\ 267 | \vdots & \vdots & \ddots & \vdots \\ 268 | \cov{X_n}{X_1} & \cov{X_n}{X_2} & \hdots & \var{X_n} 269 | }\] 270 | That is, $\Sigma_{ij} = \cov{X_i}{X_j}$. 271 | Since covariance is symmetric in its arguments, the covariance matrix is also symmetric. 272 | It's also positive semi-definite: for any $\x$, 273 | \[\x\tran\mat{\Sigma}\x = \x\tran\ev{(\bX - \ev{\bX})(\bX - \ev{\bX})\tran}\x = \ev{\x\tran(\bX - \ev{\bX})(\bX - \ev{\bX})\tran\x} = \ev{((\bX - \ev{\bX})\tran\x)^2} \geq 0\] 274 | The inverse of the covariance matrix, $\mat{\Sigma}\inv$, is sometimes called the \term{precision matrix}. 275 | 276 | \subsection{Estimation of Parameters} 277 | Now we get into some basic topics from statistics. 278 | We make some assumptions about our problem by prescribing a \term{parametric} model (e.g. a distribution that describes how the data were generated), then we fit the parameters of the model to the data. 279 | How do we choose the values of the parameters? 280 | 281 | \subsubsection{Maximum likelihood estimation} 282 | A common way to fit parameters is \term{maximum likelihood estimation} (MLE). 283 | The basic principle of MLE is to choose values that ``explain'' the data best by maximizing the probability/density of the data we've seen as a function of the parameters. 284 | Suppose we have random variables $X_1, \dots, X_n$ and corresponding observations $x_1, \dots, x_n$. 285 | Then 286 | \[\hat{\vec{\theta}}_\textsc{mle} = \argmax_\vec{\theta} \calL(\vec{\theta})\] 287 | where $\calL$ is the \term{likelihood function} 288 | \[\calL(\vec{\theta}) = p(x_1, \dots, x_n; \vec{\theta})\] 289 | Often, we assume that $X_1, \dots, X_n$ are i.i.d. Then we can write 290 | \[p(x_1, \dots, x_n; \theta) = \prod_{i=1}^n p(x_i; \vec{\theta})\] 291 | At this point, it is usually convenient to take logs, giving rise to the \term{log-likelihood} 292 | \[\log\calL(\vec{\theta}) = \sum_{i=1}^n \log p(x_i; \vec{\theta})\] 293 | This is a valid operation because the probabilities/densities are assumed to be positive, and since log is a monotonically increasing function, it preserves ordering. 294 | In other words, any maximizer of $\log\calL$ will also maximize $\calL$. 295 | 296 | For some distributions, it is possible to analytically solve for the maximum likelihood estimator. 297 | If $\log\calL$ is differentiable, setting the derivatives to zero and trying to solve for $\vec{\theta}$ is a good place to start. 298 | 299 | \subsubsection{Maximum a posteriori estimation} 300 | A more Bayesian way to fit parameters is through \term{maximum a posteriori estimation} (MAP). 301 | In this technique we assume that the parameters are a random variable, and we specify a prior distribution $p(\vec{\theta})$. 302 | Then we can employ Bayes' rule to compute the posterior distribution of the parameters given the observed data: 303 | \[p(\vec{\theta} \given x_1, \dots, x_n) \propto p(\vec{\theta})p(x_1, \dots, x_n \given \vec{\theta})\] 304 | Computing the normalizing constant is often intractable, because it involves integrating over the parameter space, which may be very high-dimensional. 305 | Fortunately, if we just want the MAP estimate, we don't care about the normalizing constant! 306 | It does not affect which values of $\vec{\theta}$ maximize the posterior. 307 | So we have 308 | \[\hat{\vec{\theta}}_\textsc{map} = \argmax_\vec{\theta} p(\vec{\theta})p(x_1, \dots, x_n \given \vec{\theta})\] 309 | Again, if we assume the observations are i.i.d., then we can express this in the equivalent, and possibly friendlier, form 310 | \[\hat{\vec{\theta}}_\textsc{map} = \argmax_\vec{\theta} \left(\log p(\vec{\theta}) + \sum_{i=1}^n \log p(x_i \given \vec{\theta})\right)\] 311 | A particularly nice case is when the prior is chosen carefully such that the posterior comes from the same family as the prior. 312 | In this case the prior is called a \term{conjugate prior}. 313 | For example, if the likelihood is binomial and the prior is beta, the posterior is also beta. 314 | There are many conjugate priors; the reader may find this \href{https://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions}{table of conjugate priors} useful. 315 | 316 | \subsection{The Gaussian distribution} 317 | There are many distributions, but one of particular importance is the \term{Gaussian distribution}, also known as the \term{normal distribution}. 318 | It is a continuous distribution, parameterized by its mean $\bm\mu \in \R^d$ and positive-definite covariance matrix $\mat{\Sigma} \in \R^{d \times d}$, with density 319 | \[p(\x; \bm\mu, \mat{\Sigma}) = \frac{1}{\sqrt{(2\pi)^d \det(\mat{\Sigma})}}\exp\left(-\frac{1}{2}(\x - \bm\mu)\tran\mat{\Sigma}\inv(\x - \bm\mu)\right)\] 320 | Note that in the special case $d = 1$, the density is written in the more recognizable form 321 | \[p(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\] 322 | We write $\vec{X} \sim \calN(\bm\mu, \mat{\Sigma})$ to denote that $\vec{X}$ is normally distributed with mean $\bm\mu$ and variance $\mat{\Sigma}$. 323 | 324 | \subsubsection{The geometry of multivariate Gaussians} 325 | The geometry of the multivariate Gaussian density is intimately related to the geometry of positive definite quadratic forms, so make sure the material in that section is well-understood before tackling this section. 326 | 327 | First observe that the p.d.f. of the multivariate Gaussian can be rewritten as 328 | \[p(\x; \bm\mu, \mat{\Sigma}) = g(\xye\tran\mat{\Sigma}\inv\xye)\] 329 | where $\xye = \x - \bm\mu$ and $g(z) = [(2\pi)^d \det(\mat{\Sigma})]^{-\frac{1}{2}}\exp\left(-\frac{z}{2}\right)$. 330 | Writing the density in this way, we see that after shifting by the mean $\bm\mu$, the density is really just a simple function of its precision matrix's quadratic form. 331 | 332 | Here is a key observation: this function $g$ is \term{strictly monotonically decreasing} in its argument. 333 | That is, $g(a) > g(b)$ whenever $a < b$. 334 | Therefore, small values of $\xye\tran\mat{\Sigma}\inv\xye$ (which generally correspond to points where $\xye$ is closer to $\vec{0}$, i.e. $\x \approx \bm\mu$) have relatively high probability densities, and vice-versa. 335 | Furthermore, because $g$ is \textit{strictly} monotonic, it is injective, so the $c$-isocontours of $p(\x; \bm\mu, \mat{\Sigma})$ are the $g\inv(c)$-isocontours of the function $\x \mapsto \xye\tran\mat{\Sigma}\inv\xye$. 336 | That is, for any $c$, 337 | \[\{\x \in \R^d : p(\x; \bm\mu, \mat{\Sigma}) = c\} = \{\x \in \R^d : \xye\tran\mat{\Sigma}\inv\xye = g\inv(c)\}\] 338 | In words, these functions have the same isocontours but different isovalues. 339 | 340 | Recall the executive summary of the geometry of positive definite quadratic forms: the isocontours of $f(\x) = \x\tran\A\x$ are ellipsoids such that the axes point in the directions of the eigenvectors of $\A$, and the lengths of these axes are proportional to the inverse square roots of the corresponding eigenvalues. 341 | Therefore in this case, the isocontours of the density are ellipsoids (centered at $\bm\mu$) with axis lengths proportional to the inverse square roots of the eigenvalues of $\mat{\Sigma}\inv$, or equivalently, the square roots of the eigenvalues of $\mat{\Sigma}$. 342 | -------------------------------------------------------------------------------- /measure-probability.tex: -------------------------------------------------------------------------------- 1 | \documentclass{article} 2 | \title{Measure and Probability Theory} 3 | 4 | \input{common.tex} 5 | 6 | \newcommand{\carath}{Carath\'{e}odory} 7 | 8 | \begin{document} 9 | \maketitle 10 | 11 | \section{About} 12 | This document is part of a series of mathematical notes available at \url{https://gwthomas.github.io/math4ml}. 13 | You are free to distribute it as you wish. 14 | Please report any mistakes to \url{gwthomas@berkeley.edu}. 15 | 16 | Measure theory is concerned with the problem of assigning a mathematically consistent notion of size to sets. 17 | We care about measure theory because of its use in the modern, rigorous formulation of probability given by Kolmogorov. 18 | 19 | \section{Collections of sets} 20 | We would like to assign measures to various subsets of $\R^n$ characterizing their size. 21 | Ideally our measure $\mu$ would satisfy 22 | \begin{enumerate}[(i)] 23 | \item For any countable collection of disjoint sets $E_1, E_2, \dots \subseteq \R^n$, 24 | \[\mu\bigg(\bigcup_i E_i\bigg) = \sum_i \mu(E_i)\] 25 | \item If two sets $E, F \subseteq \R^n$ are such that $E$ can be transformed into $F$ by rigid transformations, then $\mu(E) = \mu(F)$. 26 | \item The measure of the unit cube is 1. 27 | \end{enumerate} 28 | The first property, called \term{countable additivity}, just means that if you partition a set into countably many parts, the sum of the parts' measures equals the original set's measure. 29 | The requirement that additivity hold for countable collections (as opposed to just finite collections) is important for proving limit theorems. 30 | 31 | Unfortunately, one can show that these three properties are incompatible if we allow arbitrary subsets of $\R^n$. 32 | The solution in measure theory is to restrict ourselves to some ``reasonable'' collection of subsets. 33 | 34 | Recall that the \term{powerset} of a set $\Omega$ is the set of all subsets of $\Omega$, i.e. 35 | \[\calP(\Omega) = \{S : S \subseteq \Omega\}\] 36 | Note that in particular $\varnothing, \Omega \in \calP(\Omega)$ for any set $\Omega$. 37 | 38 | In the remainder we will consider collections of subsets of $\Omega$; in other words, these collections are subsets of $\calP(\Omega)$. 39 | We will make certain requirements of these collections so that we have some structure to work with. 40 | In particular, we choose the collections so that the properties above hold, not for arbitrary subsets of $\Omega$, but for any sets in the collection. 41 | 42 | \subsection{Algebras and $\sigma$-algebras} 43 | Let $\Omega$ be a non-empty set. 44 | Then $\calA \subseteq \calP(\Omega)$ is an algebra on $\Omega$ if 45 | \begin{enumerate}[(i)] 46 | \item $\calA$ is non-empty. 47 | \item If $E \in \calA$, then $E\comp = \Omega \setminus E \in \calA$. 48 | \item If $E_1, \dots, E_n \in \calA$, then $\bigcup_{i=1}^n E_i \in \calA$. 49 | \end{enumerate} 50 | The second property states that $\calA$ is \term{closed under complements}. 51 | Using de Morgan's laws, properties 2 and 3 collectively imply that $\calA$ is closed under finite intersections as well, since 52 | \[\bigcap_{i=1}^n E_i = \bigg(\bigcup_{i=1}^n E_i\comp\bigg)\comp\] 53 | Then we must have $\varnothing \in \calA$; since $\calA$ is non-empty there exists some $E \in \calA$, so $E\comp \in \calA$, and hence $\varnothing = E \cap E\comp \in \calA$. 54 | 55 | In light of the desirability of countable additivity, we would like the collection of subsets we consider to be closed under unions of countably many sets, not just finitely many. 56 | Thus we need to strengthen condition 3, and arrive at the following definition: a \term{$\sigma$-algebra} is an algebra that is closed under countable unions. 57 | It follows by the same reasoning as above that a $\sigma$-algebra is also closed under countable intersections. 58 | 59 | Note that $\{\varnothing, \Omega\}$ and $\calP(\Omega)$ are $\sigma$-algebras for any $\Omega$, and moreover they are respectively the smallest and largest possible $\sigma$-algebras. 60 | 61 | If $\calC \subseteq \calP(\Omega)$ is any collection of subsets of $\Omega$, there exists a unique smallest $\sigma$-algebra containing $\calC$; this is called the \term{$\sigma$-algebra generated by $\calC$} and denoted $\sigma(\calC)$. 62 | 63 | %The $\sigma$-algebra generated by the set of all open sets in $\Omega$ is called the \term{Borel $\sigma$-algebra on $\Omega$} and denoted $\calB(\Omega)$. 64 | %Its members, called the \term{Borel sets} of $\Omega$, are all the open sets, closed sets, and countable unions and intersections of these. 65 | 66 | \section{Measures} 67 | Let $\Omega$ be a non-empty set and $\calM \subseteq \calP(\Omega)$ a $\sigma$-algebra. 68 | The pair $(\Omega, \calM)$ is called a \term{measurable space}, and the elements of $\calM$ are its \term{measurable sets}. 69 | A \term{measure} on $(\Omega, \calM)$ is a function $\mu : \calM \to [0,\infty]$ such that 70 | \begin{enumerate}[(i)] 71 | \item $\mu(\varnothing) = 0$ 72 | \item For any countable collection of disjoint sets $\{E_i\} \subseteq \calM$, 73 | \[\mu\bigg(\bigdotcup_i E_i\bigg) = \sum_i \mu(E_i)\] 74 | \end{enumerate} 75 | The triple $(\Omega, \calM, \mu)$ is called a \term{measure space}. 76 | 77 | The simplest nontrivial example of a measure is the \term{counting measure}, given by 78 | \[E \mapsto \begin{cases} 79 | |E| & \text{$E$ finite} \\ 80 | \infty & \text{otherwise} 81 | \end{cases}\] 82 | 83 | We say that $\mu$ is \term{finite} if $\mu(\Omega) < \infty$, and it is \term{$\sigma$-finite} if $\Omega$ can be written as the union of countably many measurable sets of finite measure. 84 | 85 | A set $E \in \calM$ such that $\mu(E) = 0$ is called a \term{$\mu$-null set}, or usually just a \term{null set} if the measure is clear from context. 86 | A property is said to hold $\mu$-\term{almost everywhere} (often abbreviated a.e.) if the set of points for which it does not hold is $\mu$-null (again, one would usually just write \term{almost everywhere} unless there was ambiguity).\footnote{ 87 | In order for this notion to be interesting we needed to first introduce a measure with nontrivial null sets, so we wait to give an example in the Lebesgue measure section. 88 | } 89 | %If every subset of every $\mu$-null set is measurable (i.e. also an element of $\calM$), then we say that $\mu$ is \term{complete}. 90 | 91 | We now give some basic properties of measures. 92 | \begin{proposition} 93 | If $E, F \in \calM$ and $E \subseteq F$, then $\mu(E) \leq \mu(F)$. 94 | \end{proposition} 95 | This property is called \term{monotonicity}. 96 | \begin{proof} 97 | Suppose $E, F \in \calM$ and $E \subseteq F$. Then 98 | \[\mu(F) = \mu(E \dotcup (F \setminus E)) = \mu(E) + \mu(F \setminus E) \geq \mu(E)\] 99 | as claimed. 100 | \end{proof} 101 | Note that monotonicity implies 102 | \[0 = \mu(\varnothing) \leq \mu(E) \leq \mu(\Omega)\] 103 | for every $E \in \calM$, since $\varnothing \subseteq E \subseteq \Omega$. 104 | 105 | \begin{proposition} 106 | For any countable collection of sets $\{E_i\} \subseteq \calM$ (disjoint or not), 107 | \[\mu\bigg(\bigcup_i E_i\bigg) \leq \sum_i \mu(E_i)\] 108 | \end{proposition} 109 | This property is called \term{sub-additivity}. 110 | \begin{proof} 111 | Define $F_1 = E_1$ and $F_i = E_i \setminus (\bigcup_{j < i} E_j)$ for $i > 1$, noting that $\bigcup_{j \leq i} F_j = \bigcup_{j \leq i} E_j$ for all $i$ and the $F_i$ are disjoint. 112 | Then 113 | \[\mu\bigg(\bigcup_i E_i\bigg) = \mu\bigg(\bigcup_i F_i\bigg) = \sum_i \mu(F_i) \leq \sum_i \mu(E_i)\] 114 | where the last inequality follows by monotonicity since $F_i \subseteq E_i$ for all $i$. 115 | \end{proof} 116 | 117 | \section{Lebesgue measure} 118 | \term{Lebesgue measure} is the measure that corresponds to our intuitive notion of physical size. 119 | For example, the Lebesgue measure of a measurable subset of $\R$ gives a number interpretable as the set's length. 120 | Lebesgue measure can also be defined in higher dimensions (we omit this generalization), where it represents the area, volume, etc. of the set. 121 | 122 | The strategy for constructing Lebesgue measure is to define it first on intervals, which have an obvious measure (their length), and then use that definition to define the measure of more complicated sets. 123 | 124 | Let $\calI$ be the set of all intervals (open, closed, or semi-open) on $\R$. 125 | Define $\ell : \calI \to [0,\infty]$ by 126 | \[\ell([a,b]) = b - a\] 127 | with the same definition when $[a,b]$ is replaced by $(a,b)$, $[a,b)$, or $(a,b]$. 128 | For infinite intervals, use the ``obvious'' convention that $\infty-a = \infty$ and $b-(-\infty) = \infty$. 129 | 130 | The key tool in constructing Lebesgue measure is the \term{Lebesgue outer measure} $\lambda^* : \calP(\R) \to [0,\infty]$, which is given by 131 | \[\lambda^*(E) = \inf\left\{\sum_{k=1}^\infty \ell(I_k) : I_k \in \calI, E \subseteq \bigcup_{k=1}^\infty I_k\right\}\] 132 | A set $E \subseteq \R$ is said to be \term{Lebesgue measurable} if for every $A \subseteq \R$, 133 | \[\lambda^*(A) = \lambda^*(A \cap E) + \lambda^*(A \cap E\comp)\] 134 | It turns out that the set of Lebesgue measurable sets is very large and contains pretty much any reasonable set that one would encounter in practice. 135 | However, it is possible\footnote{ 136 | assuming the axiom of choice 137 | } to construct pathological subsets of $\R$ that are not Lebesgue measurable. 138 | 139 | The Lebesgue outer measure and Lebesgue measurable sets have a number of nice properties: 140 | \begin{enumerate}[(i)] 141 | \item The set of Lebesgue measurable sets, denoted $\calL$, is a $\sigma$-algebra. 142 | \item $\lambda^*|_\calL$ is a measure on $\calL$. 143 | \item $\lambda^*|_\calI = \ell$, so the measure agrees with our initial notion of interval length. 144 | \end{enumerate} 145 | Defining the Lebesgue measure $\lambda = \lambda^*|_\calL$, we have a measure space $(\R, \calL, \lambda)$.\footnote{ 146 | One can show that $\lambda$ is the unique measure on $(\R, \calL)$ that extends $\ell$. 147 | It turns out that uniqueness stems from the fact that $\ell$ is $\sigma$-finite. 148 | } 149 | 150 | \subsubsection{Sets of measure zero} 151 | Consider the following intriguing property of Lebesgue measure. 152 | \begin{proposition} 153 | If $E \subseteq \R$ is countable, then $\lambda(E) = 0$. 154 | \end{proposition} 155 | \begin{proof} 156 | First note that for any $x \in \R$, we have 157 | \[\lambda(\{x\}) = \lambda([x,x]) = x - x = 0\] 158 | Now suppose $E \subseteq \R$ is countable. 159 | Then we can write $E = \bigdotcup_i \{x_i\}$, whence it follows that 160 | \[\lambda(E) = \sum_i \lambda(\{x_i\}) = \sum_i 0 = 0\] 161 | by the countable additivity of measures. 162 | \end{proof} 163 | Specifically, it may be surprising to consider that $\lambda(\Q) = 0$. 164 | It turns out that it's also possible to construct uncountable subsets of $\R$ that have Lebesgue measure zero, e.g. the Cantor set \cite{folland}. 165 | 166 | Now for the promised example of \textit{almost everywhere}: the absolute value function $x \mapsto |x|$ is differentiable almost everywhere, since it is only not differentiable at $x = 0$, and $\lambda(\{0\}) = 0$. 167 | Note that Lebesgue measure is in some sense the ``default'' measure on $\R$, in that if no measure is specified (as in the previous sentence), the author is generally speaking in reference to Lebesgue measure. 168 | 169 | \section{Lebesgue integration} 170 | In this section we consider the problem of defining the integral of functions on an abstract measure space $(\Omega, \calM, \mu)$. 171 | 172 | Just as not all sets are measurable, not all functions are measurable. 173 | A function $f : \Omega \to \R$ is \term{measurable} if 174 | \[\{\omega \in \Omega : f(\omega) \leq x\} \in \calM \tab \forall x \in \R\] 175 | We follow the common approach of defining the Lebesgue integral for increasingly complicated functions in terms of integrals of simpler functions. 176 | The simplest functions to integrate are the \term{indicator functions}; if $E \in \calM$, then its indicator function is 177 | \[1_E(\omega) = \begin{cases} 178 | 1 & \omega \in E \\ 179 | 0 & \omega \not\in E 180 | \end{cases}\] 181 | The integral of an indicator function is defined as 182 | \[\int_\Omega 1_E\dd{\mu} = \mu(E)\] 183 | From indicator functions we can construct \term{non-negative simple functions}, which are finite linear combinations of indicator functions: 184 | \[\phi = \sum_{i=1}^n \alpha_i1_{E_i}\] 185 | where $\alpha_i \geq 0$ for all $i$. 186 | Here the integral is defined to be 187 | \[\int_\Omega \phi\dd{\mu} = \sum_{i=1}^n \alpha_i \int_\Omega 1_{E_i}\dd{\mu} = \sum_{i=1}^n \alpha_i \mu(E_i)\] 188 | and we use the convention that $0 \cdot \infty = 0$. 189 | Then we can define the integral of an arbitrary non-negative measurable function $f$ as follows: 190 | \[\int_\Omega f\dd{\mu} = \sup\left\{\int_\Omega \phi\dd{\mu} : 0 \leq \phi \leq f, \text{$\phi$ simple}\right\}\] 191 | Finally, we can extend the definition to arbitrary measurable functions by using the decomposition 192 | \[f = f^+ - f^-\] 193 | where 194 | \begin{align*} 195 | f^+(x) &= \max(f(x), 0) \\ 196 | f^-(x) &= \max(-f(x), 0) 197 | \end{align*} 198 | If at least one of $\int_\Omega f^+\dd{\mu}$ and $\int_\Omega f^-\dd{\mu}$ is finite, we define 199 | \[\int_\Omega f\dd{\mu} = \int_\Omega f^+\dd{\mu} - \int_\Omega f^-\dd{\mu}\] 200 | Furthermore if $\int_\Omega |f|\dd{\mu} < \infty$, we say that $f$ is \term{Lebesgue integrable}. 201 | Note that this is a slightly stronger condition than what is required for the previous definition; clearly $|f| = f^+ + f^-$, so $f$ is Lebesgue integrable iff the integrals of both $f^+$ and $f^-$ are finite.\footnote{ 202 | Why is the integral defined even for some functions that are not ``integrable''? 203 | I'm not sure, and would love to know if anyone has more info. 204 | But all the sources I've consulted agree on these definitions. 205 | } 206 | 207 | \subsection{The Lebesgue integral on $\R$} 208 | We now consider the special case where $\Omega = \R$ and $\mu = \lambda$. 209 | In addition to being a very important special case of the general theory above, this scenario has a geometric interpretation that helps us better understand Lebesgue integration. 210 | 211 | \subsection{Comparison with the Riemann integral} 212 | In a nutshell, the Lebesgue integral is in many ways superior to the Riemann integral. 213 | 214 | First, any function that is Riemann integrable on a bounded interval is also Lebesgue integrable, and the values of the integrals agree. 215 | But there exist functions are Lebesgue integrable but not Riemann integrable. 216 | For example, consider the rational indicator $1_\Q$ on $[0,1]$. 217 | We know that for the Lebesgue integral, 218 | \[\int_{[0,1]} 1_\Q\dd{\lambda} = \lambda(\Q \cap [0,1]) = 0\] 219 | However it is easy to check that $1_\Q$ is not Riemann integrable: every non-trivial interval will contain at least one rational number and at least one irrational number, so no matter how the partition is chosen, the lower Darboux sum will be zero and the upper Darboux sum will be one. 220 | 221 | But this is a rather contrived example. 222 | Of more practical importance is the existence of stronger convergence theorems, such as the monotone convergence theorem and dominated convergence theorem. 223 | 224 | Another advantage of the Lebesgue integral, which admittedly is less important for our purposes, is that integration can be defined on spaces other than Euclidean space. 225 | The Riemann integral relies heavily on properties of the real line. 226 | 227 | \section{Probability} 228 | Suppose we have some sort of randomized experiment (e.g. a coin toss, die roll) that has a fixed set of possible outcomes. 229 | This set is called the \term{sample space} and denoted $\Omega$. 230 | 231 | We would like to define probabilities for some \term{events}, which are subsets of $\Omega$. 232 | The set of events is denoted $\calF$ and is required to be a $\sigma$-algebra. 233 | 234 | Then we can define a \term{probability measure} $\pm : \calF \to [0,1]$ which must satisfy $\pr{\Omega} = 1$ in addition to the axioms for general measures. 235 | The triple $(\Omega, \calF, \pm)$ is called a \term{probability space}.\footnote{ 236 | Note that a probability space is simply a measure space in which the measure of the whole space equals 1. 237 | } 238 | 239 | If $\pr{A} = 1$, we say that $A$ occurs \term{almost surely} (often abbreviated a.s.).\footnote{ 240 | This is a probabilist's version of the measure-theoretic term \textit{almost everywhere}. 241 | } 242 | Conversely if $\pr{A} = 0$, we say that $A$ occurs \term{almost never}. 243 | 244 | From these axioms, a number of useful rules can be derived. 245 | \begin{proposition} 246 | If $A$ is an event, then $\pr{A\comp} = 1 - \pr{A}$. 247 | \end{proposition} 248 | \begin{proof} 249 | Using the countable additivity of $\pm$, we have 250 | \[\pr{A} + \pr{A\comp} = \pr{A \dotcup A\comp} = \pr{\Omega} = 1\] 251 | which proves the result. 252 | \end{proof} 253 | 254 | \begin{proposition} 255 | Let $A$ be an event. Then 256 | \begin{enumerate}[(i)] 257 | \item If $B$ is an event and $B \subseteq A$, then $\pr{B} \leq \pr{A}$. 258 | \item $0 = \pr{\varnothing} \leq \pr{A} \leq \pr{\Omega} = 1$ 259 | \end{enumerate} 260 | \end{proposition} 261 | \begin{proof} 262 | (i) follows immediately from the monotonocity of measures. 263 | For (ii): the middle inequality follows from (i) since $\varnothing \subseteq A \subseteq \Omega$. 264 | We also have $\pr{\varnothing} = 0$ by applying the previous proposition with $A = \Omega$. 265 | \end{proof} 266 | 267 | \begin{proposition} 268 | If $A$ and $B$ are events, then $\pr{A \cup B} = \pr{A} + \pr{B} - \pr{A \cap B}$. 269 | \end{proposition} 270 | \begin{proof} 271 | The key is to break the events up into their various overlapping and non-overlapping parts. 272 | \begin{align*} 273 | \pr{A \cup B} &= \pr{(A \cap B) \dotcup (A \setminus B) \dotcup (B \setminus A)} \\ 274 | &= \pr{A \cap B} + \pr{A \setminus B} + \pr{B \setminus A} \\ 275 | &= \pr{A \cap B} + \pr{A} - \pr{A \cap B} + \pr{B} - \pr{A \cap B} \\ 276 | &= \pr{A} + \pr{B} - \pr{A \cap B} 277 | \end{align*} 278 | \end{proof} 279 | 280 | \begin{proposition} 281 | If $\{A_i\} \subseteq \calF$ is a countable set of events, disjoint or not, then 282 | \[\prbigg{\bigcup_i A_i} \leq \sum_i \pr{A_i}\] 283 | \end{proposition} 284 | This inequality is sometimes referred to as \term{Boole's inequality} or the \term{union bound}. 285 | \begin{proof} 286 | Follows immediately from the sub-additivity of measures. 287 | \end{proof} 288 | 289 | \subsection{Random variables} 290 | Intuitively, a \term{random variable} is some uncertain quantity with an associated probability distribution over the values it can assume. 291 | 292 | Formally, a random variable on a probability space $(\Omega, \calF, \pm)$ is a measurable function $X: \Omega \to \R$.\footnote{ 293 | More generally, the codomain can be any measurable space, but $\R$ is the most common case by far and sufficient for our purposes. 294 | } 295 | 296 | We denote the range of $X$ by $X(\Omega) = \{X(\omega) : \omega \in \Omega\}$. 297 | To give a concrete example (taken from \cite{pitman}), suppose $X$ is the number of heads in two tosses of a fair coin. 298 | The sample space is 299 | \[\Omega = \{hh, tt, ht, th\}\] 300 | and $X$ is determined completely by the outcome $\omega$, i.e. $X = X(\omega)$. 301 | For example, the event $X = 1$ is the set of outcomes $\{ht, th\}$. 302 | 303 | It is common to talk about the values of a random variable without directly referencing its sample space. 304 | The two are related by the following definition: the event that the value of $X$ lies in some set $S \subseteq \R$ is 305 | \[X \in S = X\inv(S) = \{\omega \in \Omega : X(\omega) \in S\}\] 306 | Here the $X\inv$ notation means the preimage of $S$ under $X$, not the inverse of $X$. 307 | 308 | Note that special cases of this definition include $X$ being equal to, less than, or greater than some specified value. 309 | For example 310 | \[\pr{X = x} = \pr{X\inv(\{x\})} = \pr{\{\omega \in \Omega : X(\omega) = x\}}\] 311 | 312 | \subsubsection{The cumulative distribution function} 313 | The \term{cumulative distribution function} (c.d.f.) gives the probability that a random variable is at most a certain value: 314 | \[F(x) = \pr{X \leq x}\] 315 | The c.d.f. can be used to give the probability that a variable lies within a certain range: 316 | \[\pr{a < X \leq b} = F(b) - F(a)\] 317 | 318 | \subsubsection{Discrete random variables} 319 | A \term{discrete random variable} is a random variable that has a countable range and assumes each value in this range with positive probability. 320 | Discrete random variables are completely specified by their \term{probability mass function} (p.m.f.) $p : X(\Omega) \to [0,1]$ which satisfies 321 | \[\sum_x p(x) = 1\] 322 | For a discrete $X$, the probability of a particular value is given exactly by its p.m.f.: 323 | \[\pr{X = x} = p(x)\] 324 | In fact, any nonnegative function that sums to one over a countable domain induces a discrete probability space. 325 | \begin{proposition} 326 | Suppose $\Omega$ is a non-empty countable set and $p : \Omega \to [0,1]$ is such that $\sum_{\omega \in \Omega} p(\omega) = 1$. 327 | Let $\calF = \calP(\Omega)$ and 328 | \[\pr{A} = \sum_{\omega \in A} p(\omega)\] 329 | for any event $A \in \calF$. 330 | Then 331 | \begin{enumerate}[(i)] 332 | \item $(\Omega, \calF, \pm)$ is a probability space. 333 | \item If $S \subset \R$ with $|S| = |\Omega|$, then any bijection $X : \Omega \to S$ is a random variable on this space with probability mass function $p \circ X\inv$. 334 | \end{enumerate} 335 | \end{proposition} 336 | \begin{proof} 337 | $\calF$ is clearly a $\sigma$-algebra since it contains every subset of $\Omega$ and thus is closed under all complements and unions. 338 | Thus all that must be shown is that $\pm$ is a probability measure. 339 | We have $\pr{\Omega} = \sum_{\omega \in \Omega} p(\omega) = 1$ immediately by assumption. 340 | To show countable additivity, we see that if $\{A_i\} \subseteq \calF$ are disjoint, then 341 | \[\prbigg{\bigdotcup_i A_i} = \sum_{\omega \in \bigdotcup_i A_i} p(\omega) = \sum_i \sum_{\omega \in A_i} p(\omega) = \sum_i \pr{A_i}\] 342 | which proves (i). 343 | 344 | To show (ii), suppose $S \subset \R$ with $|S| = |\Omega|$ and let $X : \Omega \to S$ be a bijection. 345 | It is clear that $X$ is measurable, again because $\calF$ contains every subset of $\Omega$. 346 | We also have for any $x \in S$, 347 | \[\pr{X = x} = \pr{X\inv(\{x\})} = \pr{\{X\inv(x)\}} = p(X\inv(x)) = (p \circ X\inv)(x)\] 348 | so $p \circ X\inv$ is the probability mass function of $X$. 349 | \end{proof} 350 | 351 | \subsubsection{Continuous random variables} 352 | A \term{continuous random variable} is a random variable that has an uncountable range and assumes each value in this range with probability zero. 353 | Most of the continuous random variables that one would encounter in practice are \term{absolutely continuous random variables}\footnote{ 354 | Random variables that are continuous but not absolutely continuous are called \term{singular random variables}. 355 | We will not discuss them, assuming rather that all continuous random variables admit a density function. 356 | }, which means that there exists a function $p : \R \to [0,\infty)$ that satisfies 357 | \[F(x) = \int_{-\infty}^x p(z)\dd{z}\] 358 | The function $p$ is called a \term{probability density function} (abbreviated p.d.f.) and must satisfy 359 | \[\int_{-\infty}^\infty p(x)\dd{x} = 1\] 360 | The values of this function are not themselves probabilities, since they could exceed 1. 361 | However, they do have a couple of reasonable interpretations. 362 | One is as relative probabilities; even though the probability of each particular value being picked is technically zero, some points are still in a sense more likely than others. 363 | 364 | One can also think of the density as determining the probability that the variable will lie in a small range about a given value. 365 | Recall that for small $\epsilon$, 366 | \[\pr{x-\nicefrac{\epsilon}{2} \leq X \leq x+\nicefrac{\epsilon}{2}} = \int_{x-\nicefrac{\epsilon}{2}}^{x+\nicefrac{\epsilon}{2}} p(z)\dd{z} \approx \epsilon p(x)\] 367 | using a midpoint approximation to the integral. 368 | 369 | Here are some useful identities that follow from the definitions above: 370 | \begin{align*} 371 | \pr{a \leq X \leq b} &= \int_a^b p(x)\dd{x} \\ 372 | p(x) &= F'(x) 373 | \end{align*} 374 | 375 | \subsubsection{Other kinds of random variables} 376 | There are random variables that are neither discrete nor continuous. 377 | For example, consider a random variable determined as follows: 378 | flip a fair coin, then the value is zero if it comes up heads, otherwise draw a number uniformly at random from $[1,2]$. 379 | Such a random variable can take on uncountably many values, but only finitely many of these with positive probability. 380 | We will not discuss such random variables. 381 | 382 | \bibliography{measure-probability} 383 | \addcontentsline{toc}{section}{References} 384 | \bibliographystyle{ieeetr} 385 | \nocite{*} 386 | \end{document} 387 | -------------------------------------------------------------------------------- /cs189-linalg.tex: -------------------------------------------------------------------------------- 1 | In this section we present important classes of spaces in which our data will live and our operations will take place: vector spaces, metric spaces, normed spaces, and inner product spaces. 2 | Generally speaking, these are defined in such a way as to capture one or more important properties of Euclidean space but in a more general way. 3 | 4 | \subsection{Vector spaces} 5 | \term{Vector spaces} are the basic setting in which linear algebra happens. 6 | A vector space $V$ is a set (the elements of which are called \term{vectors}) on which two operations are defined: vectors can be added together, and vectors can be multiplied by real numbers\footnote{ 7 | More generally, vector spaces can be defined over any \term{field} $\F$. 8 | We take $\F = \R$ in this document to avoid an unnecessary diversion into abstract algebra. 9 | } called \term{scalars}. 10 | $V$ must satisfy 11 | \begin{enumerate}[(i)] 12 | \item There exists an additive identity (written $\vec{0}$) in $V$ such that $\x+\vec{0} = \x$ for all $\x \in V$ 13 | \item For each $\x \in V$, there exists an additive inverse (written $\vec{-x}$) such that $\x+(\vec{-x}) = \vec{0}$ 14 | \item There exists a multiplicative identity (written $1$) in $\R$ such that $1\x = \x$ for all $\x \in V$ 15 | \item Commutativity: $\x+\y = \y+\x$ for all $\x, \y \in V$ 16 | \item Associativity: $(\x+\y)+\vec{z} = \x+(\y+\vec{z})$ and $\alpha(\beta\x) = (\alpha\beta)\x$ for all $\x, \y, \vec{z} \in V$ and $\alpha, \beta \in \R$ 17 | \item Distributivity: $\alpha(\x+\y) = \alpha\x + \alpha\y$ and $(\alpha+\beta)\x = \alpha\x + \beta\x$ for all $\x, \y \in V$ and $\alpha, \beta \in \R$ 18 | \end{enumerate} 19 | 20 | \subsubsection{Euclidean space} 21 | The quintessential vector space is \term{Euclidean space}, which we denote $\R^n$. 22 | The vectors in this space consist of $n$-tuples of real numbers: 23 | \[\x = (x_1, x_2, \dots, x_n)\] 24 | For our purposes, it will be useful to think of them as $n \times 1$ matrices, or \term{column vectors}: 25 | \[\x = \matlit{x_1 \\ x_2 \\ \vdots \\ x_n}\] 26 | Addition and scalar multiplication are defined component-wise on vectors in $\R^n$: 27 | \[\x + \y = \matlit{x_1 + y_1 \\ \vdots \\ x_n + y_n}, \tab \alpha\x = \matlit{\alpha x_1 \\ \vdots \\ \alpha x_n}\] 28 | Euclidean space is used to mathematically represent physical space, with notions such as distance, length, and angles. 29 | Although it becomes hard to visualize for $n > 3$, these concepts generalize mathematically in obvious ways. 30 | Even when you're working in more general settings than $\R^n$, it is often useful to visualize vector addition and scalar multiplication in terms of 2D vectors in the plane or 3D vectors in space. 31 | 32 | \subsubsection{Subspaces} 33 | Vector spaces can contain other vector spaces. 34 | If $V$ is a vector space, then $S \subseteq V$ is said to be a \term{subspace} of $V$ if 35 | \begin{enumerate}[(i)] 36 | \item $\vec{0} \in S$ 37 | \item $S$ is closed under addition: $\x, \y \in S$ implies $\x+\y \in S$ 38 | \item $S$ is closed under scalar multiplication: $\x \in S, \alpha \in \R$ implies $\alpha\x \in S$ 39 | \end{enumerate} 40 | Note that $V$ is always a subspace of $V$, as is the trivial vector space which contains only $\vec{0}$. 41 | 42 | As a concrete example, a line passing through the origin is a subspace of Euclidean space. 43 | 44 | Some of the most important subspaces are those induced by linear maps. 45 | If $T : V \to W$ is a linear map, we define the \term{nullspace}\footnote{ 46 | It is sometimes called the \term{kernel} by algebraists, but we eschew this terminology because the word ``kernel'' has another meaning in machine learning. 47 | } of $T$ as 48 | \[\Null(T) = \{\x \in V \mid T\x = \vec{0}\}\] 49 | and the \term{range} (or the \term{columnspace} if we are considering the matrix form) of $T$ as 50 | \[\range(T) = \{\y \in W \mid \text{$\exists \x \in V$ such that $T\x = \y$}\}\] 51 | It is a good exercise to verify that the nullspace and range of a linear map are always subspaces of its domain and codomain, respectively. 52 | 53 | \subsection{Metric spaces} 54 | Metrics generalize the notion of distance from Euclidean space (although metric spaces need not be vector spaces). 55 | 56 | A \term{metric} on a set $S$ is a function $d : S \times S \to \R$ that satisfies 57 | \begin{enumerate}[(i)] 58 | \item $d(x,y) \geq 0$, with equality if and only if $x = y$ 59 | \item $d(x,y) = d(y,x)$ 60 | \item $d(x,z) \leq d(x,y) + d(y,z)$ (the so-called \term{triangle inequality}) 61 | \end{enumerate} 62 | for all $x, y, z \in S$. 63 | 64 | A key motivation for metrics is that they allow limits to be defined for mathematical objects other than real numbers. 65 | We say that a sequence $\{x_n\} \subseteq S$ converges to the limit $x$ if for any $\epsilon > 0$, there exists $N \in \N$ such that $d(x_n, x) < \epsilon$ for all $n \geq N$. 66 | Note that the definition for limits of sequences of real numbers, which you have likely seen in a calculus class, is a special case of this definition when using the metric $d(x, y) = |x-y|$. 67 | 68 | \subsection{Normed spaces} 69 | Norms generalize the notion of length from Euclidean space. 70 | 71 | A \term{norm} on a real vector space $V$ is a function $\|\cdot\| : V \to \R$ that satisfies 72 | \begin{enumerate}[(i)] 73 | \item $\|\x\| \geq 0$, with equality if and only if $\x = \vec{0}$ 74 | \item $\|\alpha\x\| = |\alpha|\|\x\|$ 75 | \item $\|\x+\y\| \leq \|\x\| + \|\y\|$ (the \term{triangle inequality} again) 76 | \end{enumerate} 77 | for all $\x, \y \in V$ and all $\alpha \in \R$. 78 | A vector space endowed with a norm is called a \term{normed vector space}, or simply a \term{normed space}. 79 | 80 | Note that any norm on $V$ induces a distance metric on $V$: 81 | \[d(\x, \y) = \|\x-\y\|\] 82 | One can verify that the axioms for metrics are satisfied under this definition and follow directly from the axioms for norms. 83 | Therefore any normed space is also a metric space.\footnote{ 84 | If a normed space is complete with respect to the distance metric induced by its norm, we say that it is a \term{Banach space}. 85 | } 86 | 87 | We will typically only be concerned with a few specific norms on $\R^n$: 88 | \begin{align*} 89 | \|\x\|_1 &= \sum_{i=1}^n |x_i| \\ 90 | \|\x\|_2 &= \sqrt{\sum_{i=1}^n x_i^2} \\ 91 | \|\x\|_p &= \left(\sum_{i=1}^n |x_i|^p\right)^\frac{1}{p} \tab\tab (p \geq 1) \\ 92 | \|\x\|_\infty &= \max_{1 \leq i \leq n} |x_i| 93 | \end{align*} 94 | Note that the 1- and 2-norms are special cases of the $p$-norm, and the $\infty$-norm is the limit of the $p$-norm as $p$ tends to infinity. 95 | We require $p \geq 1$ for the general definition of the $p$-norm because the triangle inequality fails to hold if $p < 1$. 96 | (Try to find a counterexample!) 97 | 98 | Here's a fun fact: for any given finite-dimensional vector space $V$, all norms on $V$ are equivalent in the sense that for two norms $\|\cdot\|_A, \|\cdot\|_B$, there exist constants $\alpha, \beta > 0$ such that 99 | \[\alpha\|\x\|_A \leq \|\x\|_B \leq \beta\|\x\|_A\] 100 | for all $\x \in V$. Therefore convergence in one norm implies convergence in any other norm. 101 | This rule may not apply in infinite-dimensional vector spaces such as function spaces, though. 102 | 103 | \subsection{Inner product spaces} 104 | An \term{inner product} on a real vector space $V$ is a function $\inner{\cdot}{\cdot} : V \times V \to \R$ satisfying 105 | \begin{enumerate}[(i)] 106 | \item $\inner{\x}{\x} \geq 0$, with equality if and only if $\x = \vec{0}$ 107 | \item $\inner{\alpha\x + \beta\y}{\vec{z}} = \alpha\inner{\x}{\vec{z}} + \beta\inner{\y}{\vec{z}}$ 108 | \item $\inner{\x}{\y} = \inner{\y}{\x}$ 109 | \end{enumerate} 110 | for all $\x, \y, \vec{z} \in V$ and all $\alpha,\beta \in \R$. 111 | A vector space endowed with an inner product is called an \term{inner product space}. 112 | 113 | Note that any inner product on $V$ induces a norm on $V$: 114 | \[\|\x\| = \sqrt{\inner{\x}{\x}}\] 115 | One can verify that the axioms for norms are satisfied under this definition and follow (almost) directly from the axioms for inner products. 116 | Therefore any inner product space is also a normed space (and hence also a metric space).\footnote{ 117 | If an inner product space is complete with respect to the distance metric induced by its inner product, we say that it is a \term{Hilbert space}. 118 | } 119 | 120 | Two vectors $\x$ and $\y$ are said to be \term{orthogonal} if $\inner{\x}{\y} = 0$; we write $\x \perp \y$ for shorthand. 121 | Orthogonality generalizes the notion of perpendicularity from Euclidean space. 122 | If two orthogonal vectors $\x$ and $\y$ additionally have unit length (i.e. $\|\x\| = \|\y\| = 1$), then they are described as \term{orthonormal}. 123 | 124 | The standard inner product on $\R^n$ is given by 125 | \[\inner{\x}{\y} = \sum_{i=1}^n x_iy_i = \x\tran\y\] 126 | The matrix notation on the righthand side (see the Transposition section if it's unfamiliar) arises because this inner product is a special case of matrix multiplication where we regard the resulting $1 \times 1$ matrix as a scalar. 127 | The inner product on $\R^n$ is also often written $\x\cdot\y$ (hence the alternate name \term{dot product}). 128 | The reader can verify that the two-norm $\|\cdot\|_2$ on $\R^n$ is induced by this inner product. 129 | 130 | \subsubsection{Pythagorean Theorem} 131 | The well-known Pythagorean theorem generalizes naturally to arbitrary inner product spaces. 132 | \begin{theorem} 133 | If $\x \perp \y$, then 134 | \[\|\x+\y\|^2 = \|\x\|^2 + \|\y\|^2\] 135 | \end{theorem} 136 | \begin{proof} 137 | Suppose $\x \perp \y$, i.e. $\inner{\x}{\y} = 0$. Then 138 | \[\|\x+\y\|^2 = \inner{\x+\y}{\x+\y} = \inner{\x}{\x} + \inner{\y}{\x} + \inner{\x}{\y} + \inner{\y}{\y} = \|\x\|^2 + \|\y\|^2\] 139 | as claimed. 140 | \end{proof} 141 | 142 | \subsubsection{Cauchy-Schwarz inequality} 143 | This inequality is sometimes useful in proving bounds: 144 | \[|\inner{\x}{\y}| \leq \|\x\| \cdot \|\y\|\] 145 | for all $\x, \y \in V$. Equality holds exactly when $\x$ and $\y$ are scalar multiples of each other (or equivalently, when they are linearly dependent). 146 | 147 | \subsection{Transposition} 148 | If $\A \in \R^{m \times n}$, its \term{transpose} $\A\tran \in \R^{n \times m}$ is given by $(\A\tran)_{ij} = A_{ji}$ for each $(i, j)$. 149 | In other words, the columns of $\A$ become the rows of $\A\tran$, and the rows of $\A$ become the columns of $\A\tran$. 150 | 151 | The transpose has several nice algebraic properties that can be easily verified from the definition: 152 | \begin{enumerate}[(i)] 153 | \item $(\A\tran)\tran = \A$ 154 | \item $(\A+\mat{B})\tran = \A\tran + \mat{B}\tran$ 155 | \item $(\alpha \A)\tran = \alpha \A\tran$ 156 | \item $(\A\mat{B})\tran = \mat{B}\tran \A\tran$ 157 | \end{enumerate} 158 | 159 | \subsection{Eigenthings} 160 | For a square matrix $\A \in \R^{n \times n}$, there may be vectors which, when $\A$ is applied to them, are simply scaled by some constant. 161 | We say that a nonzero vector $\x \in \R^n$ is an \term{eigenvector} of $\A$ corresponding to \term{eigenvalue} $\lambda$ if 162 | \[\A\x = \lambda\x\] 163 | The zero vector is excluded from this definition because $\A\vec{0} = \vec{0} = \lambda\vec{0}$ for every $\lambda$. 164 | 165 | We now give some useful results about how eigenvalues change after various manipulations. 166 | \begin{proposition} 167 | Let $\x$ be an eigenvector of $\A$ with corresponding eigenvalue $\lambda$. 168 | Then 169 | \begin{enumerate}[(i)] 170 | \item For any $\gamma \in \R$, $\x$ is an eigenvector of $\A + \gamma\I$ with eigenvalue $\lambda + \gamma$. 171 | \item If $\A$ is invertible, then $\x$ is an eigenvector of $\A\inv$ with eigenvalue $\lambda\inv$. 172 | \item $\A^k\x = \lambda^k\x$ for any $k \in \Z$ (where $\A^0 = \I$ by definition). 173 | \end{enumerate} 174 | \end{proposition} 175 | \begin{proof} 176 | (i) follows readily: 177 | \[(\A + \gamma\I)\x = \A\x + \gamma\I\x = \lambda\x + \gamma\x = (\lambda + \gamma)\x\] 178 | 179 | (ii) Suppose $\A$ is invertible. Then 180 | \[\x = \A\inv\A\x = \A\inv(\lambda\x) = \lambda\A\inv\x\] 181 | Dividing by $\lambda$, which is valid because the invertibility of $\A$ implies $\lambda \neq 0$, gives $\lambda\inv\x = \A\inv\x$. 182 | 183 | (iii) The case $k \geq 0$ follows immediately by induction on $k$. 184 | Then the general case $k \in \Z$ follows by combining the $k \geq 0$ case with (ii). 185 | \end{proof} 186 | 187 | \subsection{Trace} 188 | The \term{trace} of a square matrix is the sum of its diagonal entries: 189 | \[\tr(\A) = \sum_{i=1}^n A_{ii}\] 190 | The trace has several nice algebraic properties: 191 | \begin{enumerate}[(i)] 192 | \item $\tr(\A+\mat{B}) = \tr(\A) + \tr(\mat{B})$ 193 | \item $\tr(\alpha\A) = \alpha\tr(\A)$ 194 | \item $\tr(\A\tran) = \tr(\A)$ 195 | \item $\tr(\A\mat{B}\mat{C}\mat{D}) = \tr(\mat{B}\mat{C}\mat{D}\A) = \tr(\mat{C}\mat{D}\A\mat{B}) = \tr(\mat{D}\A\mat{B}\mat{C})$ 196 | \end{enumerate} 197 | The first three properties follow readily from the definition. 198 | The last is known as \term{invariance under cyclic permutations}. 199 | Note that the matrices cannot be reordered arbitrarily, for example $\tr(\A\mat{B}\mat{C}\mat{D}) \neq \tr(\mat{B}\A\mat{C}\mat{D})$ in general. 200 | Also, there is nothing special about the product of four matrices -- analogous rules hold for more or fewer matrices. 201 | 202 | Interestingly, the trace of a matrix is equal to the sum of its eigenvalues (repeated according to multiplicity): 203 | \[\tr(\A) = \sum_i \lambda_i(\A)\] 204 | 205 | \subsection{Determinant} 206 | The \term{determinant} of a square matrix can be defined in several different confusing ways, none of which are particularly important for our purposes; go look at an introductory linear algebra text (or Wikipedia) if you need a definition. 207 | But it's good to know the properties: 208 | \begin{enumerate}[(i)] 209 | \item $\det(\I) = 1$ 210 | \item $\det(\A\tran) = \det(\A)$ 211 | \item $\det(\A\mat{B}) = \det(\A)\det(\mat{B})$ 212 | \item $\det(\A\inv) = \det(\A)\inv$ 213 | \item $\det(\alpha\A) = \alpha^n \det(\A)$ 214 | \end{enumerate} 215 | Interestingly, the determinant of a matrix is equal to the product of its eigenvalues (repeated according to multiplicity): 216 | \[\det(\A) = \prod_i \lambda_i(\A)\] 217 | 218 | \subsection{Orthogonal matrices} 219 | A matrix $\mat{Q} \in \R^{n \times n}$ is said to be \term{orthogonal} if its columns are pairwise orthonormal. 220 | This definition implies that 221 | \[\mat{Q}\tran \mat{Q} = \mat{Q}\mat{Q}\tran = \I\] 222 | or equivalently, $\mat{Q}\tran = \mat{Q}\inv$. A nice thing about orthogonal matrices is that they preserve inner products: 223 | \[(\mat{Q}\x)\tran(\mat{Q}\y) = \x\tran \mat{Q}\tran \mat{Q}\y = \x\tran \I\y = \x\tran\y\] 224 | A direct result of this fact is that they also preserve 2-norms: 225 | \[\|\mat{Q}\x\|_2 = \sqrt{(\mat{Q}\x)\tran(\mat{Q}\x)} = \sqrt{\x\tran\x} = \|\x\|_2\] 226 | Therefore multiplication by an orthogonal matrix can be considered as a transformation that preserves length, but may rotate or reflect the vector about the origin. 227 | 228 | \subsection{Symmetric matrices} 229 | A matrix $\A \in \R^{n \times n}$ is said to be \term{symmetric} if it is equal to its own transpose ($\A = \A\tran$), meaning that $A_{ij} = A_{ji}$ for all $(i,j)$. 230 | This definition seems harmless enough but turns out to have some strong implications. 231 | We summarize the most important of these as 232 | \begin{theorem} 233 | (Spectral Theorem) 234 | If $\A \in \R^{n \times n}$ is symmetric, then there exists an orthonormal basis for $\R^n$ consisting of eigenvectors of $\A$. 235 | \end{theorem} 236 | The practical application of this theorem is a particular factorization of symmetric matrices, referred to as the \term{eigendecomposition} or \term{spectral decomposition}. 237 | Denote the orthonormal basis of eigenvectors $\q_1, \dots, \q_n$ and their eigenvalues $\lambda_1, \dots, \lambda_n$. 238 | Let $\mat{Q}$ be an orthogonal matrix with $\q_1, \dots, \q_n$ as its columns, and $\mat{\Lambda} = \diag(\lambda_1, \dots, \lambda_n)$. 239 | Since by definition $\A\q_i = \lambda_i\q_i$ for every $i$, the following relationship holds: 240 | \[\A\mat{Q} = \mat{Q}\mat{\Lambda}\] 241 | Right-multiplying by $\mat{Q}\tran$, we arrive at the decomposition 242 | \[\A = \mat{Q}\mat{\Lambda}\mat{Q}\tran\] 243 | 244 | \subsubsection{Rayleigh quotients} 245 | Let $\A \in \R^{n \times n}$ be a symmetric matrix. 246 | The expression $\x\tran\A\x$ is called a \term{quadratic form}. 247 | 248 | There turns out to be an interesting connection between the quadratic form of a symmetric matrix and its eigenvalues. 249 | This connection is provided by the \term{Rayleigh quotient} 250 | \[R_\A(\x) = \frac{\x\tran\A\x}{\x\tran\x}\] 251 | The Rayleigh quotient has a couple of important properties which the reader can (and should!) easily verify from the definition: 252 | \begin{enumerate}[(i)] 253 | \item \term{Scale invariance}: for any vector $\x \neq \vec{0}$ and any scalar $\alpha \neq 0$, $R_\A(\x) = R_\A(\alpha\x)$. 254 | \item If $\x$ is an eigenvector of $\A$ with eigenvalue $\lambda$, then $R_\A(\x) = \lambda$. 255 | \end{enumerate} 256 | We can further show that the Rayleigh quotient is bounded by the largest and smallest eigenvalues of $\A$. 257 | But first we will show a useful special case of the final result. 258 | \begin{proposition} 259 | For any $\x$ such that $\|\x\|_2 = 1$, 260 | \[\lambda_{\min}(\A) \leq \x\tran\A\x \leq \lambda_{\max}(\A)\] 261 | with equality if and only if $\x$ is a corresponding eigenvector. 262 | \end{proposition} 263 | \begin{proof} 264 | We show only the $\max$ case because the argument for the $\min$ case is entirely analogous. 265 | 266 | Since $\A$ is symmetric, we can decompose it as $\A = \mat{Q}\mat{\Lambda}\mat{Q}\tran$. 267 | Then use the change of variable $\y = \mat{Q}\tran\x$, noting that the relationship between $\x$ and $\y$ is one-to-one and that $\|\y\|_2 = 1$ since $\mat{Q}$ is orthogonal. 268 | Hence 269 | \[\max_{\|\x\|_2 = 1} \x\tran\A\x = \max_{\|\y\|_2 = 1} \y\tran\mat{\Lambda}\y = \max_{y_1^2+\dots+y_n^2=1} \sum_{i=1}^n \lambda_i y_i^2\] 270 | Written this way, it is clear that $\y$ maximizes this expression exactly if and only if it satisfies $\sum_{i \in I} y_i^2 = 1$ where $I = \{i : \lambda_i = \max_{j=1,\dots,n} \lambda_j = \lambda_{\max}(\A)\}$ and $y_j = 0$ for $j \not\in I$. 271 | That is, $I$ contains the index or indices of the largest eigenvalue. 272 | In this case, the maximal value of the expression is 273 | \[\sum_{i=1}^n \lambda_i y_i^2 = \sum_{i \in I} \lambda_i y_i^2 = \lambda_{\max}(\A) \sum_{i \in I} y_i^2 = \lambda_{\max}(\A)\] 274 | Then writing $\q_1, \dots, \q_n$ for the columns of $\mat{Q}$, we have 275 | \[\x = \mat{Q}\mat{Q}\tran\x = \mat{Q}\y = \sum_{i=1}^n y_i\q_i = \sum_{i \in I} y_i\q_i\] 276 | where we have used the matrix-vector product identity. 277 | 278 | Recall that $\q_1, \dots, \q_n$ are eigenvectors of $\A$ and form an orthonormal basis for $\R^n$. 279 | Therefore by construction, the set $\{\q_i : i \in I\}$ forms an orthonormal basis for the eigenspace of $\lambda_{\max}(\A)$. 280 | Hence $\x$, which is a linear combination of these, lies in that eigenspace and thus is an eigenvector of $\A$ corresponding to $\lambda_{\max}(\A)$. 281 | 282 | We have shown that $\max_{\|\x\|_2 = 1} \x\tran\A\x = \lambda_{\max}(\A)$, from which we have the general inequality $\x\tran\A\x \leq \lambda_{\max}(\A)$ for all unit-length $\x$. 283 | \end{proof} 284 | By the scale invariance of the Rayleigh quotient, we immediately have as a corollary (since $\x\tran\A\x = R_{\A}(\x)$ for unit $\x$) 285 | \begin{theorem} 286 | (Min-max theorem) 287 | For all $\x \neq \vec{0}$, 288 | \[\lambda_{\min}(\A) \leq R_\A(\x) \leq \lambda_{\max}(\A)\] 289 | with equality if and only if $\x$ is a corresponding eigenvector. 290 | \end{theorem} 291 | 292 | \subsection{Positive (semi-)definite matrices} 293 | A symmetric matrix $\A$ is \term{positive semi-definite} if for all $\x \in \R^n$, $\x\tran\A\x \geq 0$. 294 | Sometimes people write $\A \succeq 0$ to indicate that $\A$ is positive semi-definite. 295 | 296 | A symmetric matrix $\A$ is \term{positive definite} if for all nonzero $\x \in \R^n$, $\x\tran\A\x > 0$. 297 | Sometimes people write $\A \succ 0$ to indicate that $\A$ is positive definite. 298 | Note that positive definiteness is a strictly stronger property than positive semi-definiteness, in the sense that every positive definite matrix is positive semi-definite but not vice-versa. 299 | 300 | These properties are related to eigenvalues in the following way. 301 | \begin{proposition} 302 | A symmetric matrix is positive semi-definite if and only if all of its eigenvalues are nonnegative, and positive definite if and only if all of its eigenvalues are positive. 303 | \end{proposition} 304 | \begin{proof} 305 | Suppose $A$ is positive semi-definite, and let $\x$ be an eigenvector of $\A$ with eigenvalue $\lambda$. 306 | Then 307 | \[0 \leq \x\tran\A\x = \x\tran(\lambda\x) = \lambda\x\tran\x = \lambda\|\x\|_2^2\] 308 | Since $\x \neq \vec{0}$ (by the assumption that it is an eigenvector), we have $\|\x\|_2^2 > 0$, so we can divide both sides by $\|\x\|_2^2$ to arrive at $\lambda \geq 0$. 309 | If $\A$ is positive definite, the inequality above holds strictly, so $\lambda > 0$. 310 | This proves one direction. 311 | 312 | To simplify the proof of the other direction, we will use the machinery of Rayleigh quotients. 313 | Suppose that $\A$ is symmetric and all its eigenvalues are nonnegative. 314 | Then for all $\x \neq \vec{0}$, 315 | \[0 \leq \lambda_{\min}(\A) \leq R_\A(\x)\] 316 | Since $\x\tran\A\x$ matches $R_\A(\x)$ in sign, we conclude that $\A$ is positive semi-definite. 317 | If the eigenvalues of $\A$ are all strictly positive, then $0 < \lambda_{\min}(\A)$, whence it follows that $\A$ is positive definite. 318 | \end{proof} 319 | As an example of how these matrices arise, consider 320 | \begin{proposition} 321 | Suppose $\A \in \R^{m \times n}$. 322 | Then $\A\tran\A$ is positive semi-definite. 323 | If $\Null(\A) = \{\vec{0}\}$, then $\A\tran\A$ is positive definite. 324 | \end{proposition} 325 | \begin{proof} 326 | For any $\x \in \R^n$, 327 | \[\x\tran (\A\tran\A)\x = (\A\x)\tran(\A\x) = \|\A\x\|_2^2 \geq 0\] 328 | so $\A\tran\A$ is positive semi-definite. 329 | 330 | Note that $\|\A\x\|_2^2 = 0$ implies $\|\A\x\|_2 = 0$, which in turn implies $\A\x = \vec{0}$ (recall that this is a property of norms). 331 | If $\Null(\A) = \{\vec{0}\}$, $\A\x = \vec{0}$ implies $\x = \vec{0}$, so $\x\tran (\A\tran\A)\x = 0$ if and only if $\x = \vec{0}$, and thus $\A\tran\A$ is positive definite. 332 | \end{proof} 333 | Positive definite matrices are invertible (since their eigenvalues are nonzero), whereas positive semi-definite matrices might not be. 334 | However, if you already have a positive semi-definite matrix, it is possible to perturb its diagonal slightly to produce a positive definite matrix. 335 | \begin{proposition} 336 | If $\A$ is positive semi-definite and $\epsilon > 0$, then $\A + \epsilon\I$ is positive definite. 337 | \end{proposition} 338 | \begin{proof} 339 | Assuming $\A$ is positive semi-definite and $\epsilon > 0$, we have for any $\x \neq \vec{0}$ that 340 | \[\x\tran(\A+\epsilon\I)\x = \x\tran\A\x + \epsilon\x\tran\I\x = \underbrace{\x\tran\A\x}_{\geq 0} + \underbrace{\epsilon\|\x\|_2^2}_{> 0} > 0\] 341 | as claimed. 342 | \end{proof} 343 | An obvious but frequently useful consequence of the two propositions we have just shown is that $\A\tran\A + \epsilon\I$ is positive definite (and in particular, invertible) for \textit{any} matrix $\A$ and any $\epsilon > 0$. 344 | 345 | \subsubsection{The geometry of positive definite quadratic forms} 346 | A useful way to understand quadratic forms is by the geometry of their level sets. 347 | A \term{level set} or \term{isocontour} of a function is the set of all inputs such that the function applied to those inputs yields a given output. 348 | Mathematically, the $c$-isocontour of $f$ is $\{\x \in \dom f : f(\x) = c\}$. 349 | 350 | Let us consider the special case $f(\x) = \x\tran\mat{A}\x$ where $\mat{A}$ is a positive definite matrix. 351 | Since $\mat{A}$ is positive definite, it has a unique matrix square root $\A\halfpow = \mat{Q}\mat{\Lambda}\halfpow\mat{Q}\tran$, where $\mat{Q}\mat{\Lambda}\mat{Q}\tran$ is the eigendecomposition of $\A$ and $\mat{\Lambda}\halfpow = \diag(\sqrt{\lambda_1}, \dots \sqrt{\lambda_n})$. 352 | It is easy to see that this matrix $\A\halfpow$ is positive definite (consider its eigenvalues) and satisfies $\A\halfpow\A\halfpow = \A$. 353 | Fixing a value $c \geq 0$, the $c$-isocontour of $f$ is the set of $\x \in \R^n$ such that 354 | \[c = \x\tran\A\x = \x\tran\A\halfpow\A\halfpow\x = \|\A\halfpow\x\|_2^2\] 355 | where we have used the symmetry of $\A\halfpow$. 356 | Making the change of variable $\z = \A\halfpow\x$, we have the condition $\|\z\|_2 = \sqrt{c}$. 357 | That is, the values $\z$ lie on a sphere of radius $\sqrt{c}$. 358 | These can be parameterized as $\z = \sqrt{c}\hat{\z}$ where $\hat{\z}$ has $\|\hat{\z}\|_2 = 1$. 359 | Then since $\A\neghalfpow = \mat{Q}\mat{\Lambda}\neghalfpow\mat{Q}\tran$, we have 360 | \[\x = \A\neghalfpow\z = \mat{Q}\mat{\Lambda}\neghalfpow\mat{Q}\tran\sqrt{c}\hat{\z} = \sqrt{c}\mat{Q}\mat{\Lambda}\neghalfpow\tilde{\z}\] 361 | where $\tilde{\z} = \mat{Q}\tran\hat{\z}$ also satisfies $\|\tilde{\z}\|_2 = 1$ since $\mat{Q}$ is orthogonal. 362 | Using this parameterization, we see that the solution set $\{\x \in \R^n : f(\x) = c\}$ is the image of the unit sphere $\{\tilde{\z} \in \R^n : \|\tilde{\z}\|_2 = 1\}$ under the invertible linear map $\x = \sqrt{c}\mat{Q}\mat{\Lambda}\neghalfpow\tilde{\z}$. 363 | 364 | What we have gained with all these manipulations is a clear algebraic understanding of the $c$-isocontour of $f$ in terms of a sequence of linear transformations applied to a well-understood set. 365 | We begin with the unit sphere, then scale every axis $i$ by $\lambda_i\neghalfpow$, resulting in an axis-aligned ellipsoid. 366 | Observe that the axis lengths of the ellipsoid are proportional to the inverse square roots of the eigenvalues of $\A$. 367 | Hence larger eigenvalues correspond to shorter axis lengths, and vice-versa. 368 | 369 | Then this axis-aligned ellipsoid undergoes a rigid transformation (i.e. one that preserves length and angles, such as a rotation/reflection) given by $\mat{Q}$. 370 | The result of this transformation is that the axes of the ellipse are no longer along the coordinate axes in general, but rather along the directions given by the corresponding eigenvectors. 371 | To see this, consider the unit vector $\vec{e}_i \in \R^n$ that has $[\vec{e}_i]_j = \delta_{ij}$. 372 | In the pre-transformed space, this vector points along the axis with length proportional to $\lambda_i\neghalfpow$. 373 | But after applying the rigid transformation $\mat{Q}$, the resulting vector points in the direction of the corresponding eigenvector $\q_i$, since 374 | \[\mat{Q}\vec{e}_i = \sum_{j=1}^n [\vec{e}_i]_j\q_j = \q_i\] 375 | where we have used the matrix-vector product identity from earlier. 376 | 377 | In summary: the isocontours of $f(\x) = \x\tran\A\x$ are ellipsoids such that the axes point in the directions of the eigenvectors of $\A$, and the radii of these axes are proportional to the inverse square roots of the corresponding eigenvalues. 378 | 379 | \subsection{Singular value decomposition} 380 | Singular value decomposition (SVD) is a widely applicable tool in linear algebra. 381 | Its strength stems partially from the fact that \textit{every matrix} $\A \in \R^{m \times n}$ has an SVD (even non-square matrices)! 382 | The decomposition goes as follows: 383 | \[\A = \mat{U}\mat{\Sigma}\mat{V}\tran\] 384 | where $\mat{U} \in \R^{m \times m}$ and $\mat{V} \in \R^{n \times n}$ are orthogonal matrices and $\mat{\Sigma} \in \R^{m \times n}$ is a diagonal matrix with the \term{singular values} of $\A$ (denoted $\sigma_i$) on its diagonal. 385 | 386 | By convention, the singular values are given in non-increasing order, i.e. 387 | \[\sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_{\min(m,n)} \geq 0\] 388 | Only the first $r$ singular values are nonzero, where $r$ is the rank of $\A$. 389 | 390 | Observe that the SVD factors provide eigendecompositions for $\A\tran\A$ and $\A\A\tran$: 391 | \begin{align*} 392 | \A\tran\A &= (\mat{U}\mat{\Sigma}\mat{V}\tran)\tran\mat{U}\mat{\Sigma}\mat{V}\tran = \mat{V}\mat{\Sigma}\tran\mat{U}\tran\mat{U}\mat{\Sigma}\mat{V}\tran = \mat{V}\mat{\Sigma}\tran\mat{\Sigma}\mat{V}\tran \\ 393 | \A\A\tran &= \mat{U}\mat{\Sigma}\mat{V}\tran(\mat{U}\mat{\Sigma}\mat{V}\tran)\tran = \mat{U}\mat{\Sigma}\mat{V}\tran\mat{V}\mat{\Sigma}\tran\mat{U}\tran = \mat{U}\mat{\Sigma}\mat{\Sigma}\tran\mat{U}\tran 394 | \end{align*} 395 | It follows immediately that the columns of $\mat{V}$ (the \term{right-singular vectors} of $\A$) are eigenvectors of $\A\tran\A$, and the columns of $\mat{U}$ (the \term{left-singular vectors} of $\A$) are eigenvectors of $\A\A\tran$. 396 | 397 | The matrices $\mat{\Sigma}\tran\mat{\Sigma}$ and $\mat{\Sigma}\mat{\Sigma}\tran$ are not necessarily the same size, but both are diagonal with the squared singular values $\sigma_i^2$ on the diagonal (plus possibly some zeros). 398 | Thus the singular values of $\A$ are the square roots of the eigenvalues of $\A\tran\A$ (or equivalently, of $\A\A\tran$)\footnote{ 399 | Recall that $\A\tran\A$ and $\A\A\tran$ are positive semi-definite, so their eigenvalues are nonnegative, and thus taking square roots is always well-defined. 400 | }. 401 | 402 | \subsection{Some useful matrix identities} 403 | \subsubsection{Matrix-vector product as linear combination of matrix columns} 404 | \begin{proposition} 405 | Let $\x \in \R^n$ be a vector and $\A \in \R^{m \times n}$ a matrix with columns $\a_1, \dots, \a_n$. 406 | Then 407 | \[\A\x = \sum_{i=1}^n x_i\a_i\] 408 | \end{proposition} 409 | This identity is extremely useful in understanding linear operators in terms of their matrices' columns. 410 | The proof is very simple (consider each element of $\A\x$ individually and expand by definitions) but it is a good exercise to convince yourself. 411 | 412 | \subsubsection{Sum of outer products as matrix-matrix product} 413 | An \term{outer product} is an expression of the form $\a\b\tran$, where $\a \in \R^m$ and $\b \in \R^n$. 414 | By inspection it is not hard to see that such an expression yields an $m \times n$ matrix such that 415 | \[[\a\b\tran]_{ij} = a_ib_j\] 416 | It is not immediately obvious, but the sum of outer products is actually equivalent to an appropriate matrix-matrix product! 417 | We formalize this statement as 418 | \begin{proposition} 419 | Let $\a_1, \dots, \a_k \in \R^m$ and $\b_1, \dots, \b_k \in \R^n$. Then 420 | \[\sum_{\ell=1}^k \a_\ell\b_\ell\tran = \mat{A}\mat{B}\tran\] 421 | where 422 | \[\mat{A} = \matlit{\a_1 & \cdots & \a_k}, \tab \mat{B} = \matlit{\b_1 & \cdots & \b_k}\] 423 | \end{proposition} 424 | \begin{proof} 425 | For each $(i,j)$, we have 426 | \[\left[\sum_{\ell=1}^k \a_\ell\b_\ell\tran\right]_{ij} = \sum_{\ell=1}^k [\a_\ell\b_\ell\tran]_{ij} = \sum_{\ell=1}^k [\a_\ell]_i[\b_\ell]_j = \sum_{\ell=1}^k A_{i\ell}B_{j\ell}\] 427 | This last expression should be recognized as an inner product between the $i$th row of $\A$ and the $j$th row of $\mat{B}$, or equivalently the $j$th column of $\mat{B}\tran$. 428 | Hence by the definition of matrix multiplication, it is equal to $[\mat{A}\mat{B}\tran]_{ij}$. 429 | \end{proof} 430 | 431 | \subsubsection{Quadratic forms} 432 | Let $\A \in \R^{n \times n}$ be a symmetric matrix, and recall that the expression $\x\tran\A\x$ is called a quadratic form of $\A$. 433 | It is in some cases helpful to rewrite the quadratic form in terms of the individual elements that make up $\A$ and $\x$: 434 | \[\x\tran\A\x = \sum_{i=1}^n\sum_{j=1}^n A_{ij}x_ix_j\] 435 | This identity is valid for any square matrix (need not be symmetric), although quadratic forms are usually only discussed in the context of symmetric matrices. --------------------------------------------------------------------------------