├── Anaconda_Setup.pdf ├── Conda Setup ├── acp.png ├── p2.png ├── conda-update.png ├── latexmkrc ├── py2tex.py ├── macros-Fall2018.tex └── main.tex ├── Probability Review.pdf ├── Linear Algebra Review.pdf ├── Linux and Git Guide.pdf ├── Probability Review ├── fig1.png ├── fig2.png ├── latexmkrc ├── prob_slides.tex ├── nips07submit_e.sty └── Report.tex ├── Linear Algebra Review ├── figures │ └── figure.png ├── Makefile ├── latexmkrc ├── linalg2.toc └── linalg2.tex ├── Linux Git Guide ├── Linux Git Guide │ ├── Linux and Git Guide.pdf │ ├── latexmkrc │ └── main.tex ├── latexmkrc └── main.tex ├── Useful Links ├── XCS234 │ └── README.md └── README.md ├── README.md └── .gitignore /Anaconda_Setup.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Anaconda_Setup.pdf -------------------------------------------------------------------------------- /Conda Setup/acp.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Conda Setup/acp.png -------------------------------------------------------------------------------- /Conda Setup/p2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Conda Setup/p2.png -------------------------------------------------------------------------------- /Probability Review.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Probability Review.pdf -------------------------------------------------------------------------------- /Linear Algebra Review.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Linear Algebra Review.pdf -------------------------------------------------------------------------------- /Linux and Git Guide.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Linux and Git Guide.pdf -------------------------------------------------------------------------------- /Probability Review/fig1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Probability Review/fig1.png -------------------------------------------------------------------------------- /Probability Review/fig2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Probability Review/fig2.png -------------------------------------------------------------------------------- /Conda Setup/conda-update.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Conda Setup/conda-update.png -------------------------------------------------------------------------------- /Linear Algebra Review/figures/figure.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Linear Algebra Review/figures/figure.png -------------------------------------------------------------------------------- /Linear Algebra Review/Makefile: -------------------------------------------------------------------------------- 1 | default: 2 | latex linalg2.tex 3 | dvips -o linalg2.ps -t letter -Ppdf -G0 linalg2.dvi 4 | ps2pdf linalg2.ps 5 | #evince linalg.pdf 6 | -------------------------------------------------------------------------------- /Linux Git Guide/Linux Git Guide/Linux and Git Guide.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Linux Git Guide/Linux Git Guide/Linux and Git Guide.pdf -------------------------------------------------------------------------------- /Useful Links/XCS234/README.md: -------------------------------------------------------------------------------- 1 | # Research Papers 2 | | Label(s) | Link | Description| 3 | | ----------- | ----------- | ----------- | 4 | | `robotics` | [QT-OPT](https://arxiv.org/abs/1806.10293) | Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation | 5 | 6 | # Code 7 | | Label(s) | Link | Description| 8 | | ----------- | ----------- | ----------- | 9 | 10 | 11 | # Blog Posts 12 | | Label(s) | Link | Description| 13 | | ----------- | ----------- | ----------- | 14 | 15 | 16 | # Talks/Presentations 17 | | Label(s) | Link | Description| 18 | | ----------- | ----------- | ----------- | 19 | 20 | 21 | # Podcasts 22 | | Label(s) | Link | Description| 23 | | ----------- | ----------- | ----------- | 24 | -------------------------------------------------------------------------------- /Conda Setup/latexmkrc: -------------------------------------------------------------------------------- 1 | @default_files = ("main.tex"); # Set the root tex file for the output document 2 | $pdf_mode = 1; # tex -> PDF 3 | $auto_rc_use = 1; # Do not read further latexmkrc files 4 | $out_dir = ".."; # Create output files in the assignment directory (assignment/) 5 | $warnings_as_errors = 1; # Elevates warnings to errors. Enforces cleaner code. 6 | $pdflatex = "pdflatex -halt-on-error -interaction=batchmode %O %S"; 7 | # Forces latexmk to stop and quit if it encounters an error 8 | $jobname = "Anaconda_Setup"; # This is the name of the output PDF file 9 | $silent = 1; # For quieter output on the terminal. 10 | 11 | add_cus_dep('pytex','tex',0,'py2tex'); 12 | sub py2tex { 13 | system("./py2tex.py \"$_[0].pytex\" \"$_[0].tex\""); 14 | } -------------------------------------------------------------------------------- /Linux Git Guide/latexmkrc: -------------------------------------------------------------------------------- 1 | @default_files = ("main.tex"); # Set the root tex file for the output document 2 | $pdf_mode = 1; # tex -> PDF 3 | $auto_rc_use = 1; # Do not read further latexmkrc files 4 | $out_dir = ".."; # Create output files in the assignment directory (assignment/) 5 | $warnings_as_errors = 1; # Elevates warnings to errors. Enforces cleaner code. 6 | $pdflatex = "pdflatex -halt-on-error -interaction=batchmode %O %S"; 7 | # Forces latexmk to stop and quit if it encounters an error 8 | $jobname = "Linux and Git Guide"; # This is the name of the output PDF file 9 | $silent = 1; # For quieter output on the terminal. 10 | 11 | add_cus_dep('pytex','tex',0,'py2tex'); 12 | sub py2tex { 13 | system("./py2tex.py \"$_[0].pytex\" \"$_[0].tex\""); 14 | } -------------------------------------------------------------------------------- /Probability Review/latexmkrc: -------------------------------------------------------------------------------- 1 | @default_files = ("Report.tex"); # Set the root tex file for the output document 2 | $pdf_mode = 1; # tex -> PDF 3 | $auto_rc_use = 1; # Do not read further latexmkrc files 4 | $out_dir = ".."; # Create output files in the assignment directory (assignment/) 5 | $warnings_as_errors = 1; # Elevates warnings to errors. Enforces cleaner code. 6 | $pdflatex = "pdflatex -halt-on-error -interaction=batchmode %O %S"; 7 | # Forces latexmk to stop and quit if it encounters an error 8 | $jobname = "Probability Review"; # This is the name of the output PDF file 9 | $silent = 1; # For quieter output on the terminal. 10 | 11 | add_cus_dep('pytex','tex',0,'py2tex'); 12 | sub py2tex { 13 | system("./py2tex.py \"$_[0].pytex\" \"$_[0].tex\""); 14 | } -------------------------------------------------------------------------------- /Linear Algebra Review/latexmkrc: -------------------------------------------------------------------------------- 1 | @default_files = ("linalg2.tex"); # Set the root tex file for the output document 2 | $pdf_mode = 1; # tex -> PDF 3 | $auto_rc_use = 1; # Do not read further latexmkrc files 4 | $out_dir = ".."; # Create output files in the assignment directory (assignment/) 5 | $warnings_as_errors = 1; # Elevates warnings to errors. Enforces cleaner code. 6 | $pdflatex = "pdflatex -halt-on-error -interaction=batchmode %O %S"; 7 | # Forces latexmk to stop and quit if it encounters an error 8 | $jobname = "Linear Algebra Review"; # This is the name of the output PDF file 9 | $silent = 1; # For quieter output on the terminal. 10 | 11 | add_cus_dep('pytex','tex',0,'py2tex'); 12 | sub py2tex { 13 | system("./py2tex.py \"$_[0].pytex\" \"$_[0].tex\""); 14 | } -------------------------------------------------------------------------------- /Linux Git Guide/Linux Git Guide/latexmkrc: -------------------------------------------------------------------------------- 1 | @default_files = ("main.tex"); # Set the root tex file for the output document 2 | $pdf_mode = 1; # tex -> PDF 3 | $auto_rc_use = 1; # Do not read further latexmkrc files 4 | $out_dir = ".."; # Create output files in the assignment directory (assignment/) 5 | $warnings_as_errors = 1; # Elevates warnings to errors. Enforces cleaner code. 6 | $pdflatex = "pdflatex -halt-on-error -interaction=batchmode %O %S"; 7 | # Forces latexmk to stop and quit if it encounters an error 8 | $jobname = "Linux and Git Guide"; # This is the name of the output PDF file 9 | $silent = 1; # For quieter output on the terminal. 10 | 11 | add_cus_dep('pytex','tex',0,'py2tex'); 12 | sub py2tex { 13 | system("./py2tex.py \"$_[0].pytex\" \"$_[0].tex\""); 14 | } -------------------------------------------------------------------------------- /Conda Setup/py2tex.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import argparse, sys, io, re 3 | 4 | PYTEX_PATTERN = r'(?s)🐍(.*?)🐍' 5 | 6 | def collect_stdout_from_executable(exec_string, local_scope={}, global_scope={}): 7 | old_stdout = sys.stdout 8 | redirected_output = sys.stdout = io.StringIO() 9 | try: 10 | exec(exec_string, global_scope, local_scope) 11 | except: raise 12 | finally: 13 | sys.stdout = old_stdout 14 | return redirected_output.getvalue() 15 | 16 | def pytex_to_tex(pytex): 17 | local_scope, global_scope = {}, {} 18 | tex = re.sub(PYTEX_PATTERN, 19 | lambda match: collect_stdout_from_executable(match.group(1), 20 | local_scope, 21 | global_scope), 22 | pytex) 23 | return tex 24 | 25 | if __name__ == '__main__': 26 | parser = argparse.ArgumentParser(description='Converts .pytex files into .tex files..') 27 | parser.add_argument('infile', help="The source pytex file.") 28 | parser.add_argument('outfile', help="The destination tex file.") 29 | 30 | infile = parser.parse_args().infile 31 | outfile = parser.parse_args().outfile 32 | 33 | # Read the infile 34 | with open(infile, 'r') as in_f: 35 | # Write to the outfile 36 | with open(outfile, 'w') as out_f: 37 | out_f.write(pytex_to_tex(in_f.read())) 38 | 39 | """ 40 | Highlight with the following .sublime-syntax file: 41 | 42 | %YAML 1.2 43 | --- 44 | file_extensions: [pytex] 45 | scope: pytex 46 | 47 | contexts: 48 | main: 49 | - include: Packages/LaTeX/LaTeX.sublime-syntax 50 | - match: 🐍 51 | embed: Packages/Python/Python.sublime-syntax 52 | escape: 🐍 53 | """ -------------------------------------------------------------------------------- /Useful Links/README.md: -------------------------------------------------------------------------------- 1 | # Useful Links 2 | Within this directory you will find child directories for various XCS classes, within each of these XCS class directories there is a README.md file that contains links to online resources which relate to the class materials. We categorise resources as follows: 3 | 4 | * Research Papers ([example](https://arxiv.org/abs/2109.04617)) 5 | * Code ([example](https://github.com/suraj-nair-1/lorel)) 6 | * Blog Posts ([example](https://ai.stanford.edu/blog/meta-exploration/)) 7 | * Talks/Presentations ([example](https://www.youtube.com/watch?v=733m6qBH-jI)) 8 | * Podcasts ([example](https://www.eye-on.ai/podcast-044)) 9 | 10 | The purpose of including these links is to help students navigate some of the available online materials that relate to the class content and to help instigate additional discussion of the class materials. We hope you find value in exploring the available links and we look forward to discussing their content with you. 11 | 12 | # Contribution Guide 13 | If you have a resource you believe would be of value to add to the current lists please open a pull request with changes that include a link to the resource you wish to add. There are a few conditions that your pull request needs to follow in order to be reviewed: 14 | 15 | 1. You must ensure that your resource link is included under the header that corresponds to its resource type 16 | 2. You must ensure that a link to the same resource doesn't already exist in the list 17 | 3. You must ensure that the link is openly accessible and not behind a paywall 18 | 4. Please include your changes to the top of the current list as we wish to maintain chronological order 19 | 5. Please also include a label(s) that relate to the resource you are including (e.g. [`AI` | `robotics`]) 20 | 21 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Course Handouts 2 | This directory contains the source code for compiling course handouts. The documents in 3 | this directory **are not** the assignment handouts. 4 | 5 | Each subdirectory within this folder has, at a minimum, a file titled 6 | `latexmkrc`. This is the settings file for latexmk, which will handle 7 | juggling the various latex engines preferred by the course staff. A basic 8 | `latexmkrc` file (e.g., for `pdflatex`) might have the following contents 9 | ([`latexmk` documentation](https://mirror.las.iastate.edu/tex-archive/support/latexmk/latexmk.pdf)): 10 | ``` 11 | @default_files = ("main.tex"); # Set the root tex file for the output document 12 | $pdf_mode = 1; # tex -> PDF 13 | $auto_rc_use = 1; # Do not read further latexmkrc files 14 | $warnings_as_errors = 1; # Elevates warnings to errors. Enforces cleaner code. 15 | $pdflatex = "pdflatex -halt-on-error -interaction=batchmode %O %S"; 16 | # Forces latexmk to stop and quit if it encounters an error 17 | $jobname = "output_name"; # This is the name of the output PDF file 18 | $silent = 1 # For quieter output on the terminal. 19 | ``` 20 | Feel free to customize this as you desire, including adding more files, 21 | directories, and media. There is only one requirements: 22 | 23 | **IT MUST BE POSSIBLE TO COMPILE EACH DOCUMENT USING ONLY THE FOLLOWING COMMAND:** 24 | ``` 25 | $ latexmk 26 | ``` 27 | A properly setup `latexmkrc` file can handle any special compilation options you 28 | may require. Put those options in the `latexmkrc` file so that other course 29 | staff can compile your document with the command above. 30 | 31 | Other commands that might be helpful include: 32 | - `$ latexmk -pvc`: (preview continuously) This will run `latexmk` 33 | continuously, allowing you to immediately view changes to your output document 34 | as you save source files. 35 | - `$ latexmk -c`: This will remove all auxiliary files other than the final 36 | output PDF. 37 | - `$ latexmk -C`: This will remove all output files (including the final output 38 | PDF). -------------------------------------------------------------------------------- /Linear Algebra Review/linalg2.toc: -------------------------------------------------------------------------------- 1 | \contentsline {section}{\numberline {1}Basic Concepts and Notation}{2} 2 | \contentsline {subsection}{\numberline {1.1}Basic Notation}{2} 3 | \contentsline {section}{\numberline {2}Matrix Multiplication}{3} 4 | \contentsline {subsection}{\numberline {2.1}Vector-Vector Products}{3} 5 | \contentsline {subsection}{\numberline {2.2}Matrix-Vector Products}{4} 6 | \contentsline {subsection}{\numberline {2.3}Matrix-Matrix Products}{5} 7 | \contentsline {section}{\numberline {3}Operations and Properties}{7} 8 | \contentsline {subsection}{\numberline {3.1}The Identity Matrix and Diagonal Matrices}{8} 9 | \contentsline {subsection}{\numberline {3.2}The Transpose}{8} 10 | \contentsline {subsection}{\numberline {3.3}Symmetric Matrices}{8} 11 | \contentsline {subsection}{\numberline {3.4}The Trace}{9} 12 | \contentsline {subsection}{\numberline {3.5}Norms}{10} 13 | \contentsline {subsection}{\numberline {3.6}Linear Independence and Rank}{11} 14 | \contentsline {subsection}{\numberline {3.7}The Inverse of a Square Matrix}{11} 15 | \contentsline {subsection}{\numberline {3.8}Orthogonal Matrices}{12} 16 | \contentsline {subsection}{\numberline {3.9}Range and Nullspace of a Matrix}{13} 17 | \contentsline {subsection}{\numberline {3.10}The Determinant}{14} 18 | \contentsline {subsection}{\numberline {3.11}Quadratic Forms and Positive Semidefinite Matrices}{17} 19 | \contentsline {subsection}{\numberline {3.12}Eigenvalues and Eigenvectors}{18} 20 | \contentsline {subsection}{\numberline {3.13}Eigenvalues and Eigenvectors of Symmetric Matrices}{19} 21 | \contentsline {paragraph}{Background: representing vector w.r.t. another basis.}{20} 22 | \contentsline {paragraph}{``Diagonalizing'' matrix-vector multiplication.}{21} 23 | \contentsline {paragraph}{``Diagonalizing'' quadratic form.}{21} 24 | \contentsline {section}{\numberline {4}Matrix Calculus}{23} 25 | \contentsline {subsection}{\numberline {4.1}The Gradient}{23} 26 | \contentsline {subsection}{\numberline {4.2}The Hessian}{24} 27 | \contentsline {subsection}{\numberline {4.3}Gradients and Hessians of Quadratic and Linear Functions}{26} 28 | \contentsline {subsection}{\numberline {4.4}Least Squares}{27} 29 | \contentsline {subsection}{\numberline {4.5}Gradients of the Determinant}{28} 30 | \contentsline {subsection}{\numberline {4.6}Eigenvalues as Optimization}{28} 31 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # celery beat schedule file 95 | celerybeat-schedule 96 | 97 | # SageMath parsed files 98 | *.sage.py 99 | 100 | # Environments 101 | .env 102 | .venv 103 | env/ 104 | venv/ 105 | ENV/ 106 | env.bak/ 107 | venv.bak/ 108 | 109 | # Spyder project settings 110 | .spyderproject 111 | .spyproject 112 | 113 | # Rope project settings 114 | .ropeproject 115 | 116 | # mkdocs documentation 117 | /site 118 | 119 | # mypy 120 | .mypy_cache/ 121 | .dmypy.json 122 | dmypy.json 123 | 124 | # Pyre type checker 125 | .pyre/ 126 | 127 | # For .DS_Stores and main generated files 128 | .DS_Store 129 | *.aux 130 | *.log 131 | *.fdb_latexmk 132 | *.fls 133 | *.out 134 | *.synctex.gz 135 | main.pdf -------------------------------------------------------------------------------- /Probability Review/prob_slides.tex: -------------------------------------------------------------------------------- 1 | \documentclass{beamer} 2 | \usepackage[utf8]{inputenc} 3 | \usepackage{amsmath} 4 | \usepackage{amsfonts} 5 | \usepackage{amssymb} 6 | %\usepackage[margin=0.5in]{geometry} 7 | \usepackage[normalem]{ulem} 8 | \usepackage{graphicx} 9 | %\usetheme{Darmstadt} 10 | \title{CS229} 11 | \author{Probability Theory Review} 12 | \date{} 13 | \begin{document} 14 | \begin{frame} 15 | \titlepage 16 | \end{frame} 17 | \begin{frame} 18 | \frametitle{Elements of Probability} 19 | \pause 20 | \begin{enumerate} 21 | \itemsep1em 22 | \item[] \textbf{Sample Space} $\Omega$ 23 | \pause 24 | \item[] \indent $\qquad \{HH, HT, TH, TT\}$ 25 | \pause 26 | \item[] \textbf{Event} $A\subseteq \Omega$ 27 | \pause 28 | \item[] \indent $\qquad \{HH, HT\}$, \pause $\Omega$ 29 | \pause 30 | \item[] \textbf{Event Space} $\mathcal{F}$ 31 | \pause 32 | \item[] \textbf{Probability Measure} $P:\mathcal{F}\rightarrow\mathbb{R}$ 33 | \begin{enumerate} 34 | \itemsep1em 35 | \pause 36 | \item[] $P(A)\geq 0\quad \forall A\in\mathcal{F}$ 37 | \pause 38 | \item[] $P(\Omega)=1$ 39 | \pause 40 | \item[] If $A_1,A_2,...$ disjoint set of events ($A_i\cap A_j=\emptyset$ when $i\neq j$), then 41 | $$P\left(\bigcup_i A_i\right)=\sum_iP(A_i)$$ 42 | \end{enumerate} 43 | \end{enumerate} 44 | \end{frame} 45 | \begin{frame} 46 | \frametitle{Conditional Probability and Independence} 47 | \pause 48 | \begin{enumerate} 49 | \itemsep3em 50 | \item[] Let $B$ be any event such that $P(B)\neq 0$. 51 | \pause 52 | \item[] $P(A|B):=\frac{P(A\cap B)}{P(B)}$ 53 | \pause 54 | \item[] $A\perp B$ if and only if $P(A\cap B)=P(A)P(B)$ 55 | \pause 56 | \item[] $A\perp B$ if and only if $P(A|B)=\frac{P(A\cap B)}{P(B)}=\frac{P(A)P(B)}{P(B)}=P(A)$ 57 | \end{enumerate} 58 | \end{frame} 59 | \begin{frame} 60 | \frametitle{Random Variables (RV)} 61 | \pause 62 | \begin{center}$\omega_0=HHHTHTTHTT$\end{center} 63 | \pause 64 | \begin{enumerate} 65 | \itemsep2em 66 | \item[] A \textbf{RV} is $X:\Omega\rightarrow\mathbb{R}$ 67 | \pause 68 | \item[] \indent $\qquad$ \# of heads: $X(\omega_0)=5$ 69 | \pause 70 | \item[] \indent $\qquad$ \# of tosses until tails: $X(\omega_0)=4$ 71 | \pause 72 | \item[] $Val(X):=X(\Omega)$ 73 | \pause 74 | \item[] \indent $\qquad Val(X)=\{0,1,...,10\}$ 75 | \end{enumerate} 76 | \end{frame} 77 | \begin{frame} 78 | \frametitle{Cumulative Distribution Function (CDF)} 79 | \pause 80 | \begin{enumerate} 81 | \itemsep3em 82 | \item[] $F_X:\mathbb{R}\rightarrow[0, 1]$ 83 | \pause 84 | \item[] \indent $\qquad F_X(x)=P(X\leq x)$\pause $:=P(\{\omega|X(\omega)\leq x\})$ 85 | \pause 86 | \item[] \includegraphics[width=2in]{cdf.png} 87 | \end{enumerate} 88 | \end{frame} 89 | \begin{frame} 90 | \frametitle{Discrete vs. Continuous RV} 91 | \pause 92 | \begin{enumerate} 93 | \itemsep0.5em 94 | \item[] \textbf{Discrete RV}: $Val(X)$ is countable 95 | \pause 96 | \item[] $\qquad P(X=k):=P(\{\omega|X(\omega)=k\})$ 97 | \pause 98 | \item[] $\qquad$Probability Mass Function (PMF): $p_X:Val(X)\rightarrow [0,1]$ 99 | \pause 100 | \item[] $\qquad\qquad p_X(x):=P(X=x)$ 101 | \pause 102 | \item[] $\qquad\qquad \underset{x\in Val(X)}{\sum}p_X(x)=1$ 103 | \pause 104 | \item[] \textbf{Continuous RV}: $Val(X)$ is uncountable 105 | \pause 106 | \item[] \indent $\qquad P(a\leq X\leq b):=P(\{\omega|a\leq X(\omega)\leq b\})$ 107 | \pause 108 | \item[] $\qquad$Probability Density Function (PDF): $f_X:\mathbb{R}\rightarrow\mathbb{R}$ 109 | \pause 110 | \item[] $\qquad\qquad f_X(x):=\frac{d}{dx}F_X(x)$ 111 | \pause 112 | \item[] $\qquad\qquad f_X(x)\neq P(X=x)$ 113 | \pause 114 | \item[] $\qquad\qquad \int_{-\infty}^{\infty}\underbrace{f_X(x)dx\pause}_{P(x\leq X\leq x+dx)}=1$ 115 | \end{enumerate} 116 | \end{frame} 117 | 118 | \begin{frame} 119 | \frametitle{Expected Value and Variance} 120 | \pause 121 | \begin{enumerate} 122 | \itemsep1em 123 | \item[] $g:\mathbb{R}\rightarrow\mathbb{R}$ 124 | \pause 125 | \item[] \textbf{Expected Value} 126 | \pause 127 | \item[] $\qquad$ Let $X$ be a discrete RV with PMF $p_X$. 128 | \pause 129 | \item[] $\qquad\qquad \mathbb{E}[g(X)]:=\underset{x\in Val(X)}{\sum}g(x)p_X(x)$ 130 | \pause 131 | \item[] $\qquad$ Let $X$ be a continuous RV with PDF $f_X$. 132 | \pause 133 | \item[] $\qquad\qquad \mathbb{E}[g(X)]:=\int_{-\infty}^{\infty}g(x)f_X(x)dx$ 134 | \pause 135 | \item[] \textbf{Variance} 136 | \pause 137 | \item[] $\qquad Var(X):=\mathbb{E}[(X-\mathbb{E}[X])^2]$\pause $ =\mathbb{E}[X^2]-\mathbb{E}[X]^2$ 138 | \end{enumerate} 139 | \end{frame} 140 | 141 | \begin{frame} 142 | \frametitle{Example Distributions} 143 | \begin{center} 144 | \small 145 | $$\begin{array}{|l|l|c|c|} 146 | \hline 147 | \text{Distribution} & \text{PDF or PMF} & \text{Mean} & \text{Variance} \\ 148 | \hline 149 | Bernoulli(p) & \left\{\begin{array}{ll}p,&\text{if }x=1\\1-p,&\text{if }x=0.\end{array}\right. & p & p(1-p) \\ 150 | \hline 151 | Binomial(n,p) & \binom{n}{k}p^k(1-p)^{n-k}\text{ for }k=0,1,...,n & np & np(1-p) \\ 152 | \hline 153 | Geometric(p) & p(1-p)^{k-1}\text{ for }k=1,2,... & \frac1p & \frac{1-p}{p^2} \\ 154 | \hline 155 | Poisson(\lambda) & \frac{e^{-\lambda}\lambda^k}{k!}\text{ for }k=0,1,... & \lambda & \lambda \\ 156 | \hline 157 | Uniform(a,b) & \frac{1}{b-a}\text{ for all } x\in(a,b) & \frac{a+b}{2} & \frac{(b-a)^2}{12} \\ 158 | \hline 159 | Gaussian(\mu,\sigma^2) & \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\text{ for all } x\in(-\infty, \infty) & \mu & \sigma^2 \\ 160 | \hline 161 | Exponential(\lambda) & \lambda e^{-\lambda x}\text{ for all } x\geq 0, \lambda\geq 0 & \frac{1}{\lambda} & \frac{1}{\lambda^2}\\ 162 | \hline 163 | \end{array} $$ 164 | \end{center} 165 | \end{frame} 166 | \begin{frame} 167 | \frametitle{Two Random Variables} 168 | \begin{enumerate} 169 | \itemsep2em 170 | \item $F_{XY}(x,y)=P(X\leq x,Y\leq y)$ 171 | \item $p_{XY}(x,y)=P(X=x,Y=y)$ 172 | \item $p_X(x)=\sum_yp_{XY}(x,y)$ 173 | \item $f_{XY}(x,y)=\frac{\partial^2 F_{XY}(x,y)}{\partial x\partial y}$ 174 | \item $f_X(x)=\int_{-\infty}^{\infty}f_{XY}(x,y)dy$ 175 | \end{enumerate} 176 | \end{frame} 177 | 178 | \end{document} -------------------------------------------------------------------------------- /Linux Git Guide/main.tex: -------------------------------------------------------------------------------- 1 | \documentclass[10pt,landscape]{article} 2 | \usepackage{multicol} 3 | \usepackage{calc} 4 | \usepackage{ifthen} 5 | \usepackage[landscape]{geometry} 6 | \usepackage{amsmath,amsthm,amsfonts,amssymb} 7 | \usepackage{color,graphicx,overpic} 8 | \usepackage{hyperref} 9 | \usepackage{listings} 10 | \lstset{ 11 | basicstyle=\ttfamily, 12 | columns=fullflexible, 13 | frame=single, 14 | breaklines=true, 15 | postbreak=\mbox{\textcolor{red}{$\hookrightarrow$}\space}, 16 | } 17 | 18 | \pdfinfo{ 19 | /Title (Linux and Git Guide.pdf) 20 | /Creator (TeX) 21 | /Producer (pdfTeX 1.40.0) 22 | /Author (Armando Banuelos) 23 | /Subject (Linux and Git Guide) 24 | /Keywords (pdflatex, latex,pdftex,tex)} 25 | 26 | % This sets page margins to .5 inch if using letter paper, and to 1cm 27 | % if using A4 paper. (This probably isn't strictly necessary.) 28 | % If using another size paper, use default 1cm margins. 29 | \ifthenelse{\lengthtest { \paperwidth = 11in}} 30 | { \geometry{top=.5in,left=.25in,right=.25in,bottom=.5in} } 31 | {\ifthenelse{ \lengthtest{ \paperwidth = 297mm}} 32 | {\geometry{top=1cm,left=1cm,right=1cm,bottom=1cm} } 33 | {\geometry{top=1cm,left=1cm,right=1cm,bottom=1cm} } 34 | } 35 | 36 | % Turn off header and footer 37 | \pagestyle{empty} 38 | 39 | % Redefine section commands to use less space 40 | \makeatletter 41 | \renewcommand{\section}{\@startsection{section}{1}{0mm}% 42 | {-1ex plus -.5ex minus -.2ex}% 43 | {0.5ex plus .2ex}%x 44 | {\normalfont\large\bfseries}} 45 | \renewcommand{\subsection}{\@startsection{subsection}{2}{0mm}% 46 | {-1explus -.5ex minus -.2ex}% 47 | {0.5ex plus .2ex}% 48 | {\normalfont\normalsize\bfseries}} 49 | \renewcommand{\subsubsection}{\@startsection{subsubsection}{3}{0mm}% 50 | {-1ex plus -.5ex minus -.2ex}% 51 | {1ex plus .2ex}% 52 | {\normalfont\small\bfseries}} 53 | \makeatother 54 | 55 | % Define BibTeX command 56 | \def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em 57 | T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}} 58 | 59 | % Don't print section numbers 60 | \setcounter{secnumdepth}{0} 61 | 62 | 63 | \setlength{\parindent}{0pt} 64 | \setlength{\parskip}{0pt plus 0.5ex} 65 | 66 | %My Environments 67 | \newtheorem{example}[section]{Example} 68 | % ----------------------------------------------------------------------- 69 | 70 | \begin{document} 71 | % \raggedright 72 | % \footnotesize 73 | \begin{multicols}{2} 74 | 75 | 76 | % multicol parameters 77 | % These lengths are set only within the two main columns 78 | %\setlength{\columnseprule}{0.25pt} 79 | \setlength{\premulticols}{1pt} 80 | \setlength{\postmulticols}{1pt} 81 | \setlength{\multicolsep}{1pt} 82 | \setlength{\columnsep}{2pt} 83 | 84 | \begin{center} 85 | \Large{Linux and Git Guide} \\ 86 | \end{center} 87 | 88 | \section{Git - Keeping Your Code Updated} 89 | 90 | Git is a software used for tracking changes in files and collaborative code development. 91 | GitHub is a provider of Internet hosting for software development and version control using Git. \\ 92 | 93 | Below is the recommended Git flow you should follow for assignment development. \\ 94 | 95 | This guide assumes you have a GitHub account with token based authentication. If not, please follow this \href{https://docs.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent}{tutorial} to create an ssh key and link it to your GitHub account to perform Git actions. \\ 96 | 97 | 1. First start off by cloning a git repo to your local machine. 98 | \begin{lstlisting}[language=SQL] 99 | git clone git@github.com:scpd-proed/.git 100 | \end{lstlisting} 101 | 102 | 2. Create a branch to do your own development 103 | \begin{lstlisting}[language=SQL] 104 | git checkout -b 105 | \end{lstlisting} 106 | 107 | 3. If CFs make updates to the assignment commit your unsaved changes 108 | \begin{lstlisting}[language=SQL] 109 | git commit -am "" 110 | \end{lstlisting} 111 | 112 | 4. Merge updates from the assignment to your local branch 113 | \begin{lstlisting}[language=SQL] 114 | git pull origin master 115 | \end{lstlisting} 116 | 117 | After performing step 4, you may experience conflicts. To see which files are in conflict, please run 118 | \begin{lstlisting}[language=SQL] 119 | git status 120 | \end{lstlisting} 121 | 122 | The files in conflict are those "Not staged for commit". Open these files in the editor of your choice and look for lines \texttt{<<<<} and \texttt{>>>>}.\\ 123 | 124 | Once you have resolved the conflicts, run 125 | \begin{lstlisting}[language=SQL] 126 | git add -A . 127 | \end{lstlisting} 128 | 129 | And lastly commit your updates to your branch 130 | \begin{lstlisting}[language=SQL] 131 | git commit -am "" 132 | \end{lstlisting} 133 | 134 | Please DO NOT push your local branch changes to GitHub or create pull requests. This will expose your solution code from your development branch and is in violation of the honor code. 135 | 136 | \section{Linux - Commands You Should Know} 137 | Linux is an open-source Unix-like operating system. All our assignments assume you have some understanding on how to run commands on Linux machines. \\ 138 | 139 | Below we will go through popular Linux Commands you should know for assignment development.\\ 140 | 141 | 1. Listing directory (ls) - see files in your current directory 142 | \begin{lstlisting}[language=SQL] 143 | ls 144 | \end{lstlisting} 145 | 146 | 2. Changing directory (cd) - move to another directory 147 | \begin{lstlisting}[language=SQL] 148 | cd /path/to/directory 149 | \end{lstlisting} 150 | 151 | 3. Path Working Directory (pwd) - display pathname of current working directory 152 | \begin{lstlisting}[language=SQL] 153 | pwd 154 | \end{lstlisting} 155 | 156 | 4. Display file contents (cat) 157 | \begin{lstlisting}[language=SQL] 158 | cat file.txt 159 | \end{lstlisting} 160 | 161 | 5. Get a clear console window (clear) 162 | \begin{lstlisting}[language=SQL] 163 | clear 164 | \end{lstlisting} 165 | 166 | 6. Copying files (cp) 167 | \begin{lstlisting}[language=SQL] 168 | cp source destination 169 | \end{lstlisting} 170 | 171 | 7. Remove a file (rm) 172 | \begin{lstlisting}[language=SQL] 173 | rm file.txt 174 | \end{lstlisting} 175 | 176 | 8. Zipping up files (zip) 177 | \begin{lstlisting}[language=SQL] 178 | zip myfile.zip filename.txt 179 | \end{lstlisting} 180 | 181 | 9. Remote log into another machine (ssh) (will be useful for using Azure GPUs) 182 | \begin{lstlisting}[language=SQL] 183 | ssh user@machine 184 | \end{lstlisting} 185 | 186 | 10. Rename or move a file (mv) 187 | \begin{lstlisting}[language=SQL] 188 | mv source destination 189 | \end{lstlisting} 190 | 191 | 192 | \end{multicols} 193 | \end{document} -------------------------------------------------------------------------------- /Linux Git Guide/Linux Git Guide/main.tex: -------------------------------------------------------------------------------- 1 | \documentclass[10pt,landscape]{article} 2 | \usepackage{multicol} 3 | \usepackage{calc} 4 | \usepackage{ifthen} 5 | \usepackage[landscape]{geometry} 6 | \usepackage{amsmath,amsthm,amsfonts,amssymb} 7 | \usepackage{color,graphicx,overpic} 8 | \usepackage{hyperref} 9 | \usepackage{listings} 10 | \lstset{ 11 | basicstyle=\ttfamily, 12 | columns=fullflexible, 13 | frame=single, 14 | breaklines=true, 15 | postbreak=\mbox{\textcolor{red}{$\hookrightarrow$}\space}, 16 | } 17 | 18 | \pdfinfo{ 19 | /Title (Linux and Git Guide.pdf) 20 | /Creator (TeX) 21 | /Producer (pdfTeX 1.40.0) 22 | /Author (Armando Banuelos) 23 | /Subject (Linux and Git Guide) 24 | /Keywords (pdflatex, latex,pdftex,tex)} 25 | 26 | % This sets page margins to .5 inch if using letter paper, and to 1cm 27 | % if using A4 paper. (This probably isn't strictly necessary.) 28 | % If using another size paper, use default 1cm margins. 29 | \ifthenelse{\lengthtest { \paperwidth = 11in}} 30 | { \geometry{top=.5in,left=.25in,right=.25in,bottom=.5in} } 31 | {\ifthenelse{ \lengthtest{ \paperwidth = 297mm}} 32 | {\geometry{top=1cm,left=1cm,right=1cm,bottom=1cm} } 33 | {\geometry{top=1cm,left=1cm,right=1cm,bottom=1cm} } 34 | } 35 | 36 | % Turn off header and footer 37 | \pagestyle{empty} 38 | 39 | % Redefine section commands to use less space 40 | \makeatletter 41 | \renewcommand{\section}{\@startsection{section}{1}{0mm}% 42 | {-1ex plus -.5ex minus -.2ex}% 43 | {0.5ex plus .2ex}%x 44 | {\normalfont\large\bfseries}} 45 | \renewcommand{\subsection}{\@startsection{subsection}{2}{0mm}% 46 | {-1explus -.5ex minus -.2ex}% 47 | {0.5ex plus .2ex}% 48 | {\normalfont\normalsize\bfseries}} 49 | \renewcommand{\subsubsection}{\@startsection{subsubsection}{3}{0mm}% 50 | {-1ex plus -.5ex minus -.2ex}% 51 | {1ex plus .2ex}% 52 | {\normalfont\small\bfseries}} 53 | \makeatother 54 | 55 | % Define BibTeX command 56 | \def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em 57 | T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}} 58 | 59 | % Don't print section numbers 60 | \setcounter{secnumdepth}{0} 61 | 62 | 63 | \setlength{\parindent}{0pt} 64 | \setlength{\parskip}{0pt plus 0.5ex} 65 | 66 | %My Environments 67 | \newtheorem{example}[section]{Example} 68 | % ----------------------------------------------------------------------- 69 | 70 | \begin{document} 71 | % \raggedright 72 | % \footnotesize 73 | \begin{multicols}{2} 74 | 75 | 76 | % multicol parameters 77 | % These lengths are set only within the two main columns 78 | %\setlength{\columnseprule}{0.25pt} 79 | \setlength{\premulticols}{1pt} 80 | \setlength{\postmulticols}{1pt} 81 | \setlength{\multicolsep}{1pt} 82 | \setlength{\columnsep}{2pt} 83 | 84 | \begin{center} 85 | \Large{Linux and Git Guide} \\ 86 | \end{center} 87 | 88 | \section{Git - Keeping Your Code Updated} 89 | 90 | Git is a software used for tracking changes in files and collaborative code development. 91 | GitHub is a provider of Internet hosting for software development and version control using Git. \\ 92 | 93 | Below is the recommended Git flow you should follow for assignment development. \\ 94 | 95 | This guide assumes you have a GitHub account with token based authentication. If not, please follow this \href{https://docs.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent}{tutorial} to create an ssh key and link it to your GitHub account to perform Git actions. \\ 96 | 97 | 1. First start off by cloning a git repo to your local machine. 98 | \begin{lstlisting}[language=SQL] 99 | git clone git@github.com:scpd-proed/.git 100 | \end{lstlisting} 101 | 102 | 2. Create a branch to do your own development 103 | \begin{lstlisting}[language=SQL] 104 | git checkout -b 105 | \end{lstlisting} 106 | 107 | 3. If CFs make updates to the assignment commit your unsaved changes 108 | \begin{lstlisting}[language=SQL] 109 | git commit -am "" 110 | \end{lstlisting} 111 | 112 | 4. Merge updates from the assignment to your local branch 113 | \begin{lstlisting}[language=SQL] 114 | git pull origin master 115 | \end{lstlisting} 116 | 117 | After performing step 4, you may experience conflicts. To see which files are in conflict, please run 118 | \begin{lstlisting}[language=SQL] 119 | git status 120 | \end{lstlisting} 121 | 122 | The files in conflict are those "Not staged for commit". Open these files in the editor of your choice and look for lines \texttt{<<<<} and \texttt{>>>>}.\\ 123 | 124 | Once you have resolved the conflicts, run 125 | \begin{lstlisting}[language=SQL] 126 | git add -A . 127 | \end{lstlisting} 128 | 129 | And lastly commit your updates to your branch 130 | \begin{lstlisting}[language=SQL] 131 | git commit -am "" 132 | \end{lstlisting} 133 | 134 | Please DO NOT push your local branch changes to GitHub or create pull requests. This will expose your solution code from your development branch and is in violation of the honor code. 135 | 136 | \section{Linux - Commands You Should Know} 137 | Linux is an open-source Unix-like operating system. All our assignments assume you have some understanding on how to run commands on Linux machines. \\ 138 | 139 | Below we will go through popular Linux Commands you should know for assignment development.\\ 140 | 141 | 1. Listing directory (ls) - see files in your current directory 142 | \begin{lstlisting}[language=SQL] 143 | ls 144 | \end{lstlisting} 145 | 146 | 2. Changing directory (cd) - move to another directory 147 | \begin{lstlisting}[language=SQL] 148 | cd /path/to/directory 149 | \end{lstlisting} 150 | 151 | 3. Path Working Directory (pwd) - display pathname of current working directory 152 | \begin{lstlisting}[language=SQL] 153 | pwd 154 | \end{lstlisting} 155 | 156 | 4. Display file contents (cat) 157 | \begin{lstlisting}[language=SQL] 158 | cat file.txt 159 | \end{lstlisting} 160 | 161 | 5. Get a clear console window (clear) 162 | \begin{lstlisting}[language=SQL] 163 | clear 164 | \end{lstlisting} 165 | 166 | 6. Copying files (cp) 167 | \begin{lstlisting}[language=SQL] 168 | cp source destination 169 | \end{lstlisting} 170 | 171 | 7. Remove a file (rm) 172 | \begin{lstlisting}[language=SQL] 173 | rm file.txt 174 | \end{lstlisting} 175 | 176 | 8. Zipping up files (zip) 177 | \begin{lstlisting}[language=SQL] 178 | zip myfile.zip filename.txt 179 | \end{lstlisting} 180 | 181 | 9. Remote log into another machine (ssh) (will be useful for using Azure GPUs) 182 | \begin{lstlisting}[language=SQL] 183 | ssh user@machine 184 | \end{lstlisting} 185 | 186 | 10. Rename or move a file (mv) 187 | \begin{lstlisting}[language=SQL] 188 | mv source destination 189 | \end{lstlisting} 190 | 191 | 192 | \end{multicols} 193 | \end{document} -------------------------------------------------------------------------------- /Probability Review/nips07submit_e.sty: -------------------------------------------------------------------------------- 1 | %%%% NIPS Macros (LaTex) 2 | %%%% Style File 3 | %%%% Dec 12, 1990 Rev Aug 14, 1991; Sept, 1995; April, 1997; April, 1999 4 | 5 | % This file can be used with Latex2e whether running in main mode, or 6 | % 2.09 compatibility mode. 7 | % 8 | % If using main mode, you need to include the commands 9 | % \documentclass{article} 10 | % \usepackage{nips06submit_e,times} 11 | % as the first lines in your document. Or, if you do not have Times 12 | % Roman font available, you can just use 13 | % \documentclass{article} 14 | % \usepackage{nips06submit_e} 15 | % instead. 16 | % 17 | % If using 2.09 compatibility mode, you need to include the command 18 | % \documentstyle[nips06submit_09,times]{article} 19 | % as the first line in your document. Or, if you do not have Times 20 | % Roman font available, you can include the command 21 | % \documentstyle[nips06submit_09]{article} 22 | % instead. 23 | 24 | 25 | % Change the overall width of the page. If these parameters are 26 | % changed, they will require corresponding changes in the 27 | % maketitle section. 28 | % 29 | \renewcommand{\topfraction}{0.95} % let figure take up nearly whole page 30 | \renewcommand{\textfraction}{0.05} % let figure take up nearly whole page 31 | 32 | % Specify the dimensions of each page 33 | 34 | \setlength{\paperheight}{11in} 35 | \setlength{\paperwidth}{8.5in} 36 | 37 | \oddsidemargin .5in % Note \oddsidemargin = \evensidemargin 38 | \evensidemargin .5in 39 | \marginparwidth 0.07 true in 40 | %\marginparwidth 0.75 true in 41 | %\topmargin 0 true pt % Nominal distance from top of page to top of 42 | %\topmargin 0.125in 43 | \topmargin -0.625in 44 | \addtolength{\headsep}{0.25in} 45 | \textheight 9.0 true in % Height of text (including footnotes & figures) 46 | \textwidth 5.5 true in % Width of text line. 47 | \widowpenalty=10000 48 | \clubpenalty=10000 49 | 50 | % \thispagestyle{empty} \pagestyle{empty} 51 | \flushbottom \sloppy 52 | 53 | % We're never going to need a table of contents, so just flush it to 54 | % save space --- suggested by drstrip@sandia-2 55 | \def\addcontentsline#1#2#3{} 56 | 57 | % Title stuff, taken from deproc. 58 | \def\maketitle{\par 59 | \begingroup 60 | \def\thefootnote{\fnsymbol{footnote}} 61 | \def\@makefnmark{\hbox to 0pt{$^{\@thefnmark}$\hss}} % for perfect author 62 | % name centering 63 | % The footnote-mark was overlapping the footnote-text, 64 | % added the following to fix this problem (MK) 65 | \long\def\@makefntext##1{\parindent 1em\noindent 66 | \hbox to1.8em{\hss $\m@th ^{\@thefnmark}$}##1} 67 | \@maketitle \@thanks 68 | \endgroup 69 | \setcounter{footnote}{0} 70 | \let\maketitle\relax \let\@maketitle\relax 71 | \gdef\@thanks{}\gdef\@author{}\gdef\@title{}\let\thanks\relax} 72 | 73 | % The toptitlebar has been raised to top-justify the first page 74 | % anonymized version: 75 | 76 | \def\makeanontitle{\par 77 | \begingroup 78 | \def\thefootnote{\fnsymbol{footnote}} 79 | \def\@makefnmark{\hbox to 0pt{$^{\@thefnmark}$\hss}} % for perfect author 80 | % name centering 81 | % The footnote-mark was overlapping the footnote-text, 82 | % added the following to fix this problem (MK) 83 | \long\def\@makefntext##1{\parindent 1em\noindent 84 | \hbox to1.8em{\hss $\m@th ^{\@thefnmark}$}##1} 85 | \@makeanontitle \@thanks 86 | \endgroup 87 | \setcounter{footnote}{0} 88 | \let\makeanontitle\relax \let\@makeanontitle\relax 89 | \gdef\@thanks{}\gdef\@title{}\let\thanks\relax} 90 | 91 | 92 | 93 | 94 | % The non-anonmyous version 95 | \def\@maketitle{\vbox{\hsize\textwidth 96 | \linewidth\hsize \vskip 0.1in \toptitlebar \centering 97 | {\LARGE\bf \@title\par} \bottomtitlebar % \vskip 0.1in % minus 98 | \def\And{\end{tabular}\hfil\linebreak[0]\hfil 99 | \begin{tabular}[t]{c}\bf\rule{\z@}{24pt}\ignorespaces}% 100 | \def\AND{\end{tabular}\hfil\linebreak[4]\hfil 101 | \begin{tabular}[t]{c}\bf\rule{\z@}{24pt}\ignorespaces}% 102 | \begin{tabular}[t]{c}\bf\rule{\z@}{24pt}\@author\end{tabular}% 103 | \vskip 0.3in minus 0.1in}} 104 | 105 | % The anonymized version, doesn't matter what people list for authors 106 | \def\@makeanontitle{\vbox{\hsize\textwidth 107 | \linewidth\hsize \vskip 0.1in \toptitlebar \centering 108 | {\LARGE\bf \@title\par} \bottomtitlebar % \vskip 0.1in % minus 109 | \begin{tabular}[t]{c}\bf\rule{\z@}{24pt} 110 | Arian Maleki and Tom Do \\ 111 | Stanford University\\ 112 | \end{tabular}% 113 | \vskip 0.3in minus 0.1in}} 114 | % end anonymized version 115 | 116 | \renewenvironment{abstract}{\vskip.075in\centerline{\large\bf 117 | Abstract}\vspace{0.5ex}\begin{quote}}{\par\end{quote}\vskip 1ex} 118 | 119 | % sections with less space 120 | \def\section{\@startsection {section}{1}{\z@}{-2.0ex plus 121 | -0.5ex minus -.2ex}{1.5ex plus 0.3ex 122 | minus0.2ex}{\large\bf\raggedright}} 123 | 124 | \def\subsection{\@startsection{subsection}{2}{\z@}{-1.8ex plus 125 | -0.5ex minus -.2ex}{0.8ex plus .2ex}{\normalsize\bf\raggedright}} 126 | \def\subsubsection{\@startsection{subsubsection}{3}{\z@}{-1.5ex 127 | plus -0.5ex minus -.2ex}{0.5ex plus 128 | .2ex}{\normalsize\bf\raggedright}} 129 | \def\paragraph{\@startsection{paragraph}{4}{\z@}{1.5ex plus 130 | 0.5ex minus .2ex}{-1em}{\normalsize\bf}} 131 | \def\subparagraph{\@startsection{subparagraph}{5}{\z@}{1.5ex plus 132 | 0.5ex minus .2ex}{-1em}{\normalsize\bf}} 133 | \def\subsubsubsection{\vskip 134 | 5pt{\noindent\normalsize\rm\raggedright}} 135 | 136 | 137 | % Footnotes 138 | \footnotesep 6.65pt % 139 | \skip\footins 9pt plus 4pt minus 2pt 140 | \def\footnoterule{\kern-3pt \hrule width 12pc \kern 2.6pt } 141 | \setcounter{footnote}{0} 142 | 143 | % Lists and paragraphs 144 | \parindent 0pt 145 | \topsep 4pt plus 1pt minus 2pt 146 | \partopsep 1pt plus 0.5pt minus 0.5pt 147 | \itemsep 2pt plus 1pt minus 0.5pt 148 | \parsep 2pt plus 1pt minus 0.5pt 149 | \parskip .5pc 150 | 151 | 152 | %\leftmargin2em 153 | \leftmargin3pc 154 | \leftmargini\leftmargin \leftmarginii 2em 155 | \leftmarginiii 1.5em \leftmarginiv 1.0em \leftmarginv .5em 156 | 157 | %\labelsep \labelsep 5pt 158 | 159 | \def\@listi{\leftmargin\leftmargini} 160 | \def\@listii{\leftmargin\leftmarginii 161 | \labelwidth\leftmarginii\advance\labelwidth-\labelsep 162 | \topsep 2pt plus 1pt minus 0.5pt 163 | \parsep 1pt plus 0.5pt minus 0.5pt 164 | \itemsep \parsep} 165 | \def\@listiii{\leftmargin\leftmarginiii 166 | \labelwidth\leftmarginiii\advance\labelwidth-\labelsep 167 | \topsep 1pt plus 0.5pt minus 0.5pt 168 | \parsep \z@ \partopsep 0.5pt plus 0pt minus 0.5pt 169 | \itemsep \topsep} 170 | \def\@listiv{\leftmargin\leftmarginiv 171 | \labelwidth\leftmarginiv\advance\labelwidth-\labelsep} 172 | \def\@listv{\leftmargin\leftmarginv 173 | \labelwidth\leftmarginv\advance\labelwidth-\labelsep} 174 | \def\@listvi{\leftmargin\leftmarginvi 175 | \labelwidth\leftmarginvi\advance\labelwidth-\labelsep} 176 | 177 | \abovedisplayskip 7pt plus2pt minus5pt% 178 | \belowdisplayskip \abovedisplayskip 179 | \abovedisplayshortskip 0pt plus3pt% 180 | \belowdisplayshortskip 4pt plus3pt minus3pt% 181 | 182 | % Less leading in most fonts (due to the narrow columns) 183 | % The choices were between 1-pt and 1.5-pt leading 184 | %\def\@normalsize{\@setsize\normalsize{11pt}\xpt\@xpt} % got rid of @ (MK) 185 | \def\normalsize{\@setsize\normalsize{11pt}\xpt\@xpt} 186 | \def\small{\@setsize\small{10pt}\ixpt\@ixpt} 187 | \def\footnotesize{\@setsize\footnotesize{10pt}\ixpt\@ixpt} 188 | \def\scriptsize{\@setsize\scriptsize{8pt}\viipt\@viipt} 189 | \def\tiny{\@setsize\tiny{7pt}\vipt\@vipt} 190 | \def\large{\@setsize\large{14pt}\xiipt\@xiipt} 191 | \def\Large{\@setsize\Large{16pt}\xivpt\@xivpt} 192 | \def\LARGE{\@setsize\LARGE{20pt}\xviipt\@xviipt} 193 | \def\huge{\@setsize\huge{23pt}\xxpt\@xxpt} 194 | \def\Huge{\@setsize\Huge{28pt}\xxvpt\@xxvpt} 195 | 196 | \def\toptitlebar{\hrule height4pt\vskip .25in\vskip-\parskip} 197 | 198 | \def\bottomtitlebar{\vskip .29in\vskip-\parskip\hrule height1pt\vskip 199 | .09in} % 200 | %Reduced second vskip to compensate for adding the strut in \@author 201 | -------------------------------------------------------------------------------- /Conda Setup/macros-Fall2018.tex: -------------------------------------------------------------------------------- 1 | \newcommand{\newsec}{\section} 2 | \newcommand{\denselist}{\itemsep 0pt\partopsep 0pt} 3 | \newcommand{\bitem}{\begin{itemize}\denselist} 4 | \newcommand{\eitem}{\end{itemize}} 5 | \newcommand{\benum}{\begin{enumerate}\denselist} 6 | \newcommand{\eenum}{\end{enumerate}} 7 | 8 | \newcommand{\fig}[1]{\private{\begin{center} 9 | {\Large\bf ({#1})} 10 | \end{center}}} 11 | 12 | \newcommand{\cpsf}[1]{{\centerline{\psfig{#1}}}} 13 | \newcommand{\mytitle}[1]{\centerline{\LARGE\bf #1}} 14 | 15 | \newcommand{\myw}{{\bf w}} 16 | 17 | \newcommand{\mypar}[1]{\vspace{1ex}\noindent{\bf {#1}}} 18 | 19 | \def\thmcolon{\hspace{-.85em} {\bf :} } 20 | 21 | \newtheorem{THEOREM}{Theorem}[section] 22 | \newenvironment{theorem}{\begin{THEOREM} \thmcolon }% 23 | {\end{THEOREM}} 24 | \newtheorem{LEMMA}[THEOREM]{Lemma} 25 | \newenvironment{lemma}{\begin{LEMMA} \thmcolon }% 26 | {\end{LEMMA}} 27 | \newtheorem{COROLLARY}[THEOREM]{Corollary} 28 | \newenvironment{corollary}{\begin{COROLLARY} \thmcolon }% 29 | {\end{COROLLARY}} 30 | \newtheorem{PROPOSITION}[THEOREM]{Proposition} 31 | \newenvironment{proposition}{\begin{PROPOSITION} \thmcolon }% 32 | {\end{PROPOSITION}} 33 | \newtheorem{DEFINITION}[THEOREM]{Definition} 34 | \newenvironment{definition}{\begin{DEFINITION} \thmcolon \rm}% 35 | {\end{DEFINITION}} 36 | \newtheorem{CLAIM}[THEOREM]{Claim} 37 | \newenvironment{claim}{\begin{CLAIM} \thmcolon \rm}% 38 | {\end{CLAIM}} 39 | \newtheorem{EXAMPLE}[THEOREM]{Example} 40 | \newenvironment{example}{\begin{EXAMPLE} \thmcolon \rm}% 41 | {\end{EXAMPLE}} 42 | \newtheorem{REMARK}[THEOREM]{Remark} 43 | \newenvironment{remark}{\begin{REMARK} \thmcolon \rm}% 44 | {\end{REMARK}} 45 | %\newenvironment{proof}{\noindent {\bf Proof:} \hspace{.677em}}% 46 | % {} 47 | 48 | %theorem 49 | \newcommand{\thm}{\begin{theorem}} 50 | %lemma 51 | \newcommand{\lem}{\begin{lemma}} 52 | %proposition 53 | \newcommand{\pro}{\begin{proposition}} 54 | %definition 55 | \newcommand{\dfn}{\begin{definition}} 56 | %remark 57 | \newcommand{\rem}{\begin{remark}} 58 | %example 59 | \newcommand{\xam}{\begin{example}} 60 | %corollary 61 | \newcommand{\cor}{\begin{corollary}} 62 | %proof 63 | \newcommand{\prf}{\noindent{\bf Proof:} } 64 | %end theorem 65 | \newcommand{\ethm}{\end{theorem}} 66 | %end lemma 67 | \newcommand{\elem}{\end{lemma}} 68 | %end proposition 69 | \newcommand{\epro}{\end{proposition}} 70 | %end definition 71 | \newcommand{\edfn}{\bbox\end{definition}} 72 | %end remark 73 | \newcommand{\erem}{\bbox\end{remark}} 74 | %end example 75 | \newcommand{\exam}{\bbox\end{example}} 76 | %end corollary 77 | \newcommand{\ecor}{\end{corollary}} 78 | %end proof 79 | \newcommand{\eprf}{\bbox\vspace{0.1in}} 80 | %begin equation 81 | \newcommand{\beqn}{\begin{equation}} 82 | %end equation 83 | \newcommand{\eeqn}{\end{equation}} 84 | 85 | %\newcommand{\eqref}[1]{Eq.~\ref{#1}} 86 | 87 | \newcommand{\KB}{\mbox{\it KB\/}} 88 | \newcommand{\infers}{\vdash} 89 | \newcommand{\sat}{\models} 90 | \newcommand{\bbox}{\vrule height7pt width4pt depth1pt} 91 | 92 | \newcommand{\act}[1]{\stackrel{{#1}}{\rightarrow}} 93 | \newcommand{\at}[1]{^{(#1)}} 94 | 95 | \newcommand{\argmax}{{\rm argmax}} 96 | \newcommand{\V}{{\cal V}} 97 | \newcommand{\C}{{\cal C}} 98 | \newcommand{\calL}{{\cal L}} 99 | 100 | \newcommand{\rimp}{\Rightarrow} 101 | \newcommand{\dimp}{\Leftrightarrow} 102 | 103 | \newcommand{\nf}{\bar{f}} 104 | \newcommand{\ns}{\bar{s}} 105 | \newcommand{\na}{\bar{a}} 106 | \newcommand{\nh}{\bar{h}} 107 | \newcommand{\nr}{\bar{r}} 108 | 109 | 110 | \newcommand{\bX}{\mbox{\boldmath $X$}} 111 | \newcommand{\bY}{\mbox{\boldmath $Y$}} 112 | \newcommand{\bZ}{\mbox{\boldmath $Z$}} 113 | \newcommand{\bU}{\mbox{\boldmath $U$}} 114 | \newcommand{\bE}{\mbox{\boldmath $E$}} 115 | \newcommand{\bx}{\mbox{\boldmath $x$}} 116 | \newcommand{\be}{\mbox{\boldmath $e$}} 117 | \newcommand{\by}{\mbox{\boldmath $y$}} 118 | \newcommand{\bz}{\mbox{\boldmath $z$}} 119 | \newcommand{\bu}{\mbox{\boldmath $u$}} 120 | \newcommand{\bd}{\mbox{\boldmath $d$}} 121 | \newcommand{\smbx}{\mbox{\boldmath $\scriptstyle x$}} 122 | \newcommand{\smbd}{\mbox{\boldmath $\scriptstyle d$}} 123 | \newcommand{\smby}{\mbox{\boldmath $\scriptstyle y$}} 124 | \newcommand{\smbe}{\mbox{\boldmath $\scriptstyle e$}} 125 | 126 | \newcommand{\Parents}{\mbox{\it Parents\/}} 127 | \newcommand{\B}{{\cal B}} 128 | 129 | \newcommand{\word}[1]{\mbox{\it #1\/}} 130 | \newcommand{\Action}{\word{Action}} 131 | \newcommand{\Proposition}{\word{Proposition}} 132 | \newcommand{\true}{\word{true}} 133 | \newcommand{\false}{\word{false}} 134 | \newcommand{\Pre}{\word{Pre}} 135 | \newcommand{\Add}{\word{Add}} 136 | \newcommand{\Del}{\word{Del}} 137 | \newcommand{\Result}{\word{Result}} 138 | \newcommand{\Regress}{\word{Regress}} 139 | \newcommand{\Maintain}{\word{Maintain}} 140 | 141 | \newcommand{\bor}{\bigvee} 142 | \newcommand{\invert}[1]{{#1}^{-1}} 143 | 144 | \newcommand{\commentout}[1]{} 145 | 146 | \newcommand{\bmu}{\mbox{\boldmath $\mu$}} 147 | \newcommand{\btheta}{\mbox{\boldmath $\theta$}} 148 | \newcommand{\IR}{\mbox{$I\!\!R$}} 149 | 150 | \newcommand{\tval}[1]{{#1}^{1}} 151 | \newcommand{\fval}[1]{{#1}^{0}} 152 | 153 | \newcommand{\tr}{{\rm tr}} 154 | \newcommand{\vecy}{{\vec{y}}} 155 | \renewcommand{\Re}{{\mathbb R}} 156 | 157 | \def\twofigbox#1#2{% 158 | \noindent\begin{minipage}{\textwidth}% 159 | \epsfxsize=0.35\maxfigwidth 160 | \noindent \epsffile{#1}\hfill 161 | \epsfxsize=0.35\maxfigwidth 162 | \epsffile{#2}\\ 163 | \makebox[0.35\textwidth]{(a)}\hfill\makebox[0.35\textwidth]{(b)}% 164 | \end{minipage}} 165 | 166 | \def\twofigboxcd#1#2{% 167 | \noindent\begin{minipage}{\textwidth}% 168 | \epsfxsize=0.35\maxfigwidth 169 | \noindent \epsffile{#1}\hfill 170 | \epsfxsize=0.35\maxfigwidth 171 | \epsffile{#2}\\ 172 | \makebox[0.35\textwidth]{(c)}\hfill\makebox[0.35\textwidth]{(d)}% 173 | \end{minipage}} 174 | 175 | \def\twofigboxnolabel#1#2{% 176 | \begin{figure}[h] 177 | \centering 178 | \begin{minipage}{.5\textwidth} 179 | \centering 180 | \includegraphics[width=0.75\linewidth]{#1} 181 | \end{minipage}% 182 | \begin{minipage}{.5\textwidth} 183 | \centering 184 | \includegraphics[width=0.75\linewidth]{#2} 185 | \end{minipage} 186 | \end{figure} 187 | } 188 | 189 | \def\threefigbox#1#2#3{% 190 | \noindent\begin{minipage}{\textwidth}% 191 | \epsfxsize=0.33\maxfigwidth 192 | \noindent \epsffile{#1}\hfill 193 | \epsfxsize=0.33\maxfigwidth 194 | \noindent \epsffile{#2}\hfill 195 | \epsfxsize=0.33\maxfigwidth 196 | \epsffile{#3}\\ 197 | \makebox[0.31\textwidth]{{\scriptsize (a)}}\hfill% 198 | \makebox[0.31\textwidth]{{\scriptsize (b)}}\hfill 199 | \makebox[0.31\textwidth]{{\scriptsize (c)}}% 200 | \smallskip 201 | \end{minipage}} 202 | 203 | 204 | \def\threefigbox#1#2#3{% 205 | \begin{figure}[H] 206 | \centering 207 | \begin{subfigure}{.33\textwidth} 208 | \centering 209 | \includegraphics[width=0.75\linewidth]{#1} 210 | \caption{} 211 | \end{subfigure}% 212 | \begin{subfigure}{.33\textwidth} 213 | \centering 214 | \includegraphics[width=0.75\linewidth]{#2} 215 | \caption{} 216 | \end{subfigure}% 217 | \begin{subfigure}{.33\textwidth} 218 | \centering 219 | \includegraphics[width=0.75\linewidth]{#3} 220 | \caption{} 221 | \end{subfigure} 222 | \end{figure} 223 | } 224 | 225 | 226 | \def\threefigboxnolabel#1#2#3{% 227 | \centering 228 | \begin{subfigure}{.33\textwidth} 229 | \centering 230 | \includegraphics[width=0.75\linewidth]{#1} 231 | \end{subfigure}% 232 | \begin{subfigure}{.33\textwidth} 233 | \centering 234 | \includegraphics[width=0.75\linewidth]{#2} 235 | \end{subfigure}% 236 | \begin{subfigure}{.33\textwidth} 237 | \centering 238 | \includegraphics[width=0.75\linewidth]{#3} 239 | \end{subfigure} 240 | \end{figure} 241 | } 242 | 243 | \newlength{\maxfigwidth} 244 | \setlength{\maxfigwidth}{\textwidth} 245 | %\def\captionsize {\footnotesize} 246 | \def\captionsize {} 247 | 248 | \newcommand{\xsi}{{x^{(i)}}} 249 | \newcommand{\ysi}{{y^{(i)}}} 250 | \newcommand{\wsi}{{w^{(i)}}} 251 | \newcommand{\esi}{{\epsilon^{(i)}}} 252 | \newcommand{\calN}{{\cal N}} 253 | \newcommand{\calX}{{\cal X}} 254 | \newcommand{\calY}{{\cal Y}} 255 | \newcommand{\ytil}{{\tilde{y}}} 256 | 257 | \newcommand{\beas}{\begin{eqnarray*}} 258 | \newcommand{\eeas}{\end{eqnarray*}} 259 | 260 | \newcommand{\Ber}{{\rm Bernoulli}} 261 | \newcommand{\Bernoulli}{{\rm Bernoulli}} 262 | \newcommand{\E}{{\rm E}} 263 | 264 | \setlength{\parindent}{0pt} 265 | \setlength{\parskip}{4pt} 266 | -------------------------------------------------------------------------------- /Conda Setup/main.tex: -------------------------------------------------------------------------------- 1 | \documentclass{article} 2 | %\documentstyle[11pt,handout,psfig]{article} 3 | 4 | \usepackage{fullpage,amssymb,amsmath,tikz,forest,float,subcaption,braket} 5 | \usetikzlibrary{arrows.meta} 6 | \usepackage{graphicx} 7 | \usepackage{hyperref} 8 | \hypersetup{ 9 | colorlinks=true, 10 | linkcolor=blue, 11 | filecolor=magenta, 12 | urlcolor=cyan, 13 | } 14 | 15 | \forestset{ 16 | default preamble={ 17 | for tree={ 18 | font=\tiny, 19 | base=bottom, 20 | child anchor=north, 21 | align=center, 22 | s sep+=1.3cm, 23 | straight edge/.style={ 24 | edge path={\noexpand\path[\forestoption{edge},thick,-{Latex}] 25 | (!u.parent anchor) -- (.child anchor);} 26 | }, 27 | if n children={0} 28 | {tier=word, draw, thick, rectangle} 29 | { 30 | if n children={1} 31 | {draw, thick, rectangle, parent anchor=south} 32 | {draw, diamond, thick, aspect=2} 33 | }, 34 | if n=1{% 35 | if n'=1 36 | {edge path={\noexpand\path[\forestoption{edge},thick,-{Latex}] 37 | (!u.parent anchor) -| (.child anchor);}} 38 | {edge path={\noexpand\path[\forestoption{edge},thick,-{Latex}] 39 | (!u.parent anchor) -| (.child anchor) node[pos=.2, above] {Y};}} 40 | }{ 41 | edge path={\noexpand\path[\forestoption{edge},thick,-{Latex}] 42 | (!u.parent anchor) -| (.child anchor) node[pos=.2, above] {N};} 43 | } 44 | } 45 | } 46 | } 47 | \usepackage{xcolor} 48 | \usepackage{listings} 49 | \lstset{basicstyle=\ttfamily, 50 | showstringspaces=false, 51 | commentstyle=\color{red}, 52 | keywordstyle=\color{blue} 53 | } 54 | 55 | \usepackage[12pt]{extsizes} 56 | \usepackage{gensymb} 57 | \usepackage{graphicx} 58 | 59 | 60 | 61 | 62 | %These give really tight margins: 63 | %\setlength{\topmargin}{-0.3in} 64 | %\setlength{\textheight}{8.10in} 65 | %\setlength{\textwidth}{5.8in} 66 | %\setlength{\baselineskip}{0.1875in} 67 | %\addtolength{\leftmargin}{-2.775in} 68 | %\setlength{\footskip}{0.45in} 69 | %\setlength{\oddsidemargin}{0.5in} 70 | %\setlength{\evensidemargin}{0.5in} 71 | %%\setlength{\headsep}{0pt} 72 | %%\setlength{\headheight}{0pt} 73 | 74 | %\setlength{\topmargin}{-0.5in} 75 | \setlength{\textheight}{8in} 76 | %\setlength{\textwidth}{5.0in} 77 | %\setlength{\baselineskip}{0.1875in} 78 | %\addtolength{\leftmargin}{-2.775in} 79 | %\setlength{\footskip}{0.45in} 80 | %\setlength{\oddsidemargin}{0.5in} 81 | %\setlength{\evensidemargin}{0.5in} 82 | %%\setlength{\headsep}{0pt} 83 | %%\setlength{\headheight}{0pt} 84 | 85 | 86 | \markright{} 87 | \pagestyle{myheadings} 88 | 89 | \input{macros-Fall2018} 90 | \begin{document} 91 | \title{Conda Setup for XCS Courses} 92 | %\author{} 93 | \date{} 94 | \maketitle 95 | 96 | 97 | \part*{What is Conda?} 98 | 99 | Conda is an open source package management system and environment management system that runs on Windows, macOS, and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language. 100 | 101 | Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment. 102 | 103 | \section{Installation} 104 | 105 | The fastest way to obtain conda is to install Miniconda, a mini version of Anaconda that includes only conda and its dependencies. If you would want to leverage your exotic CPU architecture (aarch64 including Apple Silicon, ppc64le) the best way would be to install Miniforge! If you prefer to have conda plus over 7,500 open-source packages, install Anaconda. 106 | 107 | Here are some of the system requirements necessary for installation: 108 | \begin{itemize} 109 | \item 32- or 64-bit computer 110 | \item For Miniconda/Miniforge - 400 MB disk space 111 | \item For Anaconda - Minimum 3GB disk space to download and install 112 | \item Windows, macOS, or Linux 113 | \end{itemize} 114 | 115 | For download specific instructions, \href{https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html}{please refer to this link.} 116 | 117 | 118 | \textbf{For Windows}: Go to your search menu on Windows after installing Miniconda, and locate the Anaconda Command Prompt to run the commands below. 119 | \begin{center} 120 | \includegraphics[scale=0.5]{acp.png} 121 | \end{center} 122 | 123 | 124 | To verify the Conda download, please run: 125 | \begin{lstlisting}[language=bash] 126 | conda info 127 | \end{lstlisting} 128 | In case you have previously installed Miniconda, be sure to update them accordingly before moving onto step 2: 129 | \begin{lstlisting}[language=bash] 130 | #Update to lastest version of Conda 131 | conda update -n base conda 132 | #Update all packages to the latest version of Anaconda 133 | conda update anaconda 134 | \end{lstlisting} 135 | 136 | \section{Creating and activating environment from YAML file} 137 | 138 | After you have installed Conda, we have provided a file called \textbf{environment.yml}. Note that the \textbf{environment.yml} will be provided on the first day of the course upon the release of the first assignment. This YAML file will help you create a virtual environment with a specified name used for activation and install necessary dependencies for the coding assignments. Here is what an example environment.yml file looks like. 139 | \begin{center} 140 | \includegraphics[scale=0.75]{conda-update.png} 141 | \end{center} 142 | \textbf{Note: }Although there are multiple ways of completing the homework assignments, the libraries we provide you within the YAML file should be the only ones you reference in the completion of the coding assignments. 143 | \textbf{For the remainder of this setup we will be using \textit{your-env-name} to specify the name of your environment (i.e. XCS229, XCS221, etc) . You can find the name of your virtual environment by looking for the name parameter within the environment.yml file.} 144 | 145 | To create your environment from the YAML file please run: 146 | \begin{lstlisting}[language=bash] 147 | conda env create --file environment.yml 148 | \end{lstlisting} 149 | Now to specify that the environment was installed correctly, please run the code below and you should see an environment name that matches with the name parameter in the environment.yml file: 150 | \begin{lstlisting}[language=bash] 151 | conda env list 152 | \end{lstlisting} 153 | 154 | Next, you will activate the created environment. To do so, you will need to to call the following command: 155 | \begin{lstlisting}[language=bash] 156 | conda activate 157 | \end{lstlisting} 158 | 159 | Now in this environment, you will have all the relevant dependencies needed to complete the assignments. If you want to deactivate your environment please run: 160 | \begin{lstlisting}[language=bash] 161 | conda deactivate 162 | \end{lstlisting} 163 | 164 | \section{Converting Environment to Jupyter Notebook} 165 | When completing the assignments, it may be helpful to use a Jupyter Notebook as a scratchpad to help you debug or formulate potential solutions. Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. The first step to creating your Jupyter Notebook for the assignments is to activate your environment as follows: 166 | \begin{lstlisting}[language=bash] 167 | conda activate 168 | \end{lstlisting} 169 | 170 | Next, in the active environment type: 171 | \begin{lstlisting}[language=bash] 172 | #install ipykernel 173 | conda install -c anaconda ipykernel 174 | #install the new kernel 175 | ipython kernel install --user --name= 176 | \end{lstlisting} 177 | 178 | And last but not least, ensure to run the following command to launch your Jupyter Notebook in the active environment 179 | \begin{lstlisting}[language=bash] 180 | jupyter notebook 181 | \end{lstlisting} 182 | 183 | That command will take you to your Jupyter Notebook directory. When you hover over the \textbf{New} button in the top right hand corner, you should see you environment present as an option. 184 | \begin{center} 185 | \includegraphics[]{p2.png} 186 | \end{center} 187 | Go ahead and select it and it will create an empty Python Jupyter file where you can copy and paste assignment code or use it to formulate your thoughts. 188 | 189 | In the event you would like to remove your environment from Jupyter, simply run the following code: 190 | \begin{lstlisting}[language=bash] 191 | jupyter kernelspec uninstall 192 | \end{lstlisting} 193 | 194 | You are now equipped with all the tools to complete the coding assignments. For a list of the most useful Conda functions, \href{https://docs.conda.io/projects/conda/en/latest/_downloads/843d9e0198f2a193a3484886fa28163c/conda-cheatsheet.pdf}{please refer to this document}. 195 | 196 | \section{Updating Existing Conda Environment} 197 | If you plan on reutilizing the an existing Conda environment from a previous SCPD-AI course, we strongly recommend all students update their Conda environment upon starting or replace it with the new Conda environment as mentioned above. Below are steps to update your existing Conda environment in preparation for this course. 198 | 199 | \begin{lstlisting}[language=bash] 200 | conda activate 201 | conda env update --file environment.yml 202 | \end{lstlisting} 203 | 204 | \end{document} 205 | -------------------------------------------------------------------------------- /Probability Review/Report.tex: -------------------------------------------------------------------------------- 1 | \documentclass{article} 2 | \usepackage{nips07submit_e,times} 3 | \usepackage{graphicx} 4 | %\documentstyle[nips07submit_09,times]{article} 5 | 6 | \title{Review of Probability Theory} 7 | 8 | 9 | \author{ 10 | Arian Maleki \\ 11 | Department of Electrical Engineering\\ 12 | Stanford University\\ 13 | Stanford, CA 94305 \\ 14 | \texttt{arianm@stanford.edu} \\ 15 | } 16 | 17 | % The \author macro works with any number of authors. There are two commands 18 | % used to separate the names and addresses of multiple authors: \And and \AND. 19 | % 20 | % Using \And between authors leaves it to \LaTeX{} to determine where to break 21 | % the lines. Using \AND forces a linebreak at that point. So, if \LaTeX{} 22 | % puts 3 of 4 authors names on the first line, and the last on the second 23 | % line, try using \AND instead of \And before the third author name. 24 | \usepackage{amssymb} 25 | \usepackage{graphicx} 26 | \usepackage{amsmath} 27 | \usepackage{color} 28 | \usepackage{dsfont} 29 | \usepackage{amsthm} 30 | 31 | \DeclareMathOperator*{\argmax}{arg\,min} 32 | 33 | \newtheorem{thm}{Theorem}[section] 34 | \newtheorem*{definition}{Definition} 35 | \newtheorem{cor}[thm]{Corollary} 36 | \newtheorem{lem}[thm]{Lemma} 37 | \begin{document} 38 | 39 | \makeanontitle 40 | 41 | Probability theory is the study of uncertainty. Through this class, we will be relying on concepts from probability 42 | theory for deriving machine learning algorithms. These notes attempt to cover the basics of probability theory at a 43 | level appropriate for CS 229. The mathematical theory of probability is very sophisticated, and 44 | delves into a branch of analysis known as \textbf{measure theory}. In 45 | these notes, we provide a basic treatment of probability that does 46 | not address these finer details. 47 | 48 | \section{Elements of probability} 49 | 50 | In order to define a probability on a set we need a few basic elements, 51 | \begin{itemize} 52 | \item \textbf{Sample space} $\Omega$: The set of all the outcomes of a random 53 | experiment. Here, each outcome $\omega \in \Omega$ can be thought 54 | of as a complete description of the state of the real world at the 55 | end of the experiment. 56 | \item \textbf{Set of events} (or \textbf{event space}) $\mathcal{F}$: 57 | A set whose 58 | elements $A \in \mathcal{F}$ (called \textbf{events}) are subsets of $\Omega$ 59 | (i.e., $A \subseteq \Omega$ is a collection of possible outcomes of an 60 | experiment).\footnote{ 61 | $\mathcal{F}$ should satisfy three properties: 62 | (1) $\emptyset \in \mathcal{F}$; 63 | (2) $A \in \mathcal{F} \Longrightarrow \Omega \setminus A \in \mathcal{F}$; and 64 | (3) $A_1, A_2 , \ldots \in \mathcal{F} \Longrightarrow \cup_i A_i \in \mathcal{F}$. 65 | }. 66 | \item \textbf{Probability measure}: A function $P : \mathcal{F} \rightarrow \mathbb{R}$ that satisfies the following properties, 67 | \begin{description} 68 | \item[-] $P(A) \geq 0$, for all $A \in \mathcal{F}$ 69 | \item[-] $P(\Omega)=1$ 70 | \item[-] If $A_1,A_2, \ldots$ are disjoint events (i.e., $A_i \cap A_j= \emptyset$ whenever $i \neq j$), then 71 | \begin{equation*} 72 | P(\cup_{i} A_i)= \sum_i P(A_i) 73 | \end{equation*} 74 | \end{description} 75 | \end{itemize} 76 | These three properties are called the \textbf{Axioms of Probability}. 77 | 78 | \textbf{Example}: Consider the event of tossing a six-sided die. The 79 | sample space is $\Omega= \{1,2,3,4,5,6\}$. We can define different 80 | event spaces on this sample space. For example, the simplest event 81 | space is the trivial event space $\mathcal{F}=\{\emptyset, 82 | \Omega\}$. Another event space is the set of all subsets of 83 | $\Omega$. For the first event space, the unique probability measure 84 | satisfying the requirements above is 85 | given by $P(\emptyset)=0, P(\Omega)=1$. For the second event space, 86 | one valid probability measure is to assign the probability of each set in the event space to be $\frac{i}{6}$ 87 | where $i$ is the number of elements of that set; for example, 88 | $P(\{1,2,3,4\})= \frac{4}{6}$ and $P(\{1,2,3\})= \frac{3}{6}$. \\ 89 | 90 | \textbf{Properties}: 91 | \begin{description} 92 | \item[-] {If $ {A \subseteq B \Longrightarrow P(A) \leq P(B)}$.} 93 | \item[-] ${P(A \cap B) \leq \min(P(A), P(B))}$. 94 | \item[-] (Union Bound) ${P(A \cup B) \leq P(A)+P(B)}$. 95 | \item[-] ${P(\Omega \setminus A)=1-P(A)}$. 96 | \item[-] {(Law of Total Probability) If $A_1,\ldots,A_k$ are a set of disjoint events such that $\cup_{i=1}^k A_i = \Omega$, then $\sum_{i=1}^k P(A_k) = 1$.} 97 | \end{description} 98 | 99 | \subsection{Conditional probability and independence} 100 | 101 | Let $B$ be an event with non-zero probability. The conditional probability of any event $A$ given $B$ is defined as, 102 | \begin{equation*} 103 | P(A | B)\triangleq \frac{P(A \cap B)}{P(B)} 104 | \end{equation*} 105 | In other words, $P(A| B)$ is the probability measure of the event $A$ 106 | after observing the occurrence of event $B$. Two events are called 107 | independent if and only if $P(A \cap B)= P(A)P(B)$ (or equivalently, $P(A|B)=P(A)$). Therefore, 108 | independence is equivalent to saying that observing $B$ does not have 109 | any effect on the probability of $A$. 110 | 111 | \section{Random variables} 112 | 113 | Consider an experiment in which we flip 10 coins, and we want to know the 114 | number of coins that come up heads. Here, the elements of the sample 115 | space $\Omega$ are 10-length sequences of heads and tails. For example, 116 | we might have $w_0 = \langle H, H, T, H, T, H, H, T, T, T \rangle \in 117 | \Omega$. However, in practice, we usually do not care about the 118 | probability of obtaining any particular sequence of heads and tails. 119 | Instead we usually care about real-valued functions of outcomes, such as the 120 | number of heads that appear among our 10 tosses, or the 121 | length of the longest run of tails. These functions, under some 122 | technical conditions, are known as \textbf{random variables}. 123 | 124 | More formally, a random variable $X$ is a function $X:\Omega 125 | \longrightarrow \mathbb{R}$.\footnote{ Technically speaking, not every 126 | function is not acceptable as a random variable. From a 127 | measure-theoretic perspective, random variables must be 128 | Borel-measurable functions. Intuitively, this restriction ensures 129 | that given a random variable and its underlying outcome space, one 130 | can implicitly define the each of the events of the event space as 131 | being sets of outcomes $\omega \in \Omega$ for which $X(\omega)$ 132 | satisfies some property (e.g., the event $\{\omega : X(\omega) \ge 3 \}$). 133 | } Typically, we will denote random 134 | variables using upper case letters $X(\omega)$ or more simply $X$ 135 | (where the dependence on the random outcome $\omega$ is implied). We 136 | will denote the value that a random variable may take on using lower 137 | case letters $x$. 138 | 139 | \textbf{Example}: In our experiment above, suppose that $X(\omega)$ is 140 | the number of heads which occur in the sequence of tosses $\omega$. 141 | Given that only 10 coins are tossed, $X(\omega)$ can take only a 142 | finite number of values, so it is known as a \textbf{discrete random 143 | variable}. Here, the probability of the set associated with a random 144 | variable $X$ taking on some specific value $k$ is 145 | \begin{eqnarray*} 146 | P(X = k) := P(\{\omega : X(\omega) = k\}). 147 | \end{eqnarray*} 148 | 149 | \textbf{Example}: Suppose that $X(\omega)$ is a random variable 150 | indicating the amount of time it takes for a radioactive particle to 151 | decay. In this case, $X(\omega)$ takes on a infinite number of 152 | possible values, so it is called a \textbf{continuous random 153 | variable}. We denote the probability that $X$ takes on a value between 154 | two real constants $a$ and $b$ (where $a < b$) as 155 | \begin{eqnarray*} 156 | P(a \le X \le b) := P(\{\omega : a \le X(\omega) \le b\}). 157 | \end{eqnarray*} 158 | 159 | \subsection{Cumulative distribution functions} 160 | 161 | In order to specify the probability measures used when dealing with 162 | random variables, it is often convenient to specify alternative 163 | functions (CDFs, PDFs, and PMFs) from which the probability measure 164 | governing an experiment immediately follows. In this section and the 165 | next two sections, we describe each of these types of functions in 166 | turn. 167 | 168 | A \textbf{cumulative distribution function (CDF)} is a function $F_X:\mathbb{R} \rightarrow [0,1]$ which specifies a probability measure as, 169 | \begin{equation} 170 | F_X(x) \triangleq P(X \leq x). 171 | \end{equation} 172 | By using this function one can calculate the probability of any event in $\mathcal{F}$.\footnote{This is a remarkable fact and is actually a theorem that is proved in more advanced courses.} %How? First of all they define a $\pi$-system. $\mathcal{P}$ is a $\pi$-system, if $I,J \in \mathcal{P} \Longrightarrow I \cap J \in \mathcal{P}$. Then there is a theorem that says that if two probability measures are equal on a $\pi$-system then they are equal to each other on the $\sigma$-algebra generated by that $\pi$-system. Now the set of [-\infty, a] generates the Borel sigma algebra and is a $\pi$-system and therefore defines the probability uniquely!!! 173 | Figure \ref{fig1} shows a sample CDF function. 174 | 175 | \textbf{Properties}: 176 | \begin{description} 177 | \item[-] $0 \leq F_X(x) \leq 1$. 178 | \item[-] $\lim_{x \rightarrow -\infty} F_X(x)=0$. 179 | \item[-] $\lim_{x \rightarrow \infty} F_X(x)=1$. 180 | \item[-] $x \leq y \Longrightarrow F_X(x)\leq F_X(y)$. 181 | \end{description} 182 | 183 | \begin{figure} 184 | \begin{center} 185 | \includegraphics[width=7cm]{fig1}\\ 186 | \caption{A cumulative distribution function (CDF).}\label{fig1} 187 | \end{center} 188 | \end{figure} 189 | 190 | \subsection{Probability mass functions} 191 | 192 | When a random variable $X$ takes on a finite set of possible values (i.e., $X$ is a discrete random variable), a simpler way to represent the probability measure associated with 193 | a random variable is to directly specify the probability of each value that the random variable can assume. In particular, a \emph{probability 194 | mass function (PMF)} is a function $p_X : \Omega \rightarrow \mathbb{R}$ such that 195 | \begin{align*} 196 | p_X(x)\triangleq P(X=x). 197 | \end{align*} 198 | In the case of discrete random variable, we use the notation $Val(X)$ for the set of possible values that the random variable $X$ may assume. For example, if $X(\omega)$ is a random 199 | variable indicating the number of heads out of ten tosses of coin, then $Val(X) = \{0,1,2,\ldots,10\}$. 200 | 201 | \textbf{Properties}: 202 | \begin{description} 203 | \item [-] $0 \leq p_X(x) \leq 1$. 204 | \item [-] $\sum_{x \in Val(X)} p_X(x)=1$. 205 | \item [-] $\sum_{x \in A} p_X(x) = P(X \in A)$. 206 | \end{description} 207 | 208 | \subsection{Probability density functions} 209 | 210 | For some continuous random variables, the cumulative distribution function $F_X(x)$ is differentiable everywhere. In these cases, we define the \textbf{Probability Density Function} or \textbf{PDF} 211 | as the derivative of the CDF, i.e., 212 | \begin{equation} 213 | f_X(x)\triangleq \frac{dF_X(x)}{dx}. 214 | \end{equation} 215 | Note here, that the PDF for a continuous random variable may not always exist (i.e., if $F_X(x)$ is not differentiable everywhere). 216 | 217 | According to the properties of differentiation, for very small $\Delta x$, 218 | \begin{equation} 219 | P(x \leq X \leq x+ \Delta x) \approx f_X(x)\Delta x. 220 | \end{equation} 221 | Both CDFs and PDFs (when they exist!) can be used for 222 | calculating the probabilities of different events. But it should be 223 | emphasized that the value of PDF at any given point $x$ is not the probability of 224 | that event, i.e., $f_X(x) \neq P(X = x)$. For example, $f_X(x)$ can take on 225 | values larger than one (but the integral of $f_X(x)$ over any subset of $\mathbb{R}$ will 226 | be at most one). 227 | 228 | \textbf{Properties}: 229 | \begin{description} 230 | \item [-] $\small{f_X(x) \geq 0}$ . 231 | \item [-] $\small {\int_{-\infty}^\infty f_X(x) =1}$. 232 | \item [-] $\int_{x \in A} f_X(x) dx = P(X \in A)$. 233 | \end{description} 234 | 235 | \subsection{Expectation} 236 | Suppose that $X$ is a discrete random variable with PMF $p_X(x)$ and $g: \mathbb{R} \longrightarrow \mathbb{R}$ is an arbitrary function. In this case, 237 | $g(X)$ can be considered a random variable, and we define the \textbf{expectation} or \textbf{expected value} of $g(X)$ as 238 | \begin{align*} 239 | E[g(X)] \triangleq \sum_{x \in Val(X)} g(x) p_X(x). 240 | \end{align*} 241 | If $X$ is a continuous random variable with PDF $f_X(x)$, then the expected value of $g(X)$ is defined as, 242 | \begin{align*} 243 | E[g(X)]\triangleq \int_{-\infty}^{\infty} g(x) f_X(x) dx. 244 | \end{align*} 245 | Intuitively, the expectation of $g(X)$ can be thought of as a ``weighted average'' of 246 | the values that $g(x)$ can taken on for different values of $x$, where the weights are 247 | given by $p_X(x)$ or $f_X(x)$. 248 | As a special case of the above, note that the expectation, $E[X]$ of a random variable itself 249 | is found by letting $g(x) = x$; this is also known as the \textbf{mean} of the random variable $X$. 250 | 251 | \textbf{Properties}: 252 | \begin{description} 253 | \item[-] $E[a] = a$ for any constant $a \in \mathbb{R}$. 254 | \item[-] $E[af(X)] = aE[f(X)]$ for any constant $a \in \mathbb{R}$. 255 | \item[-] (Linearity of Expectation) $E[f(X) + g(X)] = E[f(X)] + E[g(X)]$. 256 | \item[-] For a discrete random variable $X$, $E[1\{X = k \}] = P(X = k)$. 257 | \end{description} 258 | 259 | \subsection{Variance} 260 | 261 | The \textbf{variance} of a random variable $X$ is a measure of how concentrated the distribution of a random variable $X$ 262 | is around its mean. Formally, the variance of a random variable $X$ is defined as 263 | \begin{align*} 264 | Var[X] \triangleq E[(X-E(X))^2] 265 | \end{align*} 266 | Using the properties in the previous section, we can derive an alternate expression for the variance: 267 | \begin{eqnarray*} 268 | E[(X - E[X])^2] &=& E[X^2 - 2E[X] X + E[X]^2] \\ &=& E[X^2] - 2 E[X] E[X] + E[X]^2 \\ &=& E[X^2] - E[X]^2, 269 | \end{eqnarray*} 270 | where the second equality follows from linearity of expectations and the fact that $E[X]$ is actually a constant with 271 | respect to the outer expectation. 272 | 273 | \textbf{Properties}: 274 | \begin{description} 275 | \item[-] $Var[a] = 0$ for any constant $a \in \mathbb{R}$. 276 | \item[-] $Var[a f(X)] = a^2 Var[f(X)]$ for any constant $a \in \mathbb{R}$. 277 | \end{description} 278 | 279 | \textbf{Example} Calculate the mean and the variance of the uniform random variable $X$ with PDF $f_X(x)=1, \ \ \forall x \in [0,1]$, 0 elsewhere. 280 | \begin{align*} 281 | E[X]= \int_{-\infty}^{\infty} xf_X(x) dx= \int_{0}^{1}xdx=\frac{1}{2}. \nonumber 282 | \end{align*} 283 | \begin{align*} 284 | E[X^2]=\int_{-\infty}^{\infty} x^2f_X(x) dx= \int_{0}^{1}x^2dx=\frac{1}{3}. \nonumber 285 | \end{align*} 286 | \begin{align*} 287 | Var[X]=E[X^2]-E[X]^2= \frac{1}{3}- \frac{1}{4}=\frac{1}{12}. \nonumber 288 | \end{align*} 289 | 290 | \textbf{Example:} Suppose that $g(x)= 1\{x \in A\}$ for some subset $A \subseteq \Omega$. 291 | What is $E[g(X)]$? 292 | 293 | Discrete case: 294 | \begin{align*} 295 | E[g(X)]= \sum_{x \in Val(X)} 1\{x \in A\} P_X(x)dx= \sum_{x \in A} P_X(x)dx= P(x \in A). \nonumber 296 | \end{align*} 297 | Continuous case: 298 | \begin{align*} 299 | E[g(X)]= \int_{-\infty}^{\infty} 1\{x \in A\} f_X(x)dx= \int_{x \in A} f_X(x)dx= P(x \in A). \nonumber 300 | \end{align*} 301 | 302 | \subsection{Some common random variables} 303 | 304 | \textbf{Discrete random variables} 305 | \begin{itemize} 306 | \item $X \sim Bernoulli(p)$ (where $0 \le p \le 1$): one if a coin with heads probability $p$ comes up heads, zero otherwise. 307 | \begin{eqnarray*} 308 | p(x) = \begin{cases} 309 | p & \text{if $p = 1$} \\ 310 | 1-p & \text{if $p = 0$} 311 | \end{cases} 312 | \end{eqnarray*} 313 | \item $X \sim Binomial(n, p)$ (where $0 \le p \le 1$): the number of heads in $n$ independent flips of a coin with heads probability $p$. 314 | \begin{eqnarray*} 315 | p(x) = {n \choose x} p^x(1-p)^{n-x} 316 | \end{eqnarray*} 317 | \item $X \sim Geometric(p)$ (where $p > 0$): the number of flips of a coin with heads probability $p$ until the first heads. 318 | \begin{eqnarray*} 319 | p(x) = p(1 - p)^{x-1} 320 | \end{eqnarray*} 321 | \item $X \sim Poisson(\lambda)$ (where $\lambda > 0$): a probability distribution over the nonnegative integers used for modeling the 322 | frequency of rare events. 323 | \begin{eqnarray*} 324 | p(x) = e^{-\lambda} \frac{\lambda^x}{x!} 325 | \end{eqnarray*} 326 | \end{itemize} 327 | 328 | \textbf{Continuous random variables} 329 | \begin{itemize} 330 | \item $X \sim Uniform(a,b)$ (where $a < b$): equal probability density to every value between $a$ and $b$ on the real line. 331 | \begin{eqnarray*} 332 | f(x) = \begin{cases} 333 | \frac{1}{b - a} & \text{if $a \le x \le b$} \\ 334 | 0 & \text{otherwise} 335 | \end{cases} 336 | \end{eqnarray*} 337 | \item $X \sim Exponential(\lambda)$ (where $\lambda > 0$): decaying probability density over the nonnegative reals. 338 | \begin{eqnarray*} 339 | f(x) = \begin{cases} 340 | \lambda e^{-\lambda x} & \text{if $x \ge 0$} \\ 341 | 0 & \text{otherwise} 342 | \end{cases} 343 | \end{eqnarray*} 344 | \item $X \sim Normal(\mu, \sigma^2)$: also known as the Gaussian distribution 345 | \begin{eqnarray*} 346 | f(x) = \frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{1}{2\sigma^2} (x - \mu)^2} 347 | \end{eqnarray*} 348 | \end{itemize} 349 | The shape of the PDFs and CDFs of some of these random variables are shown in Figure \ref{fig2}. 350 | 351 | The following table is the summary of some of the properties of these distributions. 352 | \begin{center} 353 | \begin{tabular}{|l|l|l|l|l|} 354 | \hline 355 | % after \\: \hline or \cline{col1-col2} \cline{col3-col4} ... 356 | Distribution & PDF or PMF & Mean & Variance \\ 357 | \hline 358 | $Bernoulli(p)$ & $\left\{ \begin{array}{ll} 359 | p, & \text{if $x=1$} \\ 360 | 1-p, & \text{if $x=0$.} 361 | \end{array} 362 | \right.$ & $p$ & $p(1-p)$\\ 363 | \hline 364 | $Binomial(n,p)$ & $n \choose k$ $p^k(1-p)^{n-k} $ for $ 0 \leq k \leq n$ & $np$ & $npq$ \\ 365 | \hline 366 | $Geometric(p)$ & $p(1-p)^{k-1}$ \ \ for $k=1,2, \ldots$ & $\frac{1}{p}$ & $\frac{1-p}{p^2}$ \\ 367 | \hline 368 | $Poisson(\lambda)$ & $e^{-\lambda} \lambda^x / x!$ \ \ for $k=1,2,\ldots$ & $\lambda$ & $\lambda$ \\ 369 | \hline 370 | $Uniform(a,b)$ & $\frac{1}{b-a} \ \ \forall x \in (a,b)$ & $\frac{a+b}{2}$ & $\frac{(b-a)^2}{12}$ \\ 371 | \hline 372 | $Gaussian(\mu,\sigma^2)$ & $\frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ & $\mu$ & $\sigma^2$ \\ 373 | \hline 374 | $Exponential(\lambda)$ & $\lambda e^{-\lambda x} \ \ x \geq 0, \lambda >0$ & $\frac{1}{\lambda}$ & $ \frac{1}{\lambda ^2}$ \\ 375 | \hline 376 | \end{tabular} 377 | \end{center} 378 | 379 | 380 | \begin{figure} 381 | \begin{center} 382 | \includegraphics[width=9cm]{fig2.png}\\ 383 | \caption{PDF and CDF of a couple of random variables.}\label{fig2} 384 | \end{center} 385 | \end{figure} 386 | 387 | \section{Two random variables} 388 | 389 | Thus far, we have considered single random variables. In many 390 | situations, however, there may be more than one quantity that we are 391 | interested in knowing during a random experiment. For instance, in an 392 | experiment where we flip a coin ten times, we may care about both $X(\omega) = 393 | \text{the number of heads that come up}$ as well as $Y(\omega) = \text{the 394 | length of the longest run of consecutive heads}$. In this section, we 395 | consider the setting of two random variables. 396 | 397 | \subsection{Joint and marginal distributions} 398 | 399 | Suppose that we have two random variables $X$ and $Y$. One way to work 400 | with these two random variables is to consider each of them 401 | separately. If we do that we will only need $F_X(x)$ and $F_Y(y)$. 402 | But if we want to know about the values that $X$ and $Y$ assume 403 | simultaneously during outcomes of a random experiment, we require 404 | a more complicated structure known as the 405 | \textbf{joint cumulative distribution function} of $X$ and $Y$, defined by 406 | \begin{equation*} 407 | F_{XY}(x,y)=P(X \leq x, Y \leq y) 408 | \end{equation*} 409 | It can be shown that by knowing the joint 410 | cumulative distribution function, 411 | the probability of any event involving $X$ and $Y$ can be calculated. 412 | 413 | The joint CDF $F_{XY}(x,y)$ and the joint distribution functions 414 | $F_X(x)$ and $F_Y(y)$ of each variable separately are related by 415 | \begin{eqnarray*} 416 | F_X(x) &=& \lim_{y \rightarrow \infty} F_{XY}(x,y) dy \\ 417 | F_Y(y) &=& \lim_{x \rightarrow \infty} F_{XY}(x,y) dx. 418 | \end{eqnarray*} 419 | Here, we call $F_X(x)$ and $F_Y(y)$ the \textbf{marginal cumulative distribution functions} 420 | of $F_{XY}(x,y)$. 421 | 422 | \textbf{Properties}: 423 | \begin{description} 424 | \item[-] $0 \leq F_{XY}(x,y) \leq 1$. 425 | \item[-] $\lim_{x,y \rightarrow \infty} F_{XY}(x,y)=1$. 426 | \item[-] $\lim_{x,y \rightarrow -\infty} F_{XY}(x,y)=0$. 427 | \item[-] $ F_X(x)= \lim_{y \rightarrow \infty} F_{XY}(x,y)$. 428 | \end{description} 429 | 430 | \subsection{Joint and marginal probability mass functions} 431 | 432 | If $X$ and $Y$ are discrete random variables, then the \textbf{joint probability mass function} $p_{XY} : \mathbb{R} \times \mathbb{R} \rightarrow [0,1]$ is 433 | defined by 434 | \begin{eqnarray*} 435 | p_{XY}(x,y) = P(X = x, Y = y). 436 | \end{eqnarray*} 437 | Here, $0 \le p_{XY}(x,y) \le 1$ for all $x, y$, and $\sum_{x \in Val(X)} \sum_{y \in Val(Y)} p_{XY}(x,y) = 1$. 438 | 439 | How does the joint PMF over two variables relate to the probability mass function for each variable separately? 440 | It turns out that 441 | \begin{eqnarray*} 442 | p_X(x) = \sum_y p_{XY} (x,y). 443 | \end{eqnarray*} 444 | and similarly for $p_Y(y)$. In this case, we refer to $p_X(x)$ as the 445 | \textbf{marginal probability mass function} of $X$. In statistics, the process of forming the 446 | marginal distribution with respect to one variable by summing out the 447 | other variable is often known as ``marginalization.'' 448 | 449 | \subsection{Joint and marginal probability density functions} 450 | 451 | Let $X$ and $Y$ be two continuous random variables with joint distribution function $F_{XY}$. In the case that 452 | $F_{XY}(x,y)$ is everywhere differentiable in both $x$ and $y$, then we can define the \textbf{joint probability density function}, 453 | \begin{equation*} 454 | f_{XY}(x,y)= \frac{ \partial^2 F_{XY}(x,y)}{\partial x \partial y}. 455 | \end{equation*} 456 | Like in the single-dimensional case, $f_{XY}(x,y) \neq P(X = x, Y = y)$, but rather 457 | \begin{equation*} 458 | \iint_{x \in A} f_{XY} (x,y) dx dy = P((X,Y) \in A). 459 | \end{equation*} 460 | Note that the values of the probability density function $f_{XY}(x,y)$ are always nonnegative, but they may be greater than 1. Nonetheless, it must be the 461 | case that $\int_{-\infty}^\infty \int_{-\infty}^\infty f_{XY}(x,y) = 1$. 462 | 463 | Analagous to the discrete case, we define 464 | \begin{eqnarray*} 465 | f_X(x) = \int_{-\infty}^{\infty}f_{XY}(x,y)dy, 466 | \end{eqnarray*} 467 | as the \textbf{marginal probability density function} (or \textbf{marginal 468 | density}) of $X$, and similarly for $f_Y(y)$. 469 | 470 | \subsection{Conditional distributions} 471 | 472 | Conditional distributions seek to answer the question, what is the probability distribution over $Y$, when we know 473 | that $X$ must take on a certain value $x$? 474 | In the discrete case, the conditional probability mass 475 | function of $X$ given $Y$ is simply 476 | \begin{equation*} 477 | p_{Y|X}(y|x) = \frac{p_{XY}(x,y)}{p_X(x)}, 478 | \end{equation*} 479 | assuming that $p_X(x) \neq 0$. 480 | 481 | In the continuous case, the situation is technically a little more complicated 482 | because the probability that a continuous random variable $X$ takes on a specific value $x$ is equal to zero\footnote{ 483 | To get around this, a more reasonable way to calculate the 484 | conditional CDF is, 485 | \begin{equation*} 486 | F_{Y|X}(y,x)= \lim_{\Delta x \rightarrow 0} P(Y \leq y | x \leq X \leq x + \Delta x). 487 | \end{equation*} 488 | It can be easily seen that if $F(x,y)$ is differentiable in both $x,y$ then, 489 | 490 | \begin{equation*} 491 | F_{Y|X}(y,x)=\int_{-\infty}^{y} \frac{f_{X,Y}(x,\alpha)}{f_X(x)}d\alpha 492 | \end{equation*} 493 | and therefore we define the conditional PDF of $Y$ given $X=x$ in the following way, 494 | \begin{equation*} 495 | f_{Y|X}(y|x)= \frac{f_{XY}(x,y)}{f_X(x)} 496 | \end{equation*} 497 | }. Ignoring this technical point, we simply define, by analogy to the discrete case, 498 | the \emph{conditional probability density} of $Y$ given $X = x$ to be 499 | \begin{equation*} 500 | f_{Y|X}(y|x) = \frac{f_{XY}(x,y)}{f_X(x)}, 501 | \end{equation*} 502 | provided $f_X(x) \neq 0$. 503 | 504 | \subsection{Bayes's rule} 505 | A useful formula that often arises when trying to derive expression for the conditional probability 506 | of one variable given another, is \textbf{Bayes's rule}. 507 | 508 | In the case of discrete random variables $X$ and $Y$, 509 | \begin{equation*} 510 | P_{Y|X}(y|x)=\frac{P_{XY}(x,y)}{P_X(x)}=\frac{P_{X|Y}(x|y)P_Y(y)}{\sum_{y' \in Val(Y)} P_{X|Y}(x|y')P_Y(y')}. 511 | \end{equation*} 512 | 513 | If the random variables $X$ and $Y$ are continuous, 514 | \begin{equation*} 515 | f_{Y|X}(y|x)=\frac{f_{XY}(x,y)}{f_X(x)}=\frac{f_{X|Y}(x|y)f_Y(y)}{\int_{-\infty}^{\infty} f_{X|Y}(x|y')f_Y(y')dy'}. 516 | \end{equation*} 517 | 518 | \subsection{Independence} 519 | 520 | Two random variables $X$ and $Y$ are \textbf{independent} if $F_{XY}(x,y) = F_X(x) F_Y(y)$ for all values of 521 | $x$ and $y$. Equivalently, 522 | \begin{itemize} 523 | \item For discrete random variables, $p_{XY}(x,y) = p_X(x) p_Y(y)$ for all $x \in Val(X)$, $y \in Val(Y)$. 524 | \item For discrete random variables, $p_{Y|X}(y|x) = p_Y(y)$ whenever $p_X(x) \neq 0$ for all $y \in Val(Y)$. 525 | \item For continuous random variables, $f_{XY}(x,y) = f_X(x) f_Y(y)$ for all $x,y \in \mathbb{R}$. 526 | \item For continuous random variables, $f_{Y|X}(y|x) = f_Y(y)$ whenever $f_X(x) \neq 0$ for all $y \in \mathbb{R}$. 527 | \end{itemize} 528 | 529 | Informally, two random variables $X$ and $Y$ are \textbf{independent} 530 | if ``knowing'' the value of one variable will never have any effect on 531 | the conditional probability distribution of the other variable, that 532 | is, you know all the information about the pair $(X,Y)$ by just 533 | knowing $f(x)$ and $f(y)$. The following lemma formalizes this 534 | observation: 535 | \begin{lem} 536 | If $X$ and $Y$ are independent then for any subsets $A, B \subseteq \mathbb{R}$, we have, 537 | \begin{eqnarray*} 538 | P(X \in A, Y \in B)=P(X \in A) P(Y \in B) 539 | \end{eqnarray*} 540 | \end{lem} 541 | By using the above lemma one can prove that if $X$ is independent of $Y$ then any function of $X$ is independent of any function of $Y$. 542 | 543 | \subsection{Expectation and covariance} 544 | 545 | Suppose that we have two discrete random variables $X, Y$ and $g: \mathbf{R}^2 \longrightarrow \mathbf{R}$ is a function of these two random variables. Then the expected value of $g$ is defined in the following way, 546 | \begin{equation*} 547 | E[g(X,Y)] \triangleq \sum_{x \in Val(X)} \sum_{y \in Val(Y)} g(x,y) p_{XY}(x,y). 548 | \end{equation*} 549 | For continuous random variables $X,Y$, the analogous expression is 550 | \begin{equation*} 551 | E[g(X,Y)]= \int_{-\infty}^{\infty} \int_{- \infty}^{\infty} g(x,y) f_{XY}(x,y) dxdy. 552 | \end{equation*} 553 | 554 | We can use the concept of expectation to study the relationship of two random variables with each other. In particular, the \textbf{covariance} of two random variables $X$ and $Y$ is defined as 555 | \begin{eqnarray*} 556 | Cov[X,Y] 557 | &\triangleq& E[(X-E[X])(Y-E[Y])] \\ 558 | \end{eqnarray*} 559 | Using an argument similar to that for variance, we can rewrite this as, 560 | \begin{eqnarray*} 561 | Cov[X,Y] 562 | &=& E[(X-E[X])(Y-E[Y])] \\ 563 | &=& E[XY - X E[Y] - Y E[X] + E[X] E[Y]] \\ 564 | &=& E[XY] - E[X] E[Y] - E[Y] E[X] + E[X] E[Y]] \\ 565 | &=& E[XY] - E[X] E[Y]. 566 | \end{eqnarray*} 567 | Here, the key step in showing the equality of the two forms of covariance is in the third equality, where we use the fact that $E[X]$ and $E[Y]$ are actually constants which 568 | can be pulled out of the expectation. When $Cov[X,Y] = 0$, we say that $X$ and $Y$ are \textbf{uncorrelated}\footnote{ 569 | However, this is not the same thing as stating that $X$ and $Y$ are independent! For example, if $X \sim Uniform(-1,1)$ and $Y = X^2$, then one can show that $X$ and $Y$ are 570 | uncorrelated, even though they are not independent. 571 | }. 572 | 573 | \textbf{Properties}: 574 | \begin{description} 575 | \item[-] (Linearity of expectation) $E[f(X,Y) + g(X,Y)] = E[f(X,Y)] + E[g(X,Y)]$. 576 | \item[-] $Var[X + Y] = Var[X] + Var[Y] + 2 Cov[X, Y]$. 577 | \item[-] If $X$ and $Y$ are independent, then $Cov[X, Y] = 0$. 578 | \item[-] If $X$ and $Y$ are independent, then $E[f(X)g(Y)] = E[f(X)] E[g(Y)]$. 579 | \end{description} 580 | 581 | \section{Multiple random variables} 582 | 583 | The notions and ideas introduced in the previous section can be 584 | generalized to more than two random variables. In particular, suppose that we have $n$ continuous 585 | random variables, $X_1(\omega),X_2(\omega), \ldots X_n(\omega)$. In this section, for simplicity of presentation, 586 | we focus only on the continuous case, but the generalization to discrete random variables works similarly. 587 | 588 | \subsection{Basic properties} 589 | 590 | We can define the \textbf{joint distribution function} of $X_1,X_2,\ldots,X_n$, 591 | the \textbf{joint probability density function} of $X_1,X_2,\ldots,X_n$, 592 | the \textbf{marginal probability density function} of $X_1$, and 593 | the \textbf{conditional probability density function} of $X_1$ given $X_2,\ldots,X_n$, as 594 | 595 | \begin{eqnarray*} 596 | F_{X_1, X_2, \ldots, X_n}(x_1,x_2, \ldots x_n) &=& P(X_1 \leq x_1, X_2 \leq x_2, \ldots, X_n \leq x_n) \\ 597 | f_{X_1, X_2, \ldots, X_n}(x_1,x_2, \ldots x_n) &=& \frac{\partial^n F_{X_1, X_2, \ldots, X_n}(x_1,x_2, \ldots x_n)}{\partial x_1 \ldots \partial x_n} \\ 598 | f_{X_1}(X_1) &=& \int_{-\infty}^\infty \cdots \int_{-\infty}^\infty f_{X_1, X_2, \ldots, X_n}(x_1,x_2, \ldots x_n) dx_2 \ldots dx_n \\ 599 | f_{X_1 | X_2, \ldots, X_n}(x_1|x_2, \ldots x_n) &=& \frac{f_{X_1, X_2, \ldots, X_n}(x_1,x_2, \ldots x_n)}{f_{ X_2, \ldots, X_n}(x_1,x_2, \ldots x_n)} 600 | \end{eqnarray*} 601 | 602 | To calculate the probability of an event $A \subseteq \mathbb{R}^n$ we have, 603 | \begin{equation} 604 | P((x_1,x_2, \ldots x_n) \in A)= \int_{(x_1,x_2, \ldots x_n) \in A} f_{X_1,X_2,\ldots,X_n}(x_1,x_2, \ldots x_n)dx_1dx_2 \ldots dx_n 605 | \end{equation} 606 | 607 | \textbf{Chain rule}: From the 608 | definition of conditional probabilities for multiple random variables, one can show that 609 | \begin{eqnarray*} 610 | f(x_1,x_2,\ldots,x_n) &=& f(x_n|x_1,x_2\ldots,x_{n-1}) f(x_1,x_2\ldots,x_{n-1}) \\ 611 | &=& f(x_n|x_1,x_2\ldots,x_{n-1}) f(x_{n-1} | x_1,x_2\ldots,x_{n-2}) f(x_1,x_2\ldots,x_{n-2}) \\ 612 | &=& \ldots\;\; =\;\; f(x_1) \prod_{i=2}^n f(x_i | x_1,\ldots,x_{i-1}). 613 | \end{eqnarray*} 614 | 615 | \textbf{Independence}: 616 | For multiple events, $A_1,\ldots,A_k$, 617 | we say that $A_1,\ldots,A_k$ are \textbf{mutually independent} if for any 618 | subset $S \subseteq \{1,2,\ldots,k\}$, we have 619 | \begin{eqnarray*} 620 | P(\cap_{i \in S} A_i) = \prod_{i \in S} P(A_i). 621 | \end{eqnarray*} 622 | Likewise, we say that random variables $X_1,\ldots,X_n$ are independent if 623 | \begin{eqnarray*} 624 | f(x_1,\ldots,x_n) = f(x_1)f(x_2) \cdots f(x_n). 625 | \end{eqnarray*} 626 | Here, the definition of mutual independence is simply the natural generalization of independence of two random variables to multiple random variables. 627 | 628 | Independent random variables arise often in machine learning 629 | algorithms where we assume that the training examples belonging to the 630 | training set represent independent samples from some unknown 631 | probability distribution. To make the significance of independence clear, consider a ``bad'' 632 | training set in which we first sample a single training example 633 | $(x^{(1)},y^{(1)})$ from the some unknown distribution, and then add 634 | $m-1$ copies of the exact same training example to the training set. 635 | In this case, we have (with some abuse of notation) 636 | \begin{equation*} 637 | P( (x^{(1)},y^{(1)}), \ldots. (x^{(m)},y^{(m)}) ) \neq \prod_{i=1}^m P(x^{(i)},y^{(i)}). 638 | \end{equation*} 639 | Despite the fact that the training set has size $m$, the examples are 640 | not independent! While clearly the procedure described here is not a 641 | sensible method for building a training set for a machine learning 642 | algorithm, it turns out that in practice, non-independence of samples 643 | does come up often, and it has the effect of reducing the ``effective 644 | size'' of the training set. 645 | 646 | \subsection{Random vectors} 647 | Suppose that we have $n$ random variables. When working with all these 648 | random variables together, we will often find it convenient to put 649 | them in a vector $X=[X_1\; X_2\; \ldots\; X_n]^T$. We call the 650 | resulting vector a \textbf{random vector} (more formally, a random 651 | vector is a mapping from $\Omega$ to $\mathbb{R}^n$). It should be 652 | clear that random vectors are simply an alternative notation for dealing with $n$ 653 | random variables, so the notions of joint PDF and CDF will apply to 654 | random vectors as well. 655 | 656 | \textbf{Expectation}: 657 | Consider an 658 | arbitrary function from $g:\mathbb{R}^n \rightarrow \mathbb{R} $. The expected value of this function is defined as 659 | \begin{equation} 660 | E[g(X)]=\int_{\mathbb{R}^n} g(x_1,x_2, \ldots, x_n)f_{X_1,X_2,\ldots,X_n} (x_1,x_2, \ldots x_n) dx_1 dx_2 \ldots dx_n, 661 | \end{equation} 662 | where $\int_{\mathbb{R}^n}$ is $n$ consecutive integrations from $-\infty$ to $\infty$. If $g$ is a function from $\mathbb{R}^n$ to $\mathbb{R}^m$, then the expected value of $g$ is the element-wise expected values of the output vector, i.e., if $g$ is 663 | \begin{eqnarray*} 664 | g\mathbf(x) = 665 | \begin{bmatrix} 666 | g_1\mathbf(x) \\ 667 | g_2\mathbf(x) \\ 668 | \vdots\\ 669 | g_m\mathbf(x) 670 | \end{bmatrix}, 671 | \end{eqnarray*} 672 | Then, 673 | \begin{eqnarray*} 674 | E[g\mathbf(X)] = 675 | \begin{bmatrix} 676 | E[g_1\mathbf(X)] \\ 677 | E[g_2\mathbf(X)] \\ 678 | \vdots\\ 679 | E[g_m\mathbf(X)] 680 | \end{bmatrix}. 681 | \end{eqnarray*} 682 | 683 | \textbf{Covariance matrix}: For a given random vector 684 | $X : \Omega \rightarrow \mathbb{R}^n$, its covariance matrix $\Sigma$ 685 | is the $n \times n$ square matrix whose entries are given by $\Sigma_{ij} = 686 | Cov[X_i,X_j]$. 687 | 688 | From the definition of covariance, we have 689 | \begin{eqnarray*} 690 | \Sigma 691 | &=& 692 | \begin{bmatrix} 693 | Cov[X_1,X_1] & \cdots & Cov[X_1,X_n] \\ 694 | \vdots & \ddots & \vdots \\ 695 | Cov[X_n,X_1] & \cdots & Cov[X_n,X_n] \\ 696 | \end{bmatrix} \\ 697 | &=& 698 | \begin{bmatrix} 699 | E[X_1^2] - E[X_1]E[X_1] & \cdots & E[X_1X_n] - E[X_1]E[X_n] \\ 700 | \vdots & \ddots & \vdots \\ 701 | E[X_nX_1] - E[X_n]E[X_1] & \cdots & E[X_n^2] - E[X_n]E[X_n] \\ 702 | \end{bmatrix} \\ 703 | &=& 704 | \begin{bmatrix} 705 | E[X_1^2] & \cdots & E[X_1X_n] \\ 706 | \vdots & \ddots & \vdots \\ 707 | E[X_nX_1] & \cdots & E[X_n^2] \\ 708 | \end{bmatrix} - 709 | \begin{bmatrix} 710 | E[X_1]E[X_1] & \cdots & E[X_1]E[X_n] \\ 711 | \vdots & \ddots & \vdots \\ 712 | E[X_n]E[X_1] & \cdots & E[X_n]E[X_n] \\ 713 | \end{bmatrix} \\ 714 | &=& E[XX^T] - E[X]E[X]^T = \ldots = E[(X - E[X])(X - E[X])^T]. 715 | \end{eqnarray*} 716 | where the matrix expectation is defined in the obvious way. 717 | 718 | The covariance matrix has a number of useful properties: 719 | \begin{description} 720 | \item[-] $\Sigma \succeq 0$; that is, $\Sigma$ is positive semidefinite. 721 | \item[-] $\Sigma = \Sigma^T$; that is, $\Sigma$ is symmetric. 722 | \end{description} 723 | 724 | \subsection{The multivariate Gaussian distribution} 725 | 726 | One particularly important example of a probability distribution over 727 | random vectors $X$ is called the \textbf{multivariate Gaussian} or 728 | \textbf{multivariate normal} distribution. A random vector $X \in 729 | \mathbb{R}^n$ is said to have a multivariate normal (or Gaussian) 730 | distribution with mean $\mu \in \mathbb{R}^n$ and covariance matrix 731 | $\Sigma \in \mathbb{S}^{n}_{++}$ (where $\mathbb{S}^{n}_{++}$ refers 732 | to the space of symmetric positive definite $n \times n$ matrices) 733 | \begin{align*} 734 | f_{X_1,X_2,\ldots,X_n}(x_1,x_2,\ldots,x_n;\mu,\Sigma) = \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp \left(-\frac{1}{2}(x - \mu)^T \Sigma^{-1} (x - \mu)\right). 735 | \end{align*} 736 | We write this as $X \sim \mathcal{N}(\mu,\Sigma)$. 737 | Notice that in the case $n=1$, this reduces the regular definition of 738 | a normal distribution with mean parameter $\mu_1$ and variance 739 | $\Sigma_{11}$. 740 | 741 | Generally speaking, Gaussian random variables are extremely useful in 742 | machine learning and statistics for two main reasons. First, they are 743 | extremely common when modeling ``noise'' in statistical algorithms. 744 | Quite often, noise can be considered to be the accumulation of a large 745 | number of small independent random perturbations affecting the 746 | measurement process; by the Central Limit Theorem, summations of 747 | independent random variables will tend to ``look Gaussian.'' Second, 748 | Gaussian random variables are convenient for many analytical 749 | manipulations, because many of the integrals involving Gaussian 750 | distributions that arise in practice have simple closed form 751 | solutions. We will encounter this later in the course. 752 | 753 | \section{Other resources} 754 | 755 | A good textbook on probablity at the level needed for CS229 is the book, 756 | \textit{A First Course on Probability} by Sheldon Ross. 757 | 758 | \end{document} 759 | -------------------------------------------------------------------------------- /Linear Algebra Review/linalg2.tex: -------------------------------------------------------------------------------- 1 | \documentclass[12pt]{article} 2 | \usepackage{geometry} 3 | \geometry{letterpaper, height=8.5in,width=6.5in} 4 | \usepackage{graphicx} 5 | \usepackage{amsmath} 6 | \usepackage{amssymb} 7 | \usepackage{pstricks} 8 | 9 | 10 | 11 | \def\shownotes{0} %set 1 to show author notes 12 | \ifnum\shownotes=1 13 | \newcommand{\authnote}[2]{$\ll$\textsf{\footnotesize #1 notes: #2}$\gg$} 14 | \else 15 | \newcommand{\authnote}[2]{} 16 | \fi 17 | \newcommand{\Tnote}[1]{{\color{blue}\authnote{Tengyu}{#1}}} 18 | 19 | \newcommand{\commentout}[1]{} 20 | 21 | \title{Linear Algebra Review and Reference} 22 | \author{Zico Kolter (updated by Chuong Do and Tengyu Ma)} 23 | 24 | \begin{document} 25 | 26 | \maketitle 27 | \tableofcontents 28 | 29 | \section{Basic Concepts and Notation} 30 | Linear algebra provides a way of compactly representing and operating 31 | on sets of linear equations. For example, consider the following 32 | system of equations: 33 | \[\begin{array}{rcrcl} 4 x_1 & - & 5 x_2 & = & -13 \\ 34 | -2 x_1 & + & 3 x_2 & = & 9. \end{array}\\ \] 35 | 36 | This is two equations and two variables, so as you know from high 37 | school algebra, you can find a unique solution for $x_1$ and $x_2$ (unless 38 | the equations are somehow degenerate, for example if the second 39 | equation is simply a multiple of the first, but in the case above 40 | there is in fact a unique solution). In matrix notation, we can write 41 | the system more compactly as 42 | \[Ax = b\] 43 | with 44 | \[A = \left [ \begin{array}{cc} 4 & -5 \\ -2 & 3 45 | \end{array} \right ], \quad b = \left [ \begin{array}{c} -13 \\ 46 | 9 \end{array} \right ]. \] 47 | 48 | As we will see shortly, there are many advantages (including the 49 | obvious space savings) to analyzing linear equations in this form. 50 | 51 | \subsection{Basic Notation} 52 | 53 | We use the following notation: 54 | 55 | \begin{itemize} 56 | 57 | \item By $A \in \mathbb{R}^{m \times n}$ we denote a matrix with $m$ rows 58 | and $n$ columns, where the entries of $A$ are real numbers. 59 | 60 | \item By $x \in \mathbb{R}^n$, we denote a vector with $n$ entries. 61 | By convention, an $n$-dimensional vector is often thought of as a 62 | matrix with $n$ rows and 1 column, known as a \textbf{\textit{column vector}}. 63 | If we want to explicitly 64 | represent a \textbf{\textit{row vector}} --- a matrix with 1 row and 65 | $n$ columns --- we typically write $x^T$ (here $x^T$ denotes the 66 | transpose of $x$, which we will define shortly). 67 | 68 | \item The $i$th element of a vector $x$ is denoted $x_i$: \[ x = \left 69 | [ \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_n \end{array} \right 70 | ]. \] 71 | 72 | \item We use the notation $a_{ij}$ (or $A_{ij}$, 73 | $A_{i,j}$, etc) to denote the entry of $A$ in the $i$th row 74 | and $j$th column: \[A = \left [ \begin{array}{cccc} a_{11} & 75 | a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ 76 | \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & 77 | a_{mn} \end{array} \right ].\] 78 | 79 | \item We denote the $j$th column of $A$ by $a^j$ or $A_{:,j}$: \[ A = 80 | \left [ \begin{array}{cccc} | & | & & 81 | | \\ a^1 & a^2 & \cdots & a^n \\ | & | & & | 82 | \end{array} \right ]. \] 83 | 84 | \item We denote the $i$th row of $A$ by $a^T_i$ or $A_{i,:}$: \[ A = \left 85 | [ \begin{array}{ccc} \mbox{---} & a^T_1 & 86 | \mbox{---} \\ \mbox{---} & a^T_2 & \mbox{---} \\ & \vdots & \\ 87 | \mbox{---} & a^T_m & \mbox{---} \end{array} \right ]. \] 88 | 89 | \item Viewing a matrix as a collection of column or row vectors is very important and convenient in many cases. In general, it would be mathematically (and conceptually) cleaner to operate on the level of vectors instead of scalars. There is no universal convention for denoting the columns or rows of a matrix, and thus you can feel free to change the notations as long as it's explicitly defined. %We also note that the notation here for the columns or rows of a matrix is not necessarily universally adopted, therefore 90 | 91 | \end{itemize} 92 | 93 | \section{Matrix Multiplication} 94 | 95 | The product of two matrices $A \in \mathbb{R}^{m \times n}$ and $B \in 96 | \mathbb{R}^{n \times p}$ is the matrix \[C = AB \in \mathbb{R}^{m 97 | \times p},\] where \[C_{ij} = \sum_{k=1}^n A_{ik}B_{kj}.\] Note 98 | that in order for the matrix product to exist, the number of columns 99 | in $A$ must equal the number of rows in $B$. There are 100 | many other ways of looking at matrix multiplication that may be more convenient and insightful than the standard definition above, and we'll start by 101 | examining a few special cases. 102 | 103 | \subsection{Vector-Vector Products} 104 | 105 | Given two vectors $x, y \in \mathbb{R}^n$, the quantity $x^T y$, 106 | sometimes called the \textbf{\textit{inner product}} or 107 | \textbf{\textit{dot product}} of 108 | the vectors, is a real number given by 109 | \[x^T y \in \mathbb{R} = 110 | \left [ \begin{array}{cccc} x_1 & x_2 & \cdots & x_n \end{array} \right ] 111 | \left [ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array} \right ] 112 | = \sum_{i=1}^n x_i y_i.\] 113 | Observe that inner products are really just special case of matrix multiplication. 114 | Note that it is always the case that $x^T y = y^T x$. 115 | 116 | Given vectors $x \in \mathbb{R}^m$, $y \in \mathbb{R}^n$ (not necessarily 117 | of the same size), $x y^T \in \mathbb{R}^{m \times n}$ is called the \textbf{\textit{outer 118 | product}} of the vectors. It is a matrix whose entries are given by 119 | $(x y^T)_{ij} = x_i y_j$, i.e., 120 | \[ x y^T \in \mathbb{R}^{m \times n} 121 | = \left [ \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_m \end{array} \right ] 122 | \left [ \begin{array}{cccc} y_1 & y_2 & \cdots & y_n \end{array} \right ] 123 | = \left [ \begin{array}{cccc}x_1 124 | y_1 & x_1 y_2 & \cdots & x_1 125 | y_n \\ x_2 y_1 & x_2 y_2 & \cdots & x_2 y_n \\ \vdots & \vdots & 126 | \ddots & \vdots \\ x_m y_1 & x_m y_2 & \cdots & x_m y_n 127 | \end{array} \right ]. \] 128 | As an example of how the outer product can be useful, let $\mathbf{1} \in \mathbb{R}^n$ 129 | denote an $n$-dimensional vector whose entries are all equal to 1. Furthermore, 130 | consider the matrix $A \in \mathbb{R}^{m \times n}$ whose columns are all 131 | equal to some vector $x \in \mathbb{R}^m$. Using outer products, we can 132 | represent $A$ compactly as, 133 | \[A = \left [ \begin{array}{cccc} | & | & & 134 | | \\ x & x & \cdots & x \\ | & | & & | 135 | \end{array} \right ] 136 | = \left [ \begin{array}{cccc}x_1 137 | & x_1 & \cdots & x_1 138 | \\ x_2 & x_2 & \cdots & x_2 \\ \vdots & \vdots & 139 | \ddots & \vdots \\ x_m & x_m & \cdots & x_m 140 | \end{array} \right ] 141 | = \left [ \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_m \end{array} \right ] 142 | \left [ \begin{array}{cccc} 1 & 1 & \cdots & 1 \end{array} \right ] 143 | = x \mathbf{1}^T. \] 144 | 145 | 146 | \subsection{Matrix-Vector Products} 147 | 148 | Given a matrix $A \in \mathbb{R}^{m \times n}$ and a vector $x \in 149 | \mathbb{R}^n$, their product is a vector $y = Ax \in \mathbb{R}^m$. 150 | There are a couple ways of looking at matrix-vector multiplication, 151 | and we will look at each of them in turn. 152 | 153 | If we write $A$ by rows, then we can express $Ax$ as, 154 | \[ y = Ax = \left [ \begin{array}{ccc} \mbox{---} & a^T_1 & 155 | \mbox{---} \\ \mbox{---} & a^T_2 & \mbox{---} \\ & \vdots & \\ 156 | \mbox{---} & a^T_m & \mbox{---} \end{array} \right ] x = \left [ 157 | \begin{array}{c} a^T_1 x \\ a^T_2 x \\ \vdots \\ a^T_m x 158 | \end{array} \right ]. \] 159 | In other words, the $i$th entry of $y$ is equal to the inner 160 | product of the $i$th \textit{row} of $A$ and $x$, $y_i = a_i^T x$. 161 | 162 | Alternatively, let's write $A$ in column form. In this case we see 163 | that, 164 | \begin{align} 165 | y = Ax = \left [ \begin{array}{cccc} | & | & & 166 | | \\ a^1 & a^2 & \cdots & a^n \\ | & | & & | 167 | \end{array} \right ] \left [ \begin{array}{c} x_1 \\ x_2 \\ \vdots 168 | \\ x_n \end{array} \right ] = \left [ \begin{array}{c} \\ a^1 169 | \\ \\ \end{array} \right ] x_1 + \left [ \begin{array}{c} \\ a^2 170 | \\ \\ \end{array} \right ] x_2 + \ldots + \left [ \begin{array}{c} 171 | \\ a^n \\ \\ \end{array} \right ] x_n \;\;. \label{eqn:2} 172 | \end{align} 173 | 174 | 175 | In other words, y is a \textbf{\textit{linear combination}} of the 176 | \textit{columns} of $A$, where the coefficients of the linear 177 | combination are given by the entries of $x$. 178 | 179 | So far we have been multiplying on the right by a column vector, but 180 | it is also possible to multiply on the left by a row vector. This is 181 | written, $y^T = x^T A$ for $A \in \mathbb{R}^{m \times n}$, $x \in 182 | \mathbb{R}^m$, and $y \in \mathbb{R}^n$. As before, we can express 183 | $y^T$ in two obvious ways, depending on whether we express $A$ in 184 | terms on its rows or columns. In the first case we express $A$ in 185 | terms of its columns, which gives 186 | \[y^T = x^T A = x^T \left [ \begin{array}{cccc} | & | & & 187 | | \\ a^1 & a^2 & \cdots & a^n \\ | & | & & | 188 | \end{array} \right ] = \left [ 189 | \begin{array}{cccc} x^T a^1 & x^T a^2 & \cdots & x^T a^n \end{array} 190 | \right ] \] 191 | which demonstrates that the $i$th entry of $y^T$ is equal to the inner 192 | product of $x$ and the $i$th \textit{column} of $A$. 193 | 194 | Finally, expressing $A$ in terms of rows we get the final 195 | representation of the vector-matrix product, 196 | \begin{eqnarray*} 197 | y^T & = & x^T A \\ 198 | & = & \left [ \begin{array}{cccc}x_1 & x_2 & \cdots & x_m 199 | \end{array} \right ] \left [ \begin{array}{ccc} \mbox{---} & a^T_1 & 200 | \mbox{---} \\ \mbox{---} & a^T_2 & \mbox{---} \\ & \vdots & \\ 201 | \mbox{---} & a^T_m & \mbox{---} \end{array} \right ] \\ & & \\ 202 | & = & x_1 \left [ \begin{array}{ccc} \mbox{---} & a^T_1 & \mbox{---} 203 | \end{array} \right ] + x_2 \left [ \begin{array}{ccc} \mbox{---} & 204 | a^T_2 & \mbox{---} \end{array} \right ] + ... + x_m \left [ 205 | \begin{array}{ccc} \mbox{---} & a^T_m & \mbox{---} \end{array} \right ] 206 | \end{eqnarray*} 207 | so we see that $y^T$ is a linear combination of the \textit{rows} of 208 | $A$, where the coefficients for the linear combination are given by 209 | the entries of $x$. 210 | 211 | \subsection{Matrix-Matrix Products}\label{subsec:matrix-matrix} 212 | 213 | Armed with this knowledge, we can now look at four different (but, of 214 | course, equivalent) ways of 215 | viewing the matrix-matrix multiplication $C = AB$ as defined at the 216 | beginning of this section. 217 | 218 | First, we can view matrix-matrix 219 | multiplication as a set of vector-vector products. The most obvious 220 | viewpoint, which follows 221 | immediately from the definition, is that the $(i,j)$th entry of $C$ is 222 | equal to the inner product of the $i$th row of $A$ and the $j$th column 223 | of $B$. Symbolically, this looks like the following, 224 | \[C = AB = \left [ \begin{array}{ccc} \mbox{---} & a^T_1 & 225 | \mbox{---} \\ \mbox{---} & a^T_2 & \mbox{---} \\ & \vdots & \\ 226 | \mbox{---} & a^T_m & \mbox{---} \end{array} \right ] \left [ 227 | \begin{array}{cccc} | & | & & | \\ b^1 & b^2 & \cdots & b^p \\ | & 228 | | & & | \end{array} \right ] = \left [ \begin{array}{cccc}a^T_1 b^1 229 | & a^T_1 b^2 & \cdots & a^T_1 b^p \\ a^T_2 b^1 & a^T_2 b^2 & \cdots & 230 | a^T_2 b^p \\ \vdots & \vdots & \ddots & \vdots \\ a^T_m b^1 & 231 | a^T_m b^2 & \cdots & a^T_m b^p \end{array} \right ]. \] 232 | Remember that since $A \in \mathbb{R}^{m \times n}$ and $B \in 233 | \mathbb{R}^{n \times p}$, $a_i \in \mathbb{R}^n$ and $b^j \in 234 | \mathbb{R}^n$, so these inner products all make sense. This is the 235 | most ``natural'' representation when we represent $A$ by 236 | rows and $B$ by columns. 237 | Alternatively, we can represent $A$ by 238 | columns, and $B$ by rows. This representation leads to a much trickier interpretation of $AB$ as 239 | a sum of outer products. Symbolically, 240 | \[C = AB = \left [ \begin{array}{cccc} | & | & & | \\ a^1 & a^2 & 241 | \cdots & a^n \\ | & | & & | \end{array} \right ] \left [ 242 | \begin{array}{ccc} \mbox{---} & b^T_1 & \mbox{---} \\ \mbox{---} & 243 | b^T_2 & \mbox{---} \\ & \vdots & \\ \mbox{---} & b^T_n & 244 | \mbox{---} \end{array} \right ] = \sum_{i=1}^n a^i 245 | b_i^T \;\;.\] 246 | Put another way, $AB$ is equal to the sum, over all $i$, of the outer 247 | product of the $i$th 248 | column of $A$ and the $i$th row of $B$. Since, in this case, $a^i \in 249 | \mathbb{R}^m$ and $b_i \in \mathbb{R}^p$, the dimension of the 250 | outer product $a^i b^T_i$ is $m \times p$, which coincides with the 251 | dimension of $C$. Chances are, the last equality above may appear 252 | confusing to you. If so, take the time to check it for yourself! 253 | 254 | Second, we can also view matrix-matrix multiplication as a set of 255 | matrix-vector products. Specifically, if we represent $B$ by columns, 256 | we can view the columns of $C$ as matrix-vector products between $A$ 257 | and the columns of $B$. Symbolically, 258 | \begin{align} 259 | C = AB = A \left [ \begin{array}{cccc} | & | & & | \\ b^1 & b^2 & 260 | \cdots & b^p \\ | & | & & | \end{array} \right ] = \left [ 261 | \begin{array}{cccc} | & | & & | \\ A b^1 & A b^2 & \cdots & A b^p \\ | 262 | & | & & | \end{array} \right ]. \label{eqn:1} 263 | \end{align} 264 | Here the $i$th column of $C$ is given by the matrix-vector product 265 | with the vector on the right, $c^{i} = A b^{i}$. These matrix-vector 266 | products can in turn be interpreted using both viewpoints given in the 267 | previous subsection. Finally, we have the analogous viewpoint, where 268 | we represent $A$ by rows, and view the rows of $C$ as the 269 | matrix-vector product between the rows of $A$ and $C$. Symbolically, 270 | \[C = AB = \left [ \begin{array}{ccc} \mbox{---} & a^T_1 & 271 | \mbox{---} \\ \mbox{---} & a^T_2 & \mbox{---} \\ & \vdots & \\ 272 | \mbox{---} & a^T_m & \mbox{---} \end{array} \right ] B = \left [ 273 | \begin{array}{ccc} \mbox{---} & a^T_1 B & \mbox{---} \\ \mbox{---} 274 | & a^T_2 B & \mbox{---} \\ & \vdots & \\ 275 | \mbox{---} & a^T_m B & \mbox{---} \end{array} \right ]. \] 276 | Here the $i$th row of $C$ is given by the matrix-vector product with 277 | the vector on the left, $c_i^T = a_i^T B$. 278 | 279 | It may seem like overkill to dissect matrix multiplication to such a 280 | large degree, especially when all these viewpoints follow immediately 281 | from the initial definition we gave (in about a line of math) at the 282 | beginning of this section. {\bf The direct advantage of these various viewpoints is that they allow you to operate on the level/unit of vectors instead of scalars.} To fully understand linear algebra without getting lost in the complicated manipulation of indices, the key is to operate with as large concepts as possible. \footnote{E.g., if you could write all your math derivations with matrices or vectors, it would be better than doing them with scalar elements.} 283 | \Tnote{I am not sure what's the best way to write this philosophical remark about doing math.. Any better idea?} 284 | 285 | Virtually all of linear algebra 286 | deals with matrix multiplications of some kind, and it is worthwhile 287 | to spend some time trying to develop an intuitive understanding of the 288 | viewpoints presented here. 289 | 290 | In addition to this, it is useful to know a few basic properties of 291 | matrix multiplication at a higher level: 292 | \begin{itemize} 293 | \item Matrix multiplication is associative: $(AB)C = A(BC)$. 294 | \item Matrix multiplication is distributive: $A(B + C) = AB + AC$. 295 | \item Matrix multiplication is, in general, \textit{not} commutative; 296 | that is, it can be the case that $AB \neq BA$. (For example, 297 | if $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times q}$, 298 | the matrix product $BA$ does not even exist if $m$ and $q$ are not equal!) 299 | \end{itemize} 300 | 301 | If you are not familiar with these properties, take the time to verify 302 | them for yourself. For example, to check the associativity of matrix 303 | multiplication, suppose that $A \in \mathbb{R}^{m \times n}$, 304 | $B \in \mathbb{R}^{n \times p}$, and $C \in \mathbb{R}^{p \times q}$. 305 | Note that $AB \in \mathbb{R}^{m \times p}$, so $(AB)C \in \mathbb{R}^{m \times q}$. 306 | Similarly, $BC \in \mathbb{R}^{n \times q}$, so $A(BC) \in \mathbb{R}^{m \times q}$. 307 | Thus, the dimensions of the resulting matrices agree. To show that 308 | matrix multiplication is associative, it suffices to check that the $(i,j)$th 309 | entry of $(AB)C$ is equal to the $(i,j)$th entry of $A(BC)$. We can verify 310 | this directly using the definition of matrix multiplication: 311 | \begin{eqnarray*} 312 | ((AB)C)_{ij} &=& \sum_{k=1}^p (AB)_{ik}C_{kj} = \sum_{k=1}^p \left(\sum_{l=1}^n A_{il} B_{lk}\right) C_{kj} \\ 313 | &=& \sum_{k=1}^p \left(\sum_{l=1}^n A_{il} B_{lk} C_{kj} \right) = \sum_{l=1}^n \left(\sum_{k=1}^p A_{il} B_{lk} C_{kj} \right) \\ 314 | &=& \sum_{l=1}^n A_{il} \left(\sum_{k=1}^p B_{lk} C_{kj}\right) = \sum_{l=1}^n A_{il} (BC)_{lj} = (A(BC))_{ij}. 315 | \end{eqnarray*} 316 | Here, the first and last two equalities simply use the definition of matrix multiplication, 317 | the third and fifth equalities use the distributive property for \textit{scalar multiplication over addition}, and 318 | the fourth equality uses the \textit{commutative and associativity of scalar addition}. This technique for proving 319 | matrix properties by reduction to simple scalar properties will come up often, so make sure you're familiar with it. 320 | 321 | \section{Operations and Properties} 322 | 323 | In this section we present several operations and properties of 324 | matrices and vectors. Hopefully a great deal of this will be review 325 | for you, so the notes can just serve as a reference for these topics. 326 | 327 | \subsection{The Identity Matrix and Diagonal Matrices} 328 | The \textbf{\textit{identity matrix}}, denoted $I \in \mathbb{R}^{n 329 | \times n}$, is a square matrix with ones on the diagonal and zeros 330 | everywhere else. That is, 331 | \[ I_{ij} = \left \{ \begin{array}{ll}1 & i = j \\ 0 & i \neq j 332 | \end{array} \right . \] 333 | It has the property that for all $A \in \mathbb{R}^{m \times n}$, \[AI 334 | = A = IA.\] Note that in some sense, the notation for the identity matrix 335 | is ambiguous, since it does not specify the dimension of $I$. Generally, 336 | the dimensions of $I$ are inferred from context so as to make matrix multiplication 337 | possible. For example, in the 338 | equation above, the $I$ in $AI = A$ is an $n \times n$ matrix, whereas 339 | the $I$ in $A = IA$ is an $m \times m$ matrix. 340 | 341 | A \textbf{\textit{diagonal matrix}} is a matrix where all non-diagonal 342 | elements are 0. This is typically denoted $D = \mathrm{diag}(d_1, 343 | d_2, \ldots, d_n)$, with 344 | \[ D_{ij} = \left \{ \begin{array}{ll}d_i & i = j \\ 0 & i \neq 345 | j\end{array} \right . \] 346 | Clearly, $I = \mathrm{diag}(1,1,\ldots,1)$. 347 | 348 | \subsection{The Transpose} 349 | The \textbf{\textit{transpose}} of a matrix results from ``flipping'' 350 | the rows and columns. Given a matrix $A \in \mathbb{R}^{m \times n}$, 351 | its transpose, written $A^T \in \mathbb{R}^{n \times m}$, is the $n \times m$ matrix 352 | whose entries are given by 353 | \[ (A^T)_{ij} = A_{ji}.\] 354 | We have in fact already been using the transpose when describing row 355 | vectors, since the transpose of a column vector is naturally a row 356 | vector. 357 | 358 | The following properties of transposes are easily verified: 359 | \begin{itemize} 360 | \item $(A^T)^T = A$ 361 | \item $(AB)^T = B^T A^T$ 362 | \item $(A + B)^T = A^T + B^T$ 363 | \end{itemize} 364 | 365 | \subsection{Symmetric Matrices} 366 | A square matrix $A \in \mathbb{R}^{n \times n}$ is 367 | \textbf{\textit{symmetric}} if $A = A^T$. It is 368 | \textbf{\textit{anti-symmetric}} if $A = -A^T$. It is easy to show 369 | that for any matrix $A \in \mathbb{R}^{n \times n}$, the matrix 370 | $A + A^T$ is symmetric and the matrix $A - A^T$ is anti-symmetric. 371 | From this it follows that any square matrix $A \in \mathbb{R}^{n 372 | \times n}$ can be represented as a sum of a symmetric matrix and an 373 | anti-symmetric matrix, since 374 | \[A = \frac{1}{2}(A + A^T) + \frac{1}{2}(A - A^T)\] 375 | and the first matrix on the right is symmetric, while the second is 376 | anti-symmetric. It turns out that symmetric matrices occur a great 377 | deal in practice, and they have many nice properties which we will 378 | look at shortly. It is common to denote the set of all symmetric 379 | matrices of size $n$ as $\mathbb{S}^n$, so that $A \in \mathbb{S}^n$ 380 | means that $A$ is a symmetric $n \times n$ matrix; 381 | 382 | \subsection{The Trace} 383 | The \textbf{\textit{trace}} of a square matrix $A \in \mathbb{R}^{n 384 | \times n}$, denoted $\mathrm{tr}(A)$ (or just $\mathrm{tr}A$ if the 385 | parentheses are obviously implied), is the 386 | sum of diagonal elements in the matrix: 387 | \[\mathrm{tr}A = \sum_{i=1}^n A_{ii}.\] 388 | As described in the CS229 lecture notes, the trace has the following 389 | properties (included here for the sake of completeness): 390 | \begin{itemize} 391 | \item For $A \in \mathbb{R}^{n \times n}$, $\mathrm{tr}A = 392 | \mathrm{tr}A^T$. 393 | \item For $A,B \in \mathbb{R}^{n \times n}$, $\mathrm{tr}(A + B) = 394 | \mathrm{tr}A + \mathrm{tr}B$. 395 | \item For $A \in \mathbb{R}^{n \times n}$, $t \in \mathbb{R}$, 396 | $\mathrm{tr}(tA) = t \; \mathrm{tr}A$. 397 | \item For $A,B$ such that $AB$ is square, $\mathrm{tr}AB = 398 | \mathrm{tr}BA$. 399 | \item For $A,B,C$ such that $ABC$ is square, $\mathrm{tr}ABC = 400 | \mathrm{tr}BCA = \mathrm{tr}CAB$, and so on for the product of 401 | more matrices. 402 | \end{itemize} 403 | 404 | As an example of how these properties can be proven, we'll consider the 405 | fourth property given above. Suppose that $A \in \mathbb{R}^{m \times n}$ 406 | and $B \in \mathbb{R}^{n \times m}$ (so that $AB \in \mathbb{R}^{m \times m}$ is a square matrix). 407 | Observe that $BA \in \mathbb{R}^{n \times n}$ is also a square matrix, so it makes sense 408 | to apply the trace operator to it. To verify that $\mathrm{tr} AB = \mathrm{tr} BA$, note that 409 | \begin{eqnarray*} 410 | \mathrm{tr} AB &=& \sum_{i=1}^m (AB)_{ii} = \sum_{i=1}^m \left(\sum_{j=1}^n A_{ij} B_{ji} \right) \\ 411 | &=& \sum_{i=1}^m \sum_{j=1}^n A_{ij} B_{ji} = \sum_{j=1}^n \sum_{i=1}^m B_{ji} A_{ij} \\ 412 | &=& \sum_{j=1}^n \left(\sum_{i=1}^m B_{ji} A_{ij}\right) = \sum_{j=1}^n (BA)_{jj} = \mathrm{tr} BA. 413 | \end{eqnarray*} 414 | Here, the first and last two equalities use the definition of the trace operator and matrix multiplication. 415 | The fourth equality, where the main work occurs, uses the commutativity of scalar multiplication in order 416 | to reverse the order of the terms in each product, and the commutativity and associativity of scalar addition 417 | in order to rearrange the order of the summation. 418 | 419 | \commentout{ 420 | As an example of how these properties can be proven, we'll consider the last 421 | property given above. Suppose that $A \in \mathbb{R}^{m \times n}$, 422 | $B \in \mathbb{R}^{n \times p}$, and $C \in \mathbb{R}^{p \times m}$ (so that $ABC \in \mathbb{R}^{m \times m}$ is a square matrix). 423 | In this case, it is easy to see that $BCA \in \mathbb{R}^{n \times n}$ is also 424 | a square matrix. To verify that $\mathrm{tr}ABC = \mathrm{tr}BCA$, observe that 425 | \begin{eqnarray*} 426 | \mathrm{tr} ABC &=& \sum_{i=1}^m (ABC)_{ii} = \sum_{i=1}^m \left(\sum_{j=1}^p (AB)_{ij} C_{ji} \right) = \sum_{i=1}^m \left(\sum_{j=1}^p \left( \sum_{k=1}^n A_{ik} B_{kj} \right) C_{ji} \right) \\ 427 | &=& \sum_{i=1}^m \sum_{j=1}^p \sum_{k=1}^n A_{ik} B_{kj} C_{ji} = \sum_{k=1}^n \sum_{i=1}^m \sum_{j=1}^p B_{kj} C_{ji} A_{ik} \\ 428 | &=& \sum_{k=1}^n \sum_{i=1}^m \left( \sum_{j=1}^p B_{kj} C_{ji} \right) A_{ik} = \sum_{k=1}^n \sum_{i=1}^m (BC)_{ki} A_{ik} \\ 429 | &=& \sum_{k=1}^n (BCA)_{kk} = \mathrm{tr} BCA. 430 | \end{eqnarray*} 431 | } 432 | 433 | \subsection{Norms} 434 | 435 | A \textbf{\textit{norm}} of a vector $\|x\|$ is informally a measure of 436 | the ``length'' of the vector. For example, we have the 437 | commonly-used Euclidean or $\ell_2$ norm, 438 | \[\|x\|_2 = \sqrt{\sum_{i=1}^n x_i^2}.\] 439 | Note that $\|x\|_2^2 = x^Tx$. 440 | 441 | More formally, a norm is any function $f : \mathbb{R}^{n} \rightarrow 442 | \mathbb{R}$ that satisfies 4 properties: 443 | \begin{enumerate} 444 | \item For all $x \in \mathbb{R}^n$, $f(x) \geq 0$ (non-negativity). 445 | \item $f(x) = 0$ if and only if $x = 0$ (definiteness). 446 | \item For all $x \in \mathbb{R}^n$, $t \in \mathbb{R}$, $f(tx) = 447 | |t|f(x)$ (homogeneity). 448 | \item For all $x,y \in \mathbb{R}^n$, $f(x + y) \leq f(x) + f(y)$ 449 | (triangle inequality). 450 | \end{enumerate} 451 | Other examples of norms are the $\ell_1$ norm, 452 | \[\|x\|_1 = \sum_{i=1}^n |x_i| \] 453 | and the $\ell_\infty$ norm, 454 | \[\|x\|_\infty = \mathrm{max}_i\, |x_i|.\] 455 | In fact, all three norms presented so far are examples of the family 456 | of $\ell_p$ norms, which are parameterized by a real number $p \geq 457 | 1$, and defined as 458 | \[\|x\|_p = \left ( \sum_{i=1}^n |x_i|^p \right )^{1/p}.\] 459 | 460 | Norms can also be defined for matrices, such as the Frobenius norm, 461 | \[\|A\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n A_{ij}^2} = 462 | \sqrt{\mathrm{tr}(A^T A)}. \] Many other norms exist, but they are 463 | beyond the scope of this review. 464 | 465 | \subsection{Linear Independence and Rank} 466 | A set of vectors $\{x_1, x_2, \ldots x_n\} \subset \mathbb{R}^m\ $ is said to be 467 | \textbf{\textit{(linearly) independent}} if no vector can be represented 468 | as a linear combination of the remaining vectors. Conversely, if one 469 | vector belonging to the set \textit{can} be represented as a linear combination of 470 | the remaining vectors, then the vectors are said to be \textbf{\textit{(linearly) 471 | dependent}}. That is, if 472 | \[x_n = \sum_{i=1}^{n-1} \alpha_i x_i\] 473 | for some scalar values $\alpha_1, \ldots, \alpha_{n-1} \in \mathbb{R}$, then we say that 474 | the vectors $x_1, \ldots, x_n$ are linearly dependent; otherwise, the vectors are 475 | linearly independent. For example, the vectors 476 | \[ 477 | x_1 = \left[ \begin{array}{c} 1 \\ 2 \\ 3 \end{array} \right] \quad 478 | x_2 = \left[ \begin{array}{c} 4 \\ 1 \\ 5 \end{array} \right] \quad 479 | x_3 = \left[ \begin{array}{c} 2 \\ -3 \\ -1 \end{array} \right] 480 | \] 481 | are linearly dependent because $x_3 = -2x_1 + x_2$. 482 | 483 | The \textbf{\textit{column rank}} of a matrix $A \in \mathbb{R}^{m \times n}$ is the 484 | size of the largest subset of columns of $A$ that constitute a linearly independent 485 | set. With some abuse of terminology, this is often referred to simply as the number of linearly 486 | independent columns of $A$. In the same way, the 487 | \textbf{\textit{row rank}} is the largest number of rows of $A$ that 488 | constitute a linearly independent set. 489 | 490 | For any matrix $A \in \mathbb{R}^{m \times n}$, it turns out that 491 | the column rank of $A$ is equal to the row rank of $A$ (though we will not prove this), and so both quantities 492 | are referred to collectively as the \textbf{\textit{rank}} of $A$, denoted as 493 | $\mathrm{rank}(A)$. The following are some basic properties of the 494 | rank: 495 | \begin{itemize} 496 | \item For $A \in \mathbb{R}^{m \times n}$, $\mathrm{rank}(A) \leq 497 | \mathrm{min}(m,n)$. If $\mathrm{rank}(A) = \mathrm{min}(m,n)$, 498 | then $A$ is said to be \textbf{\textit{full rank}}. 499 | \item For $A \in \mathbb{R}^{m \times n}$, $\mathrm{rank}(A) = 500 | \mathrm{rank}(A^T)$. 501 | \item For $A \in \mathbb{R}^{m \times n}$, $B \in \mathbb{R}^{n 502 | \times p}$, $\mathrm{rank}(AB) \leq \mathrm{min}(\mathrm{rank}(A), 503 | \mathrm{rank}(B))$. 504 | \item For $A,B \in \mathbb{R}^{m \times n}$, $\mathrm{rank}(A + B) 505 | \leq \mathrm{rank}(A) + \mathrm{rank}(B)$. 506 | \end{itemize} 507 | 508 | \subsection{The Inverse of a Square Matrix} 509 | 510 | The \textbf{\textit{inverse}} of a square matrix $A \in \mathbb{R}^{n 511 | \times n}$ is denoted $A^{-1}$, and is the unique matrix such that 512 | \[A^{-1} A = I = A A^{-1}.\] 513 | Note that not all matrices have inverses. Non-square matrices, for example, 514 | do not have inverses by definition. However, for some square matrices 515 | $A$, it may still be the case that $A^{-1}$ may not exist. In particular, 516 | we say that $A$ is \textbf{\textit{invertible}} or 517 | \textbf{\textit{non-singular}} if $A^{-1}$ exists and 518 | \textbf{\textit{non-invertible}} or \textbf{\textit{singular}} 519 | otherwise.\footnote{It's easy to get confused and think that non-singular means non-invertible. But in fact, it means the opposite! Watch out!} 520 | 521 | In order for a square matrix $A$ to have an inverse $A^{-1}$, then 522 | $A$ must be full rank. We will soon see that there are many 523 | alternative sufficient and necessary conditions, in addition to full 524 | rank, for invertibility. 525 | 526 | The following are properties of the inverse; 527 | all assume that $A, B \in \mathbb{R}^{n \times n}$ are non-singular: 528 | \begin{itemize} 529 | \item $(A^{-1})^{-1} = A$ 530 | \item $(AB)^{-1} = B^{-1} A^{-1}$ 531 | \item $(A^{-1})^T = (A^T)^{-1}$. For this reason this matrix is often 532 | denoted $A^{-T}$. 533 | \end{itemize} 534 | As an example of how the inverse is used, consider the linear system of equations, 535 | $Ax = b$ where $A \in \mathbb{R}^{n \times n}$, and $x, b \in \mathbb{R}^n$. If 536 | $A$ is nonsingular (i.e., invertible), then $x = A^{-1} b$. 537 | 538 | (What if $A \in \mathbb{R}^{m \times n}$ 539 | is not a square matrix? Does this work?) 540 | 541 | \Tnote{Add more about pseudo-inverse} 542 | 543 | \subsection{Orthogonal Matrices} 544 | Two vectors $x,y \in \mathbb{R}^n$ are \textbf{\textit{orthogonal}} if 545 | $x^T y$ = 0. A vector $x \in \mathbb{R}^n$ is 546 | \textbf{\textit{normalized}} if $\|x\|_2 = 1$. A square matrix $U \in 547 | \mathbb{R}^{n \times n}$ is \textbf{\textit{orthogonal}} (note the 548 | different meanings when talking about vectors versus matrices) if all 549 | its columns are orthogonal to each other and are normalized (the 550 | columns are then referred to as being \textbf{\textit{orthonormal}}). 551 | 552 | It follows immediately from the definition of orthogonality and 553 | normality that 554 | \[U^TU = I = UU^T.\] 555 | In other words, the inverse of an orthogonal matrix is its transpose. 556 | Note that if $U$ is not square --- i.e., $U \in \mathbb{R}^{m \times 557 | n},\;\; n < m$ --- but its columns are still orthonormal, then $U^TU 558 | = I$, but $U U^T \neq I$. We generally only use the term orthogonal 559 | to describe the previous case, where $U$ is square. 560 | 561 | Another nice property of orthogonal matrices is that operating on a 562 | vector with an orthogonal matrix will not change its Euclidean norm, 563 | i.e., 564 | \begin{align} 565 | \|Ux\|_2 = \|x\|_2 \label{eqn:preserve-norm} 566 | \end{align} 567 | for any $x \in \mathbb{R}^n$, $U \in \mathbb{R}^{n \times n}$ 568 | orthogonal. 569 | 570 | \subsection{Range and Nullspace of a Matrix} 571 | The \textbf{\textit{span}} of a set of vectors $\{x_1, x_2, \ldots 572 | x_n\}$ is the set of all vectors that can be expressed as a linear 573 | combination of $\{x_1, \ldots, x_n\}$. That is, 574 | \[\mathrm{span}(\{x_1, \ldots x_n\}) = \left \{v : v = \sum_{i=1}^n \alpha_i 575 | x_i, \;\; \alpha_i \in 576 | \mathbb{R} \right \}. \] 577 | It can be shown that if $\{x_1, \ldots, x_n\}$ is a set of $n$ 578 | linearly independent vectors, where each $x_i \in \mathbb{R}^n$, then 579 | $\mathrm{span}(\{x_1, \ldots x_n\}) = \mathbb{R}^n$. In other words, 580 | \textit{any} vector $v \in \mathbb{R}^n$ can be written as a linear 581 | combination of $x_1$ through $x_n$. The 582 | \textbf{\textit{projection}} of a vector $y \in \mathbb{R}^m$ onto the 583 | span of $\{x_1, \ldots, x_n\}$ (here we assume 584 | $x_i \in \mathbb{R}^m$) is the 585 | vector $v \in \mathrm{span}(\{x_1, \ldots x_n\})$, such that $v$ is as 586 | close as possible to $y$, 587 | as measured by the Euclidean norm $\|v - y\|_2$. We denote the 588 | projection as $\mathrm{Proj}(y;\{x_1, \ldots, x_n\})$ and can define 589 | it formally as, 590 | \[\mathrm{Proj}(y;\{x_1, \ldots x_n\}) = \textrm{argmin}_{v \in 591 | \mathrm{span}(\{x_1, \ldots, x_n\})} \|y - v\|_2.\] 592 | 593 | The \textbf{\textit{range}} (sometimes also called the columnspace) 594 | of a matrix $A \in \mathbb{R}^{m \times n}$, denoted $\mathcal{R}(A)$, 595 | is the the span of the columns of $A$. In other words, 596 | \[\mathcal{R}(A) = \{v \in \mathbb{R}^m: v = Ax, x \in 597 | \mathbb{R}^{n}\}.\] Making a 598 | few technical assumptions (namely that $A$ is full rank and that $n < 599 | m$), the projection of a vector $y \in \mathbb{R}^m$ onto the range of 600 | $A$ is given by, 601 | \[\mathrm{Proj}(y;A) = \mathrm{argmin}_{v \in \mathcal{R}(A)} \|v - 602 | y\|_2 = A (A^T A)^{-1}A^T y\;\;.\] This last equation should look 603 | extremely familiar, since it is almost the same formula we derived in 604 | class (and which we will soon derive again) 605 | for the least squares estimation of parameters. Looking at the 606 | definition for the projection, it should not be too hard to convince 607 | yourself that this is in fact the same objective that we minimized in 608 | our least squares problem (except for a squaring of the norm, which 609 | doesn't affect the optimal point) and so these problems are naturally 610 | very connected. When $A$ contains only a single column, $a \in 611 | \mathbb{R}^m$, this gives the special case for a projection of a 612 | vector on to a line: 613 | \[\mathrm{Proj}(y;a) = \frac{a a^T}{a^T a}y\;\;.\] 614 | 615 | The \textbf{\textit{nullspace}} of a matrix $A \in \mathbb{R}^{m 616 | \times n}$, denoted $\mathcal{N}(A)$ is the set of all 617 | vectors that equal 0 when multiplied by $A$, i.e., 618 | \[\mathcal{N}(A) = \{x \in \mathbb{R}^n : Ax = 0\}.\] Note that 619 | vectors in $\mathcal{R}(A)$ are of size $m$, while vectors in the 620 | $\mathcal{N}(A)$ are of size $n$, so vectors in $\mathcal{R}(A^T)$ 621 | and $\mathcal{N}(A)$ are both in $\mathbb{R}^n$. In fact, we can 622 | say much more. It turns out that 623 | \[\left \{ w : w = u + v, u \in \mathcal{R}(A^T), v \in \mathcal{N}(A) 624 | \right \} = \mathbb{R}^n \mbox{ and 625 | } \mathcal{R}(A^T) \cap \mathcal{N}(A) = \{\mathbf{0}\}\;\;.\] 626 | In other words, $\mathcal{R}(A^T)$ and $\mathcal{N}(A)$ are disjoint 627 | subsets that together span the entire space of $\mathbb{R}^n$. Sets 628 | of this type are called \textbf{\textit{orthogonal complements}}, and 629 | we denote this $\mathcal{R}(A^T) = \mathcal{N}(A)^\perp$. 630 | 631 | \subsection{The Determinant} 632 | The \textbf{\textit{determinant}} of a square matrix $A \in 633 | \mathbb{R}^{n \times n}$, is a function $\mathrm{det} : \mathbb{R}^{n 634 | \times n} \rightarrow \mathbb{R}$, and is denoted $|A|$ or $\mathrm{det}\,A$ 635 | (like the trace operator, we usually omit parentheses). 636 | Algebraically, 637 | one could write down an explicit formula for the determinant of $A$, but 638 | this unfortunately gives little intuition about its meaning. Instead, 639 | we'll start out by providing a geometric interpretation of the determinant 640 | and then visit some of its specific algebraic properties afterwards. 641 | 642 | Given a matrix 643 | \[ \left [ \begin{array}{ccc} \mbox{---} & a^T_1 & 644 | \mbox{---} \\ \mbox{---} & a^T_2 & \mbox{---} \\ & \vdots & \\ 645 | \mbox{---} & a^T_n & \mbox{---} \end{array} \right ], \] 646 | consider the set of points $S \subset \mathbb{R}^n$ formed by taking 647 | all possible linear combinations of the row vectors $a_1, \ldots, a_n \in \mathbb{R}^n$ of $A$, 648 | where the coefficients of the linear combination are all between 0 and 1; 649 | that is, the set $S$ is the restriction of $\textrm{span}(\{a_1,\ldots,a_n\})$ 650 | to only those linear combinations whose coefficients $\alpha_1, \ldots, \alpha_n$ 651 | satisfy $0 \le \alpha_i \le 1$, $i=1,\ldots,n$. Formally, 652 | \[ S = \{ v \in \mathbb{R}^n : v = \sum_{i=1}^n \alpha_i a_i \mbox{ where } 0 \le \alpha_i \le 1, i=1,\ldots,n \}. \] 653 | The absolute value of the determinant of $A$, it turns out, is a measure of the ``volume'' of the set $S$.\footnote{ 654 | Admittedly, we have not actually defined what we mean by ``volume'' here, but hopefully the intuition should be clear enough. 655 | When $n=2$, our notion of ``volume'' corresponds to the area of $S$ in the Cartesian plane. When $n=3$, ``volume'' corresponds 656 | with our usual notion of volume for a three-dimensional object. 657 | } 658 | 659 | For example, consider the $2 \times 2$ matrix, 660 | \begin{equation} 661 | A = \left[ \begin{array}{cc} 1 & 3 \\ 3 & 2 \end{array} \right]. \label{eq:det-example} 662 | \end{equation} 663 | Here, the rows of the matrix are 664 | \[ a_1 = \left[ \begin{array}{c} 1 \\ 3 \end{array} \right] \quad a_2 = \left[ \begin{array}{c} 3 \\ 2 \end{array} \right]. \] 665 | The set $S$ corresponding to these rows is shown in Figure~\ref{fig:determinant}. For two-dimensional matrices, 666 | $S$ generally has the shape of a \emph{parallelogram}. In our example, the value of the determinant is $|A| = -7$ (as can be computed 667 | using the formulas shown later in this section), so the area of the parallelogram is $7$. (Verify this for yourself!) 668 | 669 | In three dimensions, the set $S$ corresponds to an object known as a \emph{parallelepiped} (a three-dimensional box with skewed sides, such that 670 | every face has the shape of a parallelogram). The absolute value of the determinant of the $3 \times 3$ matrix whose rows define $S$ give 671 | the three-dimensional volume of the parallelepiped. In even higher dimensions, the set $S$ is an object known as an $n$-dimensional \emph{parallelotope}. 672 | % 673 | 674 | %\begin{figure} 675 | % \vskip 5.0cm 676 | % \psline[linewidth=0.05]{->}(6,0)(11,0) 677 | % \psline[linewidth=0.05]{->}(6,0)(6,5) 678 | % \pspolygon[linewidth=0.05,fillstyle=solid,fillcolor=lightgray](6,0)(7,3)(10,5)(9,2) 679 | % \rput(6.3,1.9){$a_1$} 680 | % \rput(8,0.8){$a_2$} 681 | % \psline[linewidth=0.1]{->}(6,0)(7,3) 682 | % \psline[linewidth=0.1]{->}(6,0)(9,2) 683 | % \rput(6.5,3.2){$(1,3)$} 684 | % \rput(9.5,1.9){$(3,2)$} 685 | % \rput(10.2,5.4){$(4,5)$} 686 | % \rput(5.4,0){$(0,0)$} 687 | % \caption{ 688 | % Illustration of the determinant for the $2 \times 2$ matrix $A$ given in \eqref{eq:det-example}. Here, $a_1$ and $a_2$ are vectors 689 | % corresponding to the rows of $A$, and the set $S$ corresponds to the shaded region (i.e., the parallelogram). The 690 | % absolute value of the determinant, $|\textrm{det} A| = 7$, is the area of the parallelogram. 691 | % } 692 | % \label{fig:determinant} 693 | %\end{figure} 694 | 695 | \begin{figure}[t] 696 | \begin{center} 697 | \includegraphics[width=0.6\textwidth]{figures/figure} 698 | \caption{ 699 | Illustration of the determinant for the $2 \times 2$ matrix $A$ given in \eqref{eq:det-example}. Here, $a_1$ and $a_2$ are vectors 700 | corresponding to the rows of $A$, and the set $S$ corresponds to the shaded region (i.e., the parallelogram). The 701 | absolute value of the determinant, $|\textrm{det} A| = 7$, is the area of the parallelogram. 702 | } 703 | \end{center} 704 | \label{fig:determinant} 705 | \end{figure} 706 | 707 | Algebraically, the determinant satisfies the following three properties (from which all other properties follow, including the 708 | general formula): 709 | \begin{enumerate} 710 | \item The determinant of the identity is 1, $|I| = 1$. (Geometrically, the volume of a unit hypercube is 1). 711 | \item Given a matrix $A \in \mathbb{R}^{n \times n}$, if we multiply a 712 | single row in $A$ by a scalar $t \in \mathbb{R}$, then the 713 | determinant of the new matrix is $t |A|$, 714 | \[\left | \left [ \begin{array}{ccc} \mbox{---} & t \; a^T_1 & 715 | \mbox{---} \\ \mbox{---} & a^T_2 & \mbox{---} \\ & \vdots & \\ 716 | \mbox{---} & a^T_m & \mbox{---} \end{array} \right ] \right |= t 717 | |A|. \] 718 | (Geometrically, multiplying one of the sides of the set $S$ by a factor $t$ 719 | causes the volume to increase by a factor $t$.) 720 | \item If we exchange any two rows $a_i^T$ and $a_j^T$ of $A$, then the 721 | determinant of the new matrix is $-|A|$, for example 722 | \[\left | \left [ \begin{array}{ccc} \mbox{---} & a^T_2 & 723 | \mbox{---} \\ \mbox{---} & a^T_1 & \mbox{---} \\ & \vdots & \\ 724 | \mbox{---} & a^T_m & \mbox{---} \end{array} \right ] \right |= 725 | -|A|. \] 726 | \end{enumerate} 727 | In case you are wondering, it is not immediately obvious that a function satisfying 728 | the above three properties exists. In fact, though, such a function does exist, 729 | and is unique (which we will not prove here). 730 | 731 | Several properties that follow from the three properties above include: 732 | \begin{itemize} 733 | \item For $A \in \mathbb{R}^{n \times n}$, $|A| = |A^T|$. 734 | \item For $A, B \in \mathbb{R}^{n \times n}$, $|AB| = |A||B|$. 735 | \item For $A \in \mathbb{R}^{n \times n}$, $|A| = 0$ if and only if 736 | $A$ is singular (i.e., non-invertible). 737 | (If $A$ is singular then it does not have full rank, and hence 738 | its columns are linearly dependent. In this case, the set $S$ corresponds to a 739 | ``flat sheet'' within the $n$-dimensional space and hence has zero volume.) 740 | \item For $A \in \mathbb{R}^{n \times n}$ and $A$ non-singular, 741 | $|A^{-1}| = 1/|A|$. 742 | \end{itemize} 743 | 744 | Before giving the general definition for the determinant, we define, 745 | for $A \in \mathbb{R}^{n \times n}$, $A_{\setminus i,\setminus j} \in 746 | \mathbb{R}^{(n-1) \times (n-1)}$ to be the \textit{matrix} that 747 | results from deleting the $i$th row and $j$th column from $A$. The 748 | general (recursive) formula for the determinant is 749 | \begin{eqnarray*} 750 | |A| & = & \sum_{i=1}^n (-1)^{i+j} a_{ij} |A_{\setminus i, \setminus 751 | j}| \;\;\;\;\;\mbox{(for any $j \in 1,\ldots, n$)} \\ 752 | & = & \sum_{j=1}^n (-1)^{i+j} a_{ij} |A_{\setminus i, \setminus j}| 753 | \;\;\;\;\;\mbox{(for any $i \in 1,\ldots, n$)} 754 | \end{eqnarray*} 755 | with the initial case that $|A| = a_{11}$ for $ A \in 756 | \mathbb{R}^{1 \times 1}$. If we were to expand this formula 757 | completely for $A \in \mathbb{R}^{n \times n}$, there would be a total 758 | of $n!$ ($n$ factorial) different terms. For this reason, we hardly 759 | ever explicitly write the complete equation of the determinant for 760 | matrices bigger than $3 \times 3$. However, the equations for 761 | determinants of matrices up to size $3 \times 3$ are fairly common, 762 | and it is good to know them: 763 | \begin{eqnarray*} 764 | \left | [a_{11}] \right | & = & a_{11} \\ 765 | \left | \left [ \begin{array}{cc} a_{11} & a_{12} \\ a_{21} & a_{22} 766 | \end{array} \right ] \right | & = & a_{11} a_{22} - a_{12} a_{21} \\ 767 | \left | \left [ \begin{array}{ccc} a_{11} & a_{12} & a_{13} \\ a_{21} 768 | & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{array} \right ] 769 | \right | & = &\begin{array}{l} a_{11} a_{22} a_{33} + a_{12} a_{23} a_{31} + 770 | a_{13} a_{21} a_{32} \\ \;\;\;\;\; - a_{11} a_{23} a_{32} - a_{12} 771 | a_{21} a_{33} - a_{13} a_{22} a_{31} \end{array} 772 | \end{eqnarray*} 773 | 774 | The \textbf{\textit{classical adjoint}} (often just called the 775 | adjoint) of a matrix $A \in \mathbb{R}^{n \times n}$, is denoted 776 | $\mathrm{adj}(A)$, and defined as 777 | \[\mathrm{adj}(A) \in \mathbb{R}^{n \times n}, \;\;\; 778 | (\mathrm{adj}(A))_{ij} = (-1)^{i+j} |A_{\setminus j, \setminus 779 | i}|\;\;\] 780 | (note the switch in the indices $A_{\setminus j, \setminus i}$). It 781 | can be shown that for any nonsingular $A \in \mathbb{R}^{n \times 782 | n}$, 783 | \[A^{-1} = \frac{1}{|A|}\mathrm{adj}(A)\;\;.\] 784 | While this is a nice ``explicit'' formula for the inverse of matrix, 785 | we should note that, numerically, there are in fact much more 786 | efficient ways of computing the inverse. 787 | 788 | \subsection{Quadratic Forms and Positive Semidefinite Matrices} 789 | Given a square matrix $A \in \mathbb{R}^{n \times n}$ and a vector $x 790 | \in \mathbb{R}^n$, the scalar value $x^T A x$ is called a 791 | \textbf{\textit{quadratic form}}. Written explicitly, we see that 792 | \[x^T A x = \sum_{i=1}^n x_i (Ax)_i = \sum_{i=1}^n x_i \left(\sum_{j=1}^n A_{ij} x_j\right) = \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i x_j\;\;.\] 793 | Note that, 794 | \[x^T A x = (x^T A x)^T = x^T A^T x = x^T\left(\frac{1}{2} A + 795 | \frac{1}{2}A^T\right)x,\] 796 | where the first equality follows from the fact that the transpose 797 | of a scalar is equal to itself, and the third equality follows 798 | from the fact that we are averaging two quantities which are themselves equal. 799 | From this, we can conclude that 800 | only the symmetric part of $A$ contributes to the quadratic 801 | form. For this reason, we often implicitly assume that the matrices 802 | appearing in a quadratic form are symmetric. 803 | 804 | We give the following definitions: 805 | \begin{itemize} 806 | \item A symmetric matrix $A \in \mathbb{S}^n$ is 807 | \textbf{\textit{positive definite}} (PD) if for all non-zero 808 | vectors $x \in \mathbb{R}^n$, $x^T A x > 0$. This is usually 809 | denoted $A \succ 0$ (or just $A > 0$), and often times the set of 810 | all positive definite matrices is denoted $\mathbb{S}^n_{++}$. 811 | 812 | 813 | \item A symmetric matrix $A \in \mathbb{S}^n$ is 814 | \textbf{\textit{positive semidefinite}} (PSD) if for all vectors 815 | $x^T A x \geq 0$. This is written $A \succeq 0$ (or just $A \geq 816 | 0$), and the set of all positive semidefinite matrices is often 817 | denoted $\mathbb{S}^n_+$. 818 | 819 | \item Likewise, a symmetric matrix $A \in 820 | \mathbb{S}^n$ is \textbf{\textit{negative definite}} (ND), denoted $A 821 | \prec 0$ (or just $A < 0$) if for all non-zero $x \in 822 | \mathbb{R}^n$, $x^T A x < 0$. 823 | 824 | \item Similarly, a symmetric matrix $A \in \mathbb{S}^n$ is 825 | \textbf{\textit{negative semidefinite}} (NSD), denoted $A \preceq 826 | 0$ (or just $A \leq 0$) if for all $x \in \mathbb{R}^n$, $x^T A x 827 | \leq 0$. 828 | 829 | \item Finally, a symmetric matrix $A \in 830 | \mathbb{S}^n$ is \textbf{\textit{indefinite}}, if it is neither 831 | positive semidefinite nor negative semidefinite --- i.e., if there 832 | exists $x_1, x_2 \in \mathbb{R}^n$ such that $x_1^T A x_1 > 0$ and 833 | $x_2^T A x_2 < 0$. 834 | 835 | \end{itemize} 836 | 837 | It should be obvious that if $A$ is positive definite, then $-A$ is 838 | negative definite and vice versa. Likewise, if $A$ is positive 839 | semidefinite then $-A$ is negative semidefinite and vice versa. If 840 | $A$ is indefinite, then so is $-A$. 841 | 842 | One important property of positive definite and negative definite matrices 843 | is that they are always full rank, and hence, invertible. To see why 844 | this is the case, suppose that some matrix $A \in \mathbb{R}^{n \times n}$ 845 | is not full rank. Then, suppose that the $j$th column of $A$ is expressible 846 | as a linear combination of other $n-1$ columns: 847 | \[ a_j = \sum_{i \neq j} x_i a_i, \] 848 | for some $x_1,\ldots,x_{j-1}, x_{j+1}, \ldots,x_{n} \in \mathbb{R}$. Setting $x_j = -1$, we have 849 | \[ Ax = \sum_{i=1}^n x_i a_i = 0. \] 850 | But this implies $x^T Ax = 0$ for some non-zero vector $x$, so $A$ must be 851 | neither positive definite nor negative definite. Therefore, if $A$ is 852 | either positive definite or negative definite, it must be full rank. 853 | 854 | Finally, there is one type of positive definite matrix that comes up 855 | frequently, and so deserves some special mention. Given any matrix $A 856 | \in \mathbb{R}^{m \times n}$ (not necessarily symmetric or even 857 | square), the matrix $G = A^T A$ (sometimes called a 858 | \textbf{\textit{Gram matrix}}) is always positive semidefinite. 859 | Further, if $m \geq n$ (and we assume for convenience that $A$ is full 860 | rank), then $G = A^T A$ is positive definite. 861 | 862 | \subsection{Eigenvalues and Eigenvectors} 863 | 864 | Given a square matrix $A \in \mathbb{R}^{n \times n}$, we say that 865 | $\lambda \in \mathbb{C}$ is an \textbf{\textit{eigenvalue}} of $A$ and 866 | $x \in \mathbb{C}^n$ is the corresponding 867 | \textbf{\textit{eigenvector}}\footnote{Note that $\lambda$ and the 868 | entries of $x$ are actually in 869 | $\mathbb{C}$, the set of complex numbers, not just the reals; we 870 | will see shortly why this is necessary. Don't worry about this 871 | technicality for now, you can think of complex vectors in the same way 872 | as real vectors.} if 873 | \[Ax = \lambda x, \;\;\; x \neq 0. \] 874 | Intuitively, this definition 875 | means that multiplying $A$ by the vector $x$ results in a new vector 876 | that points in the same direction as $x$, but scaled by a factor 877 | $\lambda$. Also note that for any eigenvector $x \in \mathbb{C}^n$, 878 | and scalar $c \in \mathbb{C}$, 879 | $A(cx) = cAx = c \lambda x = \lambda(cx)$, so $cx$ is also an 880 | eigenvector. For this reason when we talk about ``the'' eigenvector 881 | associated with $\lambda$, we usually assume that the eigenvector is 882 | normalized to have length 1 (this still creates some ambiguity, since 883 | $x$ and $-x$ will both be eigenvectors, but we will have to live with 884 | this). 885 | 886 | We can rewrite the equation above to state that $(\lambda, x)$ is an 887 | eigenvalue-eigenvector pair of $A$ if, 888 | \[(\lambda I - A)x = 0, \;\;\; x \neq 0.\] 889 | But $(\lambda I - A)x = 0$ has a non-zero solution to $x$ if and only 890 | if $(\lambda I - A)$ has a non-empty nullspace, which is only the case 891 | if $(\lambda I - A)$ is singular, i.e., 892 | \[|(\lambda I - A)| = 0.\] 893 | 894 | 895 | We can now use the previous definition of the determinant to expand 896 | this expression $|(\lambda I - A)|$ into a (very large) polynomial in $\lambda$, where 897 | $\lambda$ will have degree $n$. It's often called the characteristic polynomial of the matrix $A$. 898 | 899 | We then find the $n$ 900 | (possibly complex) roots of this characteristic polynomial and denote them by $\lambda_1, \ldots, \lambda_n$. These are all the eigenvalues of the matrix $A$, but we note that they may not be distinct. 901 | To find the eigenvector 902 | corresponding to the eigenvalue $\lambda_i$, we simply solve the 903 | linear equation $(\lambda_i I - A)x = 0$, which is guaranteed to have a non-zero solution because $\lambda_i I-A$ is singular (but there could also be multiple or infinite solutions.) 904 | 905 | 906 | 907 | 908 | It should be noted that 909 | this is not the method which is actually used in practice to 910 | numerically compute the eigenvalues and eigenvectors (remember that 911 | the complete expansion of the determinant has $n!$ terms); it is 912 | rather a mathematical argument. 913 | 914 | The following are properties of 915 | eigenvalues and eigenvectors (in all cases assume $A \in \mathbb{R}^{n 916 | \times n}$ has eigenvalues $\lambda_i, \ldots, \lambda_n$): 917 | \begin{itemize} 918 | \item The trace of a $A$ is equal to the sum of its eigenvalues, 919 | \[\mathrm{tr}A = \sum_{i=1}^n \lambda_i.\] 920 | \item The determinant of $A$ is equal to the product of its 921 | eigenvalues, 922 | \[|A| = \prod_{i=1}^n \lambda_i.\] 923 | \item The rank of $A$ is equal to the number of non-zero eigenvalues 924 | of $A$. 925 | \item Suppose $A$ is non-singular with eigenvalue $\lambda$ and an associated eigenvector $x$. Then $1/\lambda$ is an eigenvalue of 926 | $A^{-1}$ with an associated eigenvector $x$, i.e., $A^{-1}x = 927 | (1/\lambda)x$. (To prove this, take the eigenvector equation, 928 | $A x = \lambda x$ and left-multiply each side by $A^{-1}$.) 929 | \item The eigenvalues of a diagonal matrix $D = \mathrm{diag}(d_1, 930 | \ldots d_n)$ are just the diagonal entries $d_1, \ldots d_n$. 931 | \end{itemize} 932 | %We can write all the eigenvector equations simultaneously as 933 | %\[AX = X\Lambda\] 934 | %where the columns of $X \in \mathbb{R}^{n \times n}$ are the 935 | %eigenvectors of $A$ and $\Lambda$ is a diagonal matrix whose entries 936 | %are the eigenvalues of $A$, i.e., 937 | %\[X \in \mathbb{R}^{n \times n} = \left [\begin{array}{cccc} | & | & 938 | % & | \\ x_1 & x_2 & \cdots & x_n \\ | & | & & | \end{array} 939 | % \right ], \;\; \Lambda = 940 | % \mathrm{diag}(\lambda_1, \ldots, \lambda_n).\] 941 | %If the eigenvectors of $A$ are linearly independent, then the matrix 942 | %$X$ will be invertible, so $A = X\Lambda X^{-1}$. A matrix that can 943 | %be written in this form is called \textbf{\textit{diagonalizable}}. 944 | 945 | 946 | \subsection{Eigenvalues and Eigenvectors of Symmetric Matrices} 947 | 948 | 949 | In general, the structures of the eigenvalues and eigenvectors of a general square matrix can be subtle to characterize. Fortunately, in most of the cases in machine learning, it suffices to deal with symmetric real matrices, whose eigenvalues and eigenvectors have remarkable properties. 950 | 951 | 952 | %Two remarkable properties come about when we look at the eigenvalues 953 | %and eigenvectors of a symmetric matrix $A \in \mathbb{S}^n$. 954 | 955 | 956 | Throughout this section, let's assume that $A$ is a symmetric real matrix. We have the following properties: 957 | 958 | \begin{itemize} 959 | \item[1.] All eigenvalues of $A$ are real numbers. We denote them by $\lambda_1,\dots, \lambda_n$. 960 | \item[2.] There exists a set of eigenvectors $u_1,\dots, u_n$ such that a) for all $i$, $u_i$ is an eigenvector with eigenvalue $\lambda_i$ and b) $u_1,\dots, u_n$ are unit vectors and orthogonal to each other.\footnote{Mathematically, we have $\forall i, Au_i = \lambda_i u_i$, $\|u_i\|_2=1$, and $\forall j\neq i, u_i^Tu_j=0$. Moreover, we remark that it's not true that all eigenvectors $u_1,\dots, u_n$ satisfying a) of any matrix $A$ are orthogonal to each other, because the eigenvalues can be repetitive and so can eigenvectors.} 961 | \end{itemize} 962 | 963 | Let $U$ be the orthonormal matrix that contains $u_i$'s as columns:\footnote{Here for notational simplicity, we deviate from the notational convention for columns of matrices in the previous sections.} 964 | 965 | \begin{align} 966 | U = \left [ 967 | \begin{array}{cccc} | & | & & | \\ u_1 & u_2 & \cdots & u_n \\ | & 968 | | & & | \end{array} \right ] 969 | \end{align} 970 | 971 | Let $\Lambda = 972 | \mathrm{diag}(\lambda_1, \ldots, \lambda_n)$ be the diagonal matrix that contains $\lambda_1,\dots, \lambda_n$ as entries on the diagonal. Using the view of matrix-matrix vector multiplication in equation~\eqref{eqn:1} of Section~\ref{subsec:matrix-matrix}, we can verify that 973 | \begin{align} 974 | AU & = \left [ 975 | \begin{array}{cccc} | & | & & | \\A u_1 & Au_2 & \cdots & Au_n \\ | & 976 | | & & | \end{array} \right ] = \left [ 977 | \begin{array}{cccc} | & | & & | \\ \lambda_1 u_1 & \lambda_2 u_2 & \cdots & \lambda_n u_n \\ | & 978 | | & & | \end{array} \right ] = U\mathrm{diag}(\lambda_1, \ldots, \lambda_n) = U\Lambda \nonumber 979 | \end{align} 980 | 981 | Recalling that orthonormal matrix $U$ satisfies that $UU^T = I$ and using the equation above, we have 982 | \begin{align} 983 | A = AUU^T = U\Lambda U^T \label{eqn:3} 984 | \end{align} 985 | 986 | This new presentation of $A$ as $U\Lambda U^T$ is often called the diagonalization of the matrix $A$. The term diagonalization comes from the fact that with such representation, we can often effectively treat a symmetric matrix $A$ as a diagonal matrix --- which is much easier to understand --- w.r.t the basis defined by the eigenvectors $U$. We will elaborate this below by several examples. 987 | 988 | \newcommand{\R}{\mathbb{R}} 989 | \paragraph{Background: representing vector w.r.t. another basis.} Any orthonormal matrix $U = \left [ 990 | \begin{array}{cccc} | & | & & | \\ u_1 & u_2 & \cdots & u_n \\ | & 991 | | & & | \end{array} \right ] $ defines a new basis (coordinate system) of $\mathbb{R}^n$ in the following sense. For any vector $x\in \mathbb{R}^n$ can be represented as a linear combination of $u_1,\dots, u_n$ with coefficient $\hat{x}_1,\dots, \hat{x}_n$: 992 | \begin{align} 993 | x = \hat{x}_1 u_1 + \dots + \hat{x}_n u_n = U\hat{x} \nonumber 994 | \end{align} 995 | where in the second equality we use the view of equation~\eqref{eqn:2}. Indeed, such $\hat{x}$ uniquely exists 996 | \begin{align} 997 | x = U\hat{x} \Leftrightarrow U^T x = \hat{x} \nonumber 998 | \end{align} 999 | In other words, the vector $\hat{x}= U^T x$ can serve as another representation of the vector $x$ w.r.t the basis defined by $U$. 1000 | 1001 | 1002 | \paragraph{``Diagonalizing'' matrix-vector multiplication.} With the setup above, we will see that left-multiplying matrix $A$ can be viewed as left-multiplying a diagonal matrix w.r.t the basic of the eigenvectors. Suppose $x$ is a vector and $\hat{x}$ is its representation w.r.t to the basis of $U$. Let $z = Ax$ be the matrix-vector product. Now let's compute the representation $z$ w.r.t the basis of $U$: 1003 | 1004 | Then, again using the fact that $UU^T = U^TU = I$ and equation~\eqref{eqn:3}, we have that 1005 | \begin{align} 1006 | \hat{z} = U^T z = U^T Ax = U^T U\Lambda U^T x = \Lambda \hat{x} = \left [ \begin{array}{c} \lambda_1 \hat{x}_1 \\ \lambda_2 \hat{x}_2 \\ \vdots \\ \lambda_n \hat{x}_n \end{array} \right ]\nonumber 1007 | \end{align} 1008 | We see that left-multiplying matrix $A$ in the original space is equivalent to left-multiplying the diagonal matrix $\Lambda$ w.r.t the new basis, which is merely scaling each coordinate by the corresponding eigenvalue. %One of the key features here is that under the new basis, all the coordinates are scaled independently. 1009 | 1010 | Under the new basis, multiplying a matrix multiple times becomes much simpler as well. For example, suppose $q = AAAx$. Deriving out the analytical form of $q$ in terms of the entries of $A$ may be a nightmare under the original basis, but can be much easier under the new on: 1011 | \begin{align} 1012 | \hat{q} = U^T q = U^TAx = U^TU\Lambda U^TU\Lambda U^TU\Lambda U^Tx = \Lambda^3 \hat{x} = \left [ \begin{array}{c} \lambda_1^3 \hat{x}_1 \\ \lambda_2^3 \hat{x}_2 \\ \vdots \\ \lambda_n^3 \hat{x}_n \end{array} \right ] 1013 | \end{align} 1014 | 1015 | 1016 | 1017 | 1018 | 1019 | 1020 | 1021 | 1022 | %First, it can be shown that all the eigenvalues of $A$ are real. 1023 | 1024 | 1025 | %Secondly, 1026 | %the eigenvectors of $A$ are orthonormal, i.e., the matrix $X$ defined 1027 | %$above is an orthogonal matrix (for this reason, we denote the matrix 1028 | %f eigenvectors as $U$ in this case). 1029 | 1030 | % 1031 | %We can therefore represent $A$ 1032 | %as $A = U\Lambda U^T$, remembering from above that the inverse of an 1033 | %orthogonal matrix is just its transpose. 1034 | 1035 | \paragraph{``Diagonalizing'' quadratic form.} As a directly corollary, the quadratic form $x^TAx$ can also be simplified under the new basis 1036 | \begin{align} 1037 | x^TAx = x^TU\Lambda U^T x = \hat{x} \Lambda \hat{x} = \sum_{i=1}^n \lambda_i\hat{x}_i^2 \label{eqn:diag-quadratic} 1038 | \end{align} 1039 | (Recall that with the old representation, $x^TAx = \sum_{i=1,j=1}^{n} x_ix_jA_{ij}$ involves a sum of $n^2$ terms instead of $n$ terms in the equation above.) With this viewpoint, we can also show that the definiteness of the matrix $A$ depends 1040 | entirely on the sign of its eigenvalues: 1041 | \begin{itemize} 1042 | \item[1.] If all $\lambda_i > 0$, then the matrix $A$ s positivedefinite because $x^TAx = \sum_{i=1}^n \lambda_i\hat{x}_i^2 > 0$ for any $\hat{x}\neq 0$.\footnote{Note that $\hat{x}\neq 0\Leftrightarrow x\neq 0$.} 1043 | \item[2.] If all $\lambda_i \geq 0$, it is positive semidefinite because $x^TAx = \sum_{i=1}^n \lambda_i\hat{x}_i^2 \ge 0$ for all $\hat{x}$. 1044 | \item[3.] Likewise, if all $\lambda_i < 0$ or $\lambda_i \leq 0$, then $A$ is 1045 | negative definite or negative semidefinite respectively. 1046 | \item[4.] Finally, if 1047 | $A$ has both positive and negative eigenvalues, say $\lambda_i > 0$ and $\lambda_j < 0$, then it is indefinite. This is because if we let $\hat{x}$ satisfy $\hat{x}_i = 1 $ and $\hat{x}_k =0, \forall k\neq i$, then $x^TAx = \sum_{i=1}^n \lambda_i\hat{x}_i^2 > 0$. Similarly we can let $\hat{x}$ satisfy $\hat{x}_j = 1 $ and $\hat{x}_k =0, \forall k\neq j$, then $x^TAx = \sum_{i=1}^n \lambda_i\hat{x}_i^2 < 0$. \footnote{Note that $x = U\hat{x}$ and therefore constructing $\hat{x}$ gives an implicit construction of $x$. } 1048 | \end{itemize} 1049 | %Suppose $A \in \mathbb{S}^n 1050 | %= U \Lambda U^T$. 1051 | % 1052 | %Then 1053 | %\[x^T A x = x^T U \Lambda U^T x = y^T \Lambda y = \sum_{i=1}^n 1054 | %%\lambda_i y_i^2\] 1055 | %where $y = U^T x$ (and since $U$ is full rank, any vector $y \in 1056 | %\mathbb{R}^n$ can be represented in this form). Because $y_i^2$ is 1057 | %always positive, the sign of this expression depends entirely on the 1058 | %$\lambda_i$'s. If all $\lambda_i > 0$, then the matrix is positive 1059 | %definite; if all $\lambda_i \geq 0$, it is positive semidefinite. 1060 | %Likewise, if all $\lambda_i < 0$ or $\lambda_i \leq 0$, then $A$ is 1061 | %negative definite or negative semidefinite respectively. Finally, if 1062 | %$A$ has both positive and negative eigenvalues, it is indefinite. 1063 | 1064 | An application where eigenvalues and eigenvectors come up frequently 1065 | is in maximizing some function of a matrix. In particular, for a 1066 | matrix $A \in \mathbb{S}^n$, consider the following maximization 1067 | problem, 1068 | \begin{align}\mathrm{max}_{x \in \mathbb{R}^n} \;\; x^T A x = \sum_{i=1}^n \lambda_i\hat{x}_i^2 \;\;\;\;\; 1069 | \mbox{subject to } \|x\|_2^2 = 1\label{eqn:4} 1070 | \end{align} 1071 | i.e., we want to find the vector (of norm 1) which maximizes the 1072 | quadratic form. Assuming the eigenvalues are ordered as $\lambda_1 1073 | \geq \lambda_2 \geq \ldots \geq \lambda_n$, the optimal value of this optimization problem is $\lambda_1$ and any eigenvector $u_1$ corresponding to $\lambda_1$ is one of the maximizers. (If $\lambda_1 > \lambda_2$, then there is a unique eigenvector corresponding to eigenvalue $\lambda_1$, which is the unique maximizer of the optimization problem~\eqref{eqn:4}.) 1074 | 1075 | We can show this by using the diagonalization technique: Note that $\|x\|_2 = \|\hat{x}\|_2$ by equation~\eqref{eqn:preserve-norm}, and using equation~\eqref{eqn:diag-quadratic}, we can rewrite the optimization~\eqref{eqn:4} as 1076 | \begin{align} 1077 | \mathrm{max}_{\hat{x} \in \mathbb{R}^n} \;\; \hat{x}^T \Lambda \hat{x} = \sum_{i=1}^n \lambda_i\hat{x}_i^2 \;\;\;\;\; 1078 | \mbox{subject to } \|\hat{x}\|_2^2 = 1\label{eqn:5} 1079 | \end{align} 1080 | 1081 | %the optimal $x$ for this 1082 | %optimization problem is $x_1$, the eigenvector corresponding to 1083 | %$\lambda_1$. In this case the maximal value of the quadratic form is 1084 | %$\lambda_1$. Similarly, the optimal solution to the minimization 1085 | %%problem, 1086 | %\[\mathrm{min}_{x \in \mathbb{R}^n} \;\; x^T A x \;\;\;\;\; 1087 | %\mbox{subject to } \|x\|_2^2 = 1\] 1088 | 1089 | Then, we have that the objective is upper bounded by $\lambda_1$: 1090 | \begin{align} 1091 | \hat{x}^T \Lambda \hat{x} = \sum_{i=1}^n \lambda_i\hat{x}_i^2 \le \sum_{i=1}^n \lambda_1 \hat{x}_i^2 = \lambda_1 1092 | \end{align} 1093 | Moreover, setting $\hat{x} = \left [ \begin{array}{c} 1 \\ 0\\ \vdots \\ 0 \end{array} \right ]$ achieves the equality in the equation above, and this corresponds to setting $x = u_1$. 1094 | 1095 | % 1096 | %is $x_n$, the eigenvector corresponding to $\lambda_n$, and the 1097 | %minimal value is $\lambda_n$. This can be proved by appealing to the 1098 | %eigenvector-eigenvalue form of $A$ and the properties of orthogonal 1099 | %matrices. However, in the next section we will see a way of showing 1100 | %it directly using matrix calculus. 1101 | 1102 | 1103 | \section{Matrix Calculus} 1104 | While the topics in the previous sections are typically covered in a 1105 | standard course on linear algebra, one topic that does not seem to be 1106 | covered very often (and which we will use extensively) is the 1107 | extension of calculus to the vector setting. Despite the fact that 1108 | all the actual calculus we use is relatively trivial, the notation can 1109 | often make things look much more difficult than they are. In this 1110 | section we present some basic definitions of matrix calculus and 1111 | provide a few examples. 1112 | 1113 | \subsection{The Gradient} 1114 | Suppose that $f:\mathbb{R}^{m \times n} \rightarrow \mathbb{R}$ is a 1115 | function that takes as input a matrix $A$ of size $m \times n$ and 1116 | returns a real value. Then the \textbf{\textit{gradient}} of $f$ 1117 | (with respect to $A \in \mathbb{R}^{m \times n}$) is the matrix of 1118 | partial derivatives, defined as: 1119 | \[\nabla_A f(A) \in \mathbb{R}^{m \times n} = \left [ 1120 | \begin{array}{cccc} \frac{\partial f(A)}{\partial A_{11}} & 1121 | \frac{\partial f(A)}{\partial A_{12}} & \cdots & \frac{\partial 1122 | f(A)}{\partial A_{1n}} \\ \frac{\partial f(A)}{\partial A_{21}} & 1123 | \frac{\partial f(A)}{\partial A_{22}} & \cdots & \frac{\partial 1124 | f(A)}{\partial A_{2n}} \\ 1125 | \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f(A)}{\partial 1126 | A_{m1}} & \frac{\partial f(A)}{\partial A_{m2}} & \cdots & 1127 | \frac{\partial f(A)}{\partial A_{mn}} \end{array} \right ] \] 1128 | i.e., an $m \times n$ matrix with \[(\nabla_A f(A))_{ij} = 1129 | \frac{\partial f(A)}{\partial A_{ij}}.\] 1130 | Note that the size of $\nabla_A f(A)$ is always the 1131 | same as the size of $A$. So if, in particular, $A$ is just a vector 1132 | $x \in \mathbb{R}^n$, 1133 | \[\nabla_x f(x) = \left [ \begin{array}{c} \frac{\partial 1134 | f(x)}{\partial x_1} \\ \frac{\partial f(x)}{\partial x_2} \\ \vdots 1135 | \\ \frac{\partial f(x)}{\partial x_n}\end{array} \right ].\] 1136 | It is very important to remember that the gradient of a function is 1137 | \textit{only} defined if the function is real-valued, that is, if it 1138 | returns a scalar value. We can not, for example, take the gradient of 1139 | $Ax, A \in \mathbb{R}^{n \times n}$ with respect to $x$, since this 1140 | quantity is vector-valued. 1141 | 1142 | It follows directly from the equivalent properties of partial 1143 | derivatives that: 1144 | \begin{itemize} 1145 | \item $\nabla_x (f(x) + g(x)) = \nabla_x f(x) + \nabla_x g(x)$. 1146 | \item For $t \in \mathbb{R}$, $\nabla_x (t\;f(x)) = t \nabla_x f(x)$. 1147 | \end{itemize} 1148 | 1149 | In principle, gradients are a natural extension of partial derivatives 1150 | to functions of multiple variables. In practice, however, working with 1151 | gradients can sometimes be tricky for notational reasons. 1152 | For example, 1153 | suppose that $A \in \mathbb{R}^{m \times n}$ is a matrix of fixed coefficients and 1154 | suppose that $b \in \mathbb{R}^m$ is a vector of fixed coefficients. 1155 | Let $f : \mathbb{R}^{m} \rightarrow \mathbb{R}$ be the function defined by 1156 | $f(z) = z^T z$, such that $\nabla_z f(z) = 2z$. 1157 | But now, consider the expression, 1158 | \[ \nabla f(Ax). \] 1159 | How should this expression be interpreted? There are at least two possibilities: 1160 | \begin{enumerate} 1161 | \item 1162 | In the first interpretation, recall that $\nabla_z f(z) = 2z$. Here, we 1163 | interpret $\nabla f(Ax)$ as evaluating the gradient at the point $Ax$, 1164 | hence, 1165 | \[ \nabla f(Ax) = 2(Ax) = 2Ax \in \mathbb{R}^m. \] 1166 | \item 1167 | In the second interpretation, we consider the quantity $f(Ax)$ as a function of the 1168 | input variables $x$. More formally, let $g(x) = f(Ax)$. Then in this interpretation, 1169 | \[ \nabla f(Ax) = \nabla_x g(x) \in \mathbb{R}^n. \] 1170 | \end{enumerate} 1171 | Here, we can see that these two interpretations are indeed different. One interpretation 1172 | yields an $m$-dimensional vector as a result, while the other interpretation yields an 1173 | $n$-dimensional vector as a result! How can we resolve this? 1174 | 1175 | Here, the key is to make explicit the variables which we are differentiating with respect to. 1176 | In the first case, we are differentiating the function $f$ with respect to its arguments $z$ 1177 | and then substituting the argument $Ax$. In the second case, we are differentiating the 1178 | composite function $g(x) = f(Ax)$ with respect to $x$ directly. We denote the first 1179 | case as $\nabla_z f(Ax)$ and the second case as $\nabla_x f(Ax)$.\footnote{A drawback to this notation 1180 | that we will have to live with is the fact that in the first case, $\nabla_z f(Ax)$ it appears 1181 | that we are differentiating with respect to a variable that does not even appear in the expression 1182 | being differentiated! For this reason, the first case is often written as $\nabla f(Ax)$, and the fact 1183 | that we are differentiating with respect to the arguments of $f$ is understood. However, 1184 | the second case is \emph{always} written as $\nabla_x f(Ax)$.} 1185 | Keeping the notation 1186 | clear is extremely important (as you'll find out in your homework, in fact!). 1187 | 1188 | \subsection{The Hessian} 1189 | Suppose that $f:\mathbb{R}^n \rightarrow \mathbb{R}$ is a function 1190 | that takes a vector in $\mathbb{R}^n$ and returns a real number. Then 1191 | the \textbf{\textit{Hessian}} matrix with respect to $x$, written 1192 | $\nabla_x^2 f(x)$ or simply as $H$ is the $n \times n$ matrix of 1193 | partial derivatives, 1194 | \[\nabla^2_x f(x) \in \mathbb{R}^{n \times n} = \left [ 1195 | \begin{array}{cccc} \frac{\partial^2 f(x)}{\partial x_1^2} & 1196 | \frac{\partial^2 f(x)}{\partial x_1 \partial x_2} & \cdots & 1197 | \frac{\partial^2 f(x)}{\partial x_1 \partial x_n} \\ 1198 | \frac{\partial^2 f(x)}{\partial x_2 \partial x_1} & 1199 | \frac{\partial^2 f(x)}{\partial x_2^2} & \cdots & \frac{\partial^2 1200 | f(x)}{\partial x_2 \partial x_n} \\ 1201 | \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f(x)}{\partial 1202 | x_n \partial x_1} & \frac{\partial^2 f(x)}{\partial x_n \partial 1203 | x_2} & \cdots & \frac{\partial^2 f(x)}{\partial x_n^2} 1204 | \end{array} \right ].\] 1205 | In other words, $\nabla_x^2 f(x) \in \mathbb{R}^{n \times n}$, with 1206 | \[(\nabla^2_x f(x))_{ij} = \frac{\partial^2 f(x)}{\partial x_i 1207 | \partial x_j}.\] 1208 | Note that the Hessian is always symmetric, since 1209 | \[\frac{\partial^2 f(x)}{\partial x_i \partial x_j} = \frac{\partial^2 1210 | f(x)}{\partial x_j \partial x_i}.\] 1211 | Similar to the gradient, the Hessian is defined only when $f(x)$ is 1212 | real-valued. 1213 | 1214 | It is natural to think of the gradient as the analogue 1215 | of the first derivative for functions of vectors, and the Hessian as 1216 | the analogue of the second derivative (and the symbols we use also 1217 | suggest this relation). This intuition is generally 1218 | correct, but there a few caveats to keep in mind. 1219 | 1220 | 1221 | First, for real-valued functions of one variable $f:\mathbb{R} 1222 | \rightarrow \mathbb{R}$, it is a basic definition that the second 1223 | derivative is the derivative of the first derivative, i.e., 1224 | \[\frac{\partial^2 f(x)}{\partial x^2} = \frac{\partial}{\partial x} 1225 | \frac{\partial}{\partial x} f(x).\] 1226 | However, for functions of a vector, the gradient of the function is a 1227 | vector, and we cannot take the gradient of a vector --- i.e., 1228 | \[\nabla_x \nabla_x f(x) = \nabla_x \left [ \begin{array}{c} \frac{\partial 1229 | f(x)}{\partial x_1} \\ \frac{\partial f(x)}{\partial x_2} \\ \vdots 1230 | \\ \frac{\partial f(x)}{\partial x_n}\end{array} \right ]\] 1231 | and this expression is not defined. Therefore, it is \textit{not} the 1232 | case that the Hessian is the gradient of the gradient. However, 1233 | this is \textit{almost} true, in the following sense: If we look at 1234 | the $i$th entry of the gradient $(\nabla_x f(x))_i = \partial f(x) / 1235 | \partial x_i$, and take the gradient with respect to $x$ we get 1236 | \[\nabla_x \frac{\partial f(x)}{\partial x_i} = \left [ 1237 | \begin{array}{c} \frac{\partial^2 f(x)}{\partial x_i \partial x_1} 1238 | \\ \frac{\partial^2 f(x)}{\partial x_i \partial x_2} \\ \vdots \\ 1239 | \frac{\partial f(x)}{\partial x_i \partial x_n}\end{array} \right ] \] 1240 | which is the $i$th column (or row) of the Hessian. Therefore, 1241 | \[\nabla_x^2 f(x) = \left [ \begin{array}{cccc} \nabla_x (\nabla_x 1242 | f(x))_1 & \nabla_x (\nabla_x f(x))_2 & 1243 | \cdots & \nabla_x (\nabla_x f(x))_n \end{array} \right ].\] 1244 | If we don't mind being a little bit sloppy we can say that 1245 | (essentially) $\nabla_x^2 f(x) = \nabla_x (\nabla_x f(x))^T$, so long 1246 | as we understand that this really means taking the gradient of each 1247 | entry of $(\nabla_x f(x))^T$, not the gradient of the whole vector. 1248 | 1249 | Finally, note that while we can take the gradient with respect to a 1250 | matrix $A \in \mathbb{R}^n$, for the purposes of this class we will 1251 | only consider taking the Hessian with respect to a vector $x \in 1252 | \mathbb{R}^n$. This is simply a matter of convenience (and the fact 1253 | that none of the calculations we do require us to find the Hessian 1254 | with respect to a matrix), since the Hessian with respect to a matrix 1255 | would have to represent all the partial derivatives $\partial^2 f(A) / 1256 | (\partial A_{ij} \partial A_{k\ell})$, and it is rather cumbersome to 1257 | represent this as a matrix. 1258 | 1259 | \subsection{Gradients and Hessians of Quadratic and Linear Functions} 1260 | 1261 | Now let's try to determine the gradient and Hessian matrices for a few 1262 | simple functions. It should be noted that all the gradients given 1263 | here are special cases of the gradients given in the CS229 lecture 1264 | notes. 1265 | 1266 | For $x \in \mathbb{R}^n$, let $f(x) = b^T x$ for some known 1267 | vector $b \in \mathbb{R}^n$. Then 1268 | \[f(x) = \sum_{i = 1}^n b_i x_i\] 1269 | so 1270 | \[\frac{\partial f(x)}{\partial x_k} = \frac{\partial}{\partial x_k} 1271 | \sum_{i = 1}^n b_i x_i = b_k.\] 1272 | From this we can easily see that $\nabla_x b^T x = b$. This should be 1273 | compared to the analogous situation in single variable calculus, where 1274 | $\partial/(\partial x) \; ax = a$. 1275 | 1276 | Now consider the quadratic function $f(x) = x^T A x$ for $A \in 1277 | \mathbb{S}^n$. Remember that 1278 | \[f(x) = \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i x_j. \] 1279 | To take the partial derivative, we'll consider the terms including $x_k$ 1280 | and $x_k^2$ factors separately: 1281 | \begin{eqnarray*} 1282 | \frac{\partial f(x)}{\partial x_k} 1283 | &=& \frac{\partial}{\partial x_k} \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i x_j \\ 1284 | &=& \frac{\partial}{\partial x_k} \left[ \sum_{i \neq k} \sum_{j \neq k} A_{ij} x_i x_j + \sum_{i \neq k} A_{ik} x_i x_k + \sum_{j \neq k} A_{kj} x_k x_j + A_{kk} x_k^2 \right] \\ 1285 | &=& \sum_{i \neq k} A_{ik} x_i + \sum_{j \neq k} A_{kj} x_j + 2 A_{kk} x_k \\ 1286 | &=& \sum_{i=1}^n A_{ik} x_i + \sum_{j=1}^n A_{kj} x_j = 2 \sum_{i=1}^n A_{ki} x_i, 1287 | \end{eqnarray*} 1288 | where the last equality follows since $A$ is symmetric (which we can 1289 | safely assume, since it is appearing in a quadratic form). 1290 | Note that the $k$th entry of $\nabla_x f(x)$ is just the inner 1291 | product of the $k$th row of $A$ and $x$. Therefore, $\nabla_x x^T A 1292 | x = 2Ax$. Again, this should remind you of the analogous fact in 1293 | single-variable calculus, that $\partial/(\partial x)\; ax^2 = 2ax$. 1294 | 1295 | Finally, let's look at the Hessian of the quadratic function $f(x) = 1296 | x^T A x$ (it should be obvious that the Hessian of a linear function 1297 | $b^T x$ is zero). In this case, 1298 | \[\frac{\partial^2 f(x)}{\partial x_k \partial x_\ell} = 1299 | \frac{\partial}{\partial x_k} \left[ \frac{\partial f(x)}{\partial x_\ell} \right] = 1300 | \frac{\partial}{\partial x_k} \left[ 2 \sum_{i=1}^n A_{\ell i} x_i \right] = 2 A_{\ell k} = 2 A_{k \ell}.\] 1301 | Therefore, it should be clear that $\nabla_x^2 x^T A x = 2 A$, which 1302 | should be entirely expected (and again analogous to the 1303 | single-variable fact that $\partial^2/(\partial x^2)\;ax^2 = 2a$). 1304 | 1305 | To recap, 1306 | \begin{itemize} 1307 | \item $\nabla_x b^T x = b$ 1308 | \item $\nabla_x x^T A x = 2Ax$ (if $A$ symmetric) 1309 | \item $\nabla_x^2 x^T A x = 2A$ (if $A$ symmetric) 1310 | \end{itemize} 1311 | 1312 | \subsection{Least Squares} 1313 | 1314 | Let's apply the equations we obtained in the last section to 1315 | derive the least squares equations. Suppose we are given matrices $A 1316 | \in \mathbb{R}^{m \times n}$ (for simplicity we assume $A$ is full 1317 | rank) and a vector $b \in \mathbb{R}^m$ such that $b \not \in 1318 | \mathcal{R}(A)$. In this situation we will not be able to find a 1319 | vector $x \in \mathbb{R}^n$, such that $Ax = b$, so instead we want to 1320 | find a vector $x$ such that $Ax$ is as close as possible to $b$, as 1321 | measured by the square of the Euclidean norm $\|Ax - b\|_2^2$. 1322 | 1323 | Using the fact that $\|x\|_2^2$ = $x^T x$, we have 1324 | \begin{eqnarray*} 1325 | \|Ax - b\|_2^2 & = & (Ax - b)^T(Ax - b) \\ 1326 | & = & x^T A^T A x - 2b^T Ax + b^T b 1327 | \end{eqnarray*} 1328 | Taking the gradient with respect to $x$ we have, and using the 1329 | properties we derived in the previous section 1330 | \begin{eqnarray*} 1331 | \nabla_x (x^T A^T A x - 2b^T Ax + b^T b) & = & \nabla_x x^T A^T A x - 1332 | \nabla_x 2b^T Ax + \nabla_x b^T b \\ 1333 | & = & 2 A^T A x - 2 A^T b 1334 | \end{eqnarray*} 1335 | Setting this last expression equal to zero and solving for $x$ gives 1336 | the normal equations 1337 | \[x = (A^T A)^{-1}A^T b\] 1338 | which is the same as what we derived in class. 1339 | 1340 | \subsection{Gradients of the Determinant} 1341 | Now let's consider a situation where we find the gradient of a function 1342 | with respect to a matrix, namely for $A \in \mathbb{R}^{n \times n}$, 1343 | we want to find $\nabla_A |A|$. Recall from our discussion of 1344 | determinants that 1345 | \[ |A| = \sum_{i=1}^n (-1)^{i+j} A_{ij} |A_{\setminus i, \setminus 1346 | j}| \;\;\;\;\;\mbox{(for any $j \in 1,\ldots, n$)} \] 1347 | so 1348 | \[\frac{\partial}{\partial A_{k \ell}}|A| = \frac{\partial}{\partial 1349 | A_{k \ell}}\sum_{i=1}^n (-1)^{i+j} A_{ij} 1350 | |A_{\setminus i, \setminus j}| = (-1)^{k+\ell} |A_{\setminus k, \setminus 1351 | \ell}| = (\mathrm{adj}(A))_{\ell k}.\] 1352 | From this it immediately follows from the properties of the adjoint 1353 | that 1354 | \[\nabla_A |A| = (\mathrm{adj}(A))^T = |A| A^{-T}.\] 1355 | 1356 | Now let's consider the function $f:\mathbb{S}^n_{++} \rightarrow 1357 | \mathbb{R}$, $f(A) = \log |A|$. Note that we have to restrict the 1358 | domain of $f$ to be the positive definite matrices, since this ensures 1359 | that $|A| > 0$, so that the log of $|A|$ is a real number. In this 1360 | case we can use the chain rule (nothing fancy, just the ordinary chain 1361 | rule from single-variable calculus) to see that 1362 | \[\frac{\partial \log |A|}{\partial A_{ij}} = \frac{\partial \log 1363 | |A|}{\partial |A|} \frac{\partial |A|}{\partial A_{ij}} = 1364 | \frac{1}{|A|}\frac{\partial |A|}{\partial A_{ij}}.\] 1365 | From this it should be obvious that 1366 | \[\nabla_A \log |A| = \frac{1}{|A|}\nabla_A |A| = A^{-1},\] 1367 | where we can drop the transpose in the last expression because $A$ is 1368 | symmetric. Note the similarity to the single-valued case, where 1369 | $\partial/(\partial x)\; \log x = 1/x$. 1370 | 1371 | \subsection{Eigenvalues as Optimization} 1372 | 1373 | Finally, we use matrix calculus to solve an optimization problem in a 1374 | way that leads directly to eigenvalue/eigenvector analysis. Consider 1375 | the following, equality constrained optimization problem: 1376 | \[\mathrm{max}_{x \in \mathbb{R}^n} \;\; x^T A x \;\;\;\;\; 1377 | \mbox{subject to } \|x\|_2^2 = 1\] 1378 | for a symmetric matrix $A \in \mathbb{S}^{n}$. 1379 | A standard way of solving optimization problems with equality 1380 | constraints is by forming the \textbf{\textit{Lagrangian}}, an 1381 | objective function that includes the equality 1382 | constraints.\footnote{Don't worry if you haven't seen Lagrangians 1383 | before, as we will cover them in greater detail later in CS229.} 1384 | The Lagrangian in this case can be given by 1385 | \[\mathcal{L}(x, \lambda) = x^T A x - \lambda x^T x\] 1386 | where $\lambda$ is called the Lagrange multiplier associated with the 1387 | equality constraint. It can be established that for $x^*$ to be a 1388 | optimal point to the problem, the gradient of the Lagrangian has to be 1389 | zero at $x^*$ (this is not the only condition, but it is required). 1390 | That is, 1391 | \[\nabla_x \mathcal{L}(x,\lambda) = \nabla_x(x^T A x - \lambda x^T x) 1392 | = 2 A^T x - 2 \lambda x = 0.\] 1393 | Notice that this is just the linear equation $Ax = \lambda x$. This 1394 | shows that the only points which can possibly maximize (or minimize) 1395 | $x^T A x$ assuming $x^T x = 1$ are the eigenvectors of $A$. 1396 | 1397 | \end{document} 1398 | 1399 | --------------------------------------------------------------------------------