├── Anaconda_Setup.pdf
├── Conda Setup
    ├── acp.png
    ├── p2.png
    ├── conda-update.png
    ├── latexmkrc
    ├── py2tex.py
    ├── macros-Fall2018.tex
    └── main.tex
├── Probability Review.pdf
├── Linear Algebra Review.pdf
├── Linux and Git Guide.pdf
├── Probability Review
    ├── fig1.png
    ├── fig2.png
    ├── latexmkrc
    ├── prob_slides.tex
    ├── nips07submit_e.sty
    └── Report.tex
├── Linear Algebra Review
    ├── figures
    │   └── figure.png
    ├── Makefile
    ├── latexmkrc
    ├── linalg2.toc
    └── linalg2.tex
├── Linux Git Guide
    ├── Linux Git Guide
    │   ├── Linux and Git Guide.pdf
    │   ├── latexmkrc
    │   └── main.tex
    ├── latexmkrc
    └── main.tex
├── Useful Links
    ├── XCS234
    │   └── README.md
    └── README.md
├── README.md
└── .gitignore


/Anaconda_Setup.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Anaconda_Setup.pdf


--------------------------------------------------------------------------------
/Conda Setup/acp.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Conda Setup/acp.png


--------------------------------------------------------------------------------
/Conda Setup/p2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Conda Setup/p2.png


--------------------------------------------------------------------------------
/Probability Review.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Probability Review.pdf


--------------------------------------------------------------------------------
/Linear Algebra Review.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Linear Algebra Review.pdf


--------------------------------------------------------------------------------
/Linux and Git Guide.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Linux and Git Guide.pdf


--------------------------------------------------------------------------------
/Probability Review/fig1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Probability Review/fig1.png


--------------------------------------------------------------------------------
/Probability Review/fig2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Probability Review/fig2.png


--------------------------------------------------------------------------------
/Conda Setup/conda-update.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Conda Setup/conda-update.png


--------------------------------------------------------------------------------
/Linear Algebra Review/figures/figure.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Linear Algebra Review/figures/figure.png


--------------------------------------------------------------------------------
/Linear Algebra Review/Makefile:
--------------------------------------------------------------------------------
1 | default:
2 | 	latex linalg2.tex
3 | 	dvips -o linalg2.ps -t letter -Ppdf -G0 linalg2.dvi
4 | 	ps2pdf linalg2.ps
5 | 	#evince linalg.pdf
6 | 


--------------------------------------------------------------------------------
/Linux Git Guide/Linux Git Guide/Linux and Git Guide.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scpd-proed/General_Handouts/HEAD/Linux Git Guide/Linux Git Guide/Linux and Git Guide.pdf


--------------------------------------------------------------------------------
/Useful Links/XCS234/README.md:
--------------------------------------------------------------------------------
 1 | # Research Papers
 2 | | Label(s)      | Link | Description|
 3 | | ----------- | ----------- | ----------- |
 4 | | `robotics`      |    [QT-OPT](https://arxiv.org/abs/1806.10293)    |  Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation |
 5 | 
 6 | # Code
 7 | | Label(s)      | Link | Description|
 8 | | ----------- | ----------- | ----------- |
 9 | 
10 | 
11 | # Blog Posts
12 | | Label(s)      | Link | Description|
13 | | ----------- | ----------- | ----------- |
14 | 
15 | 
16 | # Talks/Presentations
17 | | Label(s)      | Link | Description|
18 | | ----------- | ----------- | ----------- |
19 | 
20 | 
21 | # Podcasts
22 | | Label(s)      | Link | Description|
23 | | ----------- | ----------- | ----------- |
24 | 


--------------------------------------------------------------------------------
/Conda Setup/latexmkrc:
--------------------------------------------------------------------------------
 1 | @default_files = ("main.tex");   # Set the root tex file for the output document
 2 | $pdf_mode = 1;                   # tex -> PDF
 3 | $auto_rc_use = 1;                # Do not read further latexmkrc files
 4 | $out_dir = "..";                 # Create output files in the assignment directory (assignment/)
 5 | $warnings_as_errors = 1;         # Elevates warnings to errors.  Enforces cleaner code.
 6 | $pdflatex = "pdflatex -halt-on-error -interaction=batchmode %O %S";
 7 |                                  # Forces latexmk to stop and quit if it encounters an error
 8 | $jobname = "Anaconda_Setup"; # This is the name of the output PDF file
 9 | $silent = 1;                     # For quieter output on the terminal.
10 | 
11 | add_cus_dep('pytex','tex',0,'py2tex');
12 | sub py2tex {
13 |   system("./py2tex.py \"$_[0].pytex\" \"$_[0].tex\"");
14 | }


--------------------------------------------------------------------------------
/Linux Git Guide/latexmkrc:
--------------------------------------------------------------------------------
 1 | @default_files = ("main.tex");   # Set the root tex file for the output document
 2 | $pdf_mode = 1;                   # tex -> PDF
 3 | $auto_rc_use = 1;                # Do not read further latexmkrc files
 4 | $out_dir = "..";                 # Create output files in the assignment directory (assignment/)
 5 | $warnings_as_errors = 1;         # Elevates warnings to errors.  Enforces cleaner code.
 6 | $pdflatex = "pdflatex -halt-on-error -interaction=batchmode %O %S";
 7 |                                  # Forces latexmk to stop and quit if it encounters an error
 8 | $jobname = "Linux and Git Guide"; # This is the name of the output PDF file
 9 | $silent = 1;                     # For quieter output on the terminal.
10 | 
11 | add_cus_dep('pytex','tex',0,'py2tex');
12 | sub py2tex {
13 |   system("./py2tex.py \"$_[0].pytex\" \"$_[0].tex\"");
14 | }


--------------------------------------------------------------------------------
/Probability Review/latexmkrc:
--------------------------------------------------------------------------------
 1 | @default_files = ("Report.tex");   # Set the root tex file for the output document
 2 | $pdf_mode = 1;                   # tex -> PDF
 3 | $auto_rc_use = 1;                # Do not read further latexmkrc files
 4 | $out_dir = "..";                 # Create output files in the assignment directory (assignment/)
 5 | $warnings_as_errors = 1;         # Elevates warnings to errors.  Enforces cleaner code.
 6 | $pdflatex = "pdflatex -halt-on-error -interaction=batchmode %O %S";
 7 |                                  # Forces latexmk to stop and quit if it encounters an error
 8 | $jobname = "Probability Review"; # This is the name of the output PDF file
 9 | $silent = 1;                     # For quieter output on the terminal.
10 | 
11 | add_cus_dep('pytex','tex',0,'py2tex');
12 | sub py2tex {
13 |   system("./py2tex.py \"$_[0].pytex\" \"$_[0].tex\"");
14 | }


--------------------------------------------------------------------------------
/Linear Algebra Review/latexmkrc:
--------------------------------------------------------------------------------
 1 | @default_files = ("linalg2.tex");   # Set the root tex file for the output document
 2 | $pdf_mode = 1;                   # tex -> PDF
 3 | $auto_rc_use = 1;                # Do not read further latexmkrc files
 4 | $out_dir = "..";                 # Create output files in the assignment directory (assignment/)
 5 | $warnings_as_errors = 1;         # Elevates warnings to errors.  Enforces cleaner code.
 6 | $pdflatex = "pdflatex -halt-on-error -interaction=batchmode %O %S";
 7 |                                  # Forces latexmk to stop and quit if it encounters an error
 8 | $jobname = "Linear Algebra Review"; # This is the name of the output PDF file
 9 | $silent = 1;                     # For quieter output on the terminal.
10 | 
11 | add_cus_dep('pytex','tex',0,'py2tex');
12 | sub py2tex {
13 |   system("./py2tex.py \"$_[0].pytex\" \"$_[0].tex\"");
14 | }


--------------------------------------------------------------------------------
/Linux Git Guide/Linux Git Guide/latexmkrc:
--------------------------------------------------------------------------------
 1 | @default_files = ("main.tex");   # Set the root tex file for the output document
 2 | $pdf_mode = 1;                   # tex -> PDF
 3 | $auto_rc_use = 1;                # Do not read further latexmkrc files
 4 | $out_dir = "..";                 # Create output files in the assignment directory (assignment/)
 5 | $warnings_as_errors = 1;         # Elevates warnings to errors.  Enforces cleaner code.
 6 | $pdflatex = "pdflatex -halt-on-error -interaction=batchmode %O %S";
 7 |                                  # Forces latexmk to stop and quit if it encounters an error
 8 | $jobname = "Linux and Git Guide"; # This is the name of the output PDF file
 9 | $silent = 1;                     # For quieter output on the terminal.
10 | 
11 | add_cus_dep('pytex','tex',0,'py2tex');
12 | sub py2tex {
13 |   system("./py2tex.py \"$_[0].pytex\" \"$_[0].tex\"");
14 | }


--------------------------------------------------------------------------------
/Conda Setup/py2tex.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | import argparse, sys, io, re
 3 | 
 4 | PYTEX_PATTERN = r'(?s)🐍(.*?)🐍'
 5 | 
 6 | def collect_stdout_from_executable(exec_string, local_scope={}, global_scope={}):
 7 |   old_stdout = sys.stdout
 8 |   redirected_output = sys.stdout = io.StringIO()
 9 |   try:
10 |     exec(exec_string, global_scope, local_scope)
11 |   except: raise 
12 |   finally:
13 |     sys.stdout = old_stdout
14 |   return redirected_output.getvalue()
15 | 
16 | def pytex_to_tex(pytex):
17 |   local_scope, global_scope = {}, {}
18 |   tex = re.sub(PYTEX_PATTERN,
19 |                lambda match: collect_stdout_from_executable(match.group(1),
20 |                                                             local_scope,
21 |                                                             global_scope),
22 |                pytex)
23 |   return tex
24 | 
25 | if __name__ == '__main__':
26 |   parser = argparse.ArgumentParser(description='Converts .pytex files into .tex files..')
27 |   parser.add_argument('infile', help="The source pytex file.")
28 |   parser.add_argument('outfile', help="The destination tex file.")
29 | 
30 |   infile = parser.parse_args().infile
31 |   outfile = parser.parse_args().outfile
32 | 
33 |   # Read the infile
34 |   with open(infile, 'r') as in_f:
35 |     # Write to the outfile    
36 |     with open(outfile, 'w') as out_f:
37 |       out_f.write(pytex_to_tex(in_f.read()))
38 | 
39 | """
40 | Highlight with the following .sublime-syntax file:
41 | 
42 | %YAML 1.2
43 | ---
44 | file_extensions: [pytex]
45 | scope: pytex
46 | 
47 | contexts:
48 |   main:
49 |     - include: Packages/LaTeX/LaTeX.sublime-syntax
50 |     - match: 🐍
51 |       embed: Packages/Python/Python.sublime-syntax
52 |       escape: 🐍
53 | """


--------------------------------------------------------------------------------
/Useful Links/README.md:
--------------------------------------------------------------------------------
 1 | # Useful Links
 2 | Within this directory you will find child directories for various XCS classes, within each of these XCS class directories there is a README.md file that contains links to online resources which relate to the class materials. We categorise resources as follows:
 3 | 
 4 | * Research Papers ([example](https://arxiv.org/abs/2109.04617))
 5 | * Code ([example](https://github.com/suraj-nair-1/lorel))
 6 | * Blog Posts ([example](https://ai.stanford.edu/blog/meta-exploration/))
 7 | * Talks/Presentations ([example](https://www.youtube.com/watch?v=733m6qBH-jI))
 8 | * Podcasts ([example](https://www.eye-on.ai/podcast-044))
 9 | 
10 | The purpose of including these links is to help students navigate some of the available online materials that relate to the class content and to help instigate additional discussion of the class materials. We hope you find value in exploring the available links and we look forward to discussing their content with you.
11 | 
12 | # Contribution Guide
13 | If you have a resource you believe would be of value to add to the current lists please open a pull request with changes that include a link to the resource you wish to add. There are a few conditions that your pull request needs to follow in order to be reviewed:
14 | 
15 | 1. You must ensure that your resource link is included under the header that corresponds to its resource type 
16 | 2. You must ensure that a link to the same resource doesn't already exist in the list
17 | 3. You must ensure that the link is openly accessible and not behind a paywall
18 | 4. Please include your changes to the top of the current list as we wish to maintain chronological order
19 | 5. Please also include a label(s) that relate to the resource you are including (e.g. [`AI` | `robotics`]) 
20 | 
21 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Course Handouts
 2 | This directory contains the source code for compiling course handouts. The documents in
 3 | this directory **are not** the assignment handouts.
 4 | 
 5 | Each subdirectory within this folder has, at a minimum, a file titled
 6 | `latexmkrc`.  This is the settings file for latexmk, which will handle
 7 | juggling the various latex engines preferred by the course staff.  A basic
 8 | `latexmkrc` file (e.g., for `pdflatex`) might have the following contents
 9 | ([`latexmk` documentation](https://mirror.las.iastate.edu/tex-archive/support/latexmk/latexmk.pdf)):
10 | ```
11 | @default_files = ("main.tex");   # Set the root tex file for the output document
12 | $pdf_mode = 1;                   # tex -> PDF
13 | $auto_rc_use = 1;                # Do not read further latexmkrc files
14 | $warnings_as_errors = 1;         # Elevates warnings to errors.  Enforces cleaner code.
15 | $pdflatex = "pdflatex -halt-on-error -interaction=batchmode %O %S";
16 |                                  # Forces latexmk to stop and quit if it encounters an error
17 | $jobname = "output_name";        # This is the name of the output PDF file
18 | $silent = 1                      # For quieter output on the terminal.
19 | ```
20 | Feel free to customize this as you desire, including adding more files,
21 | directories, and media.  There is only one requirements:
22 | 
23 | **IT MUST BE POSSIBLE TO COMPILE EACH DOCUMENT USING ONLY THE FOLLOWING COMMAND:**
24 | ```
25 | $ latexmk
26 | ```
27 | A properly setup `latexmkrc` file can handle any special compilation options you
28 | may require.  Put those options in the `latexmkrc` file so that other course
29 | staff can compile your document with the command above.
30 | 
31 | Other commands that might be helpful include:
32 | - `$ latexmk -pvc`:  (preview continuously) This will run `latexmk`
33 | continuously, allowing you to immediately view changes to your output document
34 | as you save source files.
35 | - `$ latexmk -c`:  This will remove all auxiliary files other than the final
36 | output PDF.
37 | - `$ latexmk -C`:  This will remove all output files (including the final output
38 | PDF).


--------------------------------------------------------------------------------
/Linear Algebra Review/linalg2.toc:
--------------------------------------------------------------------------------
 1 | \contentsline {section}{\numberline {1}Basic Concepts and Notation}{2}
 2 | \contentsline {subsection}{\numberline {1.1}Basic Notation}{2}
 3 | \contentsline {section}{\numberline {2}Matrix Multiplication}{3}
 4 | \contentsline {subsection}{\numberline {2.1}Vector-Vector Products}{3}
 5 | \contentsline {subsection}{\numberline {2.2}Matrix-Vector Products}{4}
 6 | \contentsline {subsection}{\numberline {2.3}Matrix-Matrix Products}{5}
 7 | \contentsline {section}{\numberline {3}Operations and Properties}{7}
 8 | \contentsline {subsection}{\numberline {3.1}The Identity Matrix and Diagonal Matrices}{8}
 9 | \contentsline {subsection}{\numberline {3.2}The Transpose}{8}
10 | \contentsline {subsection}{\numberline {3.3}Symmetric Matrices}{8}
11 | \contentsline {subsection}{\numberline {3.4}The Trace}{9}
12 | \contentsline {subsection}{\numberline {3.5}Norms}{10}
13 | \contentsline {subsection}{\numberline {3.6}Linear Independence and Rank}{11}
14 | \contentsline {subsection}{\numberline {3.7}The Inverse of a Square Matrix}{11}
15 | \contentsline {subsection}{\numberline {3.8}Orthogonal Matrices}{12}
16 | \contentsline {subsection}{\numberline {3.9}Range and Nullspace of a Matrix}{13}
17 | \contentsline {subsection}{\numberline {3.10}The Determinant}{14}
18 | \contentsline {subsection}{\numberline {3.11}Quadratic Forms and Positive Semidefinite Matrices}{17}
19 | \contentsline {subsection}{\numberline {3.12}Eigenvalues and Eigenvectors}{18}
20 | \contentsline {subsection}{\numberline {3.13}Eigenvalues and Eigenvectors of Symmetric Matrices}{19}
21 | \contentsline {paragraph}{Background: representing vector w.r.t. another basis.}{20}
22 | \contentsline {paragraph}{``Diagonalizing'' matrix-vector multiplication.}{21}
23 | \contentsline {paragraph}{``Diagonalizing'' quadratic form.}{21}
24 | \contentsline {section}{\numberline {4}Matrix Calculus}{23}
25 | \contentsline {subsection}{\numberline {4.1}The Gradient}{23}
26 | \contentsline {subsection}{\numberline {4.2}The Hessian}{24}
27 | \contentsline {subsection}{\numberline {4.3}Gradients and Hessians of Quadratic and Linear Functions}{26}
28 | \contentsline {subsection}{\numberline {4.4}Least Squares}{27}
29 | \contentsline {subsection}{\numberline {4.5}Gradients of the Determinant}{28}
30 | \contentsline {subsection}{\numberline {4.6}Eigenvalues as Optimization}{28}
31 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | pip-wheel-metadata/
 24 | share/python-wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | MANIFEST
 29 | 
 30 | # PyInstaller
 31 | #  Usually these files are written by a python script from a template
 32 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 33 | *.manifest
 34 | *.spec
 35 | 
 36 | # Installer logs
 37 | pip-log.txt
 38 | pip-delete-this-directory.txt
 39 | 
 40 | # Unit test / coverage reports
 41 | htmlcov/
 42 | .tox/
 43 | .nox/
 44 | .coverage
 45 | .coverage.*
 46 | .cache
 47 | nosetests.xml
 48 | coverage.xml
 49 | *.cover
 50 | *.py,cover
 51 | .hypothesis/
 52 | .pytest_cache/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | target/
 76 | 
 77 | # Jupyter Notebook
 78 | .ipynb_checkpoints
 79 | 
 80 | # IPython
 81 | profile_default/
 82 | ipython_config.py
 83 | 
 84 | # pyenv
 85 | .python-version
 86 | 
 87 | # pipenv
 88 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 89 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 90 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 91 | #   install all needed dependencies.
 92 | #Pipfile.lock
 93 | 
 94 | # celery beat schedule file
 95 | celerybeat-schedule
 96 | 
 97 | # SageMath parsed files
 98 | *.sage.py
 99 | 
100 | # Environments
101 | .env
102 | .venv
103 | env/
104 | venv/
105 | ENV/
106 | env.bak/
107 | venv.bak/
108 | 
109 | # Spyder project settings
110 | .spyderproject
111 | .spyproject
112 | 
113 | # Rope project settings
114 | .ropeproject
115 | 
116 | # mkdocs documentation
117 | /site
118 | 
119 | # mypy
120 | .mypy_cache/
121 | .dmypy.json
122 | dmypy.json
123 | 
124 | # Pyre type checker
125 | .pyre/
126 | 
127 | # For .DS_Stores and main generated files
128 | .DS_Store
129 | *.aux
130 | *.log
131 | *.fdb_latexmk
132 | *.fls
133 | *.out
134 | *.synctex.gz
135 | main.pdf


--------------------------------------------------------------------------------
/Probability Review/prob_slides.tex:
--------------------------------------------------------------------------------
  1 | \documentclass{beamer}
  2 | \usepackage[utf8]{inputenc}
  3 | \usepackage{amsmath}
  4 | \usepackage{amsfonts}
  5 | \usepackage{amssymb}
  6 | %\usepackage[margin=0.5in]{geometry}
  7 | \usepackage[normalem]{ulem}
  8 | \usepackage{graphicx}
  9 | %\usetheme{Darmstadt}
 10 | \title{CS229}
 11 | \author{Probability Theory Review}
 12 | \date{}
 13 | \begin{document}
 14 | \begin{frame}
 15 | \titlepage
 16 | \end{frame}
 17 | \begin{frame}
 18 | \frametitle{Elements of Probability}
 19 | \pause 
 20 | \begin{enumerate}
 21 | \itemsep1em
 22 | \item[] \textbf{Sample Space} $\Omega$
 23 | \pause 
 24 | \item[] \indent $\qquad \{HH, HT, TH, TT\}$
 25 | \pause
 26 | \item[] \textbf{Event} $A\subseteq \Omega$
 27 | \pause  
 28 | \item[] \indent $\qquad \{HH, HT\}$, \pause $\Omega$
 29 | \pause  
 30 | \item[] \textbf{Event Space} $\mathcal{F}$ 
 31 | \pause 
 32 | \item[] \textbf{Probability Measure} $P:\mathcal{F}\rightarrow\mathbb{R}$
 33 | \begin{enumerate}
 34 | \itemsep1em
 35 | \pause 
 36 | \item[] $P(A)\geq 0\quad \forall A\in\mathcal{F}$
 37 | \pause
 38 | \item[] $P(\Omega)=1$
 39 | \pause
 40 | \item[] If $A_1,A_2,...$ disjoint set of events ($A_i\cap A_j=\emptyset$ when $i\neq j$), then
 41 | $$P\left(\bigcup_i A_i\right)=\sum_iP(A_i)$$
 42 | \end{enumerate}
 43 | \end{enumerate}
 44 | \end{frame}
 45 | \begin{frame}
 46 | \frametitle{Conditional Probability and Independence}
 47 | \pause 
 48 | \begin{enumerate}
 49 | \itemsep3em
 50 | \item[] Let $B$ be any event such that $P(B)\neq 0$.
 51 | \pause 
 52 | \item[] $P(A|B):=\frac{P(A\cap B)}{P(B)}$
 53 | \pause 
 54 | \item[] $A\perp B$ if and only if $P(A\cap B)=P(A)P(B)$
 55 | \pause
 56 | \item[] $A\perp B$ if and only if $P(A|B)=\frac{P(A\cap B)}{P(B)}=\frac{P(A)P(B)}{P(B)}=P(A)$
 57 | \end{enumerate}
 58 | \end{frame}
 59 | \begin{frame}
 60 | \frametitle{Random Variables (RV)}
 61 | \pause
 62 | \begin{center}$\omega_0=HHHTHTTHTT$\end{center}
 63 | \pause 
 64 | \begin{enumerate}
 65 | \itemsep2em
 66 | \item[] A \textbf{RV} is $X:\Omega\rightarrow\mathbb{R}$
 67 | \pause
 68 | \item[] \indent $\qquad$ \# of heads: $X(\omega_0)=5$
 69 | \pause
 70 | \item[] \indent $\qquad$ \# of tosses until tails: $X(\omega_0)=4$
 71 | \pause 
 72 | \item[] $Val(X):=X(\Omega)$
 73 | \pause 
 74 | \item[] \indent $\qquad Val(X)=\{0,1,...,10\}$ 
 75 | \end{enumerate}
 76 | \end{frame}
 77 | \begin{frame}
 78 | \frametitle{Cumulative Distribution Function (CDF)}
 79 | \pause
 80 | \begin{enumerate}
 81 | \itemsep3em
 82 | \item[] $F_X:\mathbb{R}\rightarrow[0, 1]$
 83 | \pause
 84 | \item[] \indent $\qquad F_X(x)=P(X\leq x)$\pause $:=P(\{\omega|X(\omega)\leq x\})$
 85 | \pause 
 86 | \item[] \includegraphics[width=2in]{cdf.png}
 87 | \end{enumerate}
 88 | \end{frame}
 89 | \begin{frame}
 90 | \frametitle{Discrete vs. Continuous RV}
 91 | \pause
 92 | \begin{enumerate}
 93 | \itemsep0.5em
 94 | \item[] \textbf{Discrete RV}: $Val(X)$ is countable
 95 | \pause
 96 | \item[] $\qquad P(X=k):=P(\{\omega|X(\omega)=k\})$
 97 | \pause
 98 | \item[] $\qquad$Probability Mass Function (PMF): $p_X:Val(X)\rightarrow [0,1]$
 99 | \pause 
100 | \item[] $\qquad\qquad p_X(x):=P(X=x)$
101 | \pause 
102 | \item[] $\qquad\qquad \underset{x\in Val(X)}{\sum}p_X(x)=1$
103 | \pause 
104 | \item[] \textbf{Continuous RV}: $Val(X)$ is uncountable
105 | \pause
106 | \item[] \indent $\qquad P(a\leq X\leq b):=P(\{\omega|a\leq X(\omega)\leq b\})$
107 | \pause 
108 | \item[] $\qquad$Probability Density Function (PDF): $f_X:\mathbb{R}\rightarrow\mathbb{R}$
109 | \pause 
110 | \item[] $\qquad\qquad f_X(x):=\frac{d}{dx}F_X(x)$
111 | \pause 
112 | \item[] $\qquad\qquad f_X(x)\neq P(X=x)$
113 | \pause 
114 | \item[] $\qquad\qquad \int_{-\infty}^{\infty}\underbrace{f_X(x)dx\pause}_{P(x\leq X\leq x+dx)}=1$
115 | \end{enumerate}
116 | \end{frame}
117 | 
118 | \begin{frame}
119 | \frametitle{Expected Value and Variance}
120 | \pause 
121 | \begin{enumerate}
122 | \itemsep1em
123 | \item[] $g:\mathbb{R}\rightarrow\mathbb{R}$
124 | \pause 
125 | \item[] \textbf{Expected Value}
126 | \pause
127 | \item[] $\qquad$ Let $X$ be a discrete RV with PMF $p_X$.
128 | \pause 
129 | \item[] $\qquad\qquad \mathbb{E}[g(X)]:=\underset{x\in Val(X)}{\sum}g(x)p_X(x)$
130 | \pause 
131 | \item[] $\qquad$ Let $X$ be a continuous RV with PDF $f_X$.
132 | \pause 
133 | \item[] $\qquad\qquad \mathbb{E}[g(X)]:=\int_{-\infty}^{\infty}g(x)f_X(x)dx$
134 | \pause
135 | \item[] \textbf{Variance}
136 | \pause 
137 | \item[] $\qquad Var(X):=\mathbb{E}[(X-\mathbb{E}[X])^2]$\pause $ =\mathbb{E}[X^2]-\mathbb{E}[X]^2$
138 | \end{enumerate}
139 | \end{frame}
140 | 
141 | \begin{frame}
142 | \frametitle{Example Distributions}
143 | \begin{center}
144 | \small
145 | $$\begin{array}{|l|l|c|c|}
146 | \hline
147 | \text{Distribution} & \text{PDF or PMF} & \text{Mean} & \text{Variance} \\ 
148 | \hline
149 | Bernoulli(p) & \left\{\begin{array}{ll}p,&\text{if }x=1\\1-p,&\text{if }x=0.\end{array}\right. & p & p(1-p) \\ 
150 | \hline
151 | Binomial(n,p) & \binom{n}{k}p^k(1-p)^{n-k}\text{ for }k=0,1,...,n & np & np(1-p) \\ 
152 | \hline
153 | Geometric(p) & p(1-p)^{k-1}\text{ for }k=1,2,... & \frac1p & \frac{1-p}{p^2} \\ 
154 | \hline
155 | Poisson(\lambda) & \frac{e^{-\lambda}\lambda^k}{k!}\text{ for }k=0,1,... & \lambda & \lambda \\ 
156 | \hline
157 | Uniform(a,b) & \frac{1}{b-a}\text{ for all } x\in(a,b) & \frac{a+b}{2} & \frac{(b-a)^2}{12} \\ 
158 | \hline
159 | Gaussian(\mu,\sigma^2) & \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\text{ for all } x\in(-\infty, \infty) & \mu & \sigma^2 \\ 
160 | \hline
161 | Exponential(\lambda) & \lambda e^{-\lambda x}\text{ for all } x\geq 0, \lambda\geq 0 & \frac{1}{\lambda} & \frac{1}{\lambda^2}\\
162 | \hline
163 | \end{array} $$
164 | \end{center}
165 | \end{frame}
166 | \begin{frame}
167 | \frametitle{Two Random Variables}
168 | \begin{enumerate}
169 | \itemsep2em
170 | \item $F_{XY}(x,y)=P(X\leq x,Y\leq y)$
171 | \item $p_{XY}(x,y)=P(X=x,Y=y)$
172 | \item $p_X(x)=\sum_yp_{XY}(x,y)$
173 | \item $f_{XY}(x,y)=\frac{\partial^2 F_{XY}(x,y)}{\partial x\partial y}$
174 | \item $f_X(x)=\int_{-\infty}^{\infty}f_{XY}(x,y)dy$
175 | \end{enumerate}
176 | \end{frame}
177 | 
178 | \end{document}


--------------------------------------------------------------------------------
/Linux Git Guide/main.tex:
--------------------------------------------------------------------------------
  1 | \documentclass[10pt,landscape]{article}
  2 | \usepackage{multicol}
  3 | \usepackage{calc}
  4 | \usepackage{ifthen}
  5 | \usepackage[landscape]{geometry}
  6 | \usepackage{amsmath,amsthm,amsfonts,amssymb}
  7 | \usepackage{color,graphicx,overpic}
  8 | \usepackage{hyperref}
  9 | \usepackage{listings}
 10 | \lstset{
 11 |   basicstyle=\ttfamily,
 12 |   columns=fullflexible,
 13 |   frame=single,
 14 |   breaklines=true,
 15 |   postbreak=\mbox{\textcolor{red}{$\hookrightarrow$}\space},
 16 | }
 17 | 
 18 | \pdfinfo{
 19 |   /Title (Linux and Git Guide.pdf)
 20 |   /Creator (TeX)
 21 |   /Producer (pdfTeX 1.40.0)
 22 |   /Author (Armando Banuelos)
 23 |   /Subject (Linux and Git Guide)
 24 |   /Keywords (pdflatex, latex,pdftex,tex)}
 25 | 
 26 | % This sets page margins to .5 inch if using letter paper, and to 1cm
 27 | % if using A4 paper. (This probably isn't strictly necessary.)
 28 | % If using another size paper, use default 1cm margins.
 29 | \ifthenelse{\lengthtest { \paperwidth = 11in}}
 30 |     { \geometry{top=.5in,left=.25in,right=.25in,bottom=.5in} }
 31 |     {\ifthenelse{ \lengthtest{ \paperwidth = 297mm}}
 32 |         {\geometry{top=1cm,left=1cm,right=1cm,bottom=1cm} }
 33 |         {\geometry{top=1cm,left=1cm,right=1cm,bottom=1cm} }
 34 |     }
 35 | 
 36 | % Turn off header and footer
 37 | \pagestyle{empty}
 38 | 
 39 | % Redefine section commands to use less space
 40 | \makeatletter
 41 | \renewcommand{\section}{\@startsection{section}{1}{0mm}%
 42 |                                 {-1ex plus -.5ex minus -.2ex}%
 43 |                                 {0.5ex plus .2ex}%x
 44 |                                 {\normalfont\large\bfseries}}
 45 | \renewcommand{\subsection}{\@startsection{subsection}{2}{0mm}%
 46 |                                 {-1explus -.5ex minus -.2ex}%
 47 |                                 {0.5ex plus .2ex}%
 48 |                                 {\normalfont\normalsize\bfseries}}
 49 | \renewcommand{\subsubsection}{\@startsection{subsubsection}{3}{0mm}%
 50 |                                 {-1ex plus -.5ex minus -.2ex}%
 51 |                                 {1ex plus .2ex}%
 52 |                                 {\normalfont\small\bfseries}}
 53 | \makeatother
 54 | 
 55 | % Define BibTeX command
 56 | \def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
 57 |     T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
 58 | 
 59 | % Don't print section numbers
 60 | \setcounter{secnumdepth}{0}
 61 | 
 62 | 
 63 | \setlength{\parindent}{0pt}
 64 | \setlength{\parskip}{0pt plus 0.5ex}
 65 | 
 66 | %My Environments
 67 | \newtheorem{example}[section]{Example}
 68 | % -----------------------------------------------------------------------
 69 | 
 70 | \begin{document}
 71 | % \raggedright
 72 | % \footnotesize
 73 | \begin{multicols}{2}
 74 | 
 75 | 
 76 | % multicol parameters
 77 | % These lengths are set only within the two main columns
 78 | %\setlength{\columnseprule}{0.25pt}
 79 | \setlength{\premulticols}{1pt}
 80 | \setlength{\postmulticols}{1pt}
 81 | \setlength{\multicolsep}{1pt}
 82 | \setlength{\columnsep}{2pt}
 83 | 
 84 | \begin{center}
 85 |      \Large{Linux and Git Guide} \\
 86 | \end{center}
 87 | 
 88 | \section{Git - Keeping Your Code Updated}
 89 | 
 90 | Git is a software used for tracking changes in files and collaborative code development. 
 91 | GitHub is a provider of Internet hosting for software development and version control using Git. \\
 92 | 
 93 | Below is the recommended Git flow you should follow for assignment development. \\
 94 | 
 95 | This guide assumes you have a GitHub account with token based authentication. If not, please follow this \href{https://docs.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent}{tutorial} to create an ssh key and link it to your GitHub account to perform Git actions. \\
 96 | 
 97 | 1. First start off by cloning a git repo to your local machine.
 98 | \begin{lstlisting}[language=SQL]
 99 |     git clone git@github.com:scpd-proed/<XCS-ASSIGNMENT-REPO>.git
100 | \end{lstlisting}
101 | 
102 | 2. Create a branch to do your own development
103 | \begin{lstlisting}[language=SQL]
104 |     git checkout -b <branch-name>
105 | \end{lstlisting}
106 | 
107 | 3. If CFs make updates to the assignment commit your unsaved changes
108 | \begin{lstlisting}[language=SQL]
109 |     git commit -am "<some message>"
110 | \end{lstlisting}
111 | 
112 | 4. Merge updates from the assignment to your local branch
113 | \begin{lstlisting}[language=SQL]
114 |     git pull origin master
115 | \end{lstlisting}
116 | 
117 | After performing step 4, you may experience conflicts. To see which files are in conflict, please run
118 | \begin{lstlisting}[language=SQL]
119 |     git status
120 | \end{lstlisting}
121 | 
122 | The files in conflict are those "Not staged for commit". Open these files in the editor of your choice and look for lines \texttt{<<<<} and \texttt{>>>>}.\\
123 | 
124 | Once you have resolved the conflicts, run
125 | \begin{lstlisting}[language=SQL]
126 |     git add -A .
127 | \end{lstlisting}
128 | 
129 | And lastly commit your updates to your branch
130 | \begin{lstlisting}[language=SQL]
131 |     git commit -am "<some message>"
132 | \end{lstlisting}
133 | 
134 | Please DO NOT push your local branch changes to GitHub or create pull requests. This will expose your solution code from your development branch and is in violation of the honor code. 
135 | 
136 | \section{Linux - Commands You Should Know}
137 | Linux is an open-source Unix-like operating system. All our assignments assume you have some understanding on how to run commands on Linux machines. \\
138 | 
139 | Below we will go through popular Linux Commands you should know for assignment development.\\
140 | 
141 | 1. Listing directory (ls) - see files in your current directory
142 | \begin{lstlisting}[language=SQL]
143 |     ls
144 | \end{lstlisting}
145 | 
146 | 2. Changing directory (cd) - move to another directory
147 | \begin{lstlisting}[language=SQL]
148 |     cd /path/to/directory
149 | \end{lstlisting}
150 | 
151 | 3. Path Working Directory (pwd) - display pathname of current working directory
152 | \begin{lstlisting}[language=SQL]
153 |     pwd
154 | \end{lstlisting}
155 | 
156 | 4. Display file contents (cat)
157 | \begin{lstlisting}[language=SQL]
158 |     cat file.txt
159 | \end{lstlisting}
160 | 
161 | 5. Get a clear console window (clear)
162 | \begin{lstlisting}[language=SQL]
163 |     clear
164 | \end{lstlisting}
165 | 
166 | 6. Copying files (cp)
167 | \begin{lstlisting}[language=SQL]
168 |     cp source destination
169 | \end{lstlisting}
170 | 
171 | 7. Remove a file (rm)
172 | \begin{lstlisting}[language=SQL]
173 |     rm file.txt
174 | \end{lstlisting}
175 | 
176 | 8. Zipping up files (zip)
177 | \begin{lstlisting}[language=SQL]
178 |     zip myfile.zip filename.txt
179 | \end{lstlisting}
180 | 
181 | 9. Remote log into another machine (ssh) (will be useful for using Azure GPUs)
182 | \begin{lstlisting}[language=SQL]
183 |     ssh user@machine
184 | \end{lstlisting}
185 | 
186 | 10. Rename or move a file (mv)
187 | \begin{lstlisting}[language=SQL]
188 |     mv source destination 
189 | \end{lstlisting}
190 | 
191 | 
192 | \end{multicols}
193 | \end{document}


--------------------------------------------------------------------------------
/Linux Git Guide/Linux Git Guide/main.tex:
--------------------------------------------------------------------------------
  1 | \documentclass[10pt,landscape]{article}
  2 | \usepackage{multicol}
  3 | \usepackage{calc}
  4 | \usepackage{ifthen}
  5 | \usepackage[landscape]{geometry}
  6 | \usepackage{amsmath,amsthm,amsfonts,amssymb}
  7 | \usepackage{color,graphicx,overpic}
  8 | \usepackage{hyperref}
  9 | \usepackage{listings}
 10 | \lstset{
 11 |   basicstyle=\ttfamily,
 12 |   columns=fullflexible,
 13 |   frame=single,
 14 |   breaklines=true,
 15 |   postbreak=\mbox{\textcolor{red}{$\hookrightarrow$}\space},
 16 | }
 17 | 
 18 | \pdfinfo{
 19 |   /Title (Linux and Git Guide.pdf)
 20 |   /Creator (TeX)
 21 |   /Producer (pdfTeX 1.40.0)
 22 |   /Author (Armando Banuelos)
 23 |   /Subject (Linux and Git Guide)
 24 |   /Keywords (pdflatex, latex,pdftex,tex)}
 25 | 
 26 | % This sets page margins to .5 inch if using letter paper, and to 1cm
 27 | % if using A4 paper. (This probably isn't strictly necessary.)
 28 | % If using another size paper, use default 1cm margins.
 29 | \ifthenelse{\lengthtest { \paperwidth = 11in}}
 30 |     { \geometry{top=.5in,left=.25in,right=.25in,bottom=.5in} }
 31 |     {\ifthenelse{ \lengthtest{ \paperwidth = 297mm}}
 32 |         {\geometry{top=1cm,left=1cm,right=1cm,bottom=1cm} }
 33 |         {\geometry{top=1cm,left=1cm,right=1cm,bottom=1cm} }
 34 |     }
 35 | 
 36 | % Turn off header and footer
 37 | \pagestyle{empty}
 38 | 
 39 | % Redefine section commands to use less space
 40 | \makeatletter
 41 | \renewcommand{\section}{\@startsection{section}{1}{0mm}%
 42 |                                 {-1ex plus -.5ex minus -.2ex}%
 43 |                                 {0.5ex plus .2ex}%x
 44 |                                 {\normalfont\large\bfseries}}
 45 | \renewcommand{\subsection}{\@startsection{subsection}{2}{0mm}%
 46 |                                 {-1explus -.5ex minus -.2ex}%
 47 |                                 {0.5ex plus .2ex}%
 48 |                                 {\normalfont\normalsize\bfseries}}
 49 | \renewcommand{\subsubsection}{\@startsection{subsubsection}{3}{0mm}%
 50 |                                 {-1ex plus -.5ex minus -.2ex}%
 51 |                                 {1ex plus .2ex}%
 52 |                                 {\normalfont\small\bfseries}}
 53 | \makeatother
 54 | 
 55 | % Define BibTeX command
 56 | \def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
 57 |     T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
 58 | 
 59 | % Don't print section numbers
 60 | \setcounter{secnumdepth}{0}
 61 | 
 62 | 
 63 | \setlength{\parindent}{0pt}
 64 | \setlength{\parskip}{0pt plus 0.5ex}
 65 | 
 66 | %My Environments
 67 | \newtheorem{example}[section]{Example}
 68 | % -----------------------------------------------------------------------
 69 | 
 70 | \begin{document}
 71 | % \raggedright
 72 | % \footnotesize
 73 | \begin{multicols}{2}
 74 | 
 75 | 
 76 | % multicol parameters
 77 | % These lengths are set only within the two main columns
 78 | %\setlength{\columnseprule}{0.25pt}
 79 | \setlength{\premulticols}{1pt}
 80 | \setlength{\postmulticols}{1pt}
 81 | \setlength{\multicolsep}{1pt}
 82 | \setlength{\columnsep}{2pt}
 83 | 
 84 | \begin{center}
 85 |      \Large{Linux and Git Guide} \\
 86 | \end{center}
 87 | 
 88 | \section{Git - Keeping Your Code Updated}
 89 | 
 90 | Git is a software used for tracking changes in files and collaborative code development. 
 91 | GitHub is a provider of Internet hosting for software development and version control using Git. \\
 92 | 
 93 | Below is the recommended Git flow you should follow for assignment development. \\
 94 | 
 95 | This guide assumes you have a GitHub account with token based authentication. If not, please follow this \href{https://docs.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent}{tutorial} to create an ssh key and link it to your GitHub account to perform Git actions. \\
 96 | 
 97 | 1. First start off by cloning a git repo to your local machine.
 98 | \begin{lstlisting}[language=SQL]
 99 |     git clone git@github.com:scpd-proed/<XCS-ASSIGNMENT-REPO>.git
100 | \end{lstlisting}
101 | 
102 | 2. Create a branch to do your own development
103 | \begin{lstlisting}[language=SQL]
104 |     git checkout -b <branch-name>
105 | \end{lstlisting}
106 | 
107 | 3. If CFs make updates to the assignment commit your unsaved changes
108 | \begin{lstlisting}[language=SQL]
109 |     git commit -am "<some message>"
110 | \end{lstlisting}
111 | 
112 | 4. Merge updates from the assignment to your local branch
113 | \begin{lstlisting}[language=SQL]
114 |     git pull origin master
115 | \end{lstlisting}
116 | 
117 | After performing step 4, you may experience conflicts. To see which files are in conflict, please run
118 | \begin{lstlisting}[language=SQL]
119 |     git status
120 | \end{lstlisting}
121 | 
122 | The files in conflict are those "Not staged for commit". Open these files in the editor of your choice and look for lines \texttt{<<<<} and \texttt{>>>>}.\\
123 | 
124 | Once you have resolved the conflicts, run
125 | \begin{lstlisting}[language=SQL]
126 |     git add -A .
127 | \end{lstlisting}
128 | 
129 | And lastly commit your updates to your branch
130 | \begin{lstlisting}[language=SQL]
131 |     git commit -am "<some message>"
132 | \end{lstlisting}
133 | 
134 | Please DO NOT push your local branch changes to GitHub or create pull requests. This will expose your solution code from your development branch and is in violation of the honor code. 
135 | 
136 | \section{Linux - Commands You Should Know}
137 | Linux is an open-source Unix-like operating system. All our assignments assume you have some understanding on how to run commands on Linux machines. \\
138 | 
139 | Below we will go through popular Linux Commands you should know for assignment development.\\
140 | 
141 | 1. Listing directory (ls) - see files in your current directory
142 | \begin{lstlisting}[language=SQL]
143 |     ls
144 | \end{lstlisting}
145 | 
146 | 2. Changing directory (cd) - move to another directory
147 | \begin{lstlisting}[language=SQL]
148 |     cd /path/to/directory
149 | \end{lstlisting}
150 | 
151 | 3. Path Working Directory (pwd) - display pathname of current working directory
152 | \begin{lstlisting}[language=SQL]
153 |     pwd
154 | \end{lstlisting}
155 | 
156 | 4. Display file contents (cat)
157 | \begin{lstlisting}[language=SQL]
158 |     cat file.txt
159 | \end{lstlisting}
160 | 
161 | 5. Get a clear console window (clear)
162 | \begin{lstlisting}[language=SQL]
163 |     clear
164 | \end{lstlisting}
165 | 
166 | 6. Copying files (cp)
167 | \begin{lstlisting}[language=SQL]
168 |     cp source destination
169 | \end{lstlisting}
170 | 
171 | 7. Remove a file (rm)
172 | \begin{lstlisting}[language=SQL]
173 |     rm file.txt
174 | \end{lstlisting}
175 | 
176 | 8. Zipping up files (zip)
177 | \begin{lstlisting}[language=SQL]
178 |     zip myfile.zip filename.txt
179 | \end{lstlisting}
180 | 
181 | 9. Remote log into another machine (ssh) (will be useful for using Azure GPUs)
182 | \begin{lstlisting}[language=SQL]
183 |     ssh user@machine
184 | \end{lstlisting}
185 | 
186 | 10. Rename or move a file (mv)
187 | \begin{lstlisting}[language=SQL]
188 |     mv source destination 
189 | \end{lstlisting}
190 | 
191 | 
192 | \end{multicols}
193 | \end{document}


--------------------------------------------------------------------------------
/Probability Review/nips07submit_e.sty:
--------------------------------------------------------------------------------
  1 | %%%% NIPS Macros (LaTex)
  2 | %%%% Style File
  3 | %%%% Dec 12, 1990   Rev Aug 14, 1991; Sept, 1995; April, 1997; April, 1999
  4 | 
  5 | % This file can be used with Latex2e whether running in main mode, or
  6 | % 2.09 compatibility mode.
  7 | %
  8 | % If using main mode, you need to include the commands
  9 | %             \documentclass{article}
 10 | %             \usepackage{nips06submit_e,times}
 11 | % as the first lines in your document.  Or, if you do not have Times
 12 | % Roman font available, you can just use
 13 | %             \documentclass{article}
 14 | %             \usepackage{nips06submit_e}
 15 | % instead.
 16 | %
 17 | % If using 2.09 compatibility mode, you need to include the command
 18 | %             \documentstyle[nips06submit_09,times]{article}
 19 | % as the first line in your document.  Or, if you do not have Times
 20 | % Roman font available, you can include the command
 21 | %             \documentstyle[nips06submit_09]{article}
 22 | % instead.
 23 | 
 24 | 
 25 | % Change the overall width of the page.  If these parameters are
 26 | %       changed, they will require corresponding changes in the
 27 | %       maketitle section.
 28 | %
 29 | \renewcommand{\topfraction}{0.95}   % let figure take up nearly whole page
 30 | \renewcommand{\textfraction}{0.05}  % let figure take up nearly whole page
 31 | 
 32 | % Specify the dimensions of each page
 33 | 
 34 | \setlength{\paperheight}{11in}
 35 | \setlength{\paperwidth}{8.5in}
 36 | 
 37 | \oddsidemargin .5in    %   Note \oddsidemargin = \evensidemargin
 38 | \evensidemargin .5in
 39 | \marginparwidth 0.07 true in
 40 | %\marginparwidth 0.75 true in
 41 | %\topmargin 0 true pt           % Nominal distance from top of page to top of
 42 | %\topmargin 0.125in
 43 | \topmargin -0.625in
 44 | \addtolength{\headsep}{0.25in}
 45 | \textheight 9.0 true in       % Height of text (including footnotes & figures)
 46 | \textwidth 5.5 true in        % Width of text line.
 47 | \widowpenalty=10000
 48 | \clubpenalty=10000
 49 | 
 50 | % \thispagestyle{empty}        \pagestyle{empty}
 51 | \flushbottom \sloppy
 52 | 
 53 | % We're never going to need a table of contents, so just flush it to
 54 | % save space --- suggested by drstrip@sandia-2
 55 | \def\addcontentsline#1#2#3{}
 56 | 
 57 | % Title stuff, taken from deproc.
 58 | \def\maketitle{\par
 59 | \begingroup
 60 |    \def\thefootnote{\fnsymbol{footnote}}
 61 |    \def\@makefnmark{\hbox to 0pt{$^{\@thefnmark}$\hss}} % for perfect author
 62 |                                                         % name centering
 63 | %   The footnote-mark was overlapping the footnote-text,
 64 | %   added the following to fix this problem               (MK)
 65 |    \long\def\@makefntext##1{\parindent 1em\noindent
 66 |                             \hbox to1.8em{\hss $\m@th ^{\@thefnmark}$}##1}
 67 |    \@maketitle \@thanks
 68 | \endgroup
 69 | \setcounter{footnote}{0}
 70 | \let\maketitle\relax \let\@maketitle\relax
 71 | \gdef\@thanks{}\gdef\@author{}\gdef\@title{}\let\thanks\relax}
 72 | 
 73 | % The toptitlebar has been raised to top-justify the first page
 74 | % anonymized version:
 75 | 
 76 | \def\makeanontitle{\par
 77 | \begingroup
 78 |    \def\thefootnote{\fnsymbol{footnote}}
 79 |    \def\@makefnmark{\hbox to 0pt{$^{\@thefnmark}$\hss}} % for perfect author
 80 |                                                         % name centering
 81 | %   The footnote-mark was overlapping the footnote-text,
 82 | %   added the following to fix this problem               (MK)
 83 |    \long\def\@makefntext##1{\parindent 1em\noindent
 84 |                             \hbox to1.8em{\hss $\m@th ^{\@thefnmark}$}##1}
 85 |    \@makeanontitle \@thanks
 86 | \endgroup
 87 | \setcounter{footnote}{0}
 88 | \let\makeanontitle\relax \let\@makeanontitle\relax
 89 | \gdef\@thanks{}\gdef\@title{}\let\thanks\relax}
 90 | 
 91 | 
 92 | 
 93 | 
 94 | % The non-anonmyous version
 95 | \def\@maketitle{\vbox{\hsize\textwidth
 96 | \linewidth\hsize \vskip 0.1in \toptitlebar \centering
 97 | {\LARGE\bf \@title\par}  \bottomtitlebar % \vskip 0.1in %  minus
 98 |    \def\And{\end{tabular}\hfil\linebreak[0]\hfil
 99 |             \begin{tabular}[t]{c}\bf\rule{\z@}{24pt}\ignorespaces}%
100 |   \def\AND{\end{tabular}\hfil\linebreak[4]\hfil
101 |             \begin{tabular}[t]{c}\bf\rule{\z@}{24pt}\ignorespaces}%
102 |     \begin{tabular}[t]{c}\bf\rule{\z@}{24pt}\@author\end{tabular}%
103 | \vskip 0.3in minus 0.1in}}
104 | 
105 | % The anonymized version, doesn't matter what people list for authors
106 | \def\@makeanontitle{\vbox{\hsize\textwidth
107 | \linewidth\hsize \vskip 0.1in \toptitlebar \centering
108 | {\LARGE\bf \@title\par}  \bottomtitlebar % \vskip 0.1in %  minus
109 |      \begin{tabular}[t]{c}\bf\rule{\z@}{24pt}
110 | Arian Maleki   and    Tom Do \\
111 | Stanford University\\
112 | \end{tabular}%
113 | \vskip 0.3in minus 0.1in}}
114 | % end anonymized version
115 | 
116 | \renewenvironment{abstract}{\vskip.075in\centerline{\large\bf
117 | Abstract}\vspace{0.5ex}\begin{quote}}{\par\end{quote}\vskip 1ex}
118 | 
119 | % sections with less space
120 | \def\section{\@startsection {section}{1}{\z@}{-2.0ex plus
121 |     -0.5ex minus -.2ex}{1.5ex plus 0.3ex
122 | minus0.2ex}{\large\bf\raggedright}}
123 | 
124 | \def\subsection{\@startsection{subsection}{2}{\z@}{-1.8ex plus
125 | -0.5ex minus -.2ex}{0.8ex plus .2ex}{\normalsize\bf\raggedright}}
126 | \def\subsubsection{\@startsection{subsubsection}{3}{\z@}{-1.5ex
127 | plus      -0.5ex minus -.2ex}{0.5ex plus
128 | .2ex}{\normalsize\bf\raggedright}}
129 | \def\paragraph{\@startsection{paragraph}{4}{\z@}{1.5ex plus
130 | 0.5ex minus .2ex}{-1em}{\normalsize\bf}}
131 | \def\subparagraph{\@startsection{subparagraph}{5}{\z@}{1.5ex plus
132 |   0.5ex minus .2ex}{-1em}{\normalsize\bf}}
133 | \def\subsubsubsection{\vskip
134 | 5pt{\noindent\normalsize\rm\raggedright}}
135 | 
136 | 
137 | % Footnotes
138 | \footnotesep 6.65pt %
139 | \skip\footins 9pt plus 4pt minus 2pt
140 | \def\footnoterule{\kern-3pt \hrule width 12pc \kern 2.6pt }
141 | \setcounter{footnote}{0}
142 | 
143 | % Lists and paragraphs
144 | \parindent 0pt
145 | \topsep 4pt plus 1pt minus 2pt
146 | \partopsep 1pt plus 0.5pt minus 0.5pt
147 | \itemsep 2pt plus 1pt minus 0.5pt
148 | \parsep 2pt plus 1pt minus 0.5pt
149 | \parskip .5pc
150 | 
151 | 
152 | %\leftmargin2em
153 | \leftmargin3pc
154 | \leftmargini\leftmargin \leftmarginii 2em
155 | \leftmarginiii 1.5em \leftmarginiv 1.0em \leftmarginv .5em
156 | 
157 | %\labelsep \labelsep 5pt
158 | 
159 | \def\@listi{\leftmargin\leftmargini}
160 | \def\@listii{\leftmargin\leftmarginii
161 |    \labelwidth\leftmarginii\advance\labelwidth-\labelsep
162 |    \topsep 2pt plus 1pt minus 0.5pt
163 |    \parsep 1pt plus 0.5pt minus 0.5pt
164 |    \itemsep \parsep}
165 | \def\@listiii{\leftmargin\leftmarginiii
166 |     \labelwidth\leftmarginiii\advance\labelwidth-\labelsep
167 |     \topsep 1pt plus 0.5pt minus 0.5pt
168 |     \parsep \z@ \partopsep 0.5pt plus 0pt minus 0.5pt
169 |     \itemsep \topsep}
170 | \def\@listiv{\leftmargin\leftmarginiv
171 |      \labelwidth\leftmarginiv\advance\labelwidth-\labelsep}
172 | \def\@listv{\leftmargin\leftmarginv
173 |      \labelwidth\leftmarginv\advance\labelwidth-\labelsep}
174 | \def\@listvi{\leftmargin\leftmarginvi
175 |      \labelwidth\leftmarginvi\advance\labelwidth-\labelsep}
176 | 
177 | \abovedisplayskip 7pt plus2pt minus5pt%
178 | \belowdisplayskip \abovedisplayskip
179 | \abovedisplayshortskip  0pt plus3pt%
180 | \belowdisplayshortskip  4pt plus3pt minus3pt%
181 | 
182 | % Less leading in most fonts (due to the narrow columns)
183 | % The choices were between 1-pt and 1.5-pt leading
184 | %\def\@normalsize{\@setsize\normalsize{11pt}\xpt\@xpt} % got rid of @ (MK)
185 | \def\normalsize{\@setsize\normalsize{11pt}\xpt\@xpt}
186 | \def\small{\@setsize\small{10pt}\ixpt\@ixpt}
187 | \def\footnotesize{\@setsize\footnotesize{10pt}\ixpt\@ixpt}
188 | \def\scriptsize{\@setsize\scriptsize{8pt}\viipt\@viipt}
189 | \def\tiny{\@setsize\tiny{7pt}\vipt\@vipt}
190 | \def\large{\@setsize\large{14pt}\xiipt\@xiipt}
191 | \def\Large{\@setsize\Large{16pt}\xivpt\@xivpt}
192 | \def\LARGE{\@setsize\LARGE{20pt}\xviipt\@xviipt}
193 | \def\huge{\@setsize\huge{23pt}\xxpt\@xxpt}
194 | \def\Huge{\@setsize\Huge{28pt}\xxvpt\@xxvpt}
195 | 
196 | \def\toptitlebar{\hrule height4pt\vskip .25in\vskip-\parskip}
197 | 
198 | \def\bottomtitlebar{\vskip .29in\vskip-\parskip\hrule height1pt\vskip
199 | .09in} %
200 | %Reduced second vskip to compensate for adding the strut in \@author
201 | 


--------------------------------------------------------------------------------
/Conda Setup/macros-Fall2018.tex:
--------------------------------------------------------------------------------
  1 | \newcommand{\newsec}{\section}
  2 | \newcommand{\denselist}{\itemsep 0pt\partopsep 0pt}
  3 | \newcommand{\bitem}{\begin{itemize}\denselist}
  4 | \newcommand{\eitem}{\end{itemize}}
  5 | \newcommand{\benum}{\begin{enumerate}\denselist}
  6 | \newcommand{\eenum}{\end{enumerate}}
  7 | 
  8 | \newcommand{\fig}[1]{\private{\begin{center}
  9 | {\Large\bf ({#1})}
 10 | \end{center}}}
 11 | 
 12 | \newcommand{\cpsf}[1]{{\centerline{\psfig{#1}}}}
 13 | \newcommand{\mytitle}[1]{\centerline{\LARGE\bf #1}}
 14 | 
 15 | \newcommand{\myw}{{\bf w}}
 16 | 
 17 | \newcommand{\mypar}[1]{\vspace{1ex}\noindent{\bf {#1}}}
 18 | 
 19 | \def\thmcolon{\hspace{-.85em} {\bf :} }
 20 | 
 21 | \newtheorem{THEOREM}{Theorem}[section]
 22 | \newenvironment{theorem}{\begin{THEOREM} \thmcolon }%
 23 |                         {\end{THEOREM}}
 24 | \newtheorem{LEMMA}[THEOREM]{Lemma}
 25 | \newenvironment{lemma}{\begin{LEMMA} \thmcolon }%
 26 |                       {\end{LEMMA}}
 27 | \newtheorem{COROLLARY}[THEOREM]{Corollary}
 28 | \newenvironment{corollary}{\begin{COROLLARY} \thmcolon }%
 29 |                           {\end{COROLLARY}}
 30 | \newtheorem{PROPOSITION}[THEOREM]{Proposition}
 31 | \newenvironment{proposition}{\begin{PROPOSITION} \thmcolon }%
 32 |                             {\end{PROPOSITION}}
 33 | \newtheorem{DEFINITION}[THEOREM]{Definition}
 34 | \newenvironment{definition}{\begin{DEFINITION} \thmcolon \rm}%
 35 |                             {\end{DEFINITION}}
 36 | \newtheorem{CLAIM}[THEOREM]{Claim}
 37 | \newenvironment{claim}{\begin{CLAIM} \thmcolon \rm}%
 38 |                             {\end{CLAIM}}
 39 | \newtheorem{EXAMPLE}[THEOREM]{Example}
 40 | \newenvironment{example}{\begin{EXAMPLE} \thmcolon \rm}%
 41 |                             {\end{EXAMPLE}}
 42 | \newtheorem{REMARK}[THEOREM]{Remark}
 43 | \newenvironment{remark}{\begin{REMARK} \thmcolon \rm}%
 44 |                             {\end{REMARK}}
 45 | %\newenvironment{proof}{\noindent {\bf Proof:} \hspace{.677em}}%
 46 | %                      {}
 47 | 
 48 | %theorem
 49 | \newcommand{\thm}{\begin{theorem}}
 50 | %lemma
 51 | \newcommand{\lem}{\begin{lemma}}
 52 | %proposition
 53 | \newcommand{\pro}{\begin{proposition}}
 54 | %definition
 55 | \newcommand{\dfn}{\begin{definition}}
 56 | %remark
 57 | \newcommand{\rem}{\begin{remark}}
 58 | %example
 59 | \newcommand{\xam}{\begin{example}}
 60 | %corollary
 61 | \newcommand{\cor}{\begin{corollary}}
 62 | %proof
 63 | \newcommand{\prf}{\noindent{\bf Proof:} }
 64 | %end theorem
 65 | \newcommand{\ethm}{\end{theorem}}
 66 | %end lemma
 67 | \newcommand{\elem}{\end{lemma}}
 68 | %end proposition
 69 | \newcommand{\epro}{\end{proposition}}
 70 | %end definition
 71 | \newcommand{\edfn}{\bbox\end{definition}}
 72 | %end remark
 73 | \newcommand{\erem}{\bbox\end{remark}}
 74 | %end example
 75 | \newcommand{\exam}{\bbox\end{example}}
 76 | %end corollary
 77 | \newcommand{\ecor}{\end{corollary}}
 78 | %end proof
 79 | \newcommand{\eprf}{\bbox\vspace{0.1in}}
 80 | %begin equation
 81 | \newcommand{\beqn}{\begin{equation}}
 82 | %end equation
 83 | \newcommand{\eeqn}{\end{equation}}
 84 | 
 85 | %\newcommand{\eqref}[1]{Eq.~\ref{#1}}
 86 | 
 87 | \newcommand{\KB}{\mbox{\it KB\/}}
 88 | \newcommand{\infers}{\vdash}
 89 | \newcommand{\sat}{\models}
 90 | \newcommand{\bbox}{\vrule height7pt width4pt depth1pt}
 91 | 
 92 | \newcommand{\act}[1]{\stackrel{{#1}}{\rightarrow}}
 93 | \newcommand{\at}[1]{^{(#1)}}
 94 | 
 95 | \newcommand{\argmax}{{\rm argmax}}
 96 | \newcommand{\V}{{\cal V}}
 97 | \newcommand{\C}{{\cal C}}
 98 | \newcommand{\calL}{{\cal L}}
 99 | 
100 | \newcommand{\rimp}{\Rightarrow}
101 | \newcommand{\dimp}{\Leftrightarrow}
102 | 
103 | \newcommand{\nf}{\bar{f}}
104 | \newcommand{\ns}{\bar{s}}
105 | \newcommand{\na}{\bar{a}}
106 | \newcommand{\nh}{\bar{h}}
107 | \newcommand{\nr}{\bar{r}}
108 | 
109 | 
110 | \newcommand{\bX}{\mbox{\boldmath $X$}}
111 | \newcommand{\bY}{\mbox{\boldmath $Y$}}
112 | \newcommand{\bZ}{\mbox{\boldmath $Z$}}
113 | \newcommand{\bU}{\mbox{\boldmath $U$}}
114 | \newcommand{\bE}{\mbox{\boldmath $E$}}
115 | \newcommand{\bx}{\mbox{\boldmath $x$}}
116 | \newcommand{\be}{\mbox{\boldmath $e$}}
117 | \newcommand{\by}{\mbox{\boldmath $y$}}
118 | \newcommand{\bz}{\mbox{\boldmath $z$}}
119 | \newcommand{\bu}{\mbox{\boldmath $u$}}
120 | \newcommand{\bd}{\mbox{\boldmath $d$}}
121 | \newcommand{\smbx}{\mbox{\boldmath $\scriptstyle x$}}
122 | \newcommand{\smbd}{\mbox{\boldmath $\scriptstyle d$}}
123 | \newcommand{\smby}{\mbox{\boldmath $\scriptstyle y$}}
124 | \newcommand{\smbe}{\mbox{\boldmath $\scriptstyle e$}}
125 | 
126 | \newcommand{\Parents}{\mbox{\it Parents\/}}
127 | \newcommand{\B}{{\cal B}}
128 | 
129 | \newcommand{\word}[1]{\mbox{\it #1\/}}
130 | \newcommand{\Action}{\word{Action}}
131 | \newcommand{\Proposition}{\word{Proposition}}
132 | \newcommand{\true}{\word{true}}
133 | \newcommand{\false}{\word{false}}
134 | \newcommand{\Pre}{\word{Pre}}
135 | \newcommand{\Add}{\word{Add}}
136 | \newcommand{\Del}{\word{Del}}
137 | \newcommand{\Result}{\word{Result}}
138 | \newcommand{\Regress}{\word{Regress}}
139 | \newcommand{\Maintain}{\word{Maintain}}
140 | 
141 | \newcommand{\bor}{\bigvee}
142 | \newcommand{\invert}[1]{{#1}^{-1}}
143 | 
144 | \newcommand{\commentout}[1]{}
145 | 
146 | \newcommand{\bmu}{\mbox{\boldmath $\mu$}}
147 | \newcommand{\btheta}{\mbox{\boldmath $\theta$}}
148 | \newcommand{\IR}{\mbox{$I\!\!R$}}
149 | 
150 | \newcommand{\tval}[1]{{#1}^{1}}
151 | \newcommand{\fval}[1]{{#1}^{0}}
152 | 
153 | \newcommand{\tr}{{\rm tr}}
154 | \newcommand{\vecy}{{\vec{y}}}
155 | \renewcommand{\Re}{{\mathbb R}}
156 | 
157 | \def\twofigbox#1#2{%
158 | \noindent\begin{minipage}{\textwidth}%
159 | \epsfxsize=0.35\maxfigwidth
160 | \noindent \epsffile{#1}\hfill
161 | \epsfxsize=0.35\maxfigwidth
162 | \epsffile{#2}\\
163 | \makebox[0.35\textwidth]{(a)}\hfill\makebox[0.35\textwidth]{(b)}%
164 | \end{minipage}}
165 | 
166 | \def\twofigboxcd#1#2{%
167 | \noindent\begin{minipage}{\textwidth}%
168 | \epsfxsize=0.35\maxfigwidth
169 | \noindent \epsffile{#1}\hfill
170 | \epsfxsize=0.35\maxfigwidth
171 | \epsffile{#2}\\
172 | \makebox[0.35\textwidth]{(c)}\hfill\makebox[0.35\textwidth]{(d)}%
173 | \end{minipage}}
174 | 
175 | \def\twofigboxnolabel#1#2{%
176 | \begin{figure}[h]
177 | \centering
178 | \begin{minipage}{.5\textwidth}
179 |   \centering
180 |   \includegraphics[width=0.75\linewidth]{#1}
181 | \end{minipage}%
182 | \begin{minipage}{.5\textwidth}
183 |   \centering
184 |   \includegraphics[width=0.75\linewidth]{#2}
185 | \end{minipage}
186 | \end{figure}
187 | }
188 | 
189 | \def\threefigbox#1#2#3{%
190 | \noindent\begin{minipage}{\textwidth}%
191 | \epsfxsize=0.33\maxfigwidth
192 | \noindent \epsffile{#1}\hfill
193 | \epsfxsize=0.33\maxfigwidth
194 | \noindent \epsffile{#2}\hfill 
195 | \epsfxsize=0.33\maxfigwidth
196 | \epsffile{#3}\\
197 | \makebox[0.31\textwidth]{{\scriptsize (a)}}\hfill%
198 | \makebox[0.31\textwidth]{{\scriptsize (b)}}\hfill
199 | \makebox[0.31\textwidth]{{\scriptsize (c)}}%
200 | \smallskip
201 | \end{minipage}}
202 | 
203 | 
204 | \def\threefigbox#1#2#3{%
205 | \begin{figure}[H]
206 | \centering
207 | \begin{subfigure}{.33\textwidth}
208 |   \centering
209 |   \includegraphics[width=0.75\linewidth]{#1}
210 |   \caption{}
211 | \end{subfigure}%
212 | \begin{subfigure}{.33\textwidth}
213 |   \centering
214 |   \includegraphics[width=0.75\linewidth]{#2}
215 |    \caption{}
216 | \end{subfigure}%
217 | \begin{subfigure}{.33\textwidth}
218 |   \centering
219 |   \includegraphics[width=0.75\linewidth]{#3}
220 |    \caption{}
221 | \end{subfigure}
222 | \end{figure}
223 | }
224 | 
225 | 
226 | \def\threefigboxnolabel#1#2#3{%
227 | \centering
228 | \begin{subfigure}{.33\textwidth}
229 |   \centering
230 |   \includegraphics[width=0.75\linewidth]{#1}
231 | \end{subfigure}%
232 | \begin{subfigure}{.33\textwidth}
233 |   \centering
234 |   \includegraphics[width=0.75\linewidth]{#2}
235 | \end{subfigure}%
236 | \begin{subfigure}{.33\textwidth}
237 |   \centering
238 |   \includegraphics[width=0.75\linewidth]{#3}
239 | \end{subfigure}
240 | \end{figure}
241 | }
242 | 
243 | \newlength{\maxfigwidth}
244 | \setlength{\maxfigwidth}{\textwidth}
245 | %\def\captionsize {\footnotesize}
246 | \def\captionsize {}
247 | 
248 | \newcommand{\xsi}{{x^{(i)}}}
249 | \newcommand{\ysi}{{y^{(i)}}}
250 | \newcommand{\wsi}{{w^{(i)}}}
251 | \newcommand{\esi}{{\epsilon^{(i)}}}
252 | \newcommand{\calN}{{\cal N}}
253 | \newcommand{\calX}{{\cal X}}
254 | \newcommand{\calY}{{\cal Y}}
255 | \newcommand{\ytil}{{\tilde{y}}}
256 | 
257 | \newcommand{\beas}{\begin{eqnarray*}}
258 | \newcommand{\eeas}{\end{eqnarray*}}
259 | 
260 | \newcommand{\Ber}{{\rm Bernoulli}}
261 | \newcommand{\Bernoulli}{{\rm Bernoulli}}
262 | \newcommand{\E}{{\rm E}}
263 | 
264 | \setlength{\parindent}{0pt}
265 | \setlength{\parskip}{4pt}
266 | 


--------------------------------------------------------------------------------
/Conda Setup/main.tex:
--------------------------------------------------------------------------------
  1 | \documentclass{article}
  2 | %\documentstyle[11pt,handout,psfig]{article}
  3 | 
  4 | \usepackage{fullpage,amssymb,amsmath,tikz,forest,float,subcaption,braket}
  5 | \usetikzlibrary{arrows.meta}
  6 | \usepackage{graphicx}
  7 | \usepackage{hyperref}
  8 | \hypersetup{
  9 |     colorlinks=true,
 10 |     linkcolor=blue,
 11 |     filecolor=magenta,      
 12 |     urlcolor=cyan,
 13 | }
 14 | 
 15 | \forestset{
 16 |     default preamble={
 17 |         for tree={
 18 |         	   font=\tiny,
 19 |             base=bottom,
 20 |             child anchor=north,
 21 |             align=center,
 22 |             s sep+=1.3cm,
 23 |     straight edge/.style={
 24 |         edge path={\noexpand\path[\forestoption{edge},thick,-{Latex}] 
 25 |         (!u.parent anchor) -- (.child anchor);}
 26 |     },
 27 |     if n children={0}
 28 |         {tier=word, draw, thick, rectangle}
 29 |         {
 30 |         		if n children={1}
 31 |         		{draw, thick, rectangle, parent anchor=south}
 32 |         		{draw, diamond, thick, aspect=2}
 33 |         },
 34 |     if n=1{%
 35 |     		if n'=1
 36 |     			{edge path={\noexpand\path[\forestoption{edge},thick,-{Latex}] 
 37 |         			(!u.parent anchor) -| (.child anchor);}}
 38 |         		{edge path={\noexpand\path[\forestoption{edge},thick,-{Latex}] 
 39 |         			(!u.parent anchor) -| (.child anchor) node[pos=.2, above] {Y};}}
 40 |         }{
 41 |         edge path={\noexpand\path[\forestoption{edge},thick,-{Latex}] 
 42 |         (!u.parent anchor) -| (.child anchor) node[pos=.2, above] {N};}
 43 |         }
 44 |         }
 45 |     }
 46 | }
 47 | \usepackage{xcolor}
 48 | \usepackage{listings}
 49 | \lstset{basicstyle=\ttfamily,
 50 |   showstringspaces=false,
 51 |   commentstyle=\color{red},
 52 |   keywordstyle=\color{blue}
 53 | }
 54 | 
 55 | \usepackage[12pt]{extsizes}
 56 | \usepackage{gensymb}
 57 | \usepackage{graphicx}
 58 | 
 59 | 
 60 | 
 61 | 
 62 | %These give really tight margins:
 63 | %\setlength{\topmargin}{-0.3in}
 64 | %\setlength{\textheight}{8.10in}
 65 | %\setlength{\textwidth}{5.8in}
 66 | %\setlength{\baselineskip}{0.1875in}
 67 | %\addtolength{\leftmargin}{-2.775in}
 68 | %\setlength{\footskip}{0.45in}
 69 | %\setlength{\oddsidemargin}{0.5in}
 70 | %\setlength{\evensidemargin}{0.5in}
 71 | %%\setlength{\headsep}{0pt}
 72 | %%\setlength{\headheight}{0pt}
 73 | 
 74 | %\setlength{\topmargin}{-0.5in}
 75 | \setlength{\textheight}{8in}
 76 | %\setlength{\textwidth}{5.0in}
 77 | %\setlength{\baselineskip}{0.1875in}
 78 | %\addtolength{\leftmargin}{-2.775in}
 79 | %\setlength{\footskip}{0.45in}
 80 | %\setlength{\oddsidemargin}{0.5in}
 81 | %\setlength{\evensidemargin}{0.5in}
 82 | %%\setlength{\headsep}{0pt}
 83 | %%\setlength{\headheight}{0pt}
 84 | 
 85 | 
 86 | \markright{}
 87 | \pagestyle{myheadings}
 88 | 
 89 | \input{macros-Fall2018}
 90 | \begin{document}
 91 | \title{Conda Setup for XCS Courses}
 92 | %\author{}
 93 | \date{}
 94 | \maketitle
 95 | 
 96 | 
 97 | \part*{What is Conda?}
 98 | 
 99 | Conda is an open source package management system and environment management system that runs on Windows, macOS, and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language. 
100 | 
101 | Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment.
102 | 
103 | \section{Installation}
104 | 
105 | The fastest way to obtain conda is to install Miniconda, a mini version of Anaconda that includes only conda and its dependencies. If you would want to leverage your exotic CPU architecture (aarch64 including Apple Silicon, ppc64le) the best way would be to install Miniforge! If you prefer to have conda plus over 7,500 open-source packages, install Anaconda. 
106 | 
107 | Here are some of the system requirements necessary for installation:
108 | \begin{itemize}
109 |   \item 32- or 64-bit computer
110 |   \item For Miniconda/Miniforge - 400 MB disk space
111 |   \item For Anaconda - Minimum 3GB disk space to download and install
112 |   \item Windows, macOS, or Linux
113 | \end{itemize}
114 | 
115 | For download specific instructions, \href{https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html}{please refer to this link.}
116 | 
117 | 
118 | \textbf{For Windows}: Go to your search menu on Windows after installing Miniconda, and locate the Anaconda Command Prompt to run the commands below.
119 | \begin{center}
120 |    \includegraphics[scale=0.5]{acp.png} 
121 | \end{center}
122 | 
123 | 
124 | To verify the Conda download, please run: 
125 | \begin{lstlisting}[language=bash] 
126 | conda info
127 | \end{lstlisting}
128 | In case you have previously installed Miniconda, be sure to update them accordingly before moving onto step 2:
129 | \begin{lstlisting}[language=bash] 
130 | #Update to lastest version of Conda
131 | conda update -n base conda
132 | #Update all packages to the latest version of Anaconda
133 | conda update anaconda
134 | \end{lstlisting}
135 | 
136 | \section{Creating and activating environment from YAML file}
137 | 
138 | After you have installed Conda, we have provided a file called \textbf{environment.yml}. Note that the \textbf{environment.yml} will be provided on the first day of the course upon the release of the first assignment. This YAML file will help you create a virtual environment with a specified name used for activation and install necessary dependencies for the coding assignments. Here is what an example environment.yml file looks like.
139 | \begin{center}
140 | \includegraphics[scale=0.75]{conda-update.png}
141 | \end{center}
142 | \textbf{Note: }Although there are multiple ways of completing the homework assignments, the libraries we provide you within the YAML file should be the only ones you reference in the completion of the coding assignments. 
143 | \textbf{For the remainder of this setup we will be using \textit{your-env-name} to specify the name of your environment (i.e. XCS229, XCS221, etc) . You can find the name of your virtual environment by looking for the name parameter within the environment.yml file.} 
144 | 
145 | To create your environment from the YAML file please run: 
146 | \begin{lstlisting}[language=bash]
147 | conda env create --file environment.yml
148 | \end{lstlisting}
149 | Now to specify that the environment was installed correctly, please run the code below and you should see an environment name that matches with the name parameter in the environment.yml file:
150 | \begin{lstlisting}[language=bash]
151 | conda env list
152 | \end{lstlisting}
153 | 
154 | Next, you will activate the created environment. To do so, you will need to to call the following command:
155 | \begin{lstlisting}[language=bash]
156 | conda activate <your-env-name>
157 | \end{lstlisting}
158 | 
159 | Now in this environment, you will have all the relevant dependencies needed to complete the assignments. If you want to deactivate your environment please run:
160 | \begin{lstlisting}[language=bash]
161 | conda deactivate
162 | \end{lstlisting}
163 |  
164 | \section{Converting Environment to Jupyter Notebook}
165 | When completing the assignments, it may be helpful to use a Jupyter Notebook as a scratchpad to help you debug or formulate potential solutions. Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. The first step to creating your Jupyter Notebook for the assignments is to activate your environment as follows:
166 | \begin{lstlisting}[language=bash]
167 | conda activate <your-env-name>
168 | \end{lstlisting}
169 | 
170 | Next, in the active environment type:
171 | \begin{lstlisting}[language=bash]
172 | #install ipykernel
173 | conda install -c anaconda ipykernel
174 | #install the new kernel
175 | ipython kernel install --user --name=<your-env-name>
176 | \end{lstlisting}
177 | 
178 | And last but not least, ensure to run the following command to launch your Jupyter Notebook in the active environment
179 | \begin{lstlisting}[language=bash]
180 | jupyter notebook
181 | \end{lstlisting}
182 | 
183 | That command will take you to your Jupyter Notebook directory. When you hover over the \textbf{New} button in the top right hand corner, you should see you environment present as an option.
184 | \begin{center}
185 |    \includegraphics[]{p2.png} 
186 | \end{center}
187 | Go ahead and select it and it will create an empty Python Jupyter file where you can copy and paste assignment code or use it to formulate your thoughts. 
188 | 
189 | In the event you would like to remove your environment from Jupyter, simply run the following code:
190 | \begin{lstlisting}[language=bash]
191 | jupyter kernelspec uninstall <your-env-name>
192 | \end{lstlisting}
193 | 
194 | You are now equipped with all the tools to complete the coding assignments. For a list of the most useful Conda functions, \href{https://docs.conda.io/projects/conda/en/latest/_downloads/843d9e0198f2a193a3484886fa28163c/conda-cheatsheet.pdf}{please refer to this document}. 
195 | 
196 | \section{Updating Existing Conda Environment}
197 |  If you plan on reutilizing the an existing Conda environment from a previous SCPD-AI course, we strongly recommend all students update their Conda environment upon starting or replace it with the new Conda environment as mentioned above. Below are steps to update your existing Conda environment in preparation for this course. 
198 | 
199 | \begin{lstlisting}[language=bash]
200 | conda activate <your-env-name>
201 | conda env update --file environment.yml 
202 | \end{lstlisting}
203 |  
204 | \end{document}
205 | 


--------------------------------------------------------------------------------
/Probability Review/Report.tex:
--------------------------------------------------------------------------------
  1 | \documentclass{article}
  2 | \usepackage{nips07submit_e,times}
  3 | \usepackage{graphicx}
  4 | %\documentstyle[nips07submit_09,times]{article}
  5 | 
  6 | \title{Review of Probability Theory}
  7 | 
  8 | 
  9 | \author{
 10 | Arian Maleki \\
 11 | Department of Electrical Engineering\\
 12 | Stanford University\\
 13 | Stanford, CA 94305 \\
 14 | \texttt{arianm@stanford.edu} \\
 15 | }
 16 | 
 17 | % The \author macro works with any number of authors. There are two commands
 18 | % used to separate the names and addresses of multiple authors: \And and \AND.
 19 | %
 20 | % Using \And between authors leaves it to \LaTeX{} to determine where to break
 21 | % the lines. Using \AND forces a linebreak at that point. So, if \LaTeX{}
 22 | % puts 3 of 4 authors names on the first line, and the last on the second
 23 | % line, try using \AND instead of \And before the third author name.
 24 | \usepackage{amssymb}
 25 | \usepackage{graphicx}
 26 | \usepackage{amsmath}
 27 | \usepackage{color}
 28 | \usepackage{dsfont}
 29 | \usepackage{amsthm}
 30 | 
 31 | \DeclareMathOperator*{\argmax}{arg\,min}
 32 | 
 33 | \newtheorem{thm}{Theorem}[section]
 34 | \newtheorem*{definition}{Definition}
 35 | \newtheorem{cor}[thm]{Corollary}
 36 | \newtheorem{lem}[thm]{Lemma}
 37 | \begin{document}
 38 | 
 39 | \makeanontitle
 40 | 
 41 | Probability theory is the study of uncertainty.  Through this class, we will be relying on concepts from probability
 42 | theory for deriving machine learning algorithms.  These notes attempt to cover the basics of probability theory at a
 43 | level appropriate for CS 229.  The mathematical theory of probability is very sophisticated, and
 44 | delves into a branch of analysis known as \textbf{measure theory}.  In
 45 | these notes, we provide a basic treatment of probability that does
 46 | not address these finer details.  
 47 | 
 48 | \section{Elements of probability}
 49 | 
 50 | In order to define a probability on a set we need a few basic elements,
 51 | \begin{itemize}
 52 | \item \textbf{Sample space} $\Omega$: The set of all the outcomes of a random
 53 |   experiment.  Here, each outcome $\omega \in \Omega$ can be thought
 54 |   of as a complete description of the state of the real world at the
 55 |   end of the experiment.
 56 | \item \textbf{Set of events} (or \textbf{event space}) $\mathcal{F}$: 
 57 |   A set whose
 58 |   elements $A \in \mathcal{F}$ (called \textbf{events}) are subsets of $\Omega$
 59 |   (i.e., $A \subseteq \Omega$ is a collection of possible outcomes of an
 60 |   experiment).\footnote{
 61 |   $\mathcal{F}$ should satisfy three properties: 
 62 |   (1) $\emptyset \in \mathcal{F}$;
 63 |   (2) $A \in \mathcal{F} \Longrightarrow \Omega \setminus A \in \mathcal{F}$; and
 64 |   (3) $A_1, A_2 , \ldots \in \mathcal{F} \Longrightarrow \cup_i A_i \in \mathcal{F}$.
 65 | }.
 66 | \item \textbf{Probability measure}: A function $P : \mathcal{F} \rightarrow \mathbb{R}$ that satisfies the following properties,
 67 |        \begin{description}
 68 |           \item[-] $P(A) \geq 0$, for all $A \in \mathcal{F}$
 69 |           \item[-] $P(\Omega)=1$
 70 |           \item[-] If $A_1,A_2, \ldots$ are disjoint events (i.e., $A_i \cap A_j= \emptyset$ whenever $i \neq j$), then
 71 |           \begin{equation*}
 72 |             P(\cup_{i} A_i)= \sum_i P(A_i)
 73 |           \end{equation*}
 74 |         \end{description}
 75 | \end{itemize}
 76 | These three properties are called the \textbf{Axioms of Probability}.
 77 | 
 78 | \textbf{Example}: Consider the event of tossing a six-sided die. The
 79 | sample space is $\Omega= \{1,2,3,4,5,6\}$. We can define different
 80 |   event spaces on this sample space. For example, the simplest event
 81 |   space is the trivial event space $\mathcal{F}=\{\emptyset,
 82 |   \Omega\}$. Another event space is the set of all subsets of
 83 |   $\Omega$. For the first event space, the unique probability measure 
 84 |   satisfying the requirements above is
 85 |   given by $P(\emptyset)=0, P(\Omega)=1$. For the second event space,
 86 |   one valid probability measure is to assign the probability of each set in the event space to be $\frac{i}{6}$
 87 |   where $i$ is the number of elements of that set; for example,
 88 |   $P(\{1,2,3,4\})= \frac{4}{6}$ and $P(\{1,2,3\})= \frac{3}{6}$. \\
 89 | 
 90 | \textbf{Properties}:
 91 | \begin{description}
 92 | \item[-] {If $ {A \subseteq B \Longrightarrow P(A) \leq P(B)}$.}
 93 | \item[-] ${P(A \cap B) \leq \min(P(A), P(B))}$.
 94 | \item[-] (Union Bound) ${P(A \cup B) \leq P(A)+P(B)}$.
 95 | \item[-] ${P(\Omega \setminus A)=1-P(A)}$.
 96 | \item[-] {(Law of Total Probability) If $A_1,\ldots,A_k$ are a set of disjoint events such that $\cup_{i=1}^k A_i = \Omega$, then $\sum_{i=1}^k P(A_k) = 1$.}
 97 | \end{description}
 98 | 
 99 | \subsection{Conditional probability and independence}
100 | 
101 | Let $B$ be an event with non-zero probability. The conditional probability of any event $A$ given $B$ is defined as,
102 | \begin{equation*}
103 | P(A | B)\triangleq \frac{P(A \cap B)}{P(B)}
104 | \end{equation*}
105 | In other words, $P(A| B)$ is the probability measure of the event $A$
106 | after observing the occurrence of event $B$. Two events are called
107 | independent if and only if $P(A \cap B)= P(A)P(B)$ (or equivalently, $P(A|B)=P(A)$).  Therefore,
108 | independence is equivalent to saying that observing $B$ does not have
109 | any effect on the probability of $A$.
110 | 
111 | \section{Random variables}
112 | 
113 | Consider an experiment in which we flip 10 coins, and we want to know the
114 | number of coins that come up heads.  Here, the elements of the sample
115 | space $\Omega$ are 10-length sequences of heads and tails.  For example,
116 | we might have $w_0 = \langle H, H, T, H, T, H, H, T, T, T \rangle \in
117 | \Omega$.  However, in practice, we usually do not care about the
118 | probability of obtaining any particular sequence of heads and tails.
119 | Instead we usually care about real-valued functions of outcomes, such as the
120 | number of heads that appear among our 10 tosses, or the
121 | length of the longest run of tails.  These functions, under some
122 | technical conditions, are known as \textbf{random variables}.
123 | 
124 | More formally, a random variable $X$ is a function $X:\Omega
125 | \longrightarrow \mathbb{R}$.\footnote{ Technically speaking, not every
126 |   function is not acceptable as a random variable.  From a
127 |   measure-theoretic perspective, random variables must be
128 |   Borel-measurable functions.  Intuitively, this restriction ensures
129 |   that given a random variable and its underlying outcome space, one
130 |   can implicitly define the each of the events of the event space as
131 |   being sets of outcomes $\omega \in \Omega$ for which $X(\omega)$
132 |   satisfies some property (e.g., the event $\{\omega : X(\omega) \ge 3 \}$).
133 | } Typically, we will denote random
134 | variables using upper case letters $X(\omega)$ or more simply $X$
135 | (where the dependence on the random outcome $\omega$ is implied).  We
136 | will denote the value that a random variable may take on using lower
137 | case letters $x$.
138 | 
139 | \textbf{Example}: In our experiment above, suppose that $X(\omega)$ is
140 | the number of heads which occur in the sequence of tosses $\omega$.
141 | Given that only 10 coins are tossed, $X(\omega)$ can take only a
142 | finite number of values, so it is known as a \textbf{discrete random
143 | variable}. Here, the probability of the set associated with a random
144 | variable $X$ taking on some specific value $k$ is
145 | \begin{eqnarray*}
146 |   P(X = k) := P(\{\omega : X(\omega) = k\}).
147 | \end{eqnarray*}
148 | 
149 | \textbf{Example}: Suppose that $X(\omega)$ is a random variable
150 | indicating the amount of time it takes for a radioactive particle to
151 | decay.  In this case, $X(\omega)$ takes on a infinite number of
152 | possible values, so it is called a \textbf{continuous random
153 | variable}.  We denote the probability that $X$ takes on a value between 
154 | two real constants $a$ and $b$ (where $a < b$) as 
155 | \begin{eqnarray*}
156 |   P(a \le X \le b) := P(\{\omega : a \le X(\omega) \le b\}).
157 | \end{eqnarray*}
158 | 
159 | \subsection{Cumulative distribution functions}
160 | 
161 | In order to specify the probability measures used when dealing with
162 | random variables, it is often convenient to specify alternative
163 | functions (CDFs, PDFs, and PMFs) from which the probability measure
164 | governing an experiment immediately follows.  In this section and the
165 | next two sections, we describe each of these types of functions in
166 | turn.
167 | 
168 | A \textbf{cumulative distribution function (CDF)} is a function $F_X:\mathbb{R} \rightarrow [0,1]$ which specifies a probability measure as,
169 | \begin{equation}
170 | F_X(x) \triangleq P(X \leq x).
171 | \end{equation}
172 | By using this function one can calculate the probability of any event in $\mathcal{F}$.\footnote{This is a remarkable fact and is actually a theorem that is proved in more advanced courses.} %How? First of all they define a $\pi$-system. $\mathcal{P}$ is a $\pi$-system, if $I,J \in \mathcal{P} \Longrightarrow I \cap J \in \mathcal{P}$. Then there is a theorem that says that if two probability measures are equal on a $\pi$-system then they are equal to each other on the $\sigma$-algebra generated by that $\pi$-system. Now the set of [-\infty, a] generates the Borel sigma algebra and is a $\pi$-system and therefore defines the probability uniquely!!!
173 | Figure \ref{fig1} shows a sample CDF function.
174 | 
175 | \textbf{Properties}:
176 | \begin{description}
177 | \item[-] $0 \leq F_X(x) \leq 1$.
178 | \item[-] $\lim_{x \rightarrow -\infty} F_X(x)=0$.
179 | \item[-] $\lim_{x \rightarrow \infty} F_X(x)=1$.
180 | \item[-] $x \leq y \Longrightarrow F_X(x)\leq F_X(y)$.
181 | \end{description}
182 | 
183 | \begin{figure}
184 | \begin{center}
185 |   \includegraphics[width=7cm]{fig1}\\
186 |   \caption{A cumulative distribution function (CDF).}\label{fig1}
187 | \end{center}
188 | \end{figure}
189 | 
190 | \subsection{Probability mass functions}
191 | 
192 | When a random variable $X$ takes on a finite set of possible values (i.e., $X$ is a discrete random variable), a simpler way to represent the probability measure associated with
193 | a random variable is to directly specify the probability of each value that the random variable can assume.  In particular, a \emph{probability
194 | mass function (PMF)} is a function $p_X : \Omega \rightarrow \mathbb{R}$ such that
195 | \begin{align*}
196 | p_X(x)\triangleq P(X=x).
197 | \end{align*}
198 | In the case of discrete random variable, we use the notation $Val(X)$ for the set of possible values that the random variable $X$ may assume.  For example, if $X(\omega)$ is a random
199 | variable indicating the number of heads out of ten tosses of coin, then $Val(X) = \{0,1,2,\ldots,10\}$.  
200 | 
201 | \textbf{Properties}:
202 | \begin{description}
203 | \item [-] $0 \leq p_X(x) \leq  1$.
204 | \item [-] $\sum_{x \in Val(X)} p_X(x)=1$.
205 | \item [-] $\sum_{x \in A} p_X(x) = P(X \in A)$.
206 | \end{description}
207 | 
208 | \subsection{Probability density functions}
209 | 
210 | For some continuous random variables, the cumulative distribution function $F_X(x)$ is differentiable everywhere.  In these cases, we define the \textbf{Probability Density Function} or \textbf{PDF} 
211 | as the derivative of the CDF, i.e.,
212 | \begin{equation}
213 | f_X(x)\triangleq \frac{dF_X(x)}{dx}.
214 | \end{equation}
215 | Note here, that the PDF for a continuous random variable may not always exist (i.e., if $F_X(x)$ is not differentiable everywhere).
216 | 
217 | According to the properties of differentiation, for very small $\Delta x$,
218 | \begin{equation}
219 | P(x \leq X \leq x+ \Delta x) \approx f_X(x)\Delta x.
220 | \end{equation}
221 | Both CDFs and PDFs (when they exist!) can be used for
222 | calculating the probabilities of different events. But it should be
223 | emphasized that the value of PDF at any given point $x$ is not the probability of
224 | that event, i.e., $f_X(x) \neq P(X = x)$.  For example, $f_X(x)$ can take on
225 | values larger than one (but the integral of $f_X(x)$ over any subset of $\mathbb{R}$ will
226 | be at most one).
227 | 
228 | \textbf{Properties}:
229 | \begin{description}
230 | \item [-] $\small{f_X(x) \geq 0}$ .
231 | \item [-] $\small {\int_{-\infty}^\infty f_X(x) =1}$.
232 | \item [-] $\int_{x \in A} f_X(x) dx = P(X \in A)$.
233 | \end{description}
234 | 
235 | \subsection{Expectation}
236 | Suppose that $X$ is a discrete random variable with PMF $p_X(x)$ and $g: \mathbb{R} \longrightarrow \mathbb{R}$ is an arbitrary function. In this case,
237 | $g(X)$ can be considered a random variable, and we define the \textbf{expectation} or \textbf{expected value} of $g(X)$ as
238 | \begin{align*}
239 | E[g(X)] \triangleq \sum_{x \in Val(X)} g(x) p_X(x).
240 | \end{align*}
241 | If $X$ is a continuous random variable with PDF $f_X(x)$, then the expected value of $g(X)$ is defined as,
242 | \begin{align*}
243 | E[g(X)]\triangleq \int_{-\infty}^{\infty} g(x) f_X(x) dx.
244 | \end{align*}
245 | Intuitively, the expectation of $g(X)$ can be thought of as a ``weighted average'' of
246 | the values that $g(x)$ can taken on for different values of $x$, where the weights are
247 | given by $p_X(x)$ or $f_X(x)$.
248 | As a special case of the above, note that the expectation, $E[X]$ of a random variable itself
249 | is found by letting $g(x) = x$; this is also known as the \textbf{mean} of the random variable $X$.
250 | 
251 | \textbf{Properties}:
252 | \begin{description}
253 | \item[-] $E[a] = a$ for any constant $a \in \mathbb{R}$.
254 | \item[-] $E[af(X)] = aE[f(X)]$ for any constant $a \in \mathbb{R}$.
255 | \item[-] (Linearity of Expectation) $E[f(X) + g(X)] = E[f(X)] + E[g(X)]$.
256 | \item[-] For a discrete random variable $X$, $E[1\{X = k \}] = P(X = k)$.
257 | \end{description}
258 | 
259 | \subsection{Variance}
260 | 
261 | The \textbf{variance} of a random variable $X$ is a measure of how concentrated the distribution of a random variable $X$
262 | is around its mean.  Formally, the variance of a random variable $X$ is defined as
263 | \begin{align*}
264 |   Var[X] \triangleq E[(X-E(X))^2]
265 | \end{align*}
266 | Using the properties in the previous section, we can derive an alternate expression for the variance:
267 | \begin{eqnarray*}
268 |   E[(X - E[X])^2] &=& E[X^2 - 2E[X] X + E[X]^2] \\ &=& E[X^2] - 2 E[X] E[X] + E[X]^2 \\ &=& E[X^2] - E[X]^2,
269 | \end{eqnarray*}
270 | where the second equality follows from linearity of expectations and the fact that $E[X]$ is actually a constant with
271 | respect to the outer expectation.  
272 | 
273 | \textbf{Properties}:
274 | \begin{description}
275 | \item[-] $Var[a] = 0$ for any constant $a \in \mathbb{R}$.
276 | \item[-] $Var[a f(X)] = a^2 Var[f(X)]$ for any constant $a \in \mathbb{R}$.
277 | \end{description}
278 | 
279 | \textbf{Example} Calculate the mean and the variance of the uniform random variable $X$ with PDF $f_X(x)=1, \ \ \forall x \in [0,1]$, 0 elsewhere.
280 | \begin{align*}
281 | E[X]= \int_{-\infty}^{\infty} xf_X(x) dx= \int_{0}^{1}xdx=\frac{1}{2}. \nonumber
282 | \end{align*}
283 | \begin{align*}
284 | E[X^2]=\int_{-\infty}^{\infty} x^2f_X(x) dx= \int_{0}^{1}x^2dx=\frac{1}{3}. \nonumber
285 | \end{align*}
286 | \begin{align*} 
287 | Var[X]=E[X^2]-E[X]^2= \frac{1}{3}- \frac{1}{4}=\frac{1}{12}. \nonumber
288 | \end{align*}
289 | 
290 | \textbf{Example:} Suppose that $g(x)= 1\{x \in A\}$ for some subset $A \subseteq \Omega$.
291 | What is $E[g(X)]$? 
292 | 
293 | Discrete case: 
294 | \begin{align*}
295 |   E[g(X)]= \sum_{x \in Val(X)} 1\{x \in A\} P_X(x)dx= \sum_{x \in A} P_X(x)dx= P(x \in A). \nonumber
296 | \end{align*}
297 | Continuous case: 
298 | \begin{align*}
299 |   E[g(X)]= \int_{-\infty}^{\infty} 1\{x \in A\} f_X(x)dx= \int_{x \in A} f_X(x)dx= P(x \in A). \nonumber
300 | \end{align*}
301 | 
302 | \subsection{Some common random variables}
303 | 
304 | \textbf{Discrete random variables}
305 | \begin{itemize}
306 | \item $X \sim Bernoulli(p)$ (where $0 \le p \le 1$): one if a coin with heads probability $p$ comes up heads, zero otherwise.
307 |   \begin{eqnarray*}
308 |     p(x) = \begin{cases}
309 |       p & \text{if $p = 1$} \\
310 |       1-p & \text{if $p = 0$}
311 |     \end{cases}
312 |   \end{eqnarray*}
313 | \item $X \sim Binomial(n, p)$ (where $0 \le p \le 1$): the number of heads in $n$ independent flips of a coin with heads probability $p$.
314 |   \begin{eqnarray*}
315 |     p(x) = {n \choose x} p^x(1-p)^{n-x}
316 |   \end{eqnarray*}
317 | \item $X \sim Geometric(p)$ (where $p > 0$): the number of flips of a coin with heads probability $p$ until the first heads.
318 |   \begin{eqnarray*}
319 |     p(x) = p(1 - p)^{x-1}
320 |   \end{eqnarray*}
321 | \item $X \sim Poisson(\lambda)$ (where $\lambda > 0$): a probability distribution over the nonnegative integers used for modeling the
322 |   frequency of rare events.
323 |   \begin{eqnarray*}
324 |     p(x) = e^{-\lambda} \frac{\lambda^x}{x!}
325 |   \end{eqnarray*}
326 | \end{itemize}
327 | 
328 | \textbf{Continuous random variables}
329 | \begin{itemize}
330 | \item $X \sim Uniform(a,b)$ (where $a < b$): equal probability density to every value between $a$ and $b$ on the real line.
331 |   \begin{eqnarray*}
332 |     f(x) = \begin{cases} 
333 |       \frac{1}{b - a} & \text{if $a \le x \le b$} \\
334 |       0 & \text{otherwise}
335 |     \end{cases}
336 |   \end{eqnarray*}
337 | \item $X \sim Exponential(\lambda)$ (where $\lambda > 0$): decaying probability density over the nonnegative reals.
338 |   \begin{eqnarray*}
339 |     f(x) = \begin{cases}
340 |       \lambda e^{-\lambda x} & \text{if $x \ge 0$} \\
341 |       0 & \text{otherwise}
342 |     \end{cases}
343 |   \end{eqnarray*}
344 | \item $X \sim Normal(\mu, \sigma^2)$: also known as the Gaussian distribution
345 |   \begin{eqnarray*}
346 |     f(x) = \frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{1}{2\sigma^2} (x - \mu)^2}
347 |   \end{eqnarray*}
348 | \end{itemize}
349 | The shape of the PDFs and CDFs of some of these random variables are shown in Figure \ref{fig2}.
350 | 
351 |  The following table is the summary of some of the properties of these distributions.
352 | \begin{center}
353 | \begin{tabular}{|l|l|l|l|l|} 
354 |   \hline
355 |   % after \\: \hline or \cline{col1-col2} \cline{col3-col4} ...
356 |    Distribution & PDF or PMF & Mean & Variance \\
357 |   \hline
358 |   $Bernoulli(p)$ &  $\left\{   \begin{array}{ll}
359 |                                  p, & \text{if $x=1$} \\
360 |                                  1-p, & \text{if $x=0$.}
361 |                                \end{array}
362 |                              \right.$ & $p$ & $p(1-p)$\\
363 |   \hline
364 |   $Binomial(n,p)$ & $n \choose k$ $p^k(1-p)^{n-k}  $ for $ 0 \leq k \leq n$ & $np$ & $npq$ \\
365 |   \hline
366 |   $Geometric(p)$ & $p(1-p)^{k-1}$ \ \ for $k=1,2, \ldots$ & $\frac{1}{p}$ & $\frac{1-p}{p^2}$ \\
367 |   \hline
368 |   $Poisson(\lambda)$ & $e^{-\lambda} \lambda^x / x!$ \ \ for $k=1,2,\ldots$ & $\lambda$ & $\lambda$ \\
369 |   \hline
370 |   $Uniform(a,b)$ & $\frac{1}{b-a} \ \ \forall x \in (a,b)$  & $\frac{a+b}{2}$ & $\frac{(b-a)^2}{12}$ \\
371 |   \hline
372 |   $Gaussian(\mu,\sigma^2)$ & $\frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ & $\mu$ & $\sigma^2$ \\
373 |   \hline
374 |   $Exponential(\lambda)$ & $\lambda e^{-\lambda x} \ \ x \geq 0, \lambda >0$ & $\frac{1}{\lambda}$ & $ \frac{1}{\lambda ^2}$ \\
375 |    \hline
376 | \end{tabular}
377 | \end{center}
378 | 
379 | 
380 | \begin{figure}
381 | \begin{center}
382 |   \includegraphics[width=9cm]{fig2.png}\\
383 |   \caption{PDF and CDF of a couple of random variables.}\label{fig2}
384 | \end{center}
385 | \end{figure}
386 | 
387 | \section{Two random variables}
388 | 
389 | Thus far, we have considered single random variables.  In many
390 | situations, however, there may be more than one quantity that we are
391 | interested in knowing during a random experiment.  For instance, in an
392 | experiment where we flip a coin ten times, we may care about both $X(\omega) =
393 | \text{the number of heads that come up}$ as well as $Y(\omega) = \text{the
394 | length of the longest run of consecutive heads}$.  In this section, we
395 | consider the setting of two random variables.
396 | 
397 | \subsection{Joint and marginal distributions}
398 | 
399 | Suppose that we have two random variables $X$ and $Y$. One way to work
400 | with these two random variables is to consider each of them
401 | separately. If we do that we will only need $F_X(x)$ and $F_Y(y)$.
402 | But if we want to know about the values that $X$ and $Y$ assume
403 | simultaneously during outcomes of a random experiment, we require
404 | a more complicated structure known as the 
405 | \textbf{joint cumulative distribution function} of $X$ and $Y$, defined by
406 | \begin{equation*}
407 | F_{XY}(x,y)=P(X \leq x, Y \leq y)
408 | \end{equation*}
409 | It can be shown that by knowing the joint
410 | cumulative distribution function,
411 | the probability of any event involving $X$ and $Y$ can be calculated. 
412 | 
413 | The joint CDF $F_{XY}(x,y)$ and the joint distribution functions
414 | $F_X(x)$ and $F_Y(y)$ of each variable separately are related by
415 | \begin{eqnarray*}
416 |   F_X(x) &=& \lim_{y \rightarrow \infty} F_{XY}(x,y) dy \\
417 |   F_Y(y) &=& \lim_{x \rightarrow \infty} F_{XY}(x,y) dx.
418 | \end{eqnarray*}
419 | Here, we call $F_X(x)$ and $F_Y(y)$ the \textbf{marginal cumulative distribution functions}
420 | of $F_{XY}(x,y)$.
421 | 
422 | \textbf{Properties}:
423 | \begin{description}
424 | \item[-] $0 \leq F_{XY}(x,y) \leq 1$.
425 | \item[-] $\lim_{x,y \rightarrow \infty} F_{XY}(x,y)=1$.
426 | \item[-] $\lim_{x,y \rightarrow -\infty} F_{XY}(x,y)=0$.
427 | \item[-] $ F_X(x)= \lim_{y \rightarrow \infty} F_{XY}(x,y)$.
428 | \end{description}
429 | 
430 | \subsection{Joint and marginal probability mass functions}
431 | 
432 | If $X$ and $Y$ are discrete random variables, then the \textbf{joint probability mass function} $p_{XY} : \mathbb{R} \times \mathbb{R} \rightarrow [0,1]$ is
433 | defined by
434 | \begin{eqnarray*}
435 |   p_{XY}(x,y) = P(X = x, Y = y).
436 | \end{eqnarray*}
437 | Here, $0 \le p_{XY}(x,y) \le 1$ for all $x, y$, and $\sum_{x \in Val(X)} \sum_{y \in Val(Y)} p_{XY}(x,y) = 1$.
438 | 
439 | How does the joint PMF over two variables relate to the probability mass function for each variable separately? 
440 | It turns out that
441 | \begin{eqnarray*}
442 |   p_X(x) = \sum_y p_{XY} (x,y).
443 | \end{eqnarray*}
444 | and similarly for $p_Y(y)$.  In this case, we refer to $p_X(x)$ as the
445 | \textbf{marginal probability mass function} of $X$.  In statistics, the process of forming the
446 | marginal distribution with respect to one variable by summing out the
447 | other variable is often known as ``marginalization.''
448 | 
449 | \subsection{Joint and marginal probability density functions}
450 | 
451 | Let $X$ and $Y$ be two continuous random variables with joint distribution function $F_{XY}$.  In the case that
452 | $F_{XY}(x,y)$ is everywhere differentiable in both $x$ and $y$, then we can define the \textbf{joint probability density function},
453 | \begin{equation*}
454 |   f_{XY}(x,y)= \frac{ \partial^2 F_{XY}(x,y)}{\partial x \partial y}.
455 | \end{equation*}
456 | Like in the single-dimensional case, $f_{XY}(x,y) \neq P(X = x, Y = y)$, but rather
457 | \begin{equation*}
458 |   \iint_{x \in A} f_{XY} (x,y) dx dy = P((X,Y) \in A).
459 | \end{equation*}
460 | Note that the values of the probability density function $f_{XY}(x,y)$ are always nonnegative, but they may be greater than 1.  Nonetheless, it must be the
461 | case that $\int_{-\infty}^\infty \int_{-\infty}^\infty f_{XY}(x,y) = 1$.
462 | 
463 | Analagous to the discrete case, we define
464 | \begin{eqnarray*}
465 |   f_X(x) = \int_{-\infty}^{\infty}f_{XY}(x,y)dy,
466 | \end{eqnarray*}
467 | as the \textbf{marginal probability density function} (or \textbf{marginal
468 |   density}) of $X$, and similarly for $f_Y(y)$.
469 | 
470 | \subsection{Conditional distributions}
471 | 
472 | Conditional distributions seek to answer the question, what is the probability distribution over $Y$, when we know
473 | that $X$ must take on a certain value $x$?
474 | In the discrete case, the conditional probability mass
475 | function of $X$ given $Y$ is simply
476 | \begin{equation*}
477 |   p_{Y|X}(y|x) = \frac{p_{XY}(x,y)}{p_X(x)},
478 | \end{equation*}
479 | assuming that $p_X(x) \neq 0$.
480 | 
481 | In the continuous case, the situation is technically a little more complicated
482 | because the probability that a continuous random variable $X$ takes on a specific value $x$ is equal to zero\footnote{
483 |   To get around this, a more reasonable way to calculate the
484 |   conditional CDF is,
485 | \begin{equation*}
486 | F_{Y|X}(y,x)= \lim_{\Delta x \rightarrow 0} P(Y \leq y | x \leq X \leq x + \Delta x).
487 | \end{equation*}
488 | It can be easily seen that if $F(x,y)$ is differentiable in both $x,y$ then,
489 | 
490 | \begin{equation*}
491 | F_{Y|X}(y,x)=\int_{-\infty}^{y} \frac{f_{X,Y}(x,\alpha)}{f_X(x)}d\alpha
492 | \end{equation*}
493 | and therefore we define the conditional PDF of $Y$ given $X=x$ in the following way,
494 | \begin{equation*}
495 | f_{Y|X}(y|x)= \frac{f_{XY}(x,y)}{f_X(x)}
496 | \end{equation*}
497 | }.  Ignoring this technical point, we simply define, by analogy to the discrete case,
498 | the \emph{conditional probability density} of $Y$ given $X = x$ to be 
499 | \begin{equation*}
500 |   f_{Y|X}(y|x) = \frac{f_{XY}(x,y)}{f_X(x)},
501 | \end{equation*}
502 | provided $f_X(x) \neq 0$.
503 | 
504 | \subsection{Bayes's rule}
505 | A useful formula that often arises when trying to derive expression for the conditional probability 
506 | of one variable given another, is \textbf{Bayes's rule}.
507 | 
508 | In the case of discrete random variables $X$ and $Y$,
509 | \begin{equation*}
510 | P_{Y|X}(y|x)=\frac{P_{XY}(x,y)}{P_X(x)}=\frac{P_{X|Y}(x|y)P_Y(y)}{\sum_{y' \in Val(Y)} P_{X|Y}(x|y')P_Y(y')}.
511 | \end{equation*}
512 | 
513 | If the random variables $X$ and $Y$ are continuous,
514 | \begin{equation*}
515 | f_{Y|X}(y|x)=\frac{f_{XY}(x,y)}{f_X(x)}=\frac{f_{X|Y}(x|y)f_Y(y)}{\int_{-\infty}^{\infty} f_{X|Y}(x|y')f_Y(y')dy'}.
516 | \end{equation*}
517 | 
518 | \subsection{Independence}
519 | 
520 | Two random variables $X$ and $Y$ are \textbf{independent} if $F_{XY}(x,y) = F_X(x) F_Y(y)$ for all values of 
521 | $x$ and $y$.  Equivalently,
522 | \begin{itemize}
523 | \item For discrete random variables, $p_{XY}(x,y) = p_X(x) p_Y(y)$ for all $x \in Val(X)$, $y \in Val(Y)$.
524 | \item For discrete random variables, $p_{Y|X}(y|x) = p_Y(y)$ whenever $p_X(x) \neq 0$ for all $y \in Val(Y)$.
525 | \item For continuous random variables, $f_{XY}(x,y) = f_X(x) f_Y(y)$ for all $x,y \in \mathbb{R}$.
526 | \item For continuous random variables, $f_{Y|X}(y|x) = f_Y(y)$ whenever $f_X(x) \neq 0$ for all $y \in \mathbb{R}$.
527 | \end{itemize}
528 | 
529 | Informally, two random variables $X$ and $Y$ are \textbf{independent}
530 | if ``knowing'' the value of one variable will never have any effect on
531 | the conditional probability distribution of the other variable, that
532 | is, you know all the information about the pair $(X,Y)$ by just
533 | knowing $f(x)$ and $f(y)$. The following lemma formalizes this
534 | observation:
535 | \begin{lem}
536 | If $X$ and $Y$ are independent then for any subsets $A, B \subseteq \mathbb{R}$, we have,
537 | \begin{eqnarray*}
538 | P(X \in A, Y \in B)=P(X \in A) P(Y \in B)
539 | \end{eqnarray*}
540 | \end{lem}
541 | By using the above lemma one can prove that if $X$ is independent of $Y$ then any function of $X$ is independent of any function of $Y$.
542 | 
543 | \subsection{Expectation and covariance}
544 | 
545 | Suppose that we have two discrete random variables $X, Y$ and $g: \mathbf{R}^2 \longrightarrow \mathbf{R}$ is a function of these two random variables. Then the expected value of $g$ is defined in the following way,
546 | \begin{equation*}
547 |   E[g(X,Y)] \triangleq \sum_{x \in Val(X)} \sum_{y \in Val(Y)} g(x,y) p_{XY}(x,y).
548 | \end{equation*}
549 | For continuous random variables $X,Y$, the analogous expression is
550 | \begin{equation*}
551 | E[g(X,Y)]= \int_{-\infty}^{\infty} \int_{- \infty}^{\infty} g(x,y) f_{XY}(x,y) dxdy.
552 | \end{equation*}
553 | 
554 | We can use the concept of expectation to study the relationship of two random variables with each other.  In particular, the \textbf{covariance} of two random variables $X$ and $Y$ is defined as
555 | \begin{eqnarray*}
556 | Cov[X,Y] 
557 | &\triangleq& E[(X-E[X])(Y-E[Y])] \\
558 | \end{eqnarray*}
559 | Using an argument similar to that for variance, we can rewrite this as,
560 | \begin{eqnarray*}
561 | Cov[X,Y] 
562 | &=& E[(X-E[X])(Y-E[Y])] \\
563 | &=& E[XY - X E[Y] - Y E[X] + E[X] E[Y]] \\
564 | &=& E[XY] - E[X] E[Y] - E[Y] E[X] + E[X] E[Y]] \\
565 | &=& E[XY] - E[X] E[Y].
566 | \end{eqnarray*}
567 | Here, the key step in showing the equality of the two forms of covariance is in the third equality, where we use the fact that $E[X]$ and $E[Y]$ are actually constants which
568 | can be pulled out of the expectation.  When $Cov[X,Y] = 0$, we say that $X$ and $Y$ are \textbf{uncorrelated}\footnote{
569 |   However, this is not the same thing as stating that $X$ and $Y$ are independent!  For example, if $X \sim Uniform(-1,1)$ and $Y = X^2$, then one can show that $X$ and $Y$ are
570 |   uncorrelated, even though they are not independent.
571 | }.
572 | 
573 | \textbf{Properties}:
574 | \begin{description}
575 | \item[-] (Linearity of expectation) $E[f(X,Y) + g(X,Y)] = E[f(X,Y)] + E[g(X,Y)]$.
576 | \item[-] $Var[X + Y] = Var[X] + Var[Y] + 2 Cov[X, Y]$.
577 | \item[-] If $X$ and $Y$ are independent, then $Cov[X, Y] = 0$.
578 | \item[-] If $X$ and $Y$ are independent, then $E[f(X)g(Y)] = E[f(X)] E[g(Y)]$.
579 | \end{description}
580 | 
581 | \section{Multiple random variables}
582 | 
583 | The notions and ideas introduced in the previous section can be
584 | generalized to more than two random variables.  In particular, suppose that we have $n$ continuous
585 | random variables, $X_1(\omega),X_2(\omega), \ldots X_n(\omega)$.  In this section, for simplicity of presentation,
586 | we focus only on the continuous case, but the generalization to discrete random variables works similarly.
587 | 
588 | \subsection{Basic properties}
589 | 
590 | We can define the \textbf{joint distribution function} of $X_1,X_2,\ldots,X_n$, 
591 | the \textbf{joint probability density function} of $X_1,X_2,\ldots,X_n$, 
592 | the \textbf{marginal probability density function} of $X_1$, and
593 | the \textbf{conditional probability density function} of $X_1$ given $X_2,\ldots,X_n$, as
594 | 
595 | \begin{eqnarray*}
596 | F_{X_1, X_2, \ldots, X_n}(x_1,x_2, \ldots x_n) &=& P(X_1 \leq x_1, X_2 \leq x_2, \ldots, X_n \leq x_n) \\
597 | f_{X_1, X_2, \ldots, X_n}(x_1,x_2, \ldots x_n) &=& \frac{\partial^n F_{X_1, X_2, \ldots, X_n}(x_1,x_2, \ldots x_n)}{\partial x_1 \ldots \partial x_n} \\
598 | f_{X_1}(X_1) &=& \int_{-\infty}^\infty \cdots \int_{-\infty}^\infty f_{X_1, X_2, \ldots, X_n}(x_1,x_2, \ldots x_n) dx_2 \ldots dx_n \\
599 | f_{X_1 | X_2, \ldots, X_n}(x_1|x_2, \ldots x_n) &=& \frac{f_{X_1, X_2, \ldots, X_n}(x_1,x_2, \ldots x_n)}{f_{ X_2, \ldots, X_n}(x_1,x_2, \ldots x_n)}
600 | \end{eqnarray*}
601 | 
602 | To calculate the probability of an event $A \subseteq \mathbb{R}^n$ we have,
603 | \begin{equation}
604 | P((x_1,x_2, \ldots x_n) \in A)= \int_{(x_1,x_2, \ldots x_n) \in A} f_{X_1,X_2,\ldots,X_n}(x_1,x_2, \ldots x_n)dx_1dx_2 \ldots dx_n
605 | \end{equation}
606 | 
607 | \textbf{Chain rule}: From the
608 | definition of conditional probabilities for multiple random variables, one can show that
609 | \begin{eqnarray*}
610 |   f(x_1,x_2,\ldots,x_n) &=& f(x_n|x_1,x_2\ldots,x_{n-1}) f(x_1,x_2\ldots,x_{n-1}) \\
611 |   &=& f(x_n|x_1,x_2\ldots,x_{n-1}) f(x_{n-1} | x_1,x_2\ldots,x_{n-2}) f(x_1,x_2\ldots,x_{n-2}) \\
612 |   &=& \ldots\;\; =\;\; f(x_1) \prod_{i=2}^n f(x_i | x_1,\ldots,x_{i-1}).
613 | \end{eqnarray*}
614 | 
615 | \textbf{Independence}:
616 | For multiple events, $A_1,\ldots,A_k$,
617 | we say that $A_1,\ldots,A_k$ are \textbf{mutually independent} if for any
618 | subset $S \subseteq \{1,2,\ldots,k\}$, we have
619 | \begin{eqnarray*}
620 |   P(\cap_{i \in S} A_i) = \prod_{i \in S} P(A_i).
621 | \end{eqnarray*}
622 | Likewise, we say that random variables $X_1,\ldots,X_n$ are independent if
623 | \begin{eqnarray*}
624 |   f(x_1,\ldots,x_n) = f(x_1)f(x_2) \cdots f(x_n).
625 | \end{eqnarray*}
626 | Here, the definition of mutual independence is simply the natural generalization of independence of two random variables to multiple random variables.
627 | 
628 | Independent random variables arise often in machine learning
629 | algorithms where we assume that the training examples belonging to the
630 | training set represent independent samples from some unknown
631 | probability distribution.  To make the significance of independence clear, consider a ``bad''
632 | training set in which we first sample a single training example
633 | $(x^{(1)},y^{(1)})$ from the some unknown distribution, and then add
634 | $m-1$ copies of the exact same training example to the training set.
635 | In this case, we have (with some abuse of notation)
636 | \begin{equation*}
637 |   P( (x^{(1)},y^{(1)}), \ldots. (x^{(m)},y^{(m)}) ) \neq \prod_{i=1}^m P(x^{(i)},y^{(i)}).
638 | \end{equation*}
639 | Despite the fact that the training set has size $m$, the examples are
640 | not independent!  While clearly the procedure described here is not a
641 | sensible method for building a training set for a machine learning
642 | algorithm, it turns out that in practice, non-independence of samples
643 | does come up often, and it has the effect of reducing the ``effective
644 | size'' of the training set.
645 | 
646 | \subsection{Random vectors}
647 | Suppose that we have $n$ random variables. When working with all these
648 | random variables together, we will often find it convenient to put
649 | them in a vector $X=[X_1\; X_2\; \ldots\; X_n]^T$.  We call the
650 | resulting vector a \textbf{random vector} (more formally, a random
651 | vector is a mapping from $\Omega$ to $\mathbb{R}^n$). It should be
652 | clear that random vectors are simply an alternative notation for dealing with $n$
653 | random variables, so the notions of joint PDF and CDF will apply to
654 | random vectors as well.
655 | 
656 | \textbf{Expectation}:
657 | Consider an
658 | arbitrary function from $g:\mathbb{R}^n \rightarrow \mathbb{R} $. The expected value of this function is defined as
659 | \begin{equation}
660 | E[g(X)]=\int_{\mathbb{R}^n} g(x_1,x_2, \ldots, x_n)f_{X_1,X_2,\ldots,X_n} (x_1,x_2, \ldots x_n) dx_1 dx_2 \ldots dx_n,
661 | \end{equation}
662 | where $\int_{\mathbb{R}^n}$ is $n$ consecutive integrations from $-\infty$ to $\infty$. If $g$ is a function from $\mathbb{R}^n$ to $\mathbb{R}^m$, then the expected value of $g$ is the element-wise expected values of the output vector, i.e., if $g$ is
663 | \begin{eqnarray*}
664 | g\mathbf(x) =
665 | \begin{bmatrix}
666 | g_1\mathbf(x) \\
667 | g_2\mathbf(x) \\
668 |       \vdots\\
669 | g_m\mathbf(x)
670 | \end{bmatrix},
671 | \end{eqnarray*}
672 | Then,
673 | \begin{eqnarray*}
674 | E[g\mathbf(X)] =
675 | \begin{bmatrix}
676 | E[g_1\mathbf(X)] \\
677 | E[g_2\mathbf(X)] \\
678 |       \vdots\\
679 | E[g_m\mathbf(X)]
680 | \end{bmatrix}.
681 | \end{eqnarray*}
682 | 
683 | \textbf{Covariance matrix}: For a given random vector
684 | $X : \Omega \rightarrow \mathbb{R}^n$, its covariance matrix $\Sigma$
685 | is the $n \times n$ square matrix whose entries are given by $\Sigma_{ij} =
686 | Cov[X_i,X_j]$.  
687 | 
688 | From the definition of covariance, we have
689 | \begin{eqnarray*}
690 |   \Sigma 
691 |   &=&
692 |   \begin{bmatrix}
693 |     Cov[X_1,X_1] & \cdots & Cov[X_1,X_n] \\
694 |     \vdots & \ddots & \vdots \\
695 |     Cov[X_n,X_1] & \cdots & Cov[X_n,X_n] \\
696 |   \end{bmatrix} \\
697 |   &=&
698 |   \begin{bmatrix}
699 |     E[X_1^2] - E[X_1]E[X_1] & \cdots & E[X_1X_n] - E[X_1]E[X_n] \\
700 |     \vdots & \ddots & \vdots \\
701 |     E[X_nX_1] - E[X_n]E[X_1] & \cdots & E[X_n^2] - E[X_n]E[X_n] \\
702 |   \end{bmatrix} \\
703 |   &=&
704 |   \begin{bmatrix}
705 |     E[X_1^2] & \cdots & E[X_1X_n] \\
706 |     \vdots & \ddots & \vdots \\
707 |     E[X_nX_1] & \cdots & E[X_n^2] \\
708 |   \end{bmatrix} -
709 |   \begin{bmatrix}
710 |     E[X_1]E[X_1] & \cdots & E[X_1]E[X_n] \\
711 |     \vdots & \ddots & \vdots \\
712 |     E[X_n]E[X_1] & \cdots & E[X_n]E[X_n] \\
713 |   \end{bmatrix} \\
714 |   &=& E[XX^T] - E[X]E[X]^T = \ldots = E[(X - E[X])(X - E[X])^T].
715 | \end{eqnarray*}
716 | where the matrix expectation is defined in the obvious way.
717 | 
718 | The covariance matrix has a number of useful properties:
719 | \begin{description}
720 | \item[-] $\Sigma \succeq 0$; that is, $\Sigma$ is positive semidefinite.
721 | \item[-] $\Sigma = \Sigma^T$; that is, $\Sigma$ is symmetric.
722 | \end{description}
723 | 
724 | \subsection{The multivariate Gaussian distribution}
725 | 
726 | One particularly important example of a probability distribution over
727 | random vectors $X$ is called the \textbf{multivariate Gaussian} or
728 | \textbf{multivariate normal} distribution.  A random vector $X \in
729 | \mathbb{R}^n$ is said to have a multivariate normal (or Gaussian)
730 | distribution with mean $\mu \in \mathbb{R}^n$ and covariance matrix
731 | $\Sigma \in \mathbb{S}^{n}_{++}$ (where $\mathbb{S}^{n}_{++}$ refers
732 | to the space of symmetric positive definite $n \times n$ matrices)
733 | \begin{align*}
734 |   f_{X_1,X_2,\ldots,X_n}(x_1,x_2,\ldots,x_n;\mu,\Sigma) = \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp \left(-\frac{1}{2}(x - \mu)^T \Sigma^{-1} (x - \mu)\right).
735 | \end{align*}
736 | We write this as $X \sim \mathcal{N}(\mu,\Sigma)$.  
737 | Notice that in the case $n=1$, this reduces the regular definition of
738 | a normal distribution with mean parameter $\mu_1$ and variance
739 | $\Sigma_{11}$.
740 | 
741 | Generally speaking, Gaussian random variables are extremely useful in
742 | machine learning and statistics for two main reasons.  First, they are
743 | extremely common when modeling ``noise'' in statistical algorithms.
744 | Quite often, noise can be considered to be the accumulation of a large
745 | number of small independent random perturbations affecting the
746 | measurement process; by the Central Limit Theorem, summations of
747 | independent random variables will tend to ``look Gaussian.''  Second,
748 | Gaussian random variables are convenient for many analytical
749 | manipulations, because many of the integrals involving Gaussian
750 | distributions that arise in practice have simple closed form
751 | solutions.  We will encounter this later in the course.
752 | 
753 | \section{Other resources}
754 | 
755 | A good textbook on probablity at the level needed for CS229 is the book,
756 | \textit{A First Course on Probability} by Sheldon Ross.
757 | 
758 | \end{document}
759 | 


--------------------------------------------------------------------------------
/Linear Algebra Review/linalg2.tex:
--------------------------------------------------------------------------------
   1 | \documentclass[12pt]{article}
   2 | \usepackage{geometry}
   3 | \geometry{letterpaper, height=8.5in,width=6.5in}
   4 | \usepackage{graphicx}
   5 | \usepackage{amsmath}
   6 | \usepackage{amssymb}
   7 | \usepackage{pstricks}
   8 | 
   9 | 
  10 | 
  11 | \def\shownotes{0}  %set 1 to show author notes
  12 | \ifnum\shownotes=1
  13 | \newcommand{\authnote}[2]{$\ll$\textsf{\footnotesize #1 notes: #2}$\gg$}
  14 | \else
  15 | \newcommand{\authnote}[2]{}
  16 | \fi
  17 | \newcommand{\Tnote}[1]{{\color{blue}\authnote{Tengyu}{#1}}}
  18 | 
  19 | \newcommand{\commentout}[1]{}
  20 | 
  21 | \title{Linear Algebra Review and Reference}
  22 | \author{Zico Kolter (updated by Chuong Do and Tengyu Ma)} 
  23 | 
  24 | \begin{document}
  25 | 
  26 | \maketitle
  27 | \tableofcontents
  28 | 
  29 | \section{Basic Concepts and Notation}
  30 | Linear algebra provides a way of compactly representing and operating
  31 | on sets of linear equations.  For example, consider the following
  32 | system of equations:
  33 | \[\begin{array}{rcrcl} 4 x_1 & - & 5 x_2 & = & -13 \\
  34 | -2 x_1 & + & 3 x_2 & = & 9. \end{array}\\ \]
  35 | 
  36 | This is two equations and two variables, so as you know from high
  37 | school algebra, you can find a unique solution for $x_1$ and $x_2$ (unless
  38 | the equations are somehow degenerate, for example if the second
  39 | equation is simply a multiple of the first, but in the case above
  40 | there is in fact a unique solution).  In matrix notation, we can write
  41 | the system more compactly as
  42 | \[Ax = b\]
  43 | with 
  44 | \[A = \left [ \begin{array}{cc} 4 & -5 \\ -2 & 3
  45 |   \end{array} \right ], \quad  b = \left [ \begin{array}{c} -13 \\
  46 |     9 \end{array} \right ]. \]
  47 | 
  48 | As we will see shortly, there are many advantages (including the
  49 | obvious space savings) to analyzing linear equations in this form.
  50 | 
  51 | \subsection{Basic Notation}
  52 | 
  53 | We use the following notation:
  54 | 
  55 | \begin{itemize}
  56 | 
  57 | \item By $A \in \mathbb{R}^{m \times n}$ we denote a matrix with $m$ rows
  58 |   and $n$ columns, where the entries of $A$ are real numbers.
  59 | 
  60 | \item By $x \in \mathbb{R}^n$, we denote a vector with $n$ entries.
  61 |   By convention, an $n$-dimensional vector is often thought of as a
  62 |   matrix with $n$ rows and 1 column, known as a \textbf{\textit{column vector}}.  
  63 |   If we want to explicitly
  64 |   represent a \textbf{\textit{row vector}} --- a matrix with 1 row and
  65 |   $n$ columns --- we typically write $x^T$ (here $x^T$ denotes the
  66 |   transpose of $x$, which we will define shortly).
  67 | 
  68 | \item The $i$th element of a vector $x$ is denoted $x_i$: \[ x = \left
  69 |   [ \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_n \end{array} \right
  70 |   ]. \]
  71 | 
  72 | \item We use the notation $a_{ij}$ (or $A_{ij}$,
  73 |   $A_{i,j}$, etc) to denote the entry of $A$ in the $i$th row
  74 |   and $j$th column: \[A = \left [ \begin{array}{cccc} a_{11} &
  75 |   a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\
  76 |   \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots &
  77 |   a_{mn} \end{array} \right ].\]
  78 | 
  79 | \item We denote the $j$th column of $A$ by $a^j$ or $A_{:,j}$: \[ A =
  80 |   \left [ \begin{array}{cccc} | & | &  & 
  81 |   | \\ a^1 & a^2 & \cdots & a^n \\ | & | &  & |
  82 |   \end{array} \right ]. \]
  83 | 
  84 | \item We denote the $i$th row of $A$ by $a^T_i$ or $A_{i,:}$: \[ A = \left
  85 |   [ \begin{array}{ccc} \mbox{---} & a^T_1 & 
  86 |   \mbox{---} \\   \mbox{---} & a^T_2 &  \mbox{---} \\ & \vdots & \\
  87 |   \mbox{---} & a^T_m  &  \mbox{---} \end{array} \right ]. \]
  88 | 
  89 | \item Viewing a matrix as a collection of column or row vectors is very important and convenient in many cases. In general,  it would be mathematically (and conceptually) cleaner to operate on the level of vectors instead of scalars. There is no universal convention for denoting the columns or rows of a matrix, and thus you can feel free to change the notations as long as it's explicitly defined. %We also note that the notation here for the columns or rows of a matrix is not necessarily universally adopted, therefore 
  90 | 
  91 | \end{itemize}
  92 | 
  93 | \section{Matrix Multiplication}
  94 | 
  95 | The product of two matrices $A \in \mathbb{R}^{m \times n}$ and $B \in
  96 | \mathbb{R}^{n \times p}$ is the matrix \[C = AB \in \mathbb{R}^{m
  97 |   \times p},\] where \[C_{ij} = \sum_{k=1}^n A_{ik}B_{kj}.\]  Note
  98 | that in order for the matrix product to exist, the number of columns
  99 | in $A$ must equal the number of rows in $B$. There are
 100 | many other ways of looking at matrix multiplication that may be more convenient and insightful than the standard definition above, and we'll start by
 101 | examining a few special cases.
 102 | 
 103 | \subsection{Vector-Vector Products}
 104 | 
 105 | Given two vectors $x, y \in \mathbb{R}^n$, the quantity $x^T y$,
 106 | sometimes called the \textbf{\textit{inner product}} or
 107 | \textbf{\textit{dot product}} of
 108 | the vectors, is a real number given by
 109 | \[x^T y \in \mathbb{R} = 
 110 | \left [ \begin{array}{cccc} x_1 & x_2 & \cdots & x_n \end{array} \right ]
 111 | \left [ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array} \right ]
 112 | = \sum_{i=1}^n x_i y_i.\]
 113 | Observe that inner products are really just special case of matrix multiplication.
 114 | Note that it is always the case that $x^T y = y^T x$.
 115 | 
 116 | Given vectors $x \in \mathbb{R}^m$, $y \in \mathbb{R}^n$ (not necessarily 
 117 | of the same size), $x y^T \in \mathbb{R}^{m \times n}$ is called the \textbf{\textit{outer
 118 |   product}} of the vectors.  It is a matrix whose entries are given by
 119 | $(x y^T)_{ij} = x_i y_j$, i.e.,
 120 | \[ x y^T \in \mathbb{R}^{m \times n} 
 121 | = \left [ \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_m \end{array} \right ]
 122 | \left [ \begin{array}{cccc} y_1 & y_2 & \cdots & y_n \end{array} \right ]
 123 | = \left [ \begin{array}{cccc}x_1
 124 |     y_1 & x_1 y_2 & \cdots & x_1 
 125 |     y_n \\ x_2 y_1 & x_2 y_2 & \cdots & x_2 y_n \\ \vdots & \vdots &
 126 |     \ddots & \vdots \\ x_m y_1 & x_m y_2 & \cdots & x_m y_n
 127 |     \end{array} \right ]. \]
 128 | As an example of how the outer product can be useful, let $\mathbf{1} \in \mathbb{R}^n$
 129 | denote an $n$-dimensional vector whose entries are all equal to 1.  Furthermore,
 130 | consider the matrix $A \in \mathbb{R}^{m \times n}$ whose columns are all
 131 | equal to some vector $x \in \mathbb{R}^m$.  Using outer products, we can
 132 | represent $A$ compactly as,
 133 | \[A = \left [ \begin{array}{cccc} | & | &  & 
 134 |   | \\ x & x & \cdots & x \\ | & | &  & |
 135 |   \end{array} \right ] 
 136 | = \left [ \begin{array}{cccc}x_1
 137 |     & x_1 & \cdots & x_1 
 138 |     \\ x_2 & x_2 & \cdots & x_2 \\ \vdots & \vdots &
 139 |     \ddots & \vdots \\ x_m & x_m & \cdots & x_m 
 140 |     \end{array} \right ] 
 141 | = \left [ \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_m \end{array} \right ]
 142 | \left [ \begin{array}{cccc} 1 & 1 & \cdots & 1 \end{array} \right ]
 143 | = x \mathbf{1}^T. \]
 144 |   
 145 | 
 146 | \subsection{Matrix-Vector Products}
 147 | 
 148 | Given a matrix $A \in \mathbb{R}^{m \times n}$ and a vector $x \in
 149 | \mathbb{R}^n$, their product is a vector $y = Ax \in \mathbb{R}^m$.
 150 | There are a couple ways of looking at matrix-vector multiplication,
 151 | and we will look at each of them in turn.
 152 | 
 153 | If we write $A$ by rows, then we can express $Ax$ as,
 154 | \[ y = Ax = \left [ \begin{array}{ccc} \mbox{---} & a^T_1 & 
 155 |   \mbox{---} \\   \mbox{---} & a^T_2 &  \mbox{---} \\ & \vdots & \\
 156 |   \mbox{---} & a^T_m  &  \mbox{---} \end{array} \right ] x = \left [
 157 |   \begin{array}{c} a^T_1 x \\ a^T_2 x \\ \vdots \\ a^T_m x 
 158 |   \end{array} \right ]. \]
 159 | In other words, the $i$th entry of $y$ is equal to the inner
 160 | product of the $i$th \textit{row} of $A$ and $x$, $y_i = a_i^T x$.
 161 | 
 162 | Alternatively, let's write $A$ in column form.  In this case we see
 163 | that,
 164 | \begin{align}
 165 | y = Ax = \left [ \begin{array}{cccc} | & | &  & 
 166 | | \\ a^1 & a^2 & \cdots & a^n \\ | & | &  & |
 167 | \end{array} \right ] \left [ \begin{array}{c} x_1 \\ x_2 \\ \vdots
 168 | \\ x_n \end{array} \right  ] = \left [ \begin{array}{c}  \\ a^1
 169 | \\  \\ \end{array} \right ] x_1 + \left [ \begin{array}{c}  \\ a^2
 170 | \\  \\ \end{array} \right ] x_2 + \ldots +  \left [ \begin{array}{c}
 171 | \\ a^n \\  \\ \end{array} \right ] x_n \;\;.  \label{eqn:2}
 172 | \end{align}
 173 | 
 174 | 
 175 | In other words, y is a \textbf{\textit{linear combination}} of the
 176 | \textit{columns} of $A$, where the coefficients of the linear
 177 | combination are given by the entries of $x$. 
 178 | 
 179 | So far we have been multiplying on the right by a column vector, but
 180 | it is also possible to multiply on the left by a row vector.  This is
 181 | written, $y^T = x^T A$ for $A \in \mathbb{R}^{m \times n}$, $x \in
 182 | \mathbb{R}^m$, and $y \in \mathbb{R}^n$.  As before, we can express
 183 | $y^T$ in two obvious ways, depending on whether we express $A$ in
 184 | terms on its rows or columns.  In the first case we express $A$ in
 185 | terms of its columns, which gives
 186 | \[y^T = x^T A = x^T \left [ \begin{array}{cccc} | & | &  & 
 187 |   | \\ a^1 & a^2 & \cdots & a^n \\ | & | &  & |
 188 |   \end{array} \right ] = \left [
 189 |   \begin{array}{cccc} x^T a^1 & x^T a^2 & \cdots & x^T a^n \end{array}
 190 |   \right ] \]
 191 | which demonstrates that the $i$th entry of $y^T$ is equal to the inner
 192 | product of $x$ and the $i$th \textit{column} of $A$.
 193 | 
 194 | Finally, expressing $A$ in terms of rows we get the final
 195 | representation of the vector-matrix product,
 196 | \begin{eqnarray*}
 197 | y^T & = & x^T A \\
 198 |     & = & \left [ \begin{array}{cccc}x_1 & x_2 & \cdots & x_m 
 199 |   \end{array} \right ] \left [ \begin{array}{ccc} \mbox{---} & a^T_1 & 
 200 |   \mbox{---} \\   \mbox{---} & a^T_2 &  \mbox{---} \\ & \vdots & \\
 201 |   \mbox{---} & a^T_m  &  \mbox{---} \end{array} \right ] \\ & & \\ 
 202 | & = & x_1 \left [ \begin{array}{ccc} \mbox{---} & a^T_1 & \mbox{---}
 203 |   \end{array} \right ] + x_2 \left [ \begin{array}{ccc} \mbox{---} &
 204 |     a^T_2 & \mbox{---} \end{array} \right ] + ... + x_m \left [
 205 |     \begin{array}{ccc} \mbox{---} & a^T_m & \mbox{---} \end{array} \right ]
 206 | \end{eqnarray*}
 207 | so we see that $y^T$ is a linear combination of the \textit{rows} of
 208 | $A$, where the coefficients for the linear combination are given by
 209 | the entries of $x$.
 210 | 
 211 | \subsection{Matrix-Matrix Products}\label{subsec:matrix-matrix}
 212 | 
 213 | Armed with this knowledge, we can now look at four different (but, of
 214 | course, equivalent) ways of
 215 | viewing the matrix-matrix multiplication $C = AB$ as defined at the
 216 | beginning of this section.  
 217 | 
 218 | First, we can view matrix-matrix
 219 | multiplication as a set of vector-vector products.  The most obvious
 220 | viewpoint, which follows 
 221 | immediately from the definition, is that the $(i,j)$th entry of $C$ is
 222 | equal to the inner product of the $i$th row of $A$ and the $j$th column
 223 | of $B$.  Symbolically, this looks like the following,
 224 | \[C = AB = \left [ \begin{array}{ccc} \mbox{---} & a^T_1 & 
 225 |   \mbox{---} \\   \mbox{---} & a^T_2 &  \mbox{---} \\ & \vdots & \\
 226 |   \mbox{---} & a^T_m  &  \mbox{---} \end{array} \right ] \left [
 227 |   \begin{array}{cccc} | & | &  & | \\ b^1 & b^2 & \cdots & b^p \\ | &
 228 |   | &  & | \end{array} \right ] = \left [ \begin{array}{cccc}a^T_1 b^1
 229 |     & a^T_1 b^2 & \cdots & a^T_1 b^p \\ a^T_2 b^1 & a^T_2 b^2 & \cdots &
 230 |     a^T_2 b^p \\ \vdots & \vdots & \ddots & \vdots \\ a^T_m b^1 &
 231 |     a^T_m b^2 & \cdots & a^T_m b^p \end{array} \right ]. \]
 232 | Remember that since $A \in \mathbb{R}^{m \times n}$ and $B \in
 233 | \mathbb{R}^{n \times p}$, $a_i \in \mathbb{R}^n$ and $b^j \in
 234 | \mathbb{R}^n$, so these inner products all make sense. This is the
 235 | most ``natural'' representation when we represent $A$ by 
 236 | rows and $B$ by columns.  
 237 | Alternatively, we can represent $A$ by
 238 | columns, and $B$ by rows.  This representation leads to a much trickier interpretation of $AB$ as
 239 | a sum of outer products.  Symbolically,
 240 | \[C = AB = \left [ \begin{array}{cccc} | & | &  & | \\ a^1 & a^2 &
 241 |     \cdots & a^n \\ | & | &  & | \end{array} \right ] \left [
 242 |   \begin{array}{ccc} \mbox{---} & b^T_1 & \mbox{---} \\   \mbox{---} &
 243 |     b^T_2 &  \mbox{---} \\ & \vdots & \\ \mbox{---} & b^T_n  &
 244 |     \mbox{---} \end{array} \right ] = \sum_{i=1}^n a^i
 245 |     b_i^T \;\;.\]
 246 | Put another way, $AB$ is equal to the sum, over all $i$, of the outer
 247 | product of the $i$th 
 248 | column of $A$ and the $i$th row of $B$.  Since, in this case, $a^i \in
 249 | \mathbb{R}^m$ and $b_i \in \mathbb{R}^p$, the dimension of the
 250 | outer product $a^i b^T_i$ is $m \times p$, which coincides with the
 251 | dimension of $C$.  Chances are, the last equality above may appear 
 252 | confusing to you.  If so, take the time to check it for yourself!
 253 | 
 254 | Second, we can also view matrix-matrix multiplication as a set of
 255 | matrix-vector products. Specifically, if we represent $B$ by columns,
 256 | we can view the columns of $C$ as matrix-vector products between $A$
 257 | and the columns of $B$.  Symbolically,
 258 | \begin{align}
 259 | C = AB = A \left [ \begin{array}{cccc} | & | &  & | \\ b^1 & b^2 &
 260 |     \cdots & b^p \\ | & | &  & | \end{array} \right ] = \left [
 261 |     \begin{array}{cccc} | & | &  & | \\ A b^1 & A b^2 & \cdots & A b^p \\ |
 262 |       & | &  & | \end{array} \right ]. \label{eqn:1}
 263 | \end{align}
 264 | Here the $i$th column of $C$ is given by the matrix-vector product
 265 | with the vector on the right, $c^{i} = A b^{i}$.  These matrix-vector
 266 | products can in turn be interpreted using both viewpoints given in the
 267 | previous subsection.  Finally, we have the analogous viewpoint, where
 268 | we represent $A$ by rows, and view the rows of $C$ as the
 269 | matrix-vector product between the rows of $A$ and $C$.  Symbolically,
 270 | \[C = AB = \left [ \begin{array}{ccc} \mbox{---} & a^T_1 & 
 271 |   \mbox{---} \\   \mbox{---} & a^T_2 &  \mbox{---} \\ & \vdots & \\
 272 |   \mbox{---} & a^T_m  &  \mbox{---} \end{array} \right ] B = \left [
 273 |   \begin{array}{ccc} \mbox{---} & a^T_1 B &  \mbox{---} \\   \mbox{---}
 274 |   & a^T_2 B &  \mbox{---} \\ & \vdots & \\ 
 275 |   \mbox{---} & a^T_m B  &  \mbox{---} \end{array} \right ]. \]
 276 | Here the $i$th row of $C$ is given by the matrix-vector product with
 277 | the vector on the left, $c_i^T = a_i^T B$.
 278 | 
 279 | It may seem like overkill to dissect matrix multiplication to such a
 280 | large degree, especially when all these viewpoints follow immediately
 281 | from the initial definition we gave (in about a line of math) at the
 282 | beginning of this section. {\bf The direct advantage of these various viewpoints is that they allow you to operate on the level/unit of vectors instead of scalars.} To fully understand linear algebra without getting lost in the complicated manipulation of indices, the key is to operate with as large concepts as possible. \footnote{E.g., if you could write all your math derivations with matrices or vectors, it would be better than doing them with scalar elements.}
 283 | \Tnote{I am not sure what's the best way to write this philosophical remark about doing math.. Any better idea?}
 284 | 
 285 | Virtually all of linear algebra
 286 | deals with matrix multiplications of some kind, and it is worthwhile
 287 | to spend some time trying to develop an intuitive understanding of the
 288 | viewpoints presented here.
 289 | 
 290 | In addition to this, it is useful to know a few basic properties of
 291 | matrix multiplication at a higher level:
 292 | \begin{itemize}
 293 | \item Matrix multiplication is associative: $(AB)C = A(BC)$.
 294 | \item Matrix multiplication is distributive: $A(B + C) = AB + AC$.
 295 | \item Matrix multiplication is, in general, \textit{not} commutative;
 296 |   that is, it can be the case that $AB \neq BA$.  (For example,
 297 |   if $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times q}$,
 298 |   the matrix product $BA$ does not even exist if $m$ and $q$ are not equal!)
 299 | \end{itemize}
 300 | 
 301 | If you are not familiar with these properties, take the time to verify 
 302 | them for yourself.  For example, to check the associativity of matrix
 303 | multiplication, suppose that $A \in \mathbb{R}^{m \times n}$, 
 304 | $B \in \mathbb{R}^{n \times p}$, and $C \in \mathbb{R}^{p \times q}$.  
 305 | Note that $AB \in \mathbb{R}^{m \times p}$, so $(AB)C \in \mathbb{R}^{m \times q}$.
 306 | Similarly, $BC \in \mathbb{R}^{n \times q}$, so $A(BC) \in \mathbb{R}^{m \times q}$.
 307 | Thus, the dimensions of the resulting matrices agree.  To show that
 308 | matrix multiplication is associative, it suffices to check that the $(i,j)$th
 309 | entry of $(AB)C$ is equal to the $(i,j)$th entry of $A(BC)$.  We can verify 
 310 | this directly using the definition of matrix multiplication:
 311 | \begin{eqnarray*}
 312 | ((AB)C)_{ij} &=& \sum_{k=1}^p (AB)_{ik}C_{kj} = \sum_{k=1}^p \left(\sum_{l=1}^n A_{il} B_{lk}\right) C_{kj} \\
 313 | &=& \sum_{k=1}^p \left(\sum_{l=1}^n A_{il} B_{lk} C_{kj} \right) = \sum_{l=1}^n \left(\sum_{k=1}^p A_{il} B_{lk} C_{kj} \right) \\
 314 | &=& \sum_{l=1}^n A_{il} \left(\sum_{k=1}^p B_{lk} C_{kj}\right) = \sum_{l=1}^n A_{il} (BC)_{lj} = (A(BC))_{ij}.
 315 | \end{eqnarray*}
 316 | Here, the first and last two equalities simply use the definition of matrix multiplication,
 317 | the third and fifth equalities use the distributive property for \textit{scalar multiplication over addition}, and
 318 | the fourth equality uses the \textit{commutative and associativity of scalar addition}.  This technique for proving
 319 | matrix properties by reduction to simple scalar properties will come up often, so make sure you're familiar with it.
 320 | 
 321 | \section{Operations and Properties}
 322 | 
 323 | In this section we present several operations and properties of
 324 | matrices and vectors. Hopefully a great deal of this will be review
 325 | for you, so the notes can just serve as a reference for these topics.
 326 | 
 327 | \subsection{The Identity Matrix and Diagonal Matrices}
 328 | The \textbf{\textit{identity matrix}}, denoted $I \in \mathbb{R}^{n
 329 |   \times n}$, is a square matrix with ones on the diagonal and zeros
 330 |   everywhere else.  That is,
 331 | \[ I_{ij} = \left \{ \begin{array}{ll}1 & i = j \\ 0 & i \neq j
 332 |   \end{array} \right . \]
 333 | It has the property that for all $A \in \mathbb{R}^{m \times n}$, \[AI
 334 | = A = IA.\]   Note that in some sense, the notation for the identity matrix
 335 | is ambiguous, since it does not specify the dimension of $I$.  Generally, 
 336 | the dimensions of $I$ are inferred from context so as to make matrix multiplication
 337 | possible.  For example, in the
 338 | equation above, the $I$ in $AI = A$ is an $n \times n$ matrix, whereas
 339 | the $I$ in $A = IA$ is an $m \times m$ matrix.
 340 | 
 341 | A \textbf{\textit{diagonal matrix}} is a matrix where all non-diagonal
 342 | elements are 0.  This is typically denoted $D = \mathrm{diag}(d_1,
 343 | d_2, \ldots, d_n)$, with
 344 | \[ D_{ij} = \left \{ \begin{array}{ll}d_i & i = j \\ 0 & i \neq
 345 |   j\end{array} \right . \]
 346 | Clearly, $I = \mathrm{diag}(1,1,\ldots,1)$.
 347 | 
 348 | \subsection{The Transpose}
 349 | The \textbf{\textit{transpose}} of a matrix results from ``flipping''
 350 | the rows and columns.  Given a matrix $A \in \mathbb{R}^{m \times n}$,
 351 | its transpose, written $A^T \in \mathbb{R}^{n \times m}$, is the $n \times m$ matrix
 352 | whose entries are given by
 353 | \[ (A^T)_{ij} = A_{ji}.\]
 354 | We have in fact already been using the transpose when describing row
 355 | vectors, since the transpose of a column vector is naturally a row
 356 | vector.
 357 | 
 358 | The following properties of transposes are easily verified:
 359 | \begin{itemize}
 360 | \item $(A^T)^T = A$
 361 | \item $(AB)^T = B^T A^T$
 362 | \item $(A + B)^T = A^T + B^T$
 363 | \end{itemize}
 364 | 
 365 | \subsection{Symmetric Matrices}
 366 | A square matrix $A \in \mathbb{R}^{n \times n}$ is
 367 | \textbf{\textit{symmetric}} if $A = A^T$.  It is
 368 | \textbf{\textit{anti-symmetric}} if $A = -A^T$.  It is easy to show
 369 | that for any matrix $A \in \mathbb{R}^{n \times n}$, the matrix
 370 | $A + A^T$ is symmetric and the matrix $A - A^T$ is anti-symmetric.
 371 | From this it follows that any square matrix $A \in \mathbb{R}^{n
 372 |   \times n}$ can be represented as a sum of a symmetric matrix and an
 373 | anti-symmetric matrix, since
 374 | \[A = \frac{1}{2}(A + A^T) + \frac{1}{2}(A - A^T)\]
 375 | and the first matrix on the right is symmetric, while the second is
 376 | anti-symmetric.  It turns out that symmetric matrices occur a great
 377 | deal in practice, and they have many nice properties which we will
 378 | look at shortly.  It is common to denote the set of all symmetric
 379 | matrices of size $n$ as $\mathbb{S}^n$, so that $A \in \mathbb{S}^n$
 380 | means that $A$ is a symmetric $n \times n$ matrix;
 381 | 
 382 | \subsection{The Trace}
 383 | The \textbf{\textit{trace}} of a square matrix $A \in \mathbb{R}^{n
 384 |   \times n}$, denoted $\mathrm{tr}(A)$ (or just $\mathrm{tr}A$ if the
 385 |   parentheses are obviously implied), is the
 386 | sum of diagonal elements in the matrix: 
 387 | \[\mathrm{tr}A = \sum_{i=1}^n A_{ii}.\]
 388 | As described in the CS229 lecture notes, the trace has the following
 389 | properties (included here for the sake of completeness):
 390 | \begin{itemize}
 391 | \item For $A \in \mathbb{R}^{n \times n}$, $\mathrm{tr}A =
 392 |   \mathrm{tr}A^T$.
 393 | \item For $A,B \in \mathbb{R}^{n \times n}$, $\mathrm{tr}(A + B) =
 394 |   \mathrm{tr}A + \mathrm{tr}B$.
 395 | \item For $A \in \mathbb{R}^{n \times n}$, $t \in \mathbb{R}$,
 396 |   $\mathrm{tr}(tA) = t \; \mathrm{tr}A$.
 397 | \item For $A,B$ such that $AB$ is square, $\mathrm{tr}AB =
 398 |   \mathrm{tr}BA$.
 399 | \item For $A,B,C$ such that $ABC$ is square, $\mathrm{tr}ABC =
 400 |   \mathrm{tr}BCA = \mathrm{tr}CAB$, and so on for the product of
 401 |   more matrices.
 402 | \end{itemize}
 403 | 
 404 | As an example of how these properties can be proven, we'll consider the
 405 | fourth property given above.  Suppose that $A \in \mathbb{R}^{m \times n}$
 406 | and $B \in \mathbb{R}^{n \times m}$ (so that $AB \in \mathbb{R}^{m \times m}$ is a square matrix).
 407 | Observe that $BA \in \mathbb{R}^{n \times n}$ is also a square matrix, so it makes sense
 408 | to apply the trace operator to it.  To verify that $\mathrm{tr} AB = \mathrm{tr} BA$, note that
 409 | \begin{eqnarray*}
 410 |   \mathrm{tr} AB &=& \sum_{i=1}^m (AB)_{ii} = \sum_{i=1}^m \left(\sum_{j=1}^n A_{ij} B_{ji} \right) \\
 411 |   &=& \sum_{i=1}^m \sum_{j=1}^n A_{ij} B_{ji} = \sum_{j=1}^n \sum_{i=1}^m B_{ji} A_{ij} \\
 412 |   &=& \sum_{j=1}^n \left(\sum_{i=1}^m B_{ji} A_{ij}\right) = \sum_{j=1}^n (BA)_{jj} = \mathrm{tr} BA.
 413 | \end{eqnarray*}
 414 | Here, the first and last two equalities use the definition of the trace operator and matrix multiplication.
 415 | The fourth equality, where the main work occurs, uses the commutativity of scalar multiplication in order
 416 | to reverse the order of the terms in each product, and the commutativity and associativity of scalar addition
 417 | in order to rearrange the order of the summation.
 418 | 
 419 | \commentout{
 420 | As an example of how these properties can be proven, we'll consider the last
 421 | property given above.  Suppose that $A \in \mathbb{R}^{m \times n}$, 
 422 | $B \in \mathbb{R}^{n \times p}$, and $C \in \mathbb{R}^{p \times m}$ (so that $ABC \in \mathbb{R}^{m \times m}$ is a square matrix).
 423 | In this case, it is easy to see that $BCA \in \mathbb{R}^{n \times n}$ is also
 424 | a square matrix.  To verify that $\mathrm{tr}ABC = \mathrm{tr}BCA$, observe that
 425 | \begin{eqnarray*}
 426 |   \mathrm{tr} ABC &=& \sum_{i=1}^m (ABC)_{ii} = \sum_{i=1}^m \left(\sum_{j=1}^p (AB)_{ij} C_{ji} \right) = \sum_{i=1}^m \left(\sum_{j=1}^p \left( \sum_{k=1}^n A_{ik} B_{kj} \right) C_{ji} \right) \\
 427 |   &=& \sum_{i=1}^m \sum_{j=1}^p \sum_{k=1}^n A_{ik} B_{kj} C_{ji} = \sum_{k=1}^n \sum_{i=1}^m \sum_{j=1}^p B_{kj} C_{ji} A_{ik} \\
 428 |   &=& \sum_{k=1}^n \sum_{i=1}^m \left( \sum_{j=1}^p B_{kj} C_{ji} \right) A_{ik} = \sum_{k=1}^n \sum_{i=1}^m (BC)_{ki} A_{ik} \\
 429 |   &=& \sum_{k=1}^n (BCA)_{kk} = \mathrm{tr} BCA.
 430 | \end{eqnarray*}
 431 | }
 432 | 
 433 | \subsection{Norms}
 434 | 
 435 | A \textbf{\textit{norm}} of a vector $\|x\|$ is informally a measure of
 436 | the ``length'' of the vector.  For example, we have the 
 437 | commonly-used Euclidean or $\ell_2$ norm,
 438 | \[\|x\|_2 = \sqrt{\sum_{i=1}^n x_i^2}.\]
 439 | Note that $\|x\|_2^2 = x^Tx$.
 440 | 
 441 | More formally, a norm is any function $f : \mathbb{R}^{n} \rightarrow
 442 | \mathbb{R}$ that satisfies 4 properties:
 443 | \begin{enumerate}
 444 | \item For all $x \in \mathbb{R}^n$, $f(x) \geq 0$ (non-negativity).
 445 | \item $f(x) = 0$ if and only if $x = 0$ (definiteness).
 446 | \item For all $x \in \mathbb{R}^n$, $t \in \mathbb{R}$, $f(tx) =
 447 |   |t|f(x)$ (homogeneity).
 448 | \item For all $x,y \in \mathbb{R}^n$, $f(x + y) \leq f(x) + f(y)$
 449 |   (triangle inequality).
 450 | \end{enumerate}
 451 | Other examples of norms are the $\ell_1$ norm,
 452 | \[\|x\|_1 = \sum_{i=1}^n |x_i| \]
 453 | and the $\ell_\infty$ norm,
 454 | \[\|x\|_\infty = \mathrm{max}_i\, |x_i|.\]
 455 | In fact, all three norms presented so far are examples of the family
 456 | of $\ell_p$ norms, which are parameterized by a real number $p \geq
 457 | 1$, and defined as
 458 | \[\|x\|_p = \left ( \sum_{i=1}^n |x_i|^p \right )^{1/p}.\]
 459 | 
 460 | Norms can also be defined for matrices, such as the Frobenius norm,
 461 | \[\|A\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n A_{ij}^2} =
 462 | \sqrt{\mathrm{tr}(A^T A)}. \] Many other norms exist, but they are
 463 | beyond the scope of this review.
 464 | 
 465 | \subsection{Linear Independence and Rank}
 466 | A set of vectors $\{x_1, x_2, \ldots x_n\} \subset \mathbb{R}^m\ $ is said to be
 467 | \textbf{\textit{(linearly) independent}} if no vector can be represented
 468 | as a linear combination of the remaining vectors.  Conversely, if one
 469 | vector belonging to the set \textit{can} be represented as a linear combination of
 470 | the remaining vectors, then the vectors are said to be \textbf{\textit{(linearly)
 471 |     dependent}}.  That is, if
 472 | \[x_n = \sum_{i=1}^{n-1} \alpha_i x_i\]
 473 | for some scalar values $\alpha_1, \ldots, \alpha_{n-1} \in \mathbb{R}$, then we say that
 474 | the vectors $x_1, \ldots, x_n$ are linearly dependent; otherwise, the vectors are
 475 | linearly independent.  For example, the vectors
 476 | \[
 477 | x_1 = \left[ \begin{array}{c} 1 \\ 2 \\ 3 \end{array} \right] \quad 
 478 | x_2 = \left[ \begin{array}{c} 4 \\ 1 \\ 5 \end{array} \right] \quad 
 479 | x_3 = \left[ \begin{array}{c} 2 \\ -3 \\ -1 \end{array} \right] 
 480 | \]
 481 | are linearly dependent because $x_3 = -2x_1 + x_2$.
 482 | 
 483 | The \textbf{\textit{column rank}} of a matrix $A \in \mathbb{R}^{m \times n}$ is the 
 484 | size of the largest subset of columns of $A$ that constitute a linearly independent
 485 | set.  With some abuse of terminology, this is often referred to simply as the number of linearly
 486 | independent columns of $A$.  In the same way, the
 487 | \textbf{\textit{row rank}} is the largest number of rows of $A$ that
 488 | constitute a linearly independent set.
 489 | 
 490 | For any matrix $A \in \mathbb{R}^{m \times n}$, it turns out that
 491 | the column rank of $A$ is equal to the row rank of $A$ (though we will not prove this), and so both quantities
 492 | are referred to collectively as the \textbf{\textit{rank}} of $A$, denoted as
 493 | $\mathrm{rank}(A)$.  The following are some basic properties of the
 494 | rank: 
 495 | \begin{itemize}
 496 | \item For $A \in \mathbb{R}^{m \times n}$, $\mathrm{rank}(A) \leq
 497 |     \mathrm{min}(m,n)$.  If $\mathrm{rank}(A) = \mathrm{min}(m,n)$,
 498 |     then $A$ is said to be \textbf{\textit{full rank}}. 
 499 |   \item For $A \in \mathbb{R}^{m \times n}$, $\mathrm{rank}(A) =
 500 |     \mathrm{rank}(A^T)$.
 501 |   \item For $A \in \mathbb{R}^{m \times n}$, $B \in \mathbb{R}^{n
 502 |     \times p}$, $\mathrm{rank}(AB) \leq \mathrm{min}(\mathrm{rank}(A),
 503 |     \mathrm{rank}(B))$.
 504 |   \item For $A,B \in \mathbb{R}^{m \times n}$, $\mathrm{rank}(A + B)
 505 |     \leq \mathrm{rank}(A) + \mathrm{rank}(B)$.
 506 | \end{itemize}
 507 | 
 508 | \subsection{The Inverse of a Square Matrix}
 509 | 
 510 | The \textbf{\textit{inverse}} of a square matrix $A \in \mathbb{R}^{n
 511 |   \times n}$ is denoted $A^{-1}$, and is the unique matrix such that
 512 | \[A^{-1} A = I = A A^{-1}.\]  
 513 | Note that not all matrices have inverses.  Non-square matrices, for example,
 514 | do not have inverses by definition.  However, for some square matrices
 515 | $A$, it may still be the case that $A^{-1}$ may not exist.  In particular,
 516 | we say that $A$ is \textbf{\textit{invertible}} or
 517 | \textbf{\textit{non-singular}} if $A^{-1}$ exists and
 518 | \textbf{\textit{non-invertible}} or \textbf{\textit{singular}}
 519 | otherwise.\footnote{It's easy to get confused and think that non-singular means non-invertible.  But in fact, it means the opposite!  Watch out!}  
 520 | 
 521 | In order for a square matrix $A$ to have an inverse $A^{-1}$, then
 522 | $A$ must be full rank.  We will soon see that there are many
 523 | alternative sufficient and necessary conditions, in addition to full
 524 | rank, for invertibility.  
 525 | 
 526 | The following are properties of the inverse;
 527 | all assume that $A, B \in \mathbb{R}^{n \times n}$ are non-singular:
 528 | \begin{itemize}
 529 | \item $(A^{-1})^{-1} = A$
 530 | \item $(AB)^{-1} = B^{-1} A^{-1}$
 531 | \item $(A^{-1})^T = (A^T)^{-1}$.  For this reason this matrix is often
 532 |   denoted $A^{-T}$.
 533 | \end{itemize}
 534 | As an example of how the inverse is used, consider the linear system of equations,
 535 | $Ax = b$ where $A \in \mathbb{R}^{n \times n}$, and $x, b \in \mathbb{R}^n$.  If
 536 | $A$ is nonsingular (i.e., invertible), then $x = A^{-1} b$.   
 537 | 
 538 | (What if $A \in \mathbb{R}^{m \times n}$
 539 | is not a square matrix?  Does this work?)
 540 | 
 541 | \Tnote{Add more about pseudo-inverse}
 542 | 
 543 | \subsection{Orthogonal Matrices}
 544 | Two vectors $x,y \in \mathbb{R}^n$ are \textbf{\textit{orthogonal}} if
 545 | $x^T y$ = 0. A vector $x \in \mathbb{R}^n$ is
 546 | \textbf{\textit{normalized}} if $\|x\|_2 = 1$.  A square matrix $U \in
 547 | \mathbb{R}^{n \times n}$ is \textbf{\textit{orthogonal}} (note the
 548 | different meanings when talking about vectors versus matrices) if all
 549 | its columns are orthogonal to each other and are normalized (the
 550 | columns are then referred to as being \textbf{\textit{orthonormal}}).
 551 | 
 552 | It follows immediately from the definition of orthogonality and
 553 | normality that
 554 | \[U^TU = I = UU^T.\]
 555 | In other words, the inverse of an orthogonal matrix is its transpose.
 556 | Note that if $U$ is not square --- i.e., $U \in \mathbb{R}^{m \times
 557 |   n},\;\; n < m$ --- but its columns are still orthonormal, then $U^TU
 558 | = I$, but $U U^T \neq I$.  We generally only use the term orthogonal
 559 | to describe the previous case, where $U$ is square.
 560 | 
 561 | Another nice property of orthogonal matrices is that operating on a
 562 | vector with an orthogonal matrix will not change its Euclidean norm,
 563 | i.e.,
 564 | \begin{align}
 565 | \|Ux\|_2 = \|x\|_2 \label{eqn:preserve-norm}
 566 | \end{align}
 567 | for any $x \in \mathbb{R}^n$, $U \in \mathbb{R}^{n \times n}$
 568 | orthogonal.
 569 | 
 570 | \subsection{Range and Nullspace of a Matrix}
 571 | The \textbf{\textit{span}} of a set of vectors $\{x_1, x_2, \ldots
 572 | x_n\}$ is the set of all vectors that can be expressed as a linear
 573 | combination of $\{x_1, \ldots, x_n\}$.  That is,
 574 | \[\mathrm{span}(\{x_1, \ldots x_n\}) = \left \{v : v = \sum_{i=1}^n \alpha_i
 575 | x_i, \;\; \alpha_i \in
 576 | \mathbb{R} \right \}. \]
 577 | It can be shown that if $\{x_1, \ldots, x_n\}$ is a set of $n$
 578 | linearly independent vectors, where each $x_i \in \mathbb{R}^n$, then
 579 | $\mathrm{span}(\{x_1, \ldots x_n\}) = \mathbb{R}^n$.  In other words,
 580 | \textit{any} vector $v \in \mathbb{R}^n$ can be written as a linear
 581 | combination of $x_1$ through $x_n$.  The
 582 | \textbf{\textit{projection}} of a vector $y \in \mathbb{R}^m$ onto the
 583 | span of $\{x_1, \ldots, x_n\}$ (here we assume
 584 | $x_i \in \mathbb{R}^m$) is the
 585 | vector $v \in \mathrm{span}(\{x_1, \ldots x_n\})$, such that $v$ is as
 586 | close as possible to $y$, 
 587 | as measured by the Euclidean norm $\|v - y\|_2$.  We denote the
 588 | projection as $\mathrm{Proj}(y;\{x_1, \ldots, x_n\})$ and can define
 589 | it formally as,
 590 | \[\mathrm{Proj}(y;\{x_1, \ldots x_n\}) = \textrm{argmin}_{v \in
 591 |   \mathrm{span}(\{x_1, \ldots, x_n\})} \|y - v\|_2.\]  
 592 | 
 593 | The \textbf{\textit{range}} (sometimes also called the columnspace)
 594 | of a matrix $A \in \mathbb{R}^{m \times n}$, denoted $\mathcal{R}(A)$,
 595 | is the the span of the columns of $A$.  In other words, 
 596 | \[\mathcal{R}(A) = \{v \in \mathbb{R}^m: v = Ax, x \in
 597 | \mathbb{R}^{n}\}.\]  Making a 
 598 | few technical assumptions (namely that $A$ is full rank and that $n <
 599 | m$), the projection of a vector $y \in \mathbb{R}^m$ onto the range of
 600 | $A$ is given by,
 601 | \[\mathrm{Proj}(y;A) = \mathrm{argmin}_{v \in \mathcal{R}(A)} \|v -
 602 | y\|_2 = A (A^T A)^{-1}A^T y\;\;.\]  This last equation should look
 603 | extremely familiar, since it is almost the same formula we derived in
 604 | class (and which we will soon derive again)
 605 | for the least squares estimation of parameters.  Looking at the
 606 | definition for the projection, it should not be too hard to convince
 607 | yourself that this is in fact the same objective that we minimized in
 608 | our least squares problem (except for a squaring of the norm, which
 609 | doesn't affect the optimal point) and so these problems are naturally
 610 | very connected.  When $A$ contains only a single column, $a \in
 611 | \mathbb{R}^m$, this gives the special case for a projection of a
 612 | vector on to a line: 
 613 | \[\mathrm{Proj}(y;a) = \frac{a a^T}{a^T a}y\;\;.\]
 614 | 
 615 | The \textbf{\textit{nullspace}} of a matrix $A \in \mathbb{R}^{m
 616 |   \times n}$, denoted $\mathcal{N}(A)$ is the set of all
 617 |   vectors that equal 0 when multiplied by $A$, i.e.,
 618 | \[\mathcal{N}(A) = \{x \in \mathbb{R}^n : Ax = 0\}.\]  Note that
 619 | vectors in $\mathcal{R}(A)$ are of size $m$, while vectors in the
 620 | $\mathcal{N}(A)$ are of size $n$, so vectors in $\mathcal{R}(A^T)$
 621 | and $\mathcal{N}(A)$ are both in $\mathbb{R}^n$.  In fact, we can
 622 | say much more.  It turns out that 
 623 | \[\left \{ w : w = u + v, u \in \mathcal{R}(A^T), v \in \mathcal{N}(A)
 624 |   \right \}  = \mathbb{R}^n \mbox{  and
 625 | } \mathcal{R}(A^T) \cap \mathcal{N}(A) = \{\mathbf{0}\}\;\;.\]
 626 | In other words, $\mathcal{R}(A^T)$ and $\mathcal{N}(A)$ are disjoint
 627 | subsets that together span the entire space of $\mathbb{R}^n$.  Sets
 628 | of this type are called \textbf{\textit{orthogonal complements}}, and
 629 | we denote this $\mathcal{R}(A^T) = \mathcal{N}(A)^\perp$.
 630 | 
 631 | \subsection{The Determinant}
 632 | The \textbf{\textit{determinant}} of a square matrix $A \in
 633 | \mathbb{R}^{n \times n}$, is a function $\mathrm{det} : \mathbb{R}^{n
 634 |   \times n} \rightarrow \mathbb{R}$, and is denoted $|A|$ or $\mathrm{det}\,A$
 635 | (like the trace operator, we usually omit parentheses).  
 636 | Algebraically,
 637 | one could write down an explicit formula for the determinant of $A$, but
 638 | this unfortunately gives little intuition about its meaning.  Instead,
 639 | we'll start out by providing a geometric interpretation of the determinant
 640 | and then visit some of its specific algebraic properties afterwards.
 641 | 
 642 | Given a matrix
 643 | \[ \left [ \begin{array}{ccc} \mbox{---} & a^T_1 & 
 644 |     \mbox{---} \\   \mbox{---} & a^T_2 &  \mbox{---} \\ & \vdots & \\
 645 |   \mbox{---} & a^T_n  &  \mbox{---} \end{array} \right ], \] 
 646 | consider the set of points $S \subset \mathbb{R}^n$ formed by taking
 647 | all possible linear combinations of the row vectors $a_1, \ldots, a_n \in \mathbb{R}^n$ of $A$,
 648 | where the coefficients of the linear combination are all between 0 and 1;
 649 | that is, the set $S$ is the restriction of $\textrm{span}(\{a_1,\ldots,a_n\})$
 650 | to only those linear combinations whose coefficients $\alpha_1, \ldots, \alpha_n$ 
 651 | satisfy $0 \le \alpha_i \le 1$, $i=1,\ldots,n$.  Formally,
 652 | \[ S = \{ v \in \mathbb{R}^n : v = \sum_{i=1}^n \alpha_i a_i \mbox{ where } 0 \le \alpha_i \le 1, i=1,\ldots,n \}. \]
 653 | The absolute value of the determinant of $A$, it turns out, is a measure of the ``volume'' of the set $S$.\footnote{
 654 |   Admittedly, we have not actually defined what we mean by ``volume'' here, but hopefully the intuition should be clear enough.
 655 |   When $n=2$, our notion of ``volume'' corresponds to the area of $S$ in the Cartesian plane.  When $n=3$, ``volume'' corresponds
 656 |   with our usual notion of volume for a three-dimensional object.
 657 | }
 658 | 
 659 | For example, consider the $2 \times 2$ matrix,
 660 | \begin{equation}
 661 |   A = \left[ \begin{array}{cc} 1 & 3 \\ 3 & 2 \end{array} \right]. \label{eq:det-example} 
 662 | \end{equation}
 663 | Here, the rows of the matrix are
 664 | \[ a_1 = \left[ \begin{array}{c} 1 \\ 3 \end{array} \right] \quad a_2 = \left[ \begin{array}{c} 3 \\ 2 \end{array} \right]. \]
 665 | The set $S$ corresponding to these rows is shown in Figure~\ref{fig:determinant}.  For two-dimensional matrices,
 666 | $S$ generally has the shape of a \emph{parallelogram}.  In our example, the value of the determinant is $|A| = -7$ (as can be computed
 667 | using the formulas shown later in this section), so the area of the parallelogram is $7$.  (Verify this for yourself!)
 668 | 
 669 | In three dimensions, the set $S$ corresponds to an object known as a \emph{parallelepiped} (a three-dimensional box with skewed sides, such that
 670 | every face has the shape of a parallelogram).  The absolute value of the determinant of the $3 \times 3$ matrix whose rows define $S$ give
 671 | the three-dimensional volume of the parallelepiped.  In even higher dimensions, the set $S$ is an object known as an $n$-dimensional \emph{parallelotope}.
 672 | %
 673 | 
 674 | %\begin{figure}
 675 | %  \vskip 5.0cm
 676 | %  \psline[linewidth=0.05]{->}(6,0)(11,0)
 677 | %  \psline[linewidth=0.05]{->}(6,0)(6,5)
 678 | %  \pspolygon[linewidth=0.05,fillstyle=solid,fillcolor=lightgray](6,0)(7,3)(10,5)(9,2)
 679 | %  \rput(6.3,1.9){$a_1$}
 680 | %  \rput(8,0.8){$a_2$}
 681 | %  \psline[linewidth=0.1]{->}(6,0)(7,3)
 682 | %  \psline[linewidth=0.1]{->}(6,0)(9,2)
 683 | %  \rput(6.5,3.2){$(1,3)$}
 684 | %  \rput(9.5,1.9){$(3,2)$}
 685 | %  \rput(10.2,5.4){$(4,5)$}
 686 | %  \rput(5.4,0){$(0,0)$}
 687 | %  \caption{
 688 | %    Illustration of the determinant for the $2 \times 2$ matrix $A$ given in \eqref{eq:det-example}.  Here, $a_1$ and $a_2$ are vectors
 689 | %    corresponding to the rows of $A$, and the set $S$ corresponds to the shaded region (i.e., the parallelogram).  The
 690 | %    absolute value of the determinant, $|\textrm{det} A| = 7$, is the area of the parallelogram. 
 691 | %  }
 692 | %  \label{fig:determinant}
 693 | %\end{figure}
 694 | 
 695 | \begin{figure}[t]
 696 |   \begin{center}
 697 | \includegraphics[width=0.6\textwidth]{figures/figure}
 698 |   \caption{
 699 |     Illustration of the determinant for the $2 \times 2$ matrix $A$ given in \eqref{eq:det-example}.  Here, $a_1$ and $a_2$ are vectors
 700 |     corresponding to the rows of $A$, and the set $S$ corresponds to the shaded region (i.e., the parallelogram).  The
 701 |     absolute value of the determinant, $|\textrm{det} A| = 7$, is the area of the parallelogram. 
 702 |   }
 703 |   \end{center}
 704 |   \label{fig:determinant}
 705 | \end{figure}
 706 | 
 707 | Algebraically, the determinant satisfies the following three properties (from which all other properties follow, including the
 708 | general formula):
 709 | \begin{enumerate}
 710 | \item The determinant of the identity is 1, $|I| = 1$.  (Geometrically, the volume of a unit hypercube is 1).   
 711 | \item Given a matrix $A \in \mathbb{R}^{n \times n}$, if we multiply a
 712 |   single row in $A$ by a scalar $t \in \mathbb{R}$, then the
 713 |   determinant of the new matrix is $t |A|$,
 714 | \[\left | \left [ \begin{array}{ccc} \mbox{---} & t \; a^T_1 & 
 715 |   \mbox{---} \\   \mbox{---} & a^T_2 &  \mbox{---} \\ & \vdots & \\
 716 |   \mbox{---} & a^T_m  &  \mbox{---} \end{array} \right ] \right |= t
 717 |   |A|. \] 
 718 |   (Geometrically, multiplying one of the sides of the set $S$ by a factor $t$
 719 |   causes the volume to increase by a factor $t$.)
 720 | \item If we exchange any two rows $a_i^T$ and $a_j^T$ of $A$, then the
 721 |   determinant of the new matrix is $-|A|$, for example
 722 | \[\left | \left [ \begin{array}{ccc} \mbox{---} & a^T_2 & 
 723 |   \mbox{---} \\   \mbox{---} & a^T_1 &  \mbox{---} \\ & \vdots & \\
 724 |   \mbox{---} & a^T_m  &  \mbox{---} \end{array} \right ] \right |=
 725 |   -|A|. \]  
 726 | \end{enumerate}
 727 | In case you are wondering, it is not immediately obvious that a function satisfying
 728 | the above three properties exists.  In fact, though, such a function does exist, 
 729 | and is unique (which we will not prove here).  
 730 | 
 731 | Several properties that follow from the three properties above include:
 732 | \begin{itemize}
 733 | \item For $A \in \mathbb{R}^{n \times n}$, $|A| = |A^T|$.
 734 | \item For $A, B \in \mathbb{R}^{n \times n}$, $|AB| = |A||B|$.
 735 | \item For $A \in \mathbb{R}^{n \times n}$, $|A| = 0$ if and only if
 736 |   $A$ is singular (i.e., non-invertible).
 737 |   (If $A$ is singular then it does not have full rank, and hence
 738 |   its columns are linearly dependent.  In this case, the set $S$ corresponds to a
 739 |   ``flat sheet'' within the $n$-dimensional space and hence has zero volume.)
 740 | \item For $A \in \mathbb{R}^{n \times n}$ and $A$ non-singular,
 741 |   $|A^{-1}| = 1/|A|$.
 742 | \end{itemize}
 743 | 
 744 | Before giving the general definition for the determinant, we define,
 745 | for $A \in \mathbb{R}^{n \times n}$, $A_{\setminus i,\setminus j} \in
 746 | \mathbb{R}^{(n-1) \times (n-1)}$ to be the \textit{matrix} that
 747 | results from deleting the $i$th row and $j$th column from $A$.  The
 748 | general (recursive) formula for the determinant is
 749 | \begin{eqnarray*}
 750 | |A| & = & \sum_{i=1}^n (-1)^{i+j} a_{ij} |A_{\setminus i, \setminus
 751 |   j}| \;\;\;\;\;\mbox{(for any $j \in 1,\ldots, n$)} \\
 752 | & = & \sum_{j=1}^n (-1)^{i+j} a_{ij} |A_{\setminus i, \setminus j}|
 753 |   \;\;\;\;\;\mbox{(for any $i \in 1,\ldots, n$)}
 754 | \end{eqnarray*}
 755 | with the initial case that $|A| = a_{11}$ for $ A \in
 756 | \mathbb{R}^{1 \times 1}$.   If we were to expand this formula
 757 | completely for $A \in \mathbb{R}^{n \times n}$, there would be a total
 758 | of $n!$ ($n$ factorial) different terms.  For this reason, we hardly
 759 | ever explicitly write the complete equation of the determinant for
 760 | matrices bigger than $3 \times 3$.  However, the equations for
 761 | determinants of matrices up to size $3 \times 3$ are fairly common,
 762 | and it is good to know them:
 763 | \begin{eqnarray*}
 764 | \left | [a_{11}] \right | & = & a_{11} \\
 765 | \left | \left [ \begin{array}{cc} a_{11} & a_{12} \\ a_{21} & a_{22}
 766 |   \end{array} \right ] \right | & = & a_{11} a_{22} - a_{12} a_{21} \\
 767 | \left | \left [ \begin{array}{ccc} a_{11} & a_{12} & a_{13} \\ a_{21}
 768 |     & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{array} \right ]
 769 |     \right | & = &\begin{array}{l} a_{11} a_{22} a_{33} + a_{12} a_{23} a_{31} +
 770 |     a_{13} a_{21} a_{32} \\ \;\;\;\;\; - a_{11} a_{23} a_{32} - a_{12}
 771 |     a_{21} a_{33} - a_{13} a_{22} a_{31} \end{array}
 772 | \end{eqnarray*}
 773 | 
 774 | The \textbf{\textit{classical adjoint}} (often just called the
 775 | adjoint) of a matrix $A \in \mathbb{R}^{n \times n}$, is denoted
 776 | $\mathrm{adj}(A)$, and defined as
 777 | \[\mathrm{adj}(A) \in \mathbb{R}^{n \times n}, \;\;\;
 778 | (\mathrm{adj}(A))_{ij} = (-1)^{i+j} |A_{\setminus j, \setminus
 779 |   i}|\;\;\]
 780 | (note the switch in the indices $A_{\setminus j, \setminus i}$).  It
 781 |   can be shown that for any nonsingular $A \in \mathbb{R}^{n \times 
 782 |   n}$,
 783 | \[A^{-1} = \frac{1}{|A|}\mathrm{adj}(A)\;\;.\]
 784 | While this is a nice ``explicit'' formula for the inverse of matrix,
 785 | we should note that, numerically, there are in fact much more
 786 | efficient ways of computing the inverse.
 787 | 
 788 | \subsection{Quadratic Forms and Positive Semidefinite Matrices}
 789 | Given a square matrix $A \in \mathbb{R}^{n \times n}$ and a vector $x
 790 | \in \mathbb{R}^n$, the scalar value $x^T A x$ is called a
 791 | \textbf{\textit{quadratic form}}.  Written explicitly, we see that
 792 | \[x^T A x = \sum_{i=1}^n x_i (Ax)_i = \sum_{i=1}^n x_i \left(\sum_{j=1}^n A_{ij} x_j\right) = \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i x_j\;\;.\]
 793 | Note that,
 794 | \[x^T A x = (x^T A x)^T = x^T A^T x = x^T\left(\frac{1}{2} A +
 795 | \frac{1}{2}A^T\right)x,\]
 796 | where the first equality follows from the fact that the transpose
 797 | of a scalar is equal to itself, and the third equality follows
 798 | from the fact that we are averaging two quantities which are themselves equal.
 799 | From this, we can conclude that 
 800 | only the symmetric part of $A$ contributes to the quadratic
 801 | form.  For this reason, we often implicitly assume that the matrices
 802 | appearing in a quadratic form are symmetric.
 803 | 
 804 | We give the following definitions:
 805 | \begin{itemize}
 806 | \item A symmetric matrix $A \in \mathbb{S}^n$ is
 807 | \textbf{\textit{positive definite}} (PD) if for all non-zero
 808 | vectors $x \in \mathbb{R}^n$, $x^T A x > 0$.  This is usually
 809 | denoted $A \succ 0$ (or just $A > 0$), and often times the set of
 810 | all positive definite matrices is denoted $\mathbb{S}^n_{++}$. 
 811 |   
 812 | 
 813 | \item A symmetric matrix $A \in \mathbb{S}^n$ is
 814 | \textbf{\textit{positive semidefinite}} (PSD) if for all vectors
 815 | $x^T A x \geq 0$.  This is written $A \succeq 0$ (or just $A \geq
 816 | 0$), and the set of all positive semidefinite matrices is often
 817 | denoted $\mathbb{S}^n_+$.
 818 | 
 819 | \item Likewise, a symmetric matrix $A \in
 820 | \mathbb{S}^n$ is \textbf{\textit{negative definite}} (ND), denoted $A
 821 | \prec 0$ (or just $A < 0$) if for all non-zero $x \in
 822 | \mathbb{R}^n$, $x^T A x < 0$.
 823 | 
 824 | \item Similarly, a symmetric matrix $A \in \mathbb{S}^n$ is
 825 |     \textbf{\textit{negative semidefinite}} (NSD), denoted $A \preceq
 826 |     0$ (or just $A \leq 0$) if for all $x \in \mathbb{R}^n$, $x^T A x 
 827 | \leq 0$.
 828 | 
 829 | \item Finally, a symmetric matrix $A \in
 830 | \mathbb{S}^n$ is \textbf{\textit{indefinite}}, if it is neither
 831 | positive semidefinite nor negative semidefinite --- i.e., if there
 832 | exists $x_1, x_2 \in \mathbb{R}^n$ such that $x_1^T A x_1 > 0$ and
 833 | $x_2^T A x_2 < 0$.
 834 | 
 835 | \end{itemize}
 836 | 
 837 | It should be obvious that if $A$ is positive definite, then $-A$ is
 838 | negative definite and vice versa.  Likewise, if $A$ is positive
 839 | semidefinite then $-A$ is negative semidefinite and vice versa.  If
 840 | $A$ is indefinite, then so is $-A$.  
 841 | 
 842 | One important property of positive definite and negative definite matrices 
 843 | is that they are always full rank, and hence, invertible.  To see why
 844 | this is the case, suppose that some matrix $A \in \mathbb{R}^{n \times n}$
 845 | is not full rank.  Then, suppose that the $j$th column of $A$ is expressible 
 846 | as a linear combination of other $n-1$ columns:
 847 | \[ a_j = \sum_{i \neq j} x_i a_i, \]
 848 | for some $x_1,\ldots,x_{j-1}, x_{j+1}, \ldots,x_{n} \in \mathbb{R}$.  Setting $x_j = -1$, we have
 849 | \[ Ax = \sum_{i=1}^n x_i a_i = 0. \]
 850 | But this implies $x^T Ax = 0$ for some non-zero vector $x$, so $A$ must be
 851 | neither positive definite nor negative definite.  Therefore, if $A$ is
 852 | either positive definite or negative definite, it must be full rank.
 853 | 
 854 | Finally, there is one type of positive definite matrix that comes up
 855 | frequently, and so deserves some special mention.  Given any matrix $A
 856 | \in \mathbb{R}^{m \times n}$ (not necessarily symmetric or even
 857 | square), the matrix $G = A^T A$ (sometimes called a
 858 | \textbf{\textit{Gram matrix}}) is always positive semidefinite.
 859 | Further, if $m \geq n$ (and we assume for convenience that $A$ is full
 860 | rank), then $G = A^T A$ is positive definite.
 861 | 
 862 | \subsection{Eigenvalues and Eigenvectors}
 863 | 
 864 | Given a square matrix $A \in \mathbb{R}^{n \times n}$, we say that
 865 | $\lambda \in \mathbb{C}$ is an \textbf{\textit{eigenvalue}} of $A$ and
 866 | $x \in \mathbb{C}^n$  is the corresponding
 867 | \textbf{\textit{eigenvector}}\footnote{Note that $\lambda$ and the
 868 |   entries of $x$ are actually in 
 869 | $\mathbb{C}$, the set of complex numbers, not just the reals; we
 870 | will see shortly why this is necessary. Don't worry about this
 871 | technicality for now, you can think of complex vectors in the same way
 872 | as real vectors.} if
 873 | \[Ax = \lambda x, \;\;\; x \neq 0. \]
 874 | Intuitively, this definition
 875 | means that multiplying $A$ by the vector $x$ results in a new vector
 876 | that points in the same direction as $x$, but scaled by a factor
 877 | $\lambda$.  Also note that for any eigenvector $x \in \mathbb{C}^n$,
 878 | and scalar $c \in \mathbb{C}$, 
 879 | $A(cx) = cAx = c \lambda x = \lambda(cx)$, so $cx$ is also an
 880 | eigenvector.   For this reason when we talk about ``the'' eigenvector
 881 | associated with $\lambda$, we usually assume that the eigenvector is
 882 | normalized to have length 1 (this still creates some ambiguity, since
 883 | $x$ and $-x$ will both be eigenvectors, but we will have to live with
 884 | this).
 885 | 
 886 | We can rewrite the equation above to state that $(\lambda, x)$ is an
 887 | eigenvalue-eigenvector pair of $A$ if,
 888 | \[(\lambda I - A)x = 0, \;\;\; x \neq 0.\]
 889 | But $(\lambda I - A)x = 0$ has a non-zero solution to $x$ if and only
 890 | if $(\lambda I - A)$ has a non-empty nullspace, which is only the case
 891 | if $(\lambda I - A)$ is singular, i.e.,
 892 | \[|(\lambda I - A)| = 0.\]
 893 | 
 894 | 
 895 | We can now use the previous definition of the determinant to expand
 896 | this expression $|(\lambda I - A)|$ into a (very large) polynomial in $\lambda$, where
 897 | $\lambda$ will have  degree $n$. It's often called the characteristic polynomial of the matrix $A$. 
 898 | 
 899 | We then find the $n$
 900 | (possibly complex) roots of this characteristic polynomial and denote them by $\lambda_1, \ldots, \lambda_n$. These are all the eigenvalues of the matrix $A$, but we note that they may not be distinct. 
 901 | To find the eigenvector
 902 | corresponding to the eigenvalue $\lambda_i$, we simply solve the
 903 | linear equation $(\lambda_i I - A)x = 0$, which is guaranteed to have a non-zero solution because $\lambda_i I-A$ is singular (but there could also be multiple or infinite solutions.)
 904 | 
 905 | 
 906 | 
 907 | 
 908 | It should be noted that
 909 | this is not the method which is actually used in practice to
 910 | numerically compute the eigenvalues and eigenvectors (remember that
 911 | the complete expansion of the determinant has $n!$ terms); it is
 912 | rather a mathematical argument.  
 913 | 
 914 | The following are properties of
 915 | eigenvalues and eigenvectors (in all cases assume $A \in \mathbb{R}^{n
 916 |   \times n}$ has eigenvalues $\lambda_i, \ldots, \lambda_n$):
 917 | \begin{itemize}
 918 | \item The trace of a $A$ is equal to the sum of its eigenvalues,
 919 | \[\mathrm{tr}A = \sum_{i=1}^n \lambda_i.\]
 920 | \item The determinant of $A$ is equal to the product of its
 921 |   eigenvalues,
 922 | \[|A| = \prod_{i=1}^n \lambda_i.\]
 923 | \item The rank of $A$ is equal to the number of non-zero eigenvalues
 924 |   of $A$.
 925 | \item Suppose $A$ is non-singular with eigenvalue $\lambda$ and an associated eigenvector $x$. Then $1/\lambda$ is an eigenvalue of
 926 |   $A^{-1}$ with an associated eigenvector $x$, i.e., $A^{-1}x =
 927 |   (1/\lambda)x$.  (To prove this, take the eigenvector equation,
 928 |   $A x = \lambda x$ and left-multiply each side by $A^{-1}$.)
 929 | \item The eigenvalues of a diagonal matrix $D = \mathrm{diag}(d_1,
 930 |   \ldots d_n)$ are just the diagonal entries $d_1, \ldots d_n$.
 931 | \end{itemize}
 932 | %We can write all the eigenvector equations simultaneously as
 933 | %\[AX = X\Lambda\]
 934 | %where the columns of $X \in \mathbb{R}^{n \times n}$ are the
 935 | %eigenvectors of $A$ and $\Lambda$ is a diagonal matrix whose entries
 936 | %are the eigenvalues of $A$, i.e.,
 937 | %\[X \in \mathbb{R}^{n \times n} = \left [\begin{array}{cccc} | & | &
 938 | %    & | \\ x_1 & x_2 & \cdots & x_n \\ | &  | &  & | \end{array}
 939 | %  \right ], \;\; \Lambda  =
 940 | %  \mathrm{diag}(\lambda_1, \ldots, \lambda_n).\]
 941 | %If the eigenvectors of $A$ are linearly independent, then the matrix
 942 | %$X$ will be invertible, so $A = X\Lambda X^{-1}$.  A matrix that can
 943 | %be written in this form is called \textbf{\textit{diagonalizable}}.
 944 | 
 945 | 
 946 | \subsection{Eigenvalues and Eigenvectors of Symmetric Matrices}
 947 | 
 948 | 
 949 | In general, the structures of the eigenvalues and eigenvectors of a general square matrix can be subtle to characterize. Fortunately, in most of the cases in machine learning, it suffices to deal with symmetric real matrices, whose eigenvalues and eigenvectors have remarkable properties. 
 950 | 
 951 | 
 952 | %Two remarkable properties come about when we look at the eigenvalues
 953 | %and eigenvectors of a symmetric matrix $A \in \mathbb{S}^n$.  
 954 | 
 955 | 
 956 | Throughout this section, let's assume that $A$ is a symmetric real matrix. We have the following properties: 
 957 | 
 958 | \begin{itemize}
 959 | 	\item[1.] All eigenvalues of $A$ are real numbers. We denote them by $\lambda_1,\dots, \lambda_n$. 
 960 | 	\item[2.] There exists a set of eigenvectors $u_1,\dots, u_n$ such that a) for all $i$, $u_i$ is an eigenvector with eigenvalue $\lambda_i$ and b) $u_1,\dots, u_n$ are unit vectors and orthogonal to each other.\footnote{Mathematically, we have $\forall i,  Au_i = \lambda_i u_i$, $\|u_i\|_2=1$, and $\forall j\neq i, u_i^Tu_j=0$. Moreover, we remark that it's not true that all eigenvectors $u_1,\dots, u_n$ satisfying a) of any matrix $A$ are orthogonal to each other, because the eigenvalues can be repetitive and so can eigenvectors.}
 961 | \end{itemize}
 962 | 
 963 | Let $U$ be the orthonormal matrix that contains $u_i$'s as columns:\footnote{Here for notational simplicity, we deviate from the notational convention for columns of matrices in the previous sections.}
 964 | 
 965 | \begin{align}
 966 | U = \left [
 967 | \begin{array}{cccc} | & | &  & | \\ u_1 & u_2 & \cdots & u_n \\ | &
 968 | | &  & | \end{array} \right ]
 969 | \end{align}
 970 | 
 971 | Let $\Lambda  =
 972 | \mathrm{diag}(\lambda_1, \ldots, \lambda_n)$ be the diagonal matrix that contains $\lambda_1,\dots, \lambda_n$ as entries on the diagonal. Using the view of matrix-matrix vector multiplication in equation~\eqref{eqn:1} of Section~\ref{subsec:matrix-matrix}, we can verify that 
 973 | \begin{align}
 974 | AU & =  \left [
 975 | \begin{array}{cccc} | & | &  & | \\A u_1 & Au_2 & \cdots & Au_n \\ | &
 976 | | &  & | \end{array} \right ] =  \left [
 977 | \begin{array}{cccc} | & | &  & | \\ \lambda_1 u_1 & \lambda_2 u_2 & \cdots & \lambda_n u_n \\ | &
 978 | | &  & | \end{array} \right ]  = U\mathrm{diag}(\lambda_1, \ldots, \lambda_n) = U\Lambda \nonumber
 979 | \end{align}
 980 | 
 981 | Recalling that orthonormal matrix $U$ satisfies that $UU^T = I$ and using the equation above, we have 
 982 | \begin{align}
 983 | A = AUU^T = U\Lambda U^T \label{eqn:3}
 984 | \end{align}
 985 | 
 986 | This new presentation of $A$ as $U\Lambda U^T$ is often called the diagonalization of the matrix $A$. The term diagonalization comes from the fact that with such representation, we can often effectively treat a symmetric matrix $A$ as a diagonal matrix --- which is much easier to understand --- w.r.t the basis defined by the eigenvectors $U$. We will elaborate this below by several examples. 
 987 | 
 988 | \newcommand{\R}{\mathbb{R}}
 989 | \paragraph{Background: representing vector w.r.t. another basis.} Any orthonormal matrix $U =  \left [
 990 | \begin{array}{cccc} | & | &  & | \\ u_1 &  u_2 & \cdots & u_n \\ | &
 991 | | &  & | \end{array} \right ]  $ defines a new basis (coordinate system) of $\mathbb{R}^n$ in the following sense. For any vector $x\in \mathbb{R}^n$ can be represented as a linear combination of $u_1,\dots, u_n$ with coefficient $\hat{x}_1,\dots, \hat{x}_n$:
 992 | \begin{align}
 993 | x = \hat{x}_1 u_1 + \dots + \hat{x}_n u_n  = U\hat{x} \nonumber
 994 | \end{align}
 995 | where in the second equality we use the view of equation~\eqref{eqn:2}. Indeed, such $\hat{x}$ uniquely exists 
 996 | \begin{align}
 997 | x = U\hat{x} \Leftrightarrow U^T x = \hat{x} \nonumber
 998 | \end{align}
 999 | In other words,  the vector $\hat{x}= U^T x$ can serve as another representation of the vector $x$ w.r.t the basis defined by $U$. 
1000 | 
1001 | 
1002 | \paragraph{``Diagonalizing'' matrix-vector multiplication.} With the setup above, we will see that left-multiplying matrix $A$ can be viewed as left-multiplying a diagonal matrix w.r.t the basic of the eigenvectors. Suppose $x$ is a vector and $\hat{x}$ is its representation w.r.t to the basis of $U$. Let $z = Ax$ be the matrix-vector product. Now let's compute the representation $z$ w.r.t the basis of $U$: 
1003 | 
1004 | Then, again using the fact that $UU^T = U^TU = I$ and equation~\eqref{eqn:3},  we have that 
1005 | \begin{align}
1006 | \hat{z} = U^T z = U^T Ax = U^T U\Lambda U^T x = \Lambda \hat{x}  = \left [ \begin{array}{c} \lambda_1 \hat{x}_1 \\ \lambda_2 \hat{x}_2 \\ \vdots \\ \lambda_n \hat{x}_n  \end{array} \right ]\nonumber
1007 | \end{align}
1008 | We see that left-multiplying matrix $A$ in the original space is equivalent to left-multiplying the diagonal matrix $\Lambda$ w.r.t the new basis, which is merely scaling each coordinate by the corresponding eigenvalue. %One of the key features here is that under the new basis, all the coordinates are scaled independently.  
1009 | 
1010 | Under the new basis, multiplying a matrix multiple times becomes much simpler as well. For example, suppose $q = AAAx$. Deriving out the analytical form of $q$ in terms of the entries of $A$ may be a nightmare under the original basis, but can be much easier under the new on: 
1011 | \begin{align}
1012 | \hat{q} = U^T q = U^TAx = U^TU\Lambda U^TU\Lambda U^TU\Lambda U^Tx = \Lambda^3 \hat{x} = \left [ \begin{array}{c} \lambda_1^3 \hat{x}_1 \\ \lambda_2^3 \hat{x}_2 \\ \vdots \\ \lambda_n^3 \hat{x}_n  \end{array} \right ]
1013 | \end{align}
1014 | 
1015 | 
1016 | 
1017 | 
1018 | 
1019 | 
1020 | 
1021 | 
1022 | %First, it can be shown that all the eigenvalues of $A$ are real.  
1023 | 
1024 | 
1025 | %Secondly,
1026 | %the eigenvectors of $A$ are orthonormal, i.e., the matrix $X$ defined
1027 | %$above is an orthogonal matrix (for this reason, we denote the matrix
1028 | %f eigenvectors as $U$ in this case). 
1029 | 
1030 | %
1031 | %We can therefore represent $A$
1032 | %as $A = U\Lambda U^T$, remembering from above that the inverse of an
1033 | %orthogonal matrix is just its transpose.
1034 | 
1035 | \paragraph{``Diagonalizing'' quadratic form.} As a directly corollary, the quadratic form $x^TAx$ can also be simplified under the new basis
1036 | \begin{align}
1037 | x^TAx = x^TU\Lambda U^T x = \hat{x} \Lambda \hat{x} = \sum_{i=1}^n \lambda_i\hat{x}_i^2 \label{eqn:diag-quadratic}
1038 | \end{align}
1039 | (Recall that with the old representation, $x^TAx = \sum_{i=1,j=1}^{n} x_ix_jA_{ij}$ involves a sum of $n^2$ terms instead of $n$ terms in the equation above.) With this viewpoint,  we can also show that the definiteness of the matrix $A$ depends
1040 | entirely on the sign of its eigenvalues:
1041 | \begin{itemize}
1042 | 	\item[1.]   If all $\lambda_i > 0$, then the matrix $A$ s positivedefinite because $x^TAx = \sum_{i=1}^n \lambda_i\hat{x}_i^2 > 0$ for any $\hat{x}\neq 0$.\footnote{Note that $\hat{x}\neq 0\Leftrightarrow x\neq 0$.} 
1043 | 	\item[2.]  If all $\lambda_i \geq 0$, it is positive semidefinite because $x^TAx = \sum_{i=1}^n \lambda_i\hat{x}_i^2 \ge  0$ for all $\hat{x}$. 
1044 | 	\item[3.]  Likewise, if all $\lambda_i < 0$ or $\lambda_i \leq 0$, then $A$ is
1045 | 	negative definite or negative semidefinite respectively.  
1046 | 	\item[4.] Finally, if
1047 | 	$A$ has both positive and negative eigenvalues, say $\lambda_i > 0$ and $\lambda_j < 0$, then it is indefinite. This is because if we let $\hat{x}$ satisfy $\hat{x}_i = 1 $ and $\hat{x}_k =0, \forall k\neq i$, then $x^TAx = \sum_{i=1}^n \lambda_i\hat{x}_i^2 > 0$. Similarly we can let $\hat{x}$ satisfy $\hat{x}_j = 1 $ and $\hat{x}_k =0, \forall k\neq j$, then $x^TAx = \sum_{i=1}^n \lambda_i\hat{x}_i^2  < 0$. \footnote{Note that $x = U\hat{x}$ and therefore constructing $\hat{x}$ gives an implicit construction of $x$. }
1048 | \end{itemize}
1049 |   %Suppose $A \in \mathbb{S}^n
1050 | %= U \Lambda U^T$.  
1051 | %
1052 | %Then
1053 | %\[x^T A x = x^T U \Lambda U^T x = y^T \Lambda y = \sum_{i=1}^n
1054 | %%\lambda_i y_i^2\]
1055 | %where $y = U^T x$ (and since $U$ is full rank, any vector $y \in
1056 | %\mathbb{R}^n$ can be represented in this form).  Because $y_i^2$ is
1057 | %always positive, the sign of this expression depends entirely on the
1058 | %$\lambda_i$'s.  If all $\lambda_i > 0$, then the matrix is positive
1059 | %definite; if all $\lambda_i \geq 0$, it is positive semidefinite.
1060 | %Likewise, if all $\lambda_i < 0$ or $\lambda_i \leq 0$, then $A$ is
1061 | %negative definite or negative semidefinite respectively.  Finally, if
1062 | %$A$ has both positive and negative eigenvalues, it is indefinite.
1063 | 
1064 | An application where eigenvalues and eigenvectors come up frequently
1065 | is in maximizing some function of a matrix.  In particular, for a
1066 | matrix $A \in \mathbb{S}^n$, consider the following maximization
1067 | problem, 
1068 | \begin{align}\mathrm{max}_{x \in \mathbb{R}^n} \;\; x^T A x = \sum_{i=1}^n \lambda_i\hat{x}_i^2  \;\;\;\;\;
1069 | \mbox{subject to } \|x\|_2^2 = 1\label{eqn:4}
1070 | \end{align}
1071 | i.e., we want to find the vector (of norm 1) which maximizes the
1072 | quadratic form.  Assuming the eigenvalues are ordered as $\lambda_1
1073 | \geq \lambda_2 \geq \ldots \geq \lambda_n$, the optimal value of this optimization problem is $\lambda_1$ and any eigenvector $u_1$ corresponding to $\lambda_1$ is one of the maximizers. (If $\lambda_1 > \lambda_2$, then there is a unique eigenvector corresponding to eigenvalue $\lambda_1$, which is the unique maximizer of the optimization problem~\eqref{eqn:4}.)
1074 | 
1075 | We can show this by using the diagonalization technique: Note that $\|x\|_2 = \|\hat{x}\|_2$ by equation~\eqref{eqn:preserve-norm}, and using equation~\eqref{eqn:diag-quadratic}, we can rewrite the optimization~\eqref{eqn:4} as
1076 | \begin{align}
1077 | \mathrm{max}_{\hat{x} \in \mathbb{R}^n} \;\; \hat{x}^T \Lambda \hat{x} = \sum_{i=1}^n \lambda_i\hat{x}_i^2 \;\;\;\;\;
1078 | \mbox{subject to } \|\hat{x}\|_2^2 = 1\label{eqn:5}
1079 | \end{align}
1080 | 
1081 | %the optimal $x$ for this
1082 | %optimization problem is $x_1$, the eigenvector corresponding to
1083 | %$\lambda_1$.  In this case the maximal value of the quadratic form is
1084 | %$\lambda_1$.  Similarly, the optimal solution to the minimization
1085 | %%problem,
1086 | %\[\mathrm{min}_{x \in \mathbb{R}^n} \;\; x^T A x \;\;\;\;\;
1087 | %\mbox{subject to } \|x\|_2^2 = 1\]
1088 | 
1089 | Then, we have that the objective is upper bounded by $\lambda_1$:
1090 | \begin{align}
1091 | \hat{x}^T \Lambda \hat{x} = \sum_{i=1}^n \lambda_i\hat{x}_i^2 \le \sum_{i=1}^n \lambda_1 \hat{x}_i^2  = \lambda_1
1092 | \end{align}
1093 | Moreover, setting $\hat{x} = \left [ \begin{array}{c} 1 \\ 0\\ \vdots \\ 0  \end{array} \right ]$ achieves the equality in the equation above, and this corresponds to setting $x = u_1$. 
1094 | 
1095 | %
1096 | %is $x_n$, the eigenvector corresponding to $\lambda_n$, and the
1097 | %minimal value is $\lambda_n$.  This can be proved by appealing to the 
1098 | %eigenvector-eigenvalue form of $A$ and the properties of orthogonal
1099 | %matrices.  However, in the next section we will see a way of showing
1100 | %it directly using matrix calculus.
1101 | 
1102 | 
1103 | \section{Matrix Calculus}
1104 | While the topics in the previous sections are typically covered in a
1105 | standard course on linear algebra, one topic that does not seem to be
1106 | covered very often (and which we will use extensively) is the
1107 | extension of calculus to the vector setting.  Despite the fact that
1108 | all the actual calculus we use is relatively trivial, the notation can
1109 | often make things look much more difficult than they are.  In this
1110 | section we present some basic definitions of matrix calculus and
1111 | provide a few examples.
1112 | 
1113 | \subsection{The Gradient}
1114 | Suppose that $f:\mathbb{R}^{m \times n} \rightarrow \mathbb{R}$ is a
1115 | function that takes as input a matrix $A$ of size $m \times n$ and
1116 | returns a real value.  Then the \textbf{\textit{gradient}} of $f$
1117 | (with respect to $A \in \mathbb{R}^{m \times n}$) is the matrix of
1118 | partial derivatives, defined as:
1119 | \[\nabla_A f(A) \in \mathbb{R}^{m \times n} = \left [
1120 |   \begin{array}{cccc} \frac{\partial f(A)}{\partial A_{11}} & 
1121 |   \frac{\partial f(A)}{\partial A_{12}} & \cdots & \frac{\partial
1122 |   f(A)}{\partial A_{1n}}  \\ \frac{\partial f(A)}{\partial A_{21}} &
1123 |   \frac{\partial f(A)}{\partial A_{22}}  & \cdots & \frac{\partial
1124 |   f(A)}{\partial A_{2n}}  \\
1125 |   \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f(A)}{\partial
1126 |   A_{m1}}  & \frac{\partial f(A)}{\partial A_{m2}}  & \cdots &
1127 |   \frac{\partial f(A)}{\partial A_{mn}} \end{array} \right ] \]
1128 | i.e., an $m \times n$ matrix with \[(\nabla_A f(A))_{ij} =
1129 |   \frac{\partial f(A)}{\partial A_{ij}}.\]
1130 | Note that the size of $\nabla_A f(A)$ is always the
1131 |   same as the size of $A$.  So if, in particular, $A$ is just a vector
1132 |   $x \in \mathbb{R}^n$,
1133 | \[\nabla_x f(x) = \left [ \begin{array}{c} \frac{\partial
1134 |   f(x)}{\partial x_1} \\ \frac{\partial f(x)}{\partial x_2} \\ \vdots
1135 |   \\ \frac{\partial f(x)}{\partial x_n}\end{array} \right ].\]
1136 | It is very important to remember that the gradient of a function is
1137 | \textit{only} defined if the function is real-valued, that is, if it
1138 | returns a scalar value.  We can not, for example, take the gradient of
1139 | $Ax, A \in \mathbb{R}^{n \times n}$ with respect to $x$, since this
1140 | quantity is vector-valued.
1141 | 
1142 | It follows directly from the equivalent properties of partial
1143 | derivatives that:
1144 | \begin{itemize}
1145 | \item $\nabla_x (f(x) + g(x)) = \nabla_x f(x) + \nabla_x g(x)$.
1146 | \item For $t \in \mathbb{R}$, $\nabla_x (t\;f(x)) = t \nabla_x f(x)$.
1147 | \end{itemize}
1148 | 
1149 | In principle, gradients are a natural extension of partial derivatives
1150 | to functions of multiple variables.  In practice, however, working with
1151 | gradients can sometimes be tricky for notational reasons.  
1152 | For example,
1153 | suppose that $A \in \mathbb{R}^{m \times n}$ is a matrix of fixed coefficients and 
1154 | suppose that $b \in \mathbb{R}^m$ is a vector of fixed coefficients.
1155 | Let $f : \mathbb{R}^{m} \rightarrow \mathbb{R}$ be the function defined by
1156 | $f(z) = z^T z$, such that $\nabla_z f(z) = 2z$.
1157 | But now, consider the expression,
1158 | \[ \nabla f(Ax). \]
1159 | How should this expression be interpreted?  There are at least two possibilities:
1160 | \begin{enumerate}
1161 | \item 
1162 |   In the first interpretation, recall that $\nabla_z f(z) = 2z$.  Here, we
1163 |   interpret $\nabla f(Ax)$ as evaluating the gradient at the point $Ax$, 
1164 |   hence, 
1165 |   \[ \nabla f(Ax) = 2(Ax) = 2Ax \in \mathbb{R}^m. \]
1166 | \item
1167 |   In the second interpretation, we consider the quantity $f(Ax)$ as a function of the
1168 |   input variables $x$.  More formally, let $g(x) = f(Ax)$.  Then in this interpretation,
1169 |   \[ \nabla f(Ax) = \nabla_x g(x) \in \mathbb{R}^n. \]
1170 | \end{enumerate}
1171 | Here, we can see that these two interpretations are indeed different.  One interpretation
1172 | yields an $m$-dimensional vector as a result, while the other interpretation yields an
1173 | $n$-dimensional vector as a result!  How can we resolve this?
1174 | 
1175 | Here, the key is to make explicit the variables which we are differentiating with respect to.
1176 | In the first case, we are differentiating the function $f$ with respect to its arguments $z$
1177 | and then substituting the argument $Ax$.  In the second case, we are differentiating the
1178 | composite function $g(x) = f(Ax)$ with respect to $x$ directly.  We denote the first
1179 | case as $\nabla_z f(Ax)$ and the second case as $\nabla_x f(Ax)$.\footnote{A drawback to this notation
1180 | that we will have to live with is the fact that in the first case, $\nabla_z f(Ax)$ it appears
1181 | that we are differentiating with respect to a variable that does not even appear in the expression
1182 | being differentiated!  For this reason, the first case is often written as $\nabla f(Ax)$, and the fact
1183 | that we are differentiating with respect to the arguments of $f$ is understood.  However,
1184 | the second case is \emph{always} written as $\nabla_x f(Ax)$.}
1185 | Keeping the notation
1186 | clear is extremely important (as you'll find out in your homework, in fact!).  
1187 | 
1188 | \subsection{The Hessian}
1189 | Suppose that $f:\mathbb{R}^n \rightarrow \mathbb{R}$ is a function
1190 | that takes a vector in $\mathbb{R}^n$ and returns a real number.  Then
1191 | the \textbf{\textit{Hessian}} matrix with respect to $x$, written
1192 | $\nabla_x^2 f(x)$ or simply as $H$ is the $n \times n$ matrix of
1193 | partial derivatives, 
1194 | \[\nabla^2_x f(x) \in \mathbb{R}^{n \times n} = \left [
1195 |   \begin{array}{cccc} \frac{\partial^2 f(x)}{\partial x_1^2} & 
1196 |   \frac{\partial^2 f(x)}{\partial x_1 \partial x_2} & \cdots &
1197 |   \frac{\partial^2 f(x)}{\partial x_1 \partial x_n}  \\
1198 |   \frac{\partial^2 f(x)}{\partial x_2 \partial x_1} & 
1199 |   \frac{\partial^2 f(x)}{\partial x_2^2}  & \cdots & \frac{\partial^2
1200 |   f(x)}{\partial x_2 \partial x_n}  \\
1201 |   \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f(x)}{\partial
1202 |   x_n \partial x_1}  & \frac{\partial^2 f(x)}{\partial x_n \partial
1203 |   x_2} & \cdots & \frac{\partial^2 f(x)}{\partial x_n^2}
1204 |   \end{array} \right ].\]
1205 | In other words, $\nabla_x^2 f(x) \in \mathbb{R}^{n \times n}$, with
1206 | \[(\nabla^2_x f(x))_{ij} = \frac{\partial^2 f(x)}{\partial x_i
1207 |   \partial x_j}.\]
1208 | Note that the Hessian is always symmetric, since
1209 | \[\frac{\partial^2 f(x)}{\partial x_i \partial x_j} = \frac{\partial^2
1210 |   f(x)}{\partial x_j \partial x_i}.\]
1211 | Similar to the gradient, the Hessian is defined only when $f(x)$ is
1212 | real-valued.  
1213 | 
1214 | It is natural to think of the gradient as the analogue
1215 | of the first derivative for functions of vectors, and the Hessian as
1216 | the analogue of the second derivative (and the symbols we use also
1217 | suggest this relation).  This intuition is generally
1218 | correct, but there a few caveats to keep in mind.
1219 | 
1220 | 
1221 | First, for real-valued functions of one variable $f:\mathbb{R}
1222 |   \rightarrow \mathbb{R}$, it is a basic definition that the second
1223 |   derivative is the derivative of the first derivative, i.e.,
1224 | \[\frac{\partial^2 f(x)}{\partial x^2} = \frac{\partial}{\partial x}
1225 |   \frac{\partial}{\partial x} f(x).\]
1226 | However, for functions of a vector, the gradient of the function is a
1227 | vector, and we cannot take the gradient of a vector --- i.e.,
1228 | \[\nabla_x \nabla_x f(x) = \nabla_x \left [ \begin{array}{c} \frac{\partial
1229 |   f(x)}{\partial x_1} \\ \frac{\partial f(x)}{\partial x_2} \\ \vdots
1230 |   \\ \frac{\partial f(x)}{\partial x_n}\end{array} \right ]\]
1231 | and this expression is not defined.  Therefore, it is \textit{not} the
1232 |   case that the Hessian is the gradient of the gradient.  However,
1233 |   this is \textit{almost} true, in the following sense:  If we look at
1234 |   the $i$th entry of the gradient $(\nabla_x f(x))_i = \partial f(x) /
1235 |   \partial x_i$, and take the gradient with respect to $x$ we get
1236 | \[\nabla_x \frac{\partial f(x)}{\partial x_i} = \left [
1237 |   \begin{array}{c} \frac{\partial^2 f(x)}{\partial x_i \partial x_1}
1238 |   \\ \frac{\partial^2 f(x)}{\partial x_i \partial x_2} \\ \vdots \\
1239 |   \frac{\partial f(x)}{\partial x_i \partial x_n}\end{array} \right ] \]
1240 | which is the $i$th column (or row) of the Hessian.  Therefore,
1241 | \[\nabla_x^2 f(x) = \left [ \begin{array}{cccc} \nabla_x (\nabla_x
1242 |   f(x))_1 & \nabla_x (\nabla_x f(x))_2 & 
1243 |     \cdots & \nabla_x (\nabla_x f(x))_n \end{array} \right ].\]
1244 | If we don't mind being a little bit sloppy we can say that
1245 | (essentially) $\nabla_x^2 f(x) = \nabla_x (\nabla_x f(x))^T$, so long
1246 | as we understand that this really means taking the gradient of each
1247 | entry of $(\nabla_x f(x))^T$, not the gradient of the whole vector.
1248 | 
1249 | Finally, note that while we can take the gradient with respect to a
1250 | matrix $A \in \mathbb{R}^n$, for the purposes of this class we will
1251 | only consider taking the Hessian with respect to a vector $x \in
1252 | \mathbb{R}^n$.  This is simply a matter of convenience (and the fact
1253 | that none of the calculations we do require us to find the Hessian
1254 | with respect to a matrix), since the Hessian with respect to a matrix
1255 | would have to represent all the partial derivatives $\partial^2 f(A) /
1256 | (\partial A_{ij} \partial A_{k\ell})$, and it is rather cumbersome to
1257 | represent this as a matrix.
1258 | 
1259 | \subsection{Gradients and Hessians of Quadratic and Linear Functions}
1260 | 
1261 | Now let's try to determine the gradient and Hessian matrices for a few
1262 | simple functions.  It should be noted that all the gradients given
1263 | here are special cases of the gradients given in the CS229 lecture
1264 | notes.  
1265 | 
1266 | For $x \in \mathbb{R}^n$, let $f(x) = b^T x$ for some known
1267 | vector $b \in \mathbb{R}^n$.  Then  
1268 | \[f(x) = \sum_{i = 1}^n b_i x_i\]
1269 | so
1270 | \[\frac{\partial f(x)}{\partial x_k} = \frac{\partial}{\partial x_k}
1271 |   \sum_{i = 1}^n b_i x_i = b_k.\]
1272 | From this we can easily see that $\nabla_x b^T x = b$.  This should be
1273 | compared to the analogous situation in single variable calculus, where
1274 | $\partial/(\partial x) \; ax = a$.
1275 | 
1276 | Now consider the quadratic function $f(x) = x^T A x$ for $A \in
1277 | \mathbb{S}^n$.  Remember that
1278 | \[f(x) = \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i x_j. \]
1279 | To take the partial derivative, we'll consider the terms including $x_k$
1280 | and $x_k^2$ factors separately:
1281 | \begin{eqnarray*}
1282 |   \frac{\partial f(x)}{\partial x_k} 
1283 |   &=& \frac{\partial}{\partial x_k} \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i x_j \\
1284 |   &=& \frac{\partial}{\partial x_k} \left[ \sum_{i \neq k} \sum_{j \neq k} A_{ij} x_i x_j + \sum_{i \neq k} A_{ik} x_i x_k + \sum_{j \neq k} A_{kj} x_k x_j + A_{kk} x_k^2 \right] \\
1285 |   &=& \sum_{i \neq k} A_{ik} x_i + \sum_{j \neq k} A_{kj} x_j + 2 A_{kk} x_k \\
1286 |   &=& \sum_{i=1}^n A_{ik} x_i + \sum_{j=1}^n A_{kj} x_j = 2 \sum_{i=1}^n A_{ki} x_i,
1287 | \end{eqnarray*}
1288 | where the last equality follows since $A$ is symmetric (which we can
1289 | safely assume, since it is appearing in a quadratic form).
1290 | Note that the $k$th entry of $\nabla_x f(x)$ is just the inner
1291 | product of the $k$th row of $A$ and $x$.  Therefore, $\nabla_x x^T A
1292 | x = 2Ax$. Again, this should remind you of the analogous fact in
1293 | single-variable calculus, that $\partial/(\partial x)\; ax^2 = 2ax$.
1294 | 
1295 | Finally, let's look at the Hessian of the quadratic function $f(x) =
1296 | x^T A x$ (it should be obvious that the Hessian of a linear function
1297 | $b^T x$ is zero).  In this case,
1298 | \[\frac{\partial^2 f(x)}{\partial x_k \partial x_\ell} =
1299 | \frac{\partial}{\partial x_k} \left[ \frac{\partial f(x)}{\partial x_\ell} \right] =
1300 | \frac{\partial}{\partial x_k} \left[ 2 \sum_{i=1}^n A_{\ell i} x_i \right] = 2 A_{\ell k} = 2 A_{k \ell}.\]
1301 | Therefore, it should be clear that $\nabla_x^2 x^T A x = 2 A$, which
1302 | should be entirely expected (and again analogous to the
1303 | single-variable fact that $\partial^2/(\partial x^2)\;ax^2 = 2a$).
1304 | 
1305 | To recap,
1306 | \begin{itemize}
1307 | \item $\nabla_x b^T x = b$
1308 | \item $\nabla_x x^T A x = 2Ax$ (if $A$ symmetric)
1309 | \item $\nabla_x^2 x^T A x = 2A$ (if $A$ symmetric)
1310 | \end{itemize}
1311 | 
1312 | \subsection{Least Squares}
1313 | 
1314 | Let's apply the equations we obtained in the last section to
1315 | derive the least squares equations.  Suppose we are given matrices $A
1316 | \in \mathbb{R}^{m \times n}$ (for simplicity we assume $A$ is full
1317 | rank) and a vector $b \in \mathbb{R}^m$ such that $b \not \in
1318 | \mathcal{R}(A)$.  In this situation we will not be able to find a
1319 | vector $x \in \mathbb{R}^n$, such that $Ax = b$, so instead we want to
1320 | find a vector $x$ such that $Ax$ is as close as possible to $b$, as
1321 | measured by the square of the Euclidean norm $\|Ax - b\|_2^2$.
1322 | 
1323 | Using the fact that $\|x\|_2^2$ = $x^T x$, we have
1324 | \begin{eqnarray*}
1325 | \|Ax - b\|_2^2 & = & (Ax - b)^T(Ax - b) \\
1326 | & = & x^T A^T A x - 2b^T Ax + b^T b
1327 | \end{eqnarray*}
1328 | Taking the gradient with respect to $x$ we have, and using the
1329 | properties we derived in the previous section
1330 | \begin{eqnarray*}
1331 | \nabla_x (x^T A^T A x - 2b^T Ax + b^T b) & = & \nabla_x x^T A^T A x -
1332 | \nabla_x 2b^T Ax + \nabla_x b^T b \\
1333 | & = & 2 A^T A x - 2 A^T b
1334 | \end{eqnarray*}
1335 | Setting this last expression equal to zero and solving for $x$ gives
1336 | the normal equations 
1337 | \[x = (A^T A)^{-1}A^T b\]
1338 | which is the same as what we derived in class.
1339 | 
1340 | \subsection{Gradients of the Determinant}
1341 | Now let's consider a situation where we find the gradient of a function
1342 | with respect to a matrix, namely for $A \in \mathbb{R}^{n \times n}$,
1343 | we want to find $\nabla_A |A|$.  Recall from our discussion of
1344 | determinants that
1345 | \[ |A| = \sum_{i=1}^n (-1)^{i+j} A_{ij} |A_{\setminus i, \setminus
1346 |   j}| \;\;\;\;\;\mbox{(for any $j \in 1,\ldots, n$)} \]
1347 | so
1348 | \[\frac{\partial}{\partial A_{k \ell}}|A| =  \frac{\partial}{\partial
1349 |   A_{k \ell}}\sum_{i=1}^n (-1)^{i+j} A_{ij} 
1350 | |A_{\setminus i, \setminus j}| = (-1)^{k+\ell} |A_{\setminus k, \setminus
1351 |  \ell}| = (\mathrm{adj}(A))_{\ell k}.\]
1352 | From this it immediately follows from the properties of the adjoint
1353 |  that 
1354 | \[\nabla_A |A| = (\mathrm{adj}(A))^T = |A| A^{-T}.\]
1355 | 
1356 | Now let's consider the function $f:\mathbb{S}^n_{++} \rightarrow
1357 | \mathbb{R}$, $f(A) = \log |A|$.  Note that we have to restrict the
1358 | domain of $f$ to be the positive definite matrices, since this ensures
1359 | that $|A| > 0$, so that the log of $|A|$ is a real number.  In this
1360 | case we can use the chain rule (nothing fancy, just the ordinary chain
1361 | rule from single-variable calculus) to see that
1362 | \[\frac{\partial \log |A|}{\partial A_{ij}} = \frac{\partial \log
1363 |   |A|}{\partial |A|} \frac{\partial |A|}{\partial A_{ij}} =
1364 |    \frac{1}{|A|}\frac{\partial |A|}{\partial A_{ij}}.\]
1365 | From this it should be obvious that 
1366 | \[\nabla_A \log |A| = \frac{1}{|A|}\nabla_A |A| = A^{-1},\] 
1367 | where we can drop the transpose in the last expression because $A$ is
1368 | symmetric.  Note the similarity to the single-valued case, where
1369 | $\partial/(\partial x)\; \log x = 1/x$.
1370 | 
1371 | \subsection{Eigenvalues as Optimization}
1372 | 
1373 | Finally, we use matrix calculus to solve an optimization problem in a
1374 | way that leads directly to eigenvalue/eigenvector analysis.  Consider
1375 | the following, equality constrained optimization problem:
1376 | \[\mathrm{max}_{x \in \mathbb{R}^n} \;\; x^T A x \;\;\;\;\;
1377 | \mbox{subject to } \|x\|_2^2 = 1\]
1378 | for a symmetric matrix $A \in \mathbb{S}^{n}$.
1379 | A standard way of solving optimization problems with equality
1380 | constraints is by forming the \textbf{\textit{Lagrangian}}, an
1381 | objective function that includes the equality
1382 | constraints.\footnote{Don't worry if you haven't seen Lagrangians
1383 |   before, as we will cover them in greater detail later in CS229.}
1384 | The Lagrangian in this case can be given by
1385 | \[\mathcal{L}(x, \lambda) = x^T A x - \lambda x^T x\]
1386 | where $\lambda$ is called the Lagrange multiplier associated with the
1387 | equality constraint.  It can be established that for $x^*$ to be a
1388 | optimal point to the problem, the gradient of the Lagrangian has to be
1389 | zero at $x^*$ (this is not the only condition, but it is required).
1390 | That is, 
1391 | \[\nabla_x \mathcal{L}(x,\lambda) = \nabla_x(x^T A x - \lambda x^T x)
1392 | = 2 A^T x - 2 \lambda x = 0.\]
1393 | Notice that this is just the linear equation $Ax = \lambda x$.  This
1394 | shows that the only points which can possibly maximize (or minimize)
1395 | $x^T A x$ assuming $x^T x = 1$ are the eigenvectors of $A$.
1396 | 
1397 | \end{document}
1398 | 
1399 | 


--------------------------------------------------------------------------------