├── .DS_Store ├── .gitignore ├── Adversarial_Attack_and_Training └── intriguing_properties_of_NN.tex ├── Artificial_General_Intelligence ├── ER_AML.tex └── continual_learning.tex ├── Basic_Machine_Learning ├── fundamental_algo.tex ├── introduction.tex └── notation.tex ├── LICENSE ├── Neural_Networks └── kowledge_distillation.tex ├── README.md ├── imgs ├── .DS_Store ├── continual_learning │ ├── cl_1.png │ ├── cl_2.png │ └── cl_3.png ├── fundamental_algo │ ├── algo_1.png │ ├── algo_2.png │ ├── algo_3.png │ ├── algo_4.png │ └── algo_5.png ├── introduction │ └── intro_1.png └── notation │ ├── notation_1.png │ ├── notation_2.png │ ├── notation_3.png │ └── notation_4.png ├── machine_learning.pdf ├── machine_learning.tex └── references.bib /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/COD1995/A-Comprehensive-Note-on-Machine-Learning/068268fc12c612b2922acc181608104e570cf8f5/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # LaTeX auxiliary files 2 | *.aux 3 | *.lof 4 | *.log 5 | *.lot 6 | *.fls 7 | *.out 8 | *.toc 9 | *.fmt 10 | *.fot 11 | *.cb 12 | *.cb2 13 | *.lb 14 | *.bbl 15 | *.blg 16 | *.brf 17 | *.idx 18 | *.ilg 19 | *.ind 20 | *.loa 21 | *.glo 22 | *.gls 23 | *.ist 24 | *.acn 25 | *.acr 26 | *.alg 27 | *.glg 28 | *.glsdefs 29 | *.xdy 30 | 31 | # LaTeX intermediate files 32 | *.dvi 33 | *.xdv 34 | *.fdb_latexmk 35 | *.synctex.gz 36 | *.synctex(busy) 37 | *.synctex.gz(busy) 38 | *.pdfsync 39 | 40 | # LaTeX backup files 41 | *~ 42 | 43 | # Other files to ignore 44 | *.ps 45 | *.eps 46 | *.cls 47 | *.sty 48 | -------------------------------------------------------------------------------- /Adversarial_Attack_and_Training/intriguing_properties_of_NN.tex: -------------------------------------------------------------------------------- 1 | \chapter{Intriguing Properties of Neural Network} 2 | 3 | Deep neural networks, known for their exceptional performance in speech and visual recognition tasks, exhibit two notable characteristics \citep{szegedy2013intriguing}. First, \textit{the semantic information in their higher layers is embedded not in individual units but in the collective space they form}. This insights shifts the focus from analyzing single neurons to considering the entire unit group to understand network processing. Second, \textit{these networks display a surprisingly sensitivity to minute, yet percisely tailored alternations (or perturbation).} Such small changes can lead to incorrect outcomes. This vulnerability is not due to random noise; the same modifications can deceive different networks trained on a different subset of the dataset, to misclassify the same input. 4 | 5 | \section{Introduction} 6 | 7 | Deep neural networks are powerful learning models that achieve excellent performance on visual and speech recognition problems because they can express arbitrary computation that consists of a modest number of massively parallel nonlinear steps. As the resulting computation is automatically discovered by backpropagation via supervised learning, it can be difficult to interpret and can have counter-intuitive properties. 8 | 9 | The \textbf{first} property is concerned with the semantic meaning of individual units. It seems that the entire space of activation, rather than the individual units, that contains the bulk of the semantic information contrary to prior belief and the \textbf{second} property is concerned with the stability of neural networks with respect to small perturbation to their inputs. Apply an \textit{imperceptible} non-random perturbation to a test image, it is possible to arbitrarily change the network's prediction. These perturbation are found by optimizing the input to maximize the prediction error. The perturbed examples are often called ``adversarial examples" 10 | 11 | \section{Framework} 12 | \textbf{Notation} $x\in \mathbb{R}$ denotes an input image, $\phi(x)$ is an activation values of some layer. \citep{szegedy2013intriguing} first examine properties of the image of $\phi(x)$, and then search for its blind spots. 13 | -------------------------------------------------------------------------------- /Artificial_General_Intelligence/ER_AML.tex: -------------------------------------------------------------------------------- 1 | \chapter{New Insights on Reducing Abrupt Representation Change in Online Continual Learning} 2 | 3 | Experience Replay (ER), where a small subset of past data is stored and replayed alongside new data, has emerged as a simple and effective learning strategy. The author focus on the change in representations of observed data that arises when previously unobserved classes appear in representations of observed data 4 | 5 | \section{Introduction} -------------------------------------------------------------------------------- /Artificial_General_Intelligence/continual_learning.tex: -------------------------------------------------------------------------------- 1 | \chapter{Continual Learning: An Overview} 2 | \section{Introduction} 3 | \textbf{Continual Learning} is motivated by the fact that human and other organisms has the ability to adapt, accumulate and exploit knowledge. A common setting for continual learning is to learn a sequence of contents one by one and behave as if they were observed simultaneously \citep{wang2023comprehensive}. Each task learned throughout the life time can be new skills, new examples of old skills, different environments, etc (Fig.\ref{fig:cl_1}, a). This attribute of continual learning makes it also referred to as \textbf{incremental learning} or \textbf{lifelong learning}. 4 | 5 | Unlike conventional pipeline, where joint training is applied, continual learning is characterized by learning from dynamic data distributions. A major challenge is known as \textbf{catastrophic forgetting}, where \textit{adaptation to a new distribution generally results in a largely reduced ability to capture the old ones}. This dilemma is a facet of the trade-off between \textbf{learning plasticity} and \textbf{memory stability}: an excess of the former interferes with the latter, and vice versa. A good continual learning algorithm should obtain a strong \textbf{generalizability} to accommodate distribution differences within and between tasks (Fig.\ref{fig:cl_1}, b). As a naive baseline, retraining all old training samples (if allowed) makes it easy to address the above challenges, but creates huge computational and storage overheads (as well as potential privacy issues). In fact, continual learning is primarily intended to ensure \textbf{resource efficiency} of model updates, preferably close to learning only new training samples. 6 | 7 | \begin{figure}[H] 8 | \centering 9 | \includegraphics[width=0.7\linewidth]{imgs/continual_learning/cl_1.png} 10 | \caption{A conceptual framework of continual learning. \textbf{a}, Continual learning requires adapting to incremental tasks with dynamic data distributions. \textbf{b}, A desirable solution should ensure a proper balance between stability (red arrow) and plasticity (green arrow), as well as an adequate generalizability to intra-task (blue arrow) and inter-task (orange arrow) distribution differences. \textbf{c}, Representative strategies have targeted various aspects of machine learning.} 11 | \label{fig:cl_1} 12 | \end{figure} 13 | 14 | Numerous efforts have been devoted to addressing the above challenges, which can be conceptually separated into five groups (Fig.\ref{fig:cl_1}, c): \textit{regularization-based approach}; \textit{replay-based approach}; \textit{optimization-based approach}; \textit{representation-based approach}; and \textit{architecture-based approach}. These methods are \textit{closely connected}, e.g., regularization and replay ultimately act to rectify the gradient directions, and \textit{highly synergistic}, e.g., the efficacy of replay can be facilitated by distilling knowledge from the old model. 15 | 16 | \section{Setup} 17 | In this section, we first present a basic formulation of continual leanring. Then we introduce typical scenairos and evaluation metrics. 18 | 19 | \subsection{Basic Formulation} 20 | A continual learning model parameterized by \(\theta\) needs to learn corresponding task(s) with no or limited access to old training samples and perform well on their test sets. Formally, an incoming batch of training samples belonging to a task \(t\) can be represented as \(\mathcal{D}_{t, b}=\left\{\mathcal{X}_{t, b}, \mathcal{Y}_{t, b}\right\}\) where \(\mathcal{X}_{t, b}\) is the input data, \(\mathcal{Y}_{t, b}\) is the data label, \(t \in \mathcal{T}=\{1, \cdots, k\}\) is the task identity and \(b \in \mathcal{B}_{t}\) is the batch index ( \(\mathcal{T}\) and \(\mathcal{B}_{t}\) denote their space, respectively). Here we define a "task" by its training samples \(\mathcal{D}_{t}\) following the distribution \(\mathbb{D}_{t}:=p\left(\mathcal{X}_{t}, \mathcal{Y}_{t}\right)\left(\mathcal{D}_{t}\right.\) denotes the entire training set by omitting the batch index, likewise for \(\mathcal{X}_{t}\) and \(\left( \mathcal{Y}_{t}\right)\), and assume that there is no difference in distribution between training and testing. Under realistic constraints, the data label \(\mathcal{Y}_{t}\) and the task identity \(t\) might not be always available. In continual learning, the training samples of each task can arrive incrementally in batches (i.e., \(\left\{\left\{\mathcal{D}_{t, b}\right\}_{b \in \mathcal{B}_{t}}\right\}_{t \in \mathcal{T}}\) ) or simultaneously (i.e.,\(\left\{\mathcal{D}_{t}\right\}_{t \in \mathcal{T}}\)). 21 | 22 | \begin{table}[H] 23 | \centering 24 | \renewcommand{\arraystretch}{1.75} 25 | \resizebox{\textwidth}{!}{ 26 | \begin{tabular}{c|c|c} 27 | \hline 28 | \textbf{Scenario} & \textbf{Training} & \textbf{Testing} \\ 29 | \hline IIL & \(\left\{\left\{\mathcal{D}_{t, b}, t\right\}_{b \in \mathcal{B}_{t}}\right\}_{t=j}\) & \(\left\{p\left(\mathcal{X}_{t}\right)\right\}_{t=j ; t}\) is not required \\ 30 | \hline DIL & \(\left\{\mathcal{D}_{t}, t\right\}_{t \in \mathcal{T}} ; p\left(\mathcal{X}_{i}\right) \neq p\left(\mathcal{X}_{j}\right)\) and \(\mathcal{Y}_{i}=\mathcal{Y}_{j}\) for \(i \neq j\) & \(\left\{p\left(\mathcal{X}_{t}\right)\right\}_{t \in \mathcal{T}}, t\) is not required \\ 31 | \hline TIL & \(\left\{\mathcal{D}_{t}, t\right\}_{t \in \mathcal{T}} ; p\left(\mathcal{X}_{i}\right) \neq p\left(\mathcal{X}_{j}\right)\) and \(\mathcal{Y}_{i} \cap \mathcal{Y}_{j}=\emptyset\) for \(i \neq j\) & \(\left\{p\left(\mathcal{X}_{t}\right)\right\}_{t \in \mathcal{T} ; t}\) is available \\ 32 | \hline CIL & \(\left\{\mathcal{D}_{t}, t\right\}_{t \in \mathcal{T}} ; p\left(\mathcal{X}_{i}\right) \neq p\left(\mathcal{X}_{j}\right)\) and \(\mathcal{Y}_{i} \cap \mathcal{Y}_{j}=\emptyset\) for \(i \neq j\) & \(\left\{p\left(\mathcal{X}_{t}\right)\right\}_{t \in \mathcal{T} ; t}\) is unavailable \\ 33 | \hline TFCL & \(\left\{\left\{\mathcal{D}_{t, b}\right\}_{b \in \mathcal{B}_{t}}\right\}_{t \in \mathcal{T}} ; p\left(\mathcal{X}_{i}\right) \neq p\left(\mathcal{X}_{j}\right)\) and \(\mathcal{Y}_{i} \cap \mathcal{Y}_{j}=\emptyset\) for \(i \neq j\) & \(\left\{p\left(\mathcal{X}_{t}\right)\right\}_{t \in \mathcal{T}} ; t\) is optionally available \\ \hline OCL & \(\left\{\left\{\mathcal{D}_{t, b}\right\}_{b \in \mathcal{B}_{t}}\right\}_{t \in \mathcal{T}},|b|=1 ; p\left(\mathcal{X}_{i}\right) \neq p\left(\mathcal{X}_{j}\right)\) and \(\mathcal{Y}_{i} \cap \mathcal{Y}_{j}=\emptyset\) for \(i \neq j\) & \(\left\{p\left(\mathcal{X}_{t}\right)\right\}_{t \in \mathcal{T}} ; t\) is optionally available \\ 34 | \hline BBCL & \(\left\{\mathcal{D}_{t}, t\right\}_{t \in \mathcal{T}} ; p\left(\mathcal{X}_{i}\right) \neq p\left(\mathcal{X}_{j}\right), \mathcal{Y}_{i} \neq \mathcal{Y}_{j}\) and \(\mathcal{Y}_{i} \cap \mathcal{Y}_{j} \neq \emptyset\) for \(i \neq j\) & \(\left\{p\left(\mathcal{X}_{t}\right)\right\}_{t \in \mathcal{T} ; t \text { is unavailable }}\) \\ 35 | \hline CPT & \(\left\{\mathcal{D}_{t}^{p t}, t\right\}_{t \in \mathcal{T} p t}\), followed by a downstream task \(j\) & \(\left\{p\left(\mathcal{X}_{t}\right)\right\}_{t=j ; t}\) is not required \\ 36 | \hline\end{tabular}} 37 | \caption{A formal comparison of typical continual learning scenarios. \(\mathcal{D}_{t, b}:\) the training samples of task \(t\) and batch \(b\). \(|b|:\) the size of batch \(b\). \(\mathcal{B}_{t}:\) the space of incremental batches belonging to task \(t\). \(\mathcal{D}_{t}:\) the training set of task \(t\) (further specified as \(\mathcal{D}_{t}^{p t}\) for pre-training). \(\mathcal{T}:\) the space of all incremental tasks (further specified as \(\mathcal{T}^{p t}\) for pre-training). \(\mathcal{X}_{t}:\) the input data in \(\mathcal{D}_{t}\), \(p\left(\mathcal{X}_{t}\right):\) the distribution of \(\mathcal{X}_{t} . \mathcal{Y}_{t}:\) the data label of \(\mathcal{X}_{t}\).} 38 | \label{table:4.1} 39 | \end{table} 40 | 41 | \subsection{Typical Scenairo} 42 | 43 | Detail of some tyical continual learning scenairos (refer to Table \ref{table:4.1} for a formal comparision): 44 | \begin{itemize} 45 | \item \textit{Instance-Incremental Learning} (IIL): All training samples belong to the same task and arrive in batches. 46 | \item \textit{Domain-Incremental Learning} (DIL): Tasks have the same data label space but different input distributions. Task identities are not required. 47 | \item \textbf{\textit{Task-Incremental Learning}} (TIL): Tasks have disjoint data label spaces. Task identities are provided in both training and testing. 48 | \item \textbf{\textit{Class-Incremental Learning}} (CIL): Tasks have disjoint data label spaces. Task identities are only provided in training. 49 | \item \textit{Task-free Continual Learning} (TFCL): Tasks have disjoint data label spaces. Task identities are not provided in either training or testing. 50 | \item \textit{Online Continual Learning} (OCL): Tasks have disjoint data label spaces. Training samples for each task arrive as a one-pass data stream. 51 | \item \textit{Blurred Boundary Continual Learning} (BBCL): Task boundaries are blurred, characterized by distinct but overlapping data label spaces. 52 | \item \textit{Continual Pre-training} (CPT): Pre-training data arrives in sequence. The goal is to improve the performance of learning downstream tasks. 53 | \end{itemize} 54 | 55 | Lots of the above mentioned scenairo is messy, hence we will focus on the most popular scenairos: Task-Incremental Learning and Class-Incremental Learning. 56 | 57 | \subsection{Evaluation Metrics} 58 | \textbf{Overall performance} is typically evaluated by \textit{average accuracy} (AA) and \textit{average incremental accuracy} (AIA). Let \(a_{k, j} \in[0,1]\) denote the classification accuracy evaluated on the test set of the \(j\)-th task after incremental learning of the \(k\)-th task \((j \leq k)\). The output space to compute \(a_{k, j}\) consists of the classes in either \(\mathcal{Y}_{j}\) or \(\cup_{i=1}^{k} \mathcal{Y}_{i}\), corresponding to the use of multi-head evaluation (e.g., TIL) or single-head evaluation (e.g., CIL). The two metrics at the \(k\)-th task are then defined as 59 | \begin{equation*} 60 | \mathrm{AA}_{k}=\frac{1}{k} \sum_{j=1}^{k} a_{k, j} 61 | \end{equation*} 62 | AA represnets the overall performance at the current moment. 63 | 64 | \begin{equation*} 65 | \mathrm{AIA}_{k}=\frac{1}{k} \sum_{i=1}^{k} \mathrm{AA}_{i} 66 | \end{equation*} 67 | AIA reflects the historical variaion. 68 | 69 | \textbf{Memory stability} can be evaluted by \textit{forgetting measure} (FM) and \textit{backward transfer} (BWT). As for the former, the forgetting of a task is calculated by the difference between its maximum performance obtained in the past and its current performance: 70 | \begin{equation*} 71 | f_{j, k}=\max _{i \in\{1, \ldots, k-1\}}\left(a_{i, j}-a_{k, j}\right), \forall j 3\right\} \] 49 | This notation is used to define a derived set creation operator. It means that we create a new set \( \mathcal{S}^{\prime} \) by including the square of each element \( x \) from the set \( \mathcal{S} \), under the condition that \( x \) is greater than 3. In other words, \( \mathcal{S}^{\prime} \) is comprised of the squares of all elements in \( \mathcal{S} \) which are greater than 3. 50 | 51 | Additionally, the cardinality operator \( |\mathcal{S}| \) is used to denote the number of elements in the set \( \mathcal{S} \). For example, if \( \mathcal{S} = \{1, 2, 4, 5\} \), then \( \mathcal{S}^{\prime} = \{16, 25\} \) as only 4 and 5 from \( \mathcal{S} \) satisfy the condition \( x > 3 \). The \textbf{cardinality} \( |\mathcal{S}| \) in this case would be 4. 52 | 53 | \subsection{Operations on Vectors} 54 | 55 | \textbf{Vector Addition and Subtraction:} 56 | The sum and difference of two vectors \( \mathbf{x} \) and \( \mathbf{z} \) are defined component-wise as: 57 | \[ \mathbf{x} + \mathbf{z} = \left[x^{(1)} + z^{(1)}, \ldots, x^{(m)} + z^{(m)}\right] \] 58 | \[ \mathbf{x} - \mathbf{z} = \left[x^{(1)} - z^{(1)}, \ldots, x^{(m)} - z^{(m)}\right] \] 59 | \emph{Example:} For \( \mathbf{x} = [1, 2] \) and \( \mathbf{z} = [3, 4] \), 60 | \[ \mathbf{x} + \mathbf{z} = [1+3, 2+4] = [4, 6] \] 61 | 62 | \textbf{Scalar Multiplication:} 63 | A vector multiplied by a scalar \( c \) results in a scaled vector: 64 | \[ \mathbf{x} c = \left[c x^{(1)}, \ldots, c x^{(m)}\right] \] 65 | \emph{Example:} For \( \mathbf{x} = [1, 2] \) and \( c = 3 \), 66 | \[ \mathbf{x} c = [3 \times 1, 3 \times 2] = [3, 6] \] 67 | 68 | \textbf{Dot Product:} 69 | The dot product of two vectors \( \mathbf{w} \) and \( \mathbf{x} \) is a scalar: 70 | \[ \mathbf{w} \mathbf{x} = \sum_{i=1}^{m} w^{(i)} x^{(i)} \] 71 | \emph{Example:} For \( \mathbf{w} = [1, 2] \) and \( \mathbf{x} = [3, 4] \), 72 | \[ \mathbf{w} \mathbf{x} = 1 \times 3 + 2 \times 4 = 3 + 8 = 11 \] 73 | \textbf{Matrix-Vector Multiplication:} 74 | Multiplying a matrix \( \mathbf{W} \) by a vector \( \mathbf{x} \) yields another vector. For example: 75 | $$ 76 | \begin{aligned} 77 | \mathbf{W} \mathbf{x} & =\left[\begin{array}{lll} 78 | w^{(1,1)} & w^{(1,2)} & w^{(1,3)} \\ 79 | w^{(2,1)} & w^{(2,2)} & w^{(2,3)} 80 | \end{array}\right]\left[\begin{array}{l} 81 | x^{(1)} \\ 82 | x^{(2)} \\ 83 | x^{(3)} 84 | \end{array}\right] \\ 85 | & \stackrel{\text { def }}{=}\left[\begin{array}{l} 86 | w^{(1,1)} x^{(1)}+w^{(1,2)} x^{(2)}+w^{(1,3)} x^{(3)} \\ 87 | w^{(2,1)} x^{(1)}+w^{(2,2)} x^{(2)}+w^{(2,3)} x^{(3)} 88 | \end{array}\right] \\ 89 | & =\left[\begin{array}{l} 90 | \mathbf{w}^{(1)} \mathbf{x} \\ 91 | \mathbf{w}^{(2)} \mathbf{x} 92 | \end{array}\right] 93 | \end{aligned} 94 | $$ 95 | 96 | \emph{Example:} For 97 | \[ \mathbf{W} = \left[\begin{array}{ll} 98 | 1 & 2 \\ 99 | 3 & 4 100 | \end{array}\right] \text{ and } \mathbf{x} = \left[\begin{array}{l} 101 | 5 \\ 102 | 6 103 | \end{array}\right], \] 104 | \[ \mathbf{W} \mathbf{x} = \left[\begin{array}{l} 105 | 1 \times 5 + 2 \times 6 \\ 106 | 3 \times 5 + 4 \times 6 107 | \end{array}\right] = \left[\begin{array}{l} 108 | 17 \\ 109 | 39 110 | \end{array}\right] \] 111 | \textbf{Transpose and Multiplication:} 112 | For the transpose of a vector \( \mathbf{x} \) denoted \( \mathbf{x}^{\top} \), and a matrix \( \mathbf{W} \), the multiplication \( \mathbf{x}^{\top} \mathbf{W} \) is given by: 113 | $$ 114 | \begin{aligned} 115 | \mathbf{x}^{\top} \mathbf{W} & =\left[\begin{array}{ll} 116 | x^{(1)} & x^{(2)} 117 | \end{array}\right]\left[\begin{array}{lll} 118 | w^{(1,1)} & w^{(1,2)} & w^{(1,3)} \\ 119 | w^{(2,1)} & w^{(2,2)} & w^{(2,3)} 120 | \end{array}\right] \\ 121 | & \stackrel{\text { def }}{=}\left[w^{(1,1)} x^{(1)}+w^{(2,1)} x^{(2)}, w^{(1,2)} x^{(1)}+w^{(2,2)} x^{(2)}, w^{(1,3)} x^{(1)}+w^{(2,3)} x^{(2)}\right] 122 | \end{aligned} 123 | $$ 124 | 125 | \emph{Example:} For 126 | \[ \mathbf{x} = \left[\begin{array}{l} 127 | 7 \\ 128 | 8 129 | \end{array}\right] \text{ and } \mathbf{W} = \left[\begin{array}{lll} 130 | 1 & 2 & 3 \\ 131 | 4 & 5 & 6 132 | \end{array}\right], \] 133 | \[ \mathbf{x}^{\top} \mathbf{W} = \left[\begin{array}{lll} 134 | 7 \times 1 + 8 \times 4, 7 \times 2 + 8 \times 5, 7 \times 3 + 8 \times 6 135 | \end{array}\right] = \left[\begin{array}{lll} 136 | 39, 54, 69 137 | \end{array}\right] \] 138 | 139 | \subsection{Functions} 140 | 141 | \textbf{Definition of a Function}\\ 142 | A function is a relation that associates each element \( x \) of a set \( \mathcal{X} \), known as the domain, to a single element \( y \) of another set \( \mathcal{Y} \), known as the codomain. This relation is denoted as \( y = f(x) \), where \( f \) is the name of the function, \( x \) is the input or argument, and \( y \) is the output. The input variable is also referred to as the variable of the function. 143 | 144 | \emph{Example:} Consider the function \( f(x) = x^2 \) defined on the domain \( \mathcal{X} = \mathbb{R} \). For \( x = 2 \), the output is \( f(2) = 2^2 = 4 \). 145 | 146 | \textbf{Local and Global Minima}\\ 147 | The function \( f(x) \) has a local minimum at \( x = c \) if \( f(x) \geq f(c) \) for every \( x \) in an open interval around \( c \). An open interval, such as \( (0,1) \), includes all numbers between its endpoints but not the endpoints themselves. The smallest value among all local minima is known as the global minimum. 148 | 149 | \emph{Example:} In the function \( f(x) = (x-1)^2 \), the local (and global) minimum occurs at \( x = 1 \) since \( f(x) \geq f(1) = 0 \) for all \( x \). 150 | 151 | \begin{figure}[H] 152 | \centering 153 | \includegraphics[width=0.7\linewidth]{imgs/notation/notation_3.png} 154 | \caption{A local and a global minima of a function.} 155 | \label{fig:notation_3} 156 | \end{figure} 157 | 158 | \textbf{Vector Functions}\\ 159 | A vector function, denoted \( \mathbf{y} = \mathbf{f}(x) \), is a function that returns a vector \( \mathbf{y} \). Its argument can be either a vector or a scalar. 160 | 161 | \emph{Example:} For the vector function \( \mathbf{f}(x) = [x, x^2] \), with \( x = 2 \), the output is \( \mathbf{f}(2) = [2, 2^2] = [2, 4] \). 162 | 163 | \subsection{Max and Arg Max} 164 | 165 | Given a set of values \( \mathcal{A} = \{a_{1}, a_{2}, \ldots, a_{n}\} \), the operator \( \max_{a \in \mathcal{A}} f(a) \) returns the highest value of \( f(a) \) for all elements in the set \( \mathcal{A} \). Conversely, the operator \( \arg \max_{a \in \mathcal{A}} f(a) \) identifies the specific element \( a \) in the set \( \mathcal{A} \) that maximizes the function \( f(a) \). 166 | 167 | In cases where the set is implicit or infinite, we can use the notation \( \max_{a} f(a) \) or \( \arg \max_{a} f(a) \) respectively. Similarly, the operators \( \min \) and \( \arg \min \) function in a comparable way, determining the lowest value of a function and the 168 | 169 | \subsection{Assignment Operator} 170 | The expression \(a \leftarrow f(x)\) means that the variable \(a\) gets the new value: the result of \(f(x)\). We say that the variable \(a\) gets assigned a new value. Similarly, \(\mathbf{a} \leftarrow\left[a_{1}, a_{2}\right]\) means that the vector variable \(\mathbf{a}\) gets the two-dimensional vector \(\left[a_{1}, a_{2}\right]\). 171 | 172 | \subsection{Derivative and Gradient} 173 | A \textbf{derivative} \(f^{\prime}\) of a function \(f\) is a function or a value that describes how fast \(f\) grows (or decreases). If the derivative \(f^{\prime}\) is a function, then the function \(f\) can grow at a different pace in different regions of its domain. 174 | 175 | we can use \textbf{chain rule} when we encounter hard-to-differentiate function. For instance if \(F(x)=f(g(x))\), where \(f\) and \(g\) are some functions, then \(F^{\prime}(x)= f^{\prime}(g(x)) g^{\prime}(x)\). 176 | 177 | \textbf{Gradient} is the generalization of derivative for functions that take several inputs (or one input in the form of a vector or some other complex structure). A gradient of a function is a vector of \textbf{partial derivatives}. For example, \(f\left(\left[x^{(1)}, x^{(2)}\right]\right)=a x^{(1)}+b x^{(2)}+c\), then the partial derivative of function \(f\) \textit{with respect to} \(x^{(1)}\), denoted as \(\frac{\partial f}{\partial x^{(1)}}\), is given by, 178 | $$ 179 | \frac{\partial f}{\partial x^{(1)}}=a+0+0=a 180 | $$ 181 | where \(a\) is the derivative of the function \(a x^{(1)}\); the two zeroes are respectively derivatives of 182 | \(b x^{(2)}\) and \(c\), because \(x^{(2)}\) is considered constant when we compute the derivative with respect 183 | to \(x^{(1)}\), and the derivative of any constant is zero. Similarly, the partial derivative of function \(f\) with respect to \(x^{(2)}, \frac{\partial f}{\partial x^{(2)}}\), is given by, 184 | $$ 185 | \frac{\partial f}{\partial x^{(2)}}=0+b+0=b 186 | $$ 187 | 188 | The gradient of function \(f\), denoted as \(\nabla f\) is given by the vector \(\left[\frac{\partial f}{\partial x^{(1)}}, \frac{\partial f}{\partial x^{(2)}}\right]\). 189 | 190 | \section{Random Variable} 191 | 192 | A \textbf{random variable}, usually written as an italic letter, like \(X\), is a variable whose possible values are numerical outcomes of a random phenomenon. There are two types of random variables: \textbf{discrete} and \textbf{continuous}. A \textbf{discrete random variable} takes on only countable number of distinct values such as \textit{red}, \textit{yellow}, \textit{blue} or 1,2,3, . . .. 193 | 194 | \begin{figure}[H] 195 | \centering 196 | \includegraphics[width=0.7\linewidth]{imgs/notation/notation_4.png} 197 | \caption{A probability mass function and a probability density function.} 198 | \label{fig:notation_4} 199 | \end{figure} 200 | The \textbf{probability distribution} of a discrete random variable is described by a list of probability associated with each of its possible values. This list of probability is called a \textbf{probability mass function} (pmf)(Fig.\ref{fig:notation_4}, \textbf{a}). 201 | 202 | A \textbf{continuous random variable} takes an infinite number of possible values in some interval. The probability distribution of a continual random variable (a continuous probability distribution) is described by a \textbf{probability density function} (pdf) (Fig.\ref{fig:notation_4}, \textbf{b}). 203 | 204 | Let a discrete random variable \(X\) have \(k\) possible values \(\left\{x_{i}\right\}_{i=1}^{k}\). The \textbf{expectation} of \(X\) denoted as \(\mathbb{E}[X]\) is given by, 205 | 206 | \begin{equation} 207 | \begin{aligned} 208 | \mathbb{E}[X] & \stackrel{\text { def }}{=} \sum_{i=1}^{k}\left[x_{i} \cdot \operatorname{Pr}\left(X=x_{i}\right)\right] \\ & =x_{1} \cdot \operatorname{Pr}\left(X=x_{1}\right)+x_{2} \cdot \operatorname{Pr}\left(X=x_{2}\right)+\cdots+x_{k} \cdot \operatorname{Pr}\left(X=x_{k}\right) 209 | \end{aligned} 210 | \label{notation:1} 211 | \end{equation} 212 | 213 | where \(\operatorname{Pr}\left(X=x_{i}\right)\) is the probability that \(X\) has the value \(x_{i}\) according to the pmf. The expectation of a random variable is also called the \textbf{mean}, \textbf{average} or \textbf{expected value} and is frequently denoted with the letter \(\mu\), 214 | 215 | Now the \textbf{standard deviation}, defined as, 216 | $$ 217 | \sigma \stackrel{\text { def }}{=} \sqrt{\mathbb{E}\left[(X-\mu)^{2}\right]} 218 | $$ 219 | \textbf{Variance}, denoted as \(\sigma^{2}\) or \(\operatorname{var}(X)\), is defined as, 220 | $$ 221 | \sigma^{2}=\mathbb{E}\left[(X-\mu)^{2}\right] 222 | $$ 223 | For a discrete random variable, the standard deviation is given by: 224 | $$ 225 | \sigma=\sqrt{\operatorname{Pr}\left(X=x_{1}\right)\left(x_{1}-\mu\right)^{2}+\operatorname{Pr}\left(X=x_{2}\right)\left(x_{2}-\mu\right)^{2}+\cdots+\operatorname{Pr}\left(X=x_{k}\right)\left(x_{k}-\mu\right)^{2}} 226 | $$ 227 | 228 | The expectation of a continuous random variable \(X\) is given by, 229 | \begin{equation} 230 | \mathbb{E}[X] \stackrel{\text { def }}{=} \int_{\mathbb{R}} x f_{X}(x) d x 231 | \label{notation:2} 232 | \end{equation} 233 | 234 | where \(f_{X}\) is the pdf of the variable \(X\) and \(\int_{\mathbb{R}}\) is the integral of function \(x f_{X}\) 235 | \begin{tcolorbox}[enhanced jigsaw, breakable, pad at break*=1mm, colback=gray!20!white, colframe=black!85!black, title=\textbf{Real-Life Examples of Probability Concepts}] 236 | 237 | \textbf{Standard Deviation of a Discrete Random Variable:} 238 | Consider a dice game where you roll a six-sided die. Each face represents a different prize amount in dollars: \{1, 2, 3, 4, 5, 6\}. The probability of each outcome is \( \frac{1}{6} \) for a fair die. 239 | 240 | \textit{Mean Calculation:} 241 | The mean (\( \mu \)) or expected value of your winnings per roll is: 242 | \[ \mu = \frac{1+2+3+4+5+6}{6} = 3.5 \, \text{dollars} \] 243 | 244 | \textit{Standard Deviation Calculation:} 245 | The standard deviation \( \sigma \) is given by: 246 | \[ \sigma = \sqrt{\sum_{i=1}^{6} \left(\frac{1}{6}\right) \times (i - 3.5)^2} \] 247 | 248 | \textbf{Expectation of a Continuous Random Variable:} 249 | Imagine the waiting time for a bus, which follows a continuous uniform distribution between 0 and 1 hour. 250 | 251 | \textit{Expectation Calculation:} 252 | The expectation of the waiting time, \( \mathbb{E}[X] \), is calculated as: 253 | \[ \mathbb{E}[X] = \int_{0}^{1} x \, dx \] 254 | $$ 255 | \mathbb{E}[X]=\int_{0}^{1} x d x=\left[\frac{1}{2} x^{2}\right]_{0}^{1}=\frac{1}{2}\left(1^{2}\right)-\frac{1}{2}\left(0^{2}\right)=\frac{1}{2} 256 | $$ 257 | The result of this integral is \( \frac{1}{2} \) hour, indicating the average waiting time. 258 | \end{tcolorbox} 259 | The property of the pdf that the area under its curve is 1 mathematically means that \(\int_{\mathbb{R}} f_{X}(x) d x=1\). Most of the time we don't know \(f_{X}\), but we can observe some values of \(X\). In machine learning, we call these values \textbf{examples}, and the collection of these examples is called a \textbf{sample} or a \textbf{dataset}. 260 | 261 | \section{Unbiased Estimator} 262 | Because \(f_{X}\) is usually unknown, but we have sample \(S_{X}=\left\{x_{i}\right\}_{i=1}^{N}\), we often content ourselves not with the true values of statistics of the probability distribution, such as expectation, but with their \textbf{unbiased estimators}. 263 | 264 | We say that \(\hat{\theta}\left(S_{X}\right)\) is an unbiased estimator of some statistic \(\theta\) calculated using a sample \(S_{X}\) 265 | drawn from an unknown probability distribution if \(\hat{\theta}\left(S_{X}\right)\) has the following property: 266 | $$ 267 | \mathbb{E}\left[\hat{\theta}\left(S_{X}\right)\right]=\theta 268 | $$ 269 | where \(\hat{\theta}\) is a \textbf{sample statistic}, obtained using a sample \(S_{X}\) and not the real statistic \(\theta\) that can be obtained only knowing \(X\); the expectation is taken over all possible samples drawn from \(X\). Intuitively, this means that if you can have an unlimited number of such sample as \(S_{X}\), and you compute some unbiased estimator, such as \(\hat{\mu}\), using each sample, then the average of all these \(\hat{\mu}\) equals the real statistic \(\mu\) that you would get computed on \(X\). 270 | 271 | It can be shown that an unbiased estimator of an unknown \(\mathbb{E}[X]\) (Eq.\ref{notation:1} or Eq.\ref{notation:2}) is given by \(\frac{1}{N} \sum_{i=1}^{N} x_{i}\) (called in statistics the \textbf{sample mean}). 272 | 273 | \section{Bayes' Rule} 274 | The conditional probability \(\operatorname{Pr}(X=x \mid Y=y)\) is the probability of the random variable \(X\) to have a specific value $x$ given another random variable $Y$ has a specific value of $y$. The \textbf{Bayes' Rule} (also known as the \textbf{Bayes' Therem}) stipulate that: 275 | $$ 276 | \operatorname{Pr}(X=x \mid Y=y)=\frac{\operatorname{Pr}(Y=y \mid X=x) \operatorname{Pr}(X=x)}{\operatorname{Pr}(Y=y)} 277 | $$ 278 | 279 | \section{Parameter Estimation} 280 | Bayes' Rule comes in handy when we have a model of \(X\)'s distribution, and this model \(f_{\theta}\) is a function that has some parameters in the form of a vector \(\theta\). An example of such a function could be the Gaussian function that has two parameters, $\mu$ and $\sigma$, and is defined as: 281 | \begin{equation} 282 | f_{\boldsymbol{\theta}}(x)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{(x-\mu)^{2}}{2 \sigma^{2}}} 283 | \label{notation:3} 284 | \end{equation} 285 | where \(\boldsymbol{\theta} \stackrel{\text { def }}{=}[\mu, \sigma]\) and \(\pi\) is the constant \((3.14159 \ldots)\). 286 | 287 | This function has all the properties of a pdf, which is the pdf of one of the most frequently used in practice probability distributions called \textbf{Gaussian distribution} or \textbf{normal distribution} and denote as \(\dot{\mathcal{N}}\left(\mu, \sigma^{2}\right)\). Therefore, we can use it as a model of an unknown distribution of \(X\). We can update the values of parameters in the vector \(\theta\) from the data using the Bayes's Rule. 288 | 289 | \begin{equation} 290 | \operatorname{Pr}(\theta=\hat{\theta} \mid X=x) \leftarrow \frac{\operatorname{Pr}(X=x \mid \theta=\hat{\theta}) \operatorname{Pr}(\theta=\hat{\theta})}{\operatorname{Pr}(X=x)}=\frac{\operatorname{Pr}(X=x \mid \theta=\hat{\theta}) \operatorname{Pr}(\theta=\hat{\theta})}{\sum_{\tilde{\theta}} \operatorname{Pr}(X=x \mid \theta=\tilde{\theta}) \operatorname{Pr}(\theta=\tilde{\theta})} 291 | \label{eq:bayes_prediction} 292 | \end{equation} 293 | 294 | where \(\operatorname{Pr}(X=x \mid \theta=\hat{\theta}) \stackrel{\text { def }}{=} f_{\hat{\theta}}\). If we have a sample \(\mathcal{S}\) of \(X\) and the set of possible values for \(\theta\) is finite, we can easily estimate \(\operatorname{Pr}(\theta=\hat{\theta})\) by applying Bayes' Rule iteratively, one example \(x \in \mathcal{S}\) at a time. The initial value \(\operatorname{Pr}(\theta=\hat{\theta})\) can be guessed such that \(\sum_{\hat{\theta}} \operatorname{Pr}(\theta=\hat{\theta})=1\). This guess of the probabilities for different \(\hat{\theta}\) is called the \textbf{prior}. 295 | 296 | First, we compute \(\operatorname{Pr}\left(\theta=\hat{\theta} \mid X=x_{1}\right)\) for all possible values \(\hat{\theta}\). Then, before updating \(\operatorname{Pr}(\theta=\hat{\theta} \mid X=x)\) once again, this time for \(x=x_{2} \in \mathcal{S}\) using Eq.\ref{eq:bayes_prediction}, we replace the prior \(\operatorname{Pr}(\theta=\hat{\theta})\) in Eq.\ref{eq:bayes_prediction} by the new estimate \(\operatorname{Pr}(\theta=\hat{\theta}) \leftarrow \frac{1}{N} \sum_{x \in \mathcal{S}} \operatorname{Pr}(\theta=\hat{\theta} \mid X=x)\). 297 | 298 | The optimal parameters \(\theta^{*}\) given one example is obtained using the principle of \textbf{maximum a posterior} (or MAP): 299 | \begin{equation} 300 | \theta^{*}=\underset{\theta}{\arg \max } \prod_{i=1}^{N} \operatorname{Pr}\left(\theta=\hat{\theta} \mid X=x_{i}\right) 301 | \label{maximum a posteriori} 302 | \end{equation} 303 | If the set of possible values for $\theta$ isn't finite, then we need to optimize Eq. \ref{maximum a posteriori} directly using a numerical optimization routine, such as gradient descent. Usually, we optimize the natural logarithm of the right-hand side expression in Eq. \ref{maximum a posteriori} because the logarithm of a product becomes the sum of logarithms and it's easier for the machine to work with a sum than with a product. 304 | 305 | \section{Parameters vs. Hyperparameters} 306 | 307 | A \textit{hyper-parameter} is a property of a learning algorithm, usually (but not always) having a numerical value. That value influences the way the algorithm works. Hyper-parameters aren't learned by the algorithm itself from data. They have to be set by the data analyst before running the algorithm. \textit{Parameters} are variables that define the model learned by the learning algorithm. \textit{Parameters} are directly modified by the learning algorithm based on the training data. The goal of learning is to find such values of parameters that make the model optimal in a certain sense. 308 | 309 | \section{Classification vs. Regression} 310 | \textbf{Classification} is a problem of automatically assigning a \textbf{label} to an \textbf{unlabeled example}; Spam detection. Classification problem is solved by \textbf{classification learning algorithm} that takes a collection of \textbf{labelled examples} as inputs and produces a \textbf{model} that can take an unlabeled example as input and either directly output a label or output a number that can be used by the analyst to deduce the label. An example of such a number is a probability. 311 | 312 | In a classification problem, a label is a member of a finite set of \textbf{classes}. If the size of the set of classes is two, then it is a \textbf{binary} or \textbf{binomial} \textbf{classification}. \textbf{Multiclass classification} (also called \textbf{multinomial}) is a classification problem with three or more classes. 313 | 314 | \textbf{Regression} is a problem of predicting a real-valued label (often called a \textbf{target}) given an unlabeled example. The regression problem is solved by a \textbf{regression learning algorithm} that takes a collection of labelled examples as inputs and produces a model that can take an unlabeled example as input and output a target. 315 | 316 | \section{Model-Based vs. Instance-Based Learning} 317 | Most supervised learning algorithms are model-based. \textit{Model-based learning algorithms} use the training data to create a \textbf{model} that has \textbf{parameters} learned from the training data. \textit{Instance-based learning algorithms} use the whole dataset as the model. One instance-based algorithm frequently used in practice is \textbf{k-Nearest Neighbors} (kNN). 318 | 319 | \section{Shallow vs. Deep Learning} 320 | A \textbf{shallow learning} algorithm learns the parameters of the model directly from the features of the training examples. The \textbf{neural network} learning algorithms, specifically those networks with more than one \textbf{layer} between input and output. Such neural networks are called \textbf{deep neural networks}. In deep neural network learning (or, simply, \textbf{deep learning}), contrary to shallow learning, most model parameters are learned not directly from the features of the training examples, but from the outputs of the preceding layers. 321 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Jue Guo 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Neural_Networks/kowledge_distillation.tex: -------------------------------------------------------------------------------- 1 | \chapter{Distilling the knowledge in a Neural Network} 2 | This chapter is a direct and indirect reference to \cite{hinton2015distilling}. 3 | 4 | Simple way to improve the performance of any machine learning algorithm is to train many different models on the same data and then to average their predictions. Making predictions using a whole ensemble of models is too computationally expensive to allow deployment. \textit{It is shown that we can distill the knowledge of an ensemble of models into a single model.} The author introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. 5 | 6 | \section{Introduction} 7 | 8 | In large-scale machine learning, we typically use very similar models for the training stage and deployment stage despite their very different requirements. The author provided an analogy of the insects suggests that we should be willing to train a very cumbersome models if that makes it easier to extract structure from the data. The cumbersome model could be an ensemble of separately trained models or a single very large model trained with a very strong regularizer such as dropout. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # A Comprehensive Note on Machine Learning 2 | 3 | ## Introduction 4 | 5 | This document serves as an educational resource primarily designed for the courses I teach, including: 6 | 7 | - **CSE 474/574 Introduction to Machine Learning** 8 | - **CSE 455/555 Pattern Recognition** 9 | - **CSE 676 Deep Learning** 10 | 11 | Additionally, it functions as a personal reference in the field of machine learning. 12 | 13 | ## Purpose and Use 14 | 15 | This compilation aims to provide an extensive overview and guide for students and practitioners of machine learning, incorporating a blend of **direct** references and **adaptations** from established texts and sources. It is intended for: 16 | 17 | - **Educational Aid**: As a supplementary resource for teaching and learning. 18 | - **Reference Material**: For personal and academic use. 19 | 20 | This document is **not authorized** for commercial use, redistribution, or sale without explicit consent. To read the pdf version of the document: [PDF](https://github.com/COD1995/A-Comprehensive-Note-on-Machine-Learning/blob/main/machine_learning.pdf) 21 | 22 | ## Community Contributions 23 | From my side there will be constant updates but in order to motivate my students to read I have a **3 bonus points** if they make significant and meaningful contribution to each chapters of the note. I welcome contributions and feedback from the community! If you have suggestions, corrections, or additional material you think would enhance this resource, please feel free to contribute. Here's how you can do that: 24 | 25 | - **Fork the Repository**: Create your own fork of the project. 26 | - **Make Changes**: Add your contributions or modifications. 27 | - **Submit a Pull Request**: Open a pull request to the original repository with a clear list of what you've done. 28 | - **Review & Merge**: I will review your changes and merge them into the main document as appropriate. 29 | 30 | Details on how to create a pull request: [Creating a pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request) 31 | 32 | When contributing, please adhere to the following guidelines: 33 | 34 | - Ensure that any added content is accurate and relevant to machine learning. 35 | - Respect intellectual property and cite sources appropriately. 36 | - Maintain a respectful and constructive tone in discussions and pull requests. 37 | 38 | ## Installation of LaTeX 39 | 40 | LaTeX is a high-quality typesetting system; it includes features designed for the production of technical and scientific documentation. My preferred setup for writing LaTeX documents is Visual Studio Code (VSCode) with the LaTeX Workshop extension, offering a user-friendly and efficient LaTeX writing experience. Below is a guide to help you set up this environment. 41 | 42 | ### 1. Install LaTeX Distribution 43 | To begin with, you need a LaTeX distribution installed on your computer. 44 | 45 | - **Windows**: Use MiKTeX or TeX Live. MiKTeX is more user-friendly for beginners, while TeX Live is more comprehensive. Download from [MiKTeX](https://miktex.org/) or [TeX Live](https://www.tug.org/texlive/). 46 | - **macOS**: Install MacTeX, which is a macOS version of TeX Live with additional tools. Download from [MacTeX](http://www.tug.org/mactex/). 47 | - **Linux**: Most Linux distributions include TeX Live in their package repositories. Install it using your package manager (for example, `sudo apt-get install texlive-full` in Ubuntu). 48 | 49 | ### 2. Install Visual Studio Code (VSCode) 50 | VSCode is a free, open-source editor with a wide array of features. 51 | 52 | - Download and install VSCode from the [official website](https://code.visualstudio.com/). 53 | 54 | ### 3. Install LaTeX Workshop Extension in VSCode 55 | LaTeX Workshop enhances VSCode with LaTeX typesetting capabilities. 56 | 57 | - Open VSCode. 58 | - Navigate to Extensions (`Ctrl+Shift+X` / `Cmd+Shift+X`). 59 | - Search for "LaTeX Workshop" and install the extension. 60 | 61 | ### 4. Configure LaTeX Workshop 62 | Customize LaTeX Workshop settings for your needs. 63 | 64 | - Go to `File > Preferences > Settings` (`Code > Preferences > Settings` on macOS). 65 | - Search for "LaTeX Workshop" settings. 66 | - Configure according to your preferences, like setting up a default compiler or enabling auto-build on save. 67 | 68 | ### 5. Start Writing LaTeX 69 | Now, you are ready to write LaTeX documents. 70 | 71 | - Create a new file with a `.tex` extension in VSCode. 72 | - Write your LaTeX content. 73 | - Use the build feature in LaTeX Workshop to compile your document into a PDF. 74 | 75 | ### 6. Additional Tools and Tips 76 | - **Git Integration**: VSCode's integrated support for Git is beneficial for version controlling your LaTeX documents. 77 | - **Live Preview**: LaTeX Workshop supports live preview of your document. 78 | - **Custom Snippets**: Create custom snippets for frequently used LaTeX commands to improve efficiency. 79 | 80 | This setup with VSCode and LaTeX Workshop provides a powerful, modern environment for writing and managing LaTeX documents, blending LaTeX's typesetting capabilities with the features of a contemporary code editor. 81 | 82 | 83 | ## Primary Sources 84 | 85 | The majority of the material referenced in this document comes from the following key sources: 86 | 87 | 1. Zhang, Aston, et al. "Dive into Deep Learning." Cambridge University Press, 2023. 88 | 2. Bishop, C. M., & Nasrabadi, N. M. "Pattern Recognition and Machine Learning" (Vol. 4, No. 4, p. 738). New York: Springer, 2006. 89 | 3. Hart, P. E., Stork, D. G., & Duda, R. O. "Pattern Classification." Hoboken: Wiley, 2000. 90 | 4. Burkov, A. "The Hundred-Page Machine Learning Book" (Vol. 1, p. 32). Quebec City, QC, Canada: Andriy Burkov, 2019. 91 | 5. Burkov, A. "Machine Learning Engineering" (Vol. 1). Montreal, QC, Canada: True Positive Incorporated, 2020. 92 | 93 | 94 | ## Additional References 95 | 96 | All other referenced materials and sources are cited in the bibliography section of this document. 97 | 98 | ## Contact Information 99 | 100 | For inquiries, permissions, or further information, please reach out to me at ( jueguo@buffalo.edu ). 101 | 102 | ## Disclaimer 103 | 104 | This document is provided "as is," and the author makes no representations or warranties, express or implied, regarding its completeness, accuracy, or reliability. 105 | -------------------------------------------------------------------------------- /imgs/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/COD1995/A-Comprehensive-Note-on-Machine-Learning/068268fc12c612b2922acc181608104e570cf8f5/imgs/.DS_Store -------------------------------------------------------------------------------- /imgs/continual_learning/cl_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/COD1995/A-Comprehensive-Note-on-Machine-Learning/068268fc12c612b2922acc181608104e570cf8f5/imgs/continual_learning/cl_1.png -------------------------------------------------------------------------------- /imgs/continual_learning/cl_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/COD1995/A-Comprehensive-Note-on-Machine-Learning/068268fc12c612b2922acc181608104e570cf8f5/imgs/continual_learning/cl_2.png -------------------------------------------------------------------------------- /imgs/continual_learning/cl_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/COD1995/A-Comprehensive-Note-on-Machine-Learning/068268fc12c612b2922acc181608104e570cf8f5/imgs/continual_learning/cl_3.png -------------------------------------------------------------------------------- /imgs/fundamental_algo/algo_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/COD1995/A-Comprehensive-Note-on-Machine-Learning/068268fc12c612b2922acc181608104e570cf8f5/imgs/fundamental_algo/algo_1.png -------------------------------------------------------------------------------- /imgs/fundamental_algo/algo_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/COD1995/A-Comprehensive-Note-on-Machine-Learning/068268fc12c612b2922acc181608104e570cf8f5/imgs/fundamental_algo/algo_2.png -------------------------------------------------------------------------------- /imgs/fundamental_algo/algo_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/COD1995/A-Comprehensive-Note-on-Machine-Learning/068268fc12c612b2922acc181608104e570cf8f5/imgs/fundamental_algo/algo_3.png -------------------------------------------------------------------------------- /imgs/fundamental_algo/algo_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/COD1995/A-Comprehensive-Note-on-Machine-Learning/068268fc12c612b2922acc181608104e570cf8f5/imgs/fundamental_algo/algo_4.png -------------------------------------------------------------------------------- /imgs/fundamental_algo/algo_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/COD1995/A-Comprehensive-Note-on-Machine-Learning/068268fc12c612b2922acc181608104e570cf8f5/imgs/fundamental_algo/algo_5.png -------------------------------------------------------------------------------- /imgs/introduction/intro_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/COD1995/A-Comprehensive-Note-on-Machine-Learning/068268fc12c612b2922acc181608104e570cf8f5/imgs/introduction/intro_1.png -------------------------------------------------------------------------------- /imgs/notation/notation_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/COD1995/A-Comprehensive-Note-on-Machine-Learning/068268fc12c612b2922acc181608104e570cf8f5/imgs/notation/notation_1.png -------------------------------------------------------------------------------- /imgs/notation/notation_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/COD1995/A-Comprehensive-Note-on-Machine-Learning/068268fc12c612b2922acc181608104e570cf8f5/imgs/notation/notation_2.png -------------------------------------------------------------------------------- /imgs/notation/notation_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/COD1995/A-Comprehensive-Note-on-Machine-Learning/068268fc12c612b2922acc181608104e570cf8f5/imgs/notation/notation_3.png -------------------------------------------------------------------------------- /imgs/notation/notation_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/COD1995/A-Comprehensive-Note-on-Machine-Learning/068268fc12c612b2922acc181608104e570cf8f5/imgs/notation/notation_4.png -------------------------------------------------------------------------------- /machine_learning.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/COD1995/A-Comprehensive-Note-on-Machine-Learning/068268fc12c612b2922acc181608104e570cf8f5/machine_learning.pdf -------------------------------------------------------------------------------- /machine_learning.tex: -------------------------------------------------------------------------------- 1 | \documentclass[12pt]{book} 2 | \usepackage{mathpazo} 3 | \usepackage{geometry} 4 | \usepackage{titlesec} 5 | \usepackage{amsmath} 6 | \usepackage{amsfonts} 7 | \usepackage{graphicx,subcaption} 8 | \usepackage{float} 9 | \usepackage{tikz,lipsum,lmodern} 10 | \usepackage[most]{tcolorbox} 11 | \usepackage[colorlinks=true,linkcolor=blue, citecolor=blue]{hyperref} 12 | \usepackage[authoryear,round]{natbib} 13 | 14 | % Adjust spacing for chapters 15 | \titlespacing*{\chapter}{0pt}{-50pt}{20pt} 16 | 17 | % Adjust spacing for sections 18 | \titlespacing*{\section}{0pt}{20pt}{10pt} 19 | 20 | 21 | % Remove paragraph indentation and add space between paragraphs 22 | \setlength{\parindent}{0pt} 23 | \setlength{\parskip}{\baselineskip} 24 | 25 | \geometry{left=1.25in, right=1in, top=1in, bottom=1in, footskip=0.5in} 26 | 27 | \titleformat{\chapter}[display] 28 | {\normalfont\Large\bfseries}{\chaptertitlename\ \thechapter}{20pt}{\Large} 29 | \titleformat{\section} 30 | {\normalfont\large\bfseries}{\thesection}{1em}{} 31 | 32 | \title{A Comprehensive Note on Machine Learning\footnote{This content serve only for \textbf{educational} and \textbf{personal} purpose, \textbf{do not share} without my approval.}} 33 | \author{Jue Guo} 34 | \date{\today} 35 | 36 | \begin{document} 37 | \frontmatter 38 | \maketitle 39 | \tableofcontents 40 | \chapter*{Preface} 41 | 42 | The primary purpose of this document is educational, specifically for the courses I teach (\textbf{CSE 474/574 Introduction to Machine Learning, CSE 455/555 Pattern Recognition, and CSE 676 Deep Learning}), as well as for my personal reference. A substantial portion of the material herein is directly referenced or adapted from established texts and sources, and is not claimed as original content. This document is intended as a supplementary teaching and learning resource and is not authorized for commercial use, redistribution, or sale without my explicit consent. 43 | 44 | The majority of the material referenced comes from the following sources: 45 | 46 | { 47 | \renewcommand{\labelenumi}{[\theenumi]} 48 | \setcounter{enumi}{0} % Start enumeration from 1 49 | 50 | \begin{enumerate} 51 | \item Zhang, Aston, et al. Dive into deep learning. Cambridge University Press, 2023. 52 | \item Bishop, C. M., \& Nasrabadi, N. M. (2006). Pattern recognition and machine learning (Vol. 4, No. 4, p. 738). New York: Springer. 53 | \item Hart, P. E., Stork, D. G., \& Duda, R. O. (2000). Pattern classification. Hoboken: Wiley. 54 | \item Burkov, A. (2019). The hundred-page machine learning book (Vol. 1, p. 32). Quebec City, QC, Canada: Andriy Burkov. 55 | \item Burkov, A. (2020). Machine learning engineering (Vol. 1). Montreal, QC, Canada: True Positive Incorporated. \footnote{This is a birthday present from my lab member and good friend Peiyao Xiao} 56 | \end{enumerate} 57 | } 58 | 59 | All other referenced materials and sources are cited in the bibliography section of this document. This compilation is intended to provide a comprehensive overview and guide for students and practitioners of machine learning, drawing upon a wide range of foundational and contemporary sources in the field. 60 | 61 | \mainmatter 62 | \part{Basic Machine Learning} 63 | \include{Basic_Machine_Learning/introduction} 64 | \include{Basic_Machine_Learning/notation} 65 | \include{Basic_Machine_Learning/fundamental_algo} 66 | 67 | \part{Advance Machine Learning} 68 | \part{Neural Networks} 69 | \include{Neural_Networks/kowledge_distillation} 70 | \part{Convolution Neural Networks} 71 | 72 | \part{Adversarial Attacks and Training} 73 | \include{Adversarial_Attack_and_Training/intriguing_properties_of_NN} 74 | 75 | \part{Recurrent Neural Networks} 76 | 77 | \part{Transformers} 78 | 79 | \part{Artificial General Intelligence} 80 | \include{Artificial_General_Intelligence/continual_learning} 81 | \include{Artificial_General_Intelligence/ER_AML} 82 | 83 | \backmatter 84 | \bibliographystyle{plainnat} 85 | \bibliography{references} 86 | 87 | \end{document} 88 | -------------------------------------------------------------------------------- /references.bib: -------------------------------------------------------------------------------- 1 | @article{wang2023comprehensive, 2 | title = {A comprehensive survey of continual learning: Theory, method and application}, 3 | author = {Wang, Liyuan and Zhang, Xingxing and Su, Hang and Zhu, Jun}, 4 | journal = {arXiv preprint arXiv:2302.00487}, 5 | year = {2023} 6 | } 7 | @article{szegedy2013intriguing, 8 | title = {Intriguing properties of neural networks}, 9 | author = {Szegedy, Christian and Zaremba, Wojciech and Sutskever, Ilya and Bruna, Joan and Erhan, Dumitru and Goodfellow, Ian and Fergus, Rob}, 10 | journal = {arXiv preprint arXiv:1312.6199}, 11 | year = {2013} 12 | } 13 | 14 | @misc{hinton2015distilling, 15 | title = {Distilling the Knowledge in a Neural Network}, 16 | author = {Geoffrey Hinton and Oriol Vinyals and Jeff Dean}, 17 | year = {2015}, 18 | eprint = {1503.02531}, 19 | archiveprefix = {arXiv}, 20 | primaryclass = {stat.ML} 21 | } --------------------------------------------------------------------------------