├── notes ├── Neural Tangent kernels - Jacot et al.zip ├── Neural_Tangent_kernels___Jacot_et_al.pdf ├── du_et_al.pdf ├── du_et_al.tex ├── low_rank_jac │ ├── .gitignore │ ├── Makefile │ ├── amartya_ltx.sty │ ├── low_rank_jac_thm.tex │ └── refs.bib └── low_rank_jac_thm.pdf ├── papers ├── 1805.00915.pdf ├── 1806.07572.pdf ├── 1808.09372.pdf ├── 1810.02054.pdf ├── 1810.09665.pdf ├── 1810.12065.pdf ├── 1811.03804.pdf ├── 1811.03962.pdf ├── 1811.04918.pdf ├── 1811.08888.pdf ├── 1812.07956.pdf ├── 1812.10004.pdf ├── 1901.08572.pdf ├── 1901.08584.pdf ├── 1902.01384.pdf ├── 1902.04760.pdf ├── 1902.06720.pdf ├── 1904.11955.pdf ├── 1905.03684.pdf ├── 1905.05095.pdf ├── 1905.10337.pdf ├── 1905.10843.pdf ├── 1905.12173.pdf ├── 1905.13210.pdf ├── 1905.13654.pdf ├── 1906.01930.pdf ├── 1906.05392.pdf ├── 1906.05827.pdf ├── 1906.06247.pdf ├── 1906.06321.pdf ├── 1906.08034.pdf └── 1911.00809.pdf └── readme.md /notes/Neural Tangent kernels - Jacot et al.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/notes/Neural Tangent kernels - Jacot et al.zip -------------------------------------------------------------------------------- /notes/Neural_Tangent_kernels___Jacot_et_al.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/notes/Neural_Tangent_kernels___Jacot_et_al.pdf -------------------------------------------------------------------------------- /notes/du_et_al.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/notes/du_et_al.pdf -------------------------------------------------------------------------------- /notes/du_et_al.tex: -------------------------------------------------------------------------------- 1 | \documentclass[11pt]{article} 2 | 3 | % Estilo del documento 4 | \usepackage[utf8]{inputenc} % Lets you write accents with áéíóú etc 5 | \usepackage[T1]{fontenc} % Lets you write UTF-8 chars in the code 6 | \setlength{\headheight}{14.0pt} % Removes fancy header warning (Not sure what it does) 7 | \usepackage{geometry} % To edit margins and their format 8 | \usepackage[english]{babel} % language 9 | \usepackage{indentfirst} % First paragraph of each section / subsection 10 | \usepackage[linktocpage]{hyperref} % References inside the document and hyperrefs out of it 11 | \usepackage{url} % url colors and so 12 | \hypersetup{colorlinks=true, urlcolor=blue} 13 | \usepackage{graphicx} % to include images, Gull page: http://en.wikibooks.org/wiki/LaTeX/Floats,_Figures_and_Captions 14 | \usepackage[export]{adjustbox} % Images layout (e.g. lets you put right, left in the includegraphix) 15 | \usepackage{listings} % Show code 16 | \usepackage{fancyhdr} % Headers y footers 17 | \usepackage{multicol} % http://stackoverflow.com/questions/1491717/how-to-display-a-content-in-two-column-layout-in-latex 18 | \usepackage{blindtext} % For the cool paragraph (Enter after the paragraph section) 19 | \usepackage{textcomp} 20 | \usepackage{bussproofs} 21 | \usepackage{enumitem} % To enum with letters and other things 22 | \usepackage{leftidx} % left superindices 23 | \usepackage{euscript} % Fancy A and S for symmetry groups (among other things) 24 | \usepackage{dsfont} 25 | 26 | % Math packages 27 | \usepackage{amsmath} % General maths 28 | \usepackage{amsthm} % theorems, propositions... 29 | \usepackage{amssymb} % symbols, arrows... 30 | \usepackage{amsrefs} % Automatically formatted bibliography 31 | \usepackage{mathrsfs} % Very flamboyant letters 32 | %\usepackage{stmaryrd} % Square brackets for semantics 33 | \usepackage{bussproofs} 34 | 35 | \usepackage{xparse} 36 | 37 | \usepackage{color} 38 | % Colors 39 | \definecolor{mygreen}{rgb}{0,0.6,0} 40 | \definecolor{mygray}{rgb}{0.8,0.8,0.8} 41 | \definecolor{mymauve}{rgb}{0.58,0,0.82} 42 | 43 | %Others 44 | \usepackage{nag} % Warning for deprecated methods 45 | 46 | % Document style 47 | \geometry{margin=3cm} % 48 | \geometry{a4paper} % 49 | %\setlength{\parindent}{1.5em} % First line indentation 50 | \setlength{\parskip}{0.5\baselineskip} % Paragraph separation 51 | \setcounter{tocdepth}{2} % Table of contents until subsection 52 | 53 | % amsthm style definitions 54 | \theoremstyle{plain} 55 | \newtheorem{thm}{Theorem}[section] 56 | \newtheorem{prop}[thm]{Proposition} 57 | \newtheorem{lemma}[thm]{Lemma} 58 | \newtheorem{condition}[thm]{Condition} 59 | \newtheorem{corol}[thm]{Corollary} 60 | 61 | \newtheorem{tma}{Teorema}[section] 62 | \newtheorem{prob}[thm]{Problem} 63 | \newtheorem{lema}[tma]{Lema} 64 | \newtheorem{corolario}[tma]{Corolario} 65 | 66 | \theoremstyle{definition} 67 | \newtheorem{example}{Example} 68 | \newtheorem{remark}[thm]{Remark} 69 | \newtheorem*{exer}{Exercise} 70 | \newtheorem{pr}{Proof} 71 | \newtheorem{defi}[thm]{Definition} 72 | 73 | \newtheorem{ejem}{Ejemplo} 74 | \newtheorem{obs}{Observación} 75 | \newtheorem*{ejer}{Ejercicio} 76 | \newtheorem{demo}{Demostración} 77 | \newtheorem{definicion}[thm]{Definición} 78 | 79 | % Tikz's shit 80 | \usepackage{tikz} % To draw cats automatas etc etc 81 | \usetikzlibrary{automata} % 82 | \usetikzlibrary{arrows} % Different types of arrows (e.g. inclusion) 83 | 84 | \usetikzlibrary[shapes.arrows] 85 | \usetikzlibrary{shapes.geometric} 86 | \usetikzlibrary{backgrounds} 87 | \usetikzlibrary{positioning} 88 | \usetikzlibrary{calc} 89 | \usetikzlibrary{intersections} 90 | \usetikzlibrary{fadings} 91 | \usetikzlibrary{decorations.footprints} 92 | \usetikzlibrary{patterns} 93 | \usetikzlibrary{shapes.callouts} 94 | \usetikzlibrary{fit} 95 | 96 | % Tikz Settings 97 | \tikzset{->, >=stealth', shorten >=1pt, auto, node distance=1cm, semithick, baseline=(current bounding box.center)} 98 | 99 | % Listing 100 | \lstset{ 101 | columns=fullflexible, 102 | backgroundcolor=\color{white}, % choose the background color; you must add \usepackage{color} or \usepackage{xcolor} 103 | basicstyle=\ttfamily, % the size of the fonts that are used for the code 104 | breakatwhitespace=false, % sets if automatic breaks should only happen at whitespace 105 | breaklines=true, % sets automatic line breaking 106 | captionpos=b, % sets the caption-position to bottom 107 | commentstyle=\color{mygreen}, % comment style 108 | %deletekeywords={...}, % if you want to delete keywords from the given language 109 | inputencoding=utf8, 110 | %escapeinside={\%*}{*)}, % if you want to add LaTeX within your code 111 | extendedchars=true, % lets you use non-ASCII characters; for 8-bits encodings only, does not work with UTF-8 112 | literate= {á}{{\'a}}1 {é}{{\'e}}1 {í}{{\'i}}1 {ó}{{\'o}}1 {ú}{{\'u}}1 {ñ}{{\~n}}1 113 | {Á}{{\'A}}1 {É}{{\'E}}1 {Í}{{\'I}}1 {Ó}{{\'O}}1 {Ú}{{\'U}}1 {Ñ}{{\~N}}1 114 | {_}{{\_}}1 {^}{{\textasciicircum}}1, 115 | frame=single, % adds a frame around the code 116 | keepspaces=true, % keeps spaces in text, useful for keeping indentation of code (possibly needs columns=flexible) 117 | keywordstyle=\color{blue}, % keyword style 118 | language=C++, % the language of the code 119 | morekeywords={ll,ii,vi,vii,vvi,vll,mii,ld,point,vect,line,circle,polygon, each}, 120 | % if you want to add more keywords to the set 121 | numbers=left, % where to put the line-numbers; possible values are (none, left, right) 122 | numbersep=5pt, % how far the line-numbers are from the code 123 | numberstyle=\tiny\color{mygray}, % the style that is used for the line-numbers 124 | rulecolor=\color{black}, % if not set, the frame-color may be changed on line-breaks within not-black text (e.g. comments (green here)) 125 | showspaces=false, % show spaces everywhere adding particular underscores; it overrides 'showstringspaces' 126 | showstringspaces=false, % underline spaces within strings only 127 | showtabs=false, % show tabs within strings adding particular underscores 128 | stepnumber=1, % the step between two line-numbers. If it's 1, each line will be numbered 129 | stringstyle=\color{mymauve}, % string literal style 130 | tabsize=4, % sets default tabsize to 4 spaces 131 | %title=\lstname, % show the filename of files included with \lstinputlisting; also try caption instead of title 132 | texcl=true, 133 | morecomment=[l][basicstyle]{http://} 134 | } 135 | 136 | % Config Headers y footers 137 | %\pagestyle{fancy} 138 | %\fancyhf{} 139 | %\renewcommand{\sectionmark}[1]{\markright{#1}{}} % Stop showing section numbers in the header 140 | %\renewcommand{\subsectionmark}[1]{\markright{#1}{}} % Stop showing subsection numberless in the header 141 | %\renewcommand{\subsubsectionmark}[1]{\markright{#1}{}} % Stop showing subsubsection numberless in the header 142 | 143 | % Cool Paragraph 144 | \makeatletter 145 | \renewcommand{\paragraph}{\@startsection{paragraph}{4}{0ex}% 146 | {-3.25ex plus -1ex minus -0.2ex}% 147 | {1ex plus 0.2ex}% 148 | {\normalfont\normalsize\bfseries}} 149 | \makeatother 150 | 151 | \renewcommand{\baselinestretch}{1.3} 152 | 153 | % Config caption names: 154 | \renewcommand{\lstlistingname}{Algorithm} 155 | 156 | % Usage: \circled{1}[\leq] 157 | \newcommand*\circledaux[1]{\tikz[baseline=(char.base)]{ 158 | \node[shape=circle,draw,inner sep=0.8pt] (char) {#1};}} 159 | 160 | \NewDocumentCommand{\circled}{ m o }{% 161 | \IfNoValueTF{#2}{ \circledaux{#1} }{ \stackrel{\circledaux{#1}}{#2} }% 162 | } 163 | 164 | 165 | %\rhead{\fancyplain{}{}} % predefined () 166 | %\lhead{\fancyplain{}{\rightmark }} % 1. sectionname, 1.1 subsection name etc 167 | %\cfoot{\fancyplain{}{\thepage}} 168 | 169 | % Totally necessary: always writes correctly epsilon and phi 170 | \let\temp\phi 171 | \let\phi\varphi 172 | \let\varphi\temp 173 | \let\temp\epsilon 174 | \let\epsilon\varepsilon 175 | \let\varepsilon\temp 176 | \renewcommand{\star}{\ast} 177 | 178 | % My definitions 179 | \newcommand{\Ss}{{\EuScript S}} 180 | \newcommand{\Aa}{{\EuScript A}} 181 | \newcommand{\Ab}{\text{Ab}} 182 | 183 | 184 | \newcommand{\x}{{\tt x}} \newcommand{\y}{{\tt y}} 185 | \newcommand{\z}{{\tt z}} \renewcommand{\t}{{\tt t}} 186 | \newcommand{\s}{{\tt s}} \newcommand{\ww}{{\tt w}} 187 | \newcommand{\uu}{{\tt u}} 188 | \newcommand{\Var}[1]{\text{Var}\left[#1\right]} 189 | \newcommand{\Cov}[1]{\text{Cov}\left[#1\right]} 190 | \renewcommand{\P}[1]{\mathbb{P}\left[#1\right]} 191 | \newcommand{\Vart}{\text{Var}} 192 | \newcommand{\E}[1]{\mathbb{E}\left[ #1 \right]} 193 | \newcommand{\R}{\mathbb{R}} 194 | \newcommand{\Z}{\mathbb{Z}} 195 | \newcommand{\N}{\mathbb{N}} 196 | \newcommand{\pa}[1]{\left( #1\right)} 197 | \newcommand{\norm}[1]{\left\| #1 \right\|} 198 | \newcommand{\abs}[1]{\left| #1 \right|} 199 | %\renewcommand{\dot}[1]{\left\langle #1\right\rangle} 200 | \renewcommand{\L}{\mathscr{L}} 201 | \newcommand{\dirich}[1]{\mathcal{E}\left( #1 \right)} 202 | \newcommand{\grad}{\nabla} 203 | \renewcommand{\exp}[1]{\text{exp}\left(#1\right)} 204 | \newcommand{\Ent}[1]{\text{Ent}\left[#1\right]} 205 | \newcommand{\Entt}{\text{Ent}} 206 | \newcommand{\Lip}{\text{Lip}} 207 | \newcommand{\diam}[1]{\text{diam}\left(#1\right)} 208 | 209 | \newcommand{\one}[1]{\mathds{1}} 210 | \newcommand{\ip}[2]{\left\langle{#1},{#2}\right\rangle} 211 | 212 | \DeclareMathOperator*{\argmax}{arg\,max} 213 | \DeclareMathOperator*{\argmin}{arg\,min} 214 | 215 | % Rules 216 | \newcommand{\HRule}{\rule{\linewidth}{0.5mm}} % Title's rule 217 | 218 | \renewcommand{\arraystretch}{1.5} % Space between rows in tabular 219 | \usepackage{multirow} 220 | 221 | \usepackage{xcolor} 222 | \usepackage[framemethod=tikz]{mdframed} 223 | 224 | \definecolor{cccolor}{rgb}{.67,.7,.67} 225 | 226 | 227 | \usepackage{mdframed} 228 | \usetikzlibrary{shadows} 229 | \newmdtheoremenv[shadow=true, shadowsize=5pt]{boxedthm}{Theorem} %TODO shared counter + italic font 230 | 231 | 232 | 233 | % Wrapper for pseudocode 234 | \usepackage{algorithm} 235 | % Pseudocode 236 | \usepackage[noend]{algpseudocode}% https://tex.stackexchange.com/questions/177025/hyperref-cleveref-and-algpseudocode-same-identifier-warning 237 | 238 | % PseudoCode 239 | \newcommand*\var{\mathit} % Variables in pseudocode 240 | \newcommand*\fn{\operatorname} % Functions in pseudocode 241 | \newcommand{\code}{\texttt} % Inline Code 242 | 243 | \makeatletter 244 | \newcounter{algorithmicH}% New algorithmic-like hyperref counter 245 | \let\oldalgorithmic\algorithmic 246 | \renewcommand{\algorithmic}{% 247 | \stepcounter{algorithmicH}% Step counter 248 | \oldalgorithmic}% Do what was always done with algorithmic environment 249 | \renewcommand{\theHALG@line}{ALG@line.\thealgorithmicH.\arabic{ALG@line}} 250 | \makeatother 251 | 252 | \iffalse 253 | \begin{algorithm}[!htp] 254 | \caption{Rejection Sampling}\label{lst:rej_samp} 255 | \begin{algorithmic}[1] 256 | \Procedure{$\operatorname{rejection\_sampling}$}{$f, g, M$} 257 | \While{\code{true}} 258 | \State $x \gets $ \code{sample}$\pa{g}$ 259 | \State $\var{accept} \gets \frac{f(x)}{Mg(x)}$ 260 | \If{\code{sample}$\pa{\mathcal{U}(0,1)} < \var{accept}$} 261 | \State \Return $x$ \Comment{Accept $x$} 262 | \EndIf 263 | \EndWhile 264 | \EndProcedure 265 | \end{algorithmic} 266 | \end{algorithm} 267 | \fi 268 | 269 | \usepackage{epigraph} 270 | \setlength{\epigraphwidth}{0.5\linewidth} 271 | \setlength{\epigraphrule}{0pt} 272 | \renewcommand*{\textflush}{flushright} 273 | \renewcommand*{\epigraphsize}{\normalsize\itshape} 274 | 275 | \usepackage[capitalise,nameinlink,noabbrev]{cleveref} % Cite with \cref or \Cref so the name of the object (Theorem, Proposition, etc.) is written automatically 276 | 277 | % Customized sections: http://tex.stackexchange.com/questions/136527/section-numbering-without-numbers/136541#136541 278 | 279 | %\usepackage{titlesec} 280 | %\titlelabel{\thetitle.\enspace} 281 | %\titleformat{\section} 282 | % {\normalsize\bfseries\centering} % The style of the section title 283 | % {} % a prefix 284 | % {0pt} % How much space exists between the prefix and the title 285 | % {Question \thesection} % How the section is represented 286 | % %{Section \thesection:\quad} % How the section is represented 287 | % 288 | %% Starred variant 289 | %\titleformat{name=\section,numberless} 290 | % {\normalfont\Large\bfseries} 291 | % {} 292 | % {0pt} 293 | % {} 294 | 295 | % Graphics 296 | 297 | 298 | %================================================================================ 299 | % Comments 300 | %================================================================================ 301 | \iffalse 302 | 303 | % Align 304 | \begin{align*} 305 | \begin{aligned} 306 | i &= i \\ 307 | &= i \\ 308 | \end{aligned} 309 | \end{align*} 310 | 311 | % Stack things 312 | \stackrel{?}{<} 313 | 314 | % Graphics 315 | \begin{figure}[h!] 316 | \centering 317 | \includegraphics[scale=0.1]{1} 318 | \caption{SGD adaptation} 319 | \end{figure} 320 | 321 | \fi 322 | 323 | \title{} 324 | \date{} 325 | \author{} 326 | 327 | 328 | 329 | \begin{document} 330 | 331 | \section{Gradient Descent Finds Global Minima of Deep Neural Networks} 332 | \subsection*{Definitions} 333 | \begin{itemize} 334 | \item $m$: Width of each layer of the neural network. 335 | \item $n$: number of samples. 336 | \item $d$: dimension of training data. 337 | \item $H$: number of layers of the neural network. 338 | \item $\eta$: learning rate for gradient descent. 339 | \item $\theta$: parameters of the neural network. 340 | \item $\theta(k)$: parameters of the neural network after $k$ iterations of training with gradient descent. $\theta(0)$ are the parameters at initialization (iid $N(0,1)$). 341 | \item $\sigma$: Activation function. It is Lipschitz, smooth, analytical and not a polynomial. 342 | \item $(\mathbf{x}_i, y_i) \in \R^d\times\R, 1\leq i\leq n$: training data and corresponding labels. In this work, it is assumed that no two input points are parallel, i.e. $x_i \nparallel x_j$ for $i\neq j$. 343 | \item $\mathbf{y} = (y_1,\dots, y_n) \in \R^n$: vector of labels. 344 | \item $\mathbf{W}^{(1)} \in \R^{m\times d}, \mathbf{W}^{(h)}\in \R^{m\times m} 2\leq h\leq H, \mathbf{a}\in \R^m$ are, respectively, the first layer, the $h$ layer and the output layer of the neural network respectively. We also use $\mathbf{W}^{(h)}(k)$, $\mathbf{a}(k)$ to denote the layers after $k$ iterations of training with GD. 345 | \item $c_{\sigma}=\left(\mathbb{E}_{x \sim N(0,1)}\left[\sigma(x)^{2}\right]\right)^{-1}$ is a scaling factor to normalize the input in the initialization phase of the neural network. 346 | \item \textbf{Fully-connected neural network (NN)}. Let $\mathbf{x}^{(0)}$ be an input of the NN. Then the fully-connected neural network function $f$ is defined recursively in the following way: 347 | \begin{align*} 348 | \begin{aligned} 349 | \mathbf{x}^{(h)} &= \sqrt{\frac{c_\sigma}{m}} \sigma\left(\mathbf{W}^{(h)} \mathbf{x}^{(h-1)}\right), 1 \leq h \leq H \\ 350 | f(\mathbf{x}, \theta) &= \mathbf{a}^\top \mathbf{x}^{(H)}. 351 | \end{aligned} 352 | \end{align*} 353 | where $c_{\sigma}=\left(\mathbb{E}_{x \sim N(0,1)}\left[\sigma(x)^{2}\right]\right)^{-1}$ is the scaling defined above. 354 | 355 | \item \textbf{Loss function ($\ell_2$)}. $L(\theta) = \frac{1}{2}\sum_{i=1}^n (f(\theta,\mathbf{x}_i)-y_i)^2$. 356 | \item $u_i(k) = f(\theta(k), \mathbf{x}_i)$. Output of the NN for sample $i$ after $k$ iterations of GD. 357 | \item $\mathbf{u}(k) = (u_1(k), \dots, u_n(k))^\top \in \R^n$. 358 | \item $\mathbf{G}^{(h)}(k) \in \R^{n\times n}$, $1\leq h \leq H+1$ defined as $\mathbf{G}_{ij}^{(h)}(k) = \left\langle\frac{\partial u_{i}(k)}{\partial \mathbf{W}^{(h)}(k)}, \frac{\partial u_{j}(k)}{\partial \mathbf{W}^{(h)}(k)}\right\rangle$ for $h=1, \ldots, H$ and $\mathbf{G}_{i j}^{(H+1)}(k)=\left\langle\frac{\partial u_{i}(k)}{\partial \mathbf{a}(k)}, \frac{\partial u_{j}(k)}{\partial \mathbf{a}(k)}\right\rangle$. So that the following definition can be used to express the dynamics of the NN. 359 | \item $\mathbf{G}(k)$ defined as $\mathbf{G}_{ij}(k) = \sum_{h=1}^{H+1} \mathbf{G}_{ij}^{(h)}(k)$. Note that for the infinite NTK the function behaves as its linearization and it holds 360 | \[ 361 | \mathbf{y}-\mathbf{u}(k+1)=(\mathbf{I}-\eta K)(\mathbf{y}-\mathbf{u}(k)), 362 | \] 363 | We want to argue that 364 | \[ 365 | \mathbf{y}-\mathbf{u}(k+1)\approx(\mathbf{I}-\eta \mathbf{G}(k))(\mathbf{y}-\mathbf{u}(k)), 366 | \] 367 | in a precise way. Note the gradient descent update is 368 | \begin{align*} 369 | \begin{aligned} 370 | \mathbf{W}^{(h)}(k) &=\mathbf{W}^{(h)}(k-1)-\eta \frac{\partial L(\theta(k-1))}{\partial \mathbf{W}^{(h)}(k-1)}, \\ 371 | \mathbf{a}(k) &=\mathbf{a}(k-1)-\eta \frac{\partial L(\theta(k-1))}{\partial \mathbf{a}(k-1)}. 372 | \end{aligned} 373 | \end{align*} 374 | 375 | \begin{remark} 376 | Each entry of $\mathbf{G}^{(h)}(k)$ is an inner product and thus $\mathbf{G}^{(h)}(k)$ is a PSD matrix. Furthermore, if there exists one $h\in[H]$ such that $\mathbf{G}^{(h)}(k)$ is strictly positive definite, then if one chooses the step size $\eta$ to be sufficiently small, the loss decreases at the $k-th $ iteration according the analysis of power method, which presents linear convergence rate. In the paper they focus on $\mathbf{G}^{(H)}(k)$ only. 377 | \end{remark} 378 | 379 | \item $\mathbf{K}^{(h)}$ is a fixed matrix which depends on the input data, neural network architecture (including the activation function but does not depend on the parameters $\theta$. It will be shown that $\mathbf{G}^{(H)}(0)$ at initialization is close to $\mathbf{K}^{(H)}$, that $\mathbf{G}^{(H)}(k)$ is close to $\mathbf{G}^{(H)}(0)$ and that $\mathbf{K}^{(H)}$ is positive semidefinite. These three things imply linear convergence of gradient descent by proving that the minimum eigenvalue of $\mathbf{G}^{(H)}(k)$ is bounded below by a constant independent of $k$. The definition of these matrices for the fully neural network connected the following: 380 | \begin{align} 381 | \begin{aligned} 382 | \mathbf{K}_{i j}^{(0)} &=\left\langle\mathbf{x}_{i}, \mathbf{x}_{j}\right\rangle \\ \mathbf{A}_{i j}^{(h)} &=\left(\begin{array}{cc}{\mathbf{K}_{i i}^{(h-1)}} & {\mathbf{K}_{i j}^{(h-1)}} \\ {\mathbf{K}_{j i}^{(h-1)}} & {\mathbf{K}_{j j}^{(h-1)}}\end{array}\right) \\ \mathbf{K}_{i j}^{(h)} &=c_{\sigma} \mathbb{E}_{(u, v)^{\top} \sim N\left(\mathbf{0}, \mathbf{A}_{i j}^{(h)}\right)}[\sigma(u) \sigma(v)] \\ \mathbf{K}_{i j}^{(H)} &=c_{\sigma} \mathbf{K}_{i j}^{(H-1)} \mathbb{E}_{(u, v)^{\top} \sim N\left(\mathbf{0}, \mathbf{A}_{i j}^{(H-1)}\right)}\left[\sigma^{\prime}(u) \sigma^{\prime}(v)\right] 383 | \end{aligned} 384 | \end{align} 385 | 386 | \item $u_{i}^{\prime}(\theta) = \frac{\partial u_{i}}{\partial \theta}, u_{i}^{(h)}(\theta) = \frac{\partial u_{i}}{\partial \mathbf{W}^{(h)}}, u_{i}^{(a)}(\theta) = \frac{\partial u_{i}}{\partial \mathbf{a}}, L^{\prime}(\theta)=\frac{\partial L(\theta)}{\partial \theta}, L^{(h)}(\mathbf{W}^{(h)})=\frac{\partial L(\theta)}{\partial \mathbf{W}^{(h)}}, L^{(a)}(\theta) = \frac{\partial L}{\partial \mathbf{a}}$. 387 | 388 | \end{itemize} 389 | 390 | \subsection*{Results} 391 | 392 | The paper proves linear global convergence, i.e. to zero training error of some deep networks architectures with high probability with respect to the initialization assuming the networks are sufficiently overparametrized and that $\ell_2$ loss is used. Note the learning rate has to be quite small, much more that what would be used in practice. Another caveat is that overparametrization depends on $\lambda_0$ the minimum eigenvalue of $K^(H)$ which is proved to be positive but it is not provided any kind of guarantee for $\lambda_0$ not being arbitrarily small in some cases. 393 | 394 | The results of the paper are for fully-connected NNs, which needs exponential overparametrization with depth, for ResNets, in which this dependence with depth drops to a polynomial, and convolutional ResNets. In these notes we focus on the fully-connected architecture for simplicity. The arguments are quite similar across architectures. 395 | 396 | \begin{thm}[Convergence Rate of Gradient Descent for Deep Fully-connected Neural Networks]\label{thm:convergence} 397 | Assume for all $i \in [n]$, $\norm{\mathbf{x}_i}_2 = 1$, $\abs{y_i} = O(1)$ and the number of hidden nodes per layer 398 | \begin{align*} 399 | m=\Omega\left(2^{O(H)}\max\left\{ 400 | \frac{n^4}{\lambda_{\min}^4\left(\mathbf{K}^{(H)}\right)},\frac{n}{\delta}, \frac{n^2\log(\frac{Hn}{\delta})}{\lambda_{\min}^2\left(\mathbf{K}^{(H)}\right)} 401 | \right\}\right) 402 | \end{align*} 403 | If we set the step size 404 | \[\eta = O\left(\frac{\lambda_{\min}\left(\mathbf{K}^{(H)}\right)}{n^22^{O(H)}}\right),\] 405 | then with probability at least $1-\delta$ over the random initialization, for $k=1,2,\ldots$, the loss at each iteration satisfies 406 | \begin{align*} 407 | L(\theta(k))\le \left(1-\frac{\eta \lambda _{\min}\left(\mathbf{K}^{(H)}\right)}{2}\right)^{k}L(\theta(0)). 408 | \end{align*} 409 | \end{thm} 410 | 411 | In order to prove the theorem, we introduce a few lemmas. First, we state the condition of the theorem we want to prove for all $k$ with high probability, where $\lambda_0$ is the minimum eigenvalue of $\mathbf{K}^{(H)}$. 412 | 413 | \begin{condition}\label{cond:linear_converge} 414 | At the $k$-th iteration, we have \begin{align*} 415 | \norm{\mathbf{y}-\mathbf{u}(k)}_2^2 \le (1-\frac{\eta \lambda_0}{2})^{k} \norm{\mathbf{y}-\mathbf{u}(0)}_2^2. 416 | \end{align*} 417 | \end{condition} 418 | 419 | 420 | \begin{lemma}[Initialization norm] If $\sigma(\cdot)$ is $L$-Lipschitz and $m= \Omega\left(\frac{nHg_c(H)^2}{\delta}\right)$ with $C = c_\sigma L(2\abs{\sigma(0)} \sqrt{\frac{2}{\pi}}+2L)$, then with probability at least $1-\delta$ over random initialization, for every $h \in [H]$ and $i \in [n]$ we have 421 | \[ 422 | \frac{1}{c_{x, 0}} \leq\left\|\mathbf{x}_{i}^{(h)}(0)\right\|_{2} \leq c_{x, 0}, 423 | \] 424 | where $c_{x,0}=2$. 425 | \end{lemma} 426 | A similar lemma can be proven for different architectures with a different value of $c_{x,0}$. This lemma is needed in the proofs of Lemmas \ref{lemma:activations_stability} and \ref{lemma:eigenvalue_stability_while_training}. 427 | 428 | \begin{lemma}[Least Eigenvalue at the Initialization] If $m= \Omega\left(\frac{n^2\log(Hn/\delta)2^{O(H)}}{\lambda_0^2}\right)$ we have 429 | \[ 430 | \lambda_{\textup{min}}(\mathbf{G}^{(H)}(0)) \geq \frac{3}{4}\lambda_0. 431 | \] 432 | \end{lemma} 433 | 434 | \begin{lemma}[Least Eigenvalue at the Initialization]\label{lemma:activations_stability} 435 | Suppose for every $h\in[H]$, $\norm{\mathbf{W}^{(h)}(0)}_2 \le c_{w,0}\sqrt{m}$, $\norm{\mathbf{x}^{(h)}(0)}_2 \le c_{x,0}$ and $\norm{\mathbf{W}^{(h)}(k)-\mathbf{W}^{(h)}(0)}_F \le \sqrt{m} R$ for some constant $c_{w,0},c_{x,0} > 0$ and $R \le c_{w,0}$. 436 | If $\sigma(\cdot)$ is $L-$Lipschitz, we have \begin{align*} 437 | \norm{\mathbf{x}^{(h)}(k)-\mathbf{x}^{(h)}(0)}_2 \le \sqrt{c_{\sigma}}Lc_{x,0}g_{c_x}(h)R 438 | \end{align*} where $c_x=2\sqrt{c_{\sigma}}Lc_{w,0}$. 439 | \end{lemma} 440 | 441 | \begin{lemma} \label{lemma:eigenvalue_stability_while_training} Suppose $\sigma(\cdot)$ is $L-$Lipschitz and $\beta-$smooth. Suppose for $h\in[H]$, $\norm{\mathbf{W}^{(h)}(0)}_2\le c_{w,0}\sqrt{m}$, $\norm{\mathbf{a}(0)}_2\le a_{2,0}\sqrt{m}$, $\norm{\mathbf{a}(0)}_4\le a_{4,0}m^{1/4}$ , $\frac{1}{c_{x,0}}\le\norm{\mathbf{x}^{(h)}(0)}_2 \le c_{x,0}$, if $\norm{\mathbf{W}^{(h)}(k)-\mathbf{W}^{(h)}(0)}_F$, $\norm{\mathbf{a}(k)-\mathbf{a}(0)}_2 \le \sqrt{m}R$ where $R \le c g_{c_x}(H)^{-1}\lambda_0n^{-1}$ and $R\le c g_{c_x}(H)^{-1}$ for some small constant $c$ and $c_x = 2\sqrt{c_{\sigma}}Lc_{w,0}$, we have \begin{align*} 442 | \norm{\mathbf{G}^{(H)}(k) - \mathbf{G}^{(H)}(0)}_2 \le \frac{\lambda_0}{4}. 443 | \end{align*} 444 | \end{lemma} 445 | The assumption $\norm{W^{(h)}(0)}_2 \leq c_{w,0}\sqrt{m}$ is a well know fact of gaussian initialized matrices and the bounds on $\norm{a(0)}_2$ and $\norm{a(0)}_4$ can be proved using standard concentration inequalities. $a_{2,0}$ and $a_{4,0}$ are universal constants. 446 | 447 | \begin{lemma} \label{lemma:weights_stability} 448 | If Condition~\ref{cond:linear_converge} holds for $k'=1,\ldots,k$, we have for any $s =1,\ldots,k+1$ 449 | \begin{align*} 450 | &\norm{\mathbf{W}^{(h)}(s)-\mathbf{W}^{(h)}(0)}_F, \norm{\mathbf{a}(s)-\mathbf{a}(0)}_2 \le R'\sqrt{m}\\ 451 | &\norm{\mathbf{W}^{(h)}(s)-\mathbf{W}^{(h)}(s-1)}_F, \norm{\mathbf{a}(s)-\mathbf{a}(s-1)}_2\le \eta Q'(s-1) 452 | \end{align*}where $R'=\frac{16c_{x,0}a_{2,0}\left(c_x\right)^H \sqrt{n} \norm{\mathbf{y}-\mathbf{u}(0)}_2}{\lambda_0\sqrt{m}} \le cg_{c_x}(H)^{-1}$ for some small constant $c$ with $c_x=\max\{2\sqrt{c_{\sigma}}Lc_{w,0},1\}$ and $ Q'(s)= 4c_{x,0}a_{2,0}\left(c_x\right)^{H}\sqrt{n} \norm{\mathbf{y}-\mathbf{u}(s)}_2$ 453 | 454 | \end{lemma} 455 | 456 | \begin{lemma}\label{lemma:small_snd_order_term} 457 | Let 458 | \[ 459 | I_2^i(k) = \int_{s=0}^{\eta}\left\langle L^{\prime}(\theta(k)), u_{i}^{\prime}(\theta(k))-u_{i}^{\prime}\left(\theta(k)-s L^{\prime}(\theta(k))\right)\right\rangle d s 460 | \] 461 | and $\mathbf{I}_2(k) = (I_2^1(k), \dots, I_2^n(k))^\top$. 462 | If Condition~\ref{cond:linear_converge} holds for $k'=1,\ldots,k$, suppose $\eta\le c\lambda_0\left(n^{2}H^2(c_x)^{3H}g_{2c_x}(H)\right)^{-1}$ for some small constant $c$, we have \begin{align*} 463 | \norm{\mathbf{I}_2(k)}_2 \le \frac{1}{8}\eta \lambda_0 \norm{\mathbf{y}-\mathbf{u}(k)}_2. 464 | \end{align*} 465 | \end{lemma} 466 | 467 | \begin{lemma}\label{lemma:small_snd_order_term_2} 468 | If Condition~\ref{cond:linear_converge} holds for $k'=1,\ldots,k$, suppose $\eta\le c\lambda_0\left(n^{2}H^2(c_x)^{2H}g_{2c_x}(H)\right)^{-1}$ for some small constant $c$, then we have 469 | $\norm{\mathbf{u}(k+1)-\mathbf{u}(k)}_2^2\le \frac{1}{8}\eta \lambda_0 \norm{\mathbf{y}-\mathbf{u}(k)}_2^2$. 470 | 471 | \end{lemma} 472 | 473 | \begin{proof}[Proof of Theorem \ref{thm:convergence}] 474 | We want to prove Condition \ref{cond:linear_converge} for all $k$. We proceed by induction. Note that 475 | \begin{equation} \label{eq:decomposition} 476 | \begin{aligned} &\|\mathbf{y}-\mathbf{u}(k+1)\|_{2}^{2} \\=&\|\mathbf{y}-\mathbf{u}(k)-(\mathbf{u}(k+1)-\mathbf{u}(k))\|_{2}^{2} \\=&\|\mathbf{y}-\mathbf{u}(k)\|_{2}^{2}-2(\mathbf{y}-\mathbf{u}(k))^{\top}(\mathbf{u}(k+1)-\mathbf{u}(k))+\|\mathbf{u}(k+1)-\mathbf{u}(k)\|_{2}^{2} \end{aligned} 477 | \end{equation} 478 | 479 | We need the second summand to be greater in absolute value than the third one for the loss to decrease. Intuitively this is true because by a Taylor expansion of $\mathbf{u}(k+1)-\mathbf{u}(k)$ with respect to $\eta$ we have that the second summand is of order $\eta$ plus second order terms and the third summand is of order $\eta^2$, so for $\eta$ small enough we can proof that the loss decreases. Then we have to prove that the first order term in $\eta$ is proportional to the constant of \ref{cond:linear_converge}. Expanding one coordinate of $\mathbf{u}(k+1)-\mathbf{u}(k)$ by Taylor we obtain 480 | \begin{align*} 481 | \begin{aligned} 482 | \mathbf{u}_i(k+1)-\mathbf{u}_i(k) = \left( -\eta\left\langle L^{\prime}(\theta(k)), u_{i}^{\prime}(\theta(k))\right\rangle \right) + I_2^i(k) 483 | \end{aligned} 484 | \end{align*} 485 | where, following the notation of the paper we denote $I_2^i(k)$ the second order term on $\eta$. It is equal to 486 | \[ 487 | I_2^i(k) = \int_{s=0}^{\eta}\left\langle L^{\prime}(\theta(k)), u_{i}^{\prime}(\theta(k))-u_{i}^{\prime}\left(\theta(k)-s L^{\prime}(\theta(k))\right)\right\rangle \mathrm{d}s. 488 | \] 489 | But let's focus on the first term, which we denote $I_1^i(k)$, and let $\mathbf{I}_1(k) =(I_1^1(k)), \dots, (I_1^n(k))^\top $ and $\mathbf{I}_2(k) =(I_2^1(k)), \dots, (I_2^n(k))^\top $. We have 490 | \begin{align*} 491 | \begin{aligned} I_{1}^{i} &=-\eta\left\langle L^{\prime}(\theta(k)), u_{i}^{\prime}(\theta(k))\right\rangle \\ &=-\eta \sum_{j=1}^{n}\left(u_{j}-y_{j}\right)\left\langle u_{j}^{\prime}(\theta(k)), u_{i}^{\prime}(\theta(k))\right\rangle \\ & \triangleq-\eta \sum_{j=1}^{n}\left(u_{j}-y_{j}\right) \sum_{h=1}^{H+1} \mathbf{G}_{i j}^{(h)}(k) \end{aligned} 492 | \end{align*} 493 | or in matricial form 494 | \[ 495 | \mathbf{I}_{1}(k)=-\eta \mathbf{G}(k)(\mathbf{u}(k)-\mathbf{y}) 496 | \] 497 | Now observe that 498 | \begin{align} \label{ineq:bound_Gh} 499 | \begin{aligned}(\mathbf{y}-\mathbf{u}(k))^{\top} \mathbf{I}_{1}(k) &=\eta(\mathbf{y}-\mathbf{u}(k))^{\top} \mathbf{G}(k)(\mathbf{y}-\mathbf{u}(k)) \\ & \geq \lambda_{\min }(\mathbf{G}(k))\|\mathbf{y}-\mathbf{u}(k)\|_{2}^{2} \\ & \geq \lambda_{\min }\left(\mathbf{G}^{(H)}(k)\right)\|\mathbf{y}-\mathbf{u}(k)\|_{2}^{2} \end{aligned} 500 | \end{align} 501 | 502 | We will only need to look at $\mathbf{G}^{(H)}$ which has the following form 503 | \[ 504 | \mathbf{G}_{i, j}^{(H)}(k)=\left(\mathbf{x}_{i}^{(H-1)}(k)\right)^{\top} \mathbf{x}_{j}^{(H-1)}(k) \cdot \frac{c_{\sigma}}{m} \sum_{r=1}^{m} a_{r}^{2} \sigma^{\prime}\left(\left(\theta_{r}^{(H)}(k)\right)^{\top} \mathbf{x}_{i}^{(H-1)}(k)\right) \sigma^{\prime}\left(\left(\theta_{r}^{(H)}(k)\right)^{\top} \mathbf{x}_{j}^{(H-1)}(k)\right) 505 | \] 506 | 507 | In principle one could look at $\mathbf{G}(k)$ but in the paper they do not do that. The analysis becomes simple if only $\mathbf{G}^{(H)}$ is used. 508 | 509 | So putting all together we have 510 | \begin{align*} 511 | \begin{aligned} &\|\mathbf{y}-\mathbf{u}(k+1)\|_{2}^{2} \\ 512 | \circled{1}[\leq] &\left(1-\eta \lambda_{\min }\left(\mathbf{G}^{(H)}(k)\right)\right)\|\mathbf{y}-\mathbf{u}(k)\|_{2}^{2}-2(\mathbf{y}-\mathbf{u}(k))^{\top} \mathbf{I}_{2}(k)+\|\mathbf{u}(k+1)-\mathbf{u}(k)\|_{2}^{2} \\ 513 | \circled{2}[\leq] & \left(1-\eta \lambda_{0}\right)\|\mathbf{y}-\mathbf{u}(k)\|_{2}^{2}-2(\mathbf{y}-\mathbf{u}(k))^{\top} \mathbf{I}_{2}+\|\mathbf{u}(k+1)-\mathbf{u}(k)\|_{2}^{2} \\ 514 | \circled{3}[\leq] &\left(1-\frac{\eta \lambda_{0}}{2}\right)\|\mathbf{y}-\mathbf{u}(k)\|_{2}^{2}. 515 | \end{aligned} 516 | \begin{aligned}\end{aligned} 517 | \end{align*} 518 | 519 | $\circled{1}$ uses Equation \eqref{eq:decomposition} and inequality \eqref{ineq:bound_Gh}. \circled{3} uses Lemmas \ref{lemma:small_snd_order_term} and \ref{lemma:small_snd_order_term_2}. For $\circled{2}$, by induction hypothesis, using Lemma \ref{lemma:weights_stability} we obtain 520 | \[ 521 | \begin{aligned}\left\|\mathbf{W}^{(h)}(k)-\mathbf{W}^{(h)}(0)\right\|_{F} & \leq R^{\prime} \sqrt{m} \\ & \leq R \sqrt{m} \end{aligned} 522 | \] 523 | for the choice of $m$ in the theorem. By Lemma \ref{lemma:eigenvalue_stability_while_training} we get $\lambda_{\min }\left(\mathbf{G}^{(H)}(k)\right) \geq \frac{\lambda_{0}}{2}$. 524 | 525 | 526 | \end{proof} 527 | 528 | 529 | \nocite{*} % Include refs not cited 530 | \bibliography{refs} %use a bibtex bibliography file refs.bib 531 | \bibliographystyle{plain} %use the plain bibliography style 532 | 533 | \end{document} 534 | -------------------------------------------------------------------------------- /notes/low_rank_jac/.gitignore: -------------------------------------------------------------------------------- 1 | # -*- mode: gitignore; -*- 2 | *~ 3 | \#*\# 4 | /.emacs.desktop 5 | /.emacs.desktop.lock 6 | *.elc 7 | auto-save-list 8 | tramp 9 | .\#* 10 | 11 | # Org-mode 12 | .org-id-locations 13 | *_archive 14 | 15 | # flymake-mode 16 | *_flymake.* 17 | 18 | # eshell files 19 | /eshell/history 20 | /eshell/lastdir 21 | 22 | # elpa packages 23 | /elpa/ 24 | 25 | # reftex files 26 | *.rel 27 | 28 | # AUCTeX auto folder 29 | /auto/ 30 | 31 | # cask packages 32 | .cask/ 33 | dist/ 34 | 35 | # Flycheck 36 | flycheck_*.el 37 | 38 | # server auth directory 39 | /server/ 40 | 41 | # projectiles files 42 | .projectile 43 | 44 | # directory configuration 45 | .dir-locals.el 46 | 47 | # network security 48 | /network-security.data 49 | 50 | 51 | *.pdf 52 | *.pdf_tex 53 | *.synctex.gz 54 | 55 | ## Core latex/pdflatex auxiliary files: 56 | *.aux 57 | *.lof 58 | *.log 59 | *.lot 60 | *.fls 61 | *.out 62 | *.toc 63 | *.fmt 64 | *.fot 65 | *.cb 66 | *.cb2 67 | .*.lb 68 | 69 | ## Intermediate documents: 70 | *.dvi 71 | *.xdv 72 | *-converted-to.* 73 | # these rules might exclude image files for figures etc. 74 | # *.ps 75 | # *.eps 76 | # *.pdf 77 | 78 | ## Generated if empty string is given at "Please type another file name for output:" 79 | .pdf 80 | 81 | ## Bibliography auxiliary files (bibtex/biblatex/biber): 82 | *.bbl 83 | *.bcf 84 | *.blg 85 | *-blx.aux 86 | *-blx.bib 87 | *.run.xml 88 | 89 | ## Build tool auxiliary files: 90 | *.fdb_latexmk 91 | *.synctex 92 | *.synctex(busy) 93 | *.synctex.gz 94 | *.synctex.gz(busy) 95 | *.pdfsync 96 | 97 | ## Build tool directories for auxiliary files 98 | # latexrun 99 | latex.out/ 100 | 101 | ## Auxiliary and intermediate files from other packages: 102 | # algorithms 103 | *.alg 104 | *.loa 105 | 106 | # achemso 107 | acs-*.bib 108 | 109 | # amsthm 110 | *.thm 111 | 112 | # beamer 113 | *.nav 114 | *.pre 115 | *.snm 116 | *.vrb 117 | 118 | # changes 119 | *.soc 120 | 121 | # comment 122 | *.cut 123 | 124 | # cprotect 125 | *.cpt 126 | 127 | # elsarticle (documentclass of Elsevier journals) 128 | *.spl 129 | 130 | # endnotes 131 | *.ent 132 | 133 | # fixme 134 | *.lox 135 | 136 | # feynmf/feynmp 137 | *.mf 138 | *.mp 139 | *.t[1-9] 140 | *.t[1-9][0-9] 141 | *.tfm 142 | 143 | #(r)(e)ledmac/(r)(e)ledpar 144 | *.end 145 | *.?end 146 | *.[1-9] 147 | *.[1-9][0-9] 148 | *.[1-9][0-9][0-9] 149 | *.[1-9]R 150 | *.[1-9][0-9]R 151 | *.[1-9][0-9][0-9]R 152 | *.eledsec[1-9] 153 | *.eledsec[1-9]R 154 | *.eledsec[1-9][0-9] 155 | *.eledsec[1-9][0-9]R 156 | *.eledsec[1-9][0-9][0-9] 157 | *.eledsec[1-9][0-9][0-9]R 158 | 159 | # glossaries 160 | *.acn 161 | *.acr 162 | *.glg 163 | *.glo 164 | *.gls 165 | *.glsdefs 166 | *.lzo 167 | *.lzs 168 | 169 | # uncomment this for glossaries-extra (will ignore makeindex's style files!) 170 | # *.ist 171 | 172 | # gnuplottex 173 | *-gnuplottex-* 174 | 175 | # gregoriotex 176 | *.gaux 177 | *.gtex 178 | 179 | # htlatex 180 | *.4ct 181 | *.4tc 182 | *.idv 183 | *.lg 184 | *.trc 185 | *.xref 186 | 187 | # hyperref 188 | *.brf 189 | 190 | # knitr 191 | *-concordance.tex 192 | # TODO Comment the next line if you want to keep your tikz graphics files 193 | *.tikz 194 | *-tikzDictionary 195 | 196 | # listings 197 | *.lol 198 | 199 | # luatexja-ruby 200 | *.ltjruby 201 | 202 | # makeidx 203 | *.idx 204 | *.ilg 205 | *.ind 206 | 207 | # minitoc 208 | *.maf 209 | *.mlf 210 | *.mlt 211 | *.mtc[0-9]* 212 | *.slf[0-9]* 213 | *.slt[0-9]* 214 | *.stc[0-9]* 215 | 216 | # minted 217 | _minted* 218 | *.pyg 219 | 220 | # morewrites 221 | *.mw 222 | 223 | # nomencl 224 | *.nlg 225 | *.nlo 226 | *.nls 227 | 228 | # pax 229 | *.pax 230 | 231 | # pdfpcnotes 232 | *.pdfpc 233 | 234 | # sagetex 235 | *.sagetex.sage 236 | *.sagetex.py 237 | *.sagetex.scmd 238 | 239 | # scrwfile 240 | *.wrt 241 | 242 | # sympy 243 | *.sout 244 | *.sympy 245 | sympy-plots-for-*.tex/ 246 | 247 | # pdfcomment 248 | *.upa 249 | *.upb 250 | 251 | # pythontex 252 | *.pytxcode 253 | pythontex-files-*/ 254 | 255 | # tcolorbox 256 | *.listing 257 | 258 | # thmtools 259 | *.loe 260 | 261 | # TikZ & PGF 262 | *.dpth 263 | *.md5 264 | *.auxlock 265 | 266 | # todonotes 267 | *.tdo 268 | 269 | # vhistory 270 | *.hst 271 | *.ver 272 | 273 | # easy-todo 274 | *.lod 275 | 276 | # xcolor 277 | *.xcp 278 | 279 | # xmpincl 280 | *.xmpi 281 | 282 | # xindy 283 | *.xdy 284 | 285 | # xypic precompiled matrices and outlines 286 | *.xyc 287 | *.xyd 288 | 289 | # endfloat 290 | *.ttt 291 | *.fff 292 | 293 | # Latexian 294 | TSWLatexianTemp* 295 | 296 | ## Editors: 297 | # WinEdt 298 | *.bak 299 | *.sav 300 | 301 | # Texpad 302 | .texpadtmp 303 | 304 | # LyX 305 | *.lyx~ 306 | 307 | # Kile 308 | *.backup 309 | 310 | # gummi 311 | .*.swp 312 | 313 | # KBibTeX 314 | *~[0-9]* 315 | 316 | # auto folder when using emacs and auctex 317 | ./auto/* 318 | *.el 319 | 320 | # expex forward references with \gathertags 321 | *-tags.tex 322 | 323 | # standalone packages 324 | *.sta 325 | 326 | # Makeindex log files 327 | *.lpzreport.aux 328 | auto/ 329 | supp.zip 330 | -------------------------------------------------------------------------------- /notes/low_rank_jac/Makefile: -------------------------------------------------------------------------------- 1 | ALL=$(wildcard *.sty *.tex figs/*.svg) 2 | PAPER=low_rank_jac_thm 3 | SHELL=/bin/zsh 4 | 5 | #FIGS_SVG=$(wildcard figs/*.svg) 6 | #FIGS_PDF=$(FIGS_SVG:%.svg=%.pdf) 7 | 8 | #./figs/%.pdf: ./figs/%.svg ## Figures for the manuscript 9 | # inkscape -D -z --file=$< --export-pdf=$@ --export-latex 10 | 11 | #FIGS_SVG2=$(wildcard images_adv/*.svg) 12 | #FIGS_PDF2=$(FIGS_SVG2:%.svg=%.pdf) 13 | 14 | #./images_adv/%.pdf: ./images_adv/%.svg ## Figures for the manuscript 15 | # inkscape -D -z --file=$< --export-pdf=$@ --export-latex 16 | 17 | 18 | # all: $(FIGS_PDF2) $(FIGS_PDF) ## Build full thesis (LaTeX + figures) 19 | # pdflatex $(PAPER) 20 | # pdflatex $(PAPER) 21 | # bibtex $(PAPER) 22 | # pdflatex $(PAPER) 23 | # pdflatex $(PAPER) 24 | 25 | all: 26 | pdflatex $(PAPER) 27 | pdflatex $(PAPER) 28 | bibtex $(PAPER) 29 | pdflatex $(PAPER) 30 | pdflatex $(PAPER) 31 | 32 | clean: ## Clean LaTeX and output figure files 33 | rm -f *.out *.aux *.log *.blg *.bbl 34 | # rm -f $(FIGS_PDF) 35 | 36 | #watch: ## Recompile on any update of LaTeX or SVG sources 37 | # @while [ 1 ]; do; inotifywait $(ALL); sleep 0.01; make all; done 38 | -------------------------------------------------------------------------------- /notes/low_rank_jac/amartya_ltx.sty: -------------------------------------------------------------------------------- 1 | 2 | % Theorem Environments 3 | 4 | \newtheorem{thm}{Theorem} 5 | \newtheorem{lem}[thm]{Lemma} 6 | \newtheorem{corollary}[thm]{Corollary} 7 | \newtheorem{claim}[thm]{Claim} 8 | \newtheorem{proposition}[thm]{Proposition} 9 | \newtheorem{remark}{Remark} 10 | \newtheorem{defn}{Definition} 11 | \newtheorem{example}{Example} 12 | \newtheorem{assump}{Assumption} 13 | 14 | 15 | \def\LatinUpper{A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z} 16 | \def\LatinLower{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z} 17 | 18 | 19 | 20 | 21 | % Caligraphic fonts 22 | \newcommand{\genCal}[1]{\expandafter\newcommand\csname c#1\endcsname{{\mathcal #1}}} 23 | \@for\q:=\LatinUpper\do{% 24 | \expandafter\genCal\q 25 | } 26 | 27 | % Blackboard fonts 28 | \newcommand{\genBb}[1]{\expandafter\newcommand\csname b#1\endcsname{{\mathbb #1}}} 29 | \@for\q:=\LatinUpper\do{% 30 | \expandafter\genBb\q 31 | } 32 | 33 | % Fraktur fonts 34 | \newcommand{\genFk}[1]{\expandafter\newcommand\csname k#1\endcsname{{\mathfrak #1}}} 35 | \@for\q:=\LatinUpper\do{% 36 | \expandafter\genFk\q 37 | } 38 | 39 | \newcommand{\genFkl}[1]{\expandafter\newcommand\csname k#1\endcsname{{\mathfrak #1}}} 40 | \@for\q:=\LatinLower\do{% 41 | \expandafter\genFkl\q 42 | } 43 | 44 | 45 | % Vectors 46 | \renewcommand{\vec}[1]{{\mathbf{#1}}} 47 | \newcommand{\genLatinVec}[1]{\expandafter\newcommand\csname v#1\endcsname{{\vec #1}}} 48 | \@for\q:=\LatinLower\do{% 49 | \expandafter\genLatinVec\q 50 | } 51 | 52 | 53 | % Greek symbol vectors 54 | \def\mydefgreek#1{\expandafter\def\csname v#1\endcsname{\text{\boldmath$\mathbf{\csname #1\endcsname}$}}} 55 | \def\mydefallgreek#1{\ifx\mydefallgreek#1\else\mydefgreek{#1}% 56 | \lowercase{\mydefgreek{#1}}\expandafter\mydefallgreek\fi} 57 | \mydefallgreek {alpha}{beta}{gamma}{delta}{epsilon}{zeta}{eta}{theta}{iota}{kappa}{lambda}{mu}{nu}{xi}{omicron}{pi}{rho}{sigma}{tau}{upsilon}{phi}{chi}{psi}{omega}\mydefallgreek 58 | 59 | % Parentheses 60 | \newcommand{\bc}[1]{\left\{{#1}\right\}} 61 | \newcommand{\br}[1]{\left({#1}\right)} 62 | \newcommand{\bs}[1]{\left[{#1}\right]} 63 | \newcommand{\abs}[1]{\left| {#1} \right|} 64 | \newcommand{\ceil}[1]{\left\lceil #1 \right\rceil} 65 | \newcommand{\floor}[1]{\left\lfloor #1 \right\rfloor} 66 | \newcommand{\bsd}[1]{\left\llbracket{#1}\right\rrbracket} 67 | \newcommand{\ip}[2]{\left\langle{#1},{#2}\right\rangle} 68 | 69 | % Vector notations 70 | \newcommand{\reals}{\mathbb{R}} 71 | 72 | %Important functions 73 | \newcommand{\sgn}[1]{\mathrm{sign}(#1)} 74 | \newcommand{\diag}[1]{\mathrm{diag}\left(#1\right)} 75 | \newcommand{\rank}[1]{\mathrm{rank}\left(#1\right)} 76 | \newcommand{\rad}[2]{\mathrm{RAD}_{#2}(#1)} 77 | \newcommand{\supp}{\mathop{\mathrm{sup}}} 78 | \newcommand{\inff}{\mathop{\mathrm{inf}}} 79 | \newcommand{\argmax}{\mathop{\mathrm{argmax}}} 80 | \newcommand{\argmin}{\mathop{\mathrm{argmin}}} 81 | \newcommand{\norm}[1]{\mathrm{\left\lVert#1\right\rVert}} 82 | 83 | % Complexity operators 84 | \newcommand{\bigO}[1]{O\left(#1\right)} 85 | \newcommand{\softO}[1]{\widetilde{\cO}\br{{#1}}} 86 | \newcommand{\Om}[1]{\Omega\br{{#1}}} 87 | \newcommand{\softOm}[1]{\tilde\Omega\br{{#1}}} 88 | -------------------------------------------------------------------------------- /notes/low_rank_jac/low_rank_jac_thm.tex: -------------------------------------------------------------------------------- 1 | \documentclass[a4paper]{article} 2 | \usepackage[utf8]{inputenc} % allow utf-8 input 3 | \usepackage[T1]{fontenc} % use 8-bit T1 fonts 4 | \usepackage{hyperref} % hyperlinks 5 | \usepackage{url} % simple URL typesetting 6 | \usepackage{booktabs} % professional-quality tables 7 | \usepackage{amsfonts} % blackboard math symbols 8 | \usepackage{nicefrac} % compact symbols for 1/2, etc. 9 | \usepackage{microtype} % microtypography 10 | 11 | \setlength{\headheight}{14.0pt} % Removes fancy header warning (Not sure what it does) 12 | \usepackage[margin=3cm]{geometry} % To edit margins and their format 13 | \usepackage[english]{babel} % language 14 | \usepackage{indentfirst} % First paragraph of each section / subsection 15 | \usepackage{listings} % Show code 16 | \usepackage{fancyhdr} % Headers y footers 17 | \usepackage{multicol} % http://stackoverflow.com/questions/1491717/how-to-display-a-content-in-two-column-layout-in-latex 18 | \usepackage{blindtext} % For the cool paragraph (Enter after the paragraph section) 19 | \usepackage{textcomp} 20 | \usepackage{bussproofs} 21 | \usepackage{enumitem} % To enum with letters and other things 22 | \usepackage{leftidx} % left superindices 23 | \usepackage{euscript} % Fancy A and S for symmetry groups (among other things) 24 | \usepackage{dsfont} 25 | 26 | 27 | 28 | \usepackage{hyperref} 29 | \usepackage{enumerate} 30 | %\usepackage{enumitem} 31 | 32 | \usepackage{nicefrac} 33 | \usepackage{mathtools} 34 | \usepackage{amssymb} 35 | \usepackage{amsthm} 36 | \usepackage{bbm} 37 | 38 | 39 | \usepackage{algpseudocode} 40 | %\usepackage{algorithmic} 41 | \usepackage{algorithm} 42 | 43 | 44 | %%% DAVID%%%% 45 | 46 | % Totally necessary: always writes correctly epsilon 47 | \let\temp\epsilon 48 | \let\epsilon\varepsilon 49 | \let\varepsilon\temp 50 | \renewcommand{\star}{\ast} 51 | 52 | % My definitions 53 | \newcommand{\Ss}{{\EuScript S}} 54 | \newcommand{\Aa}{{\EuScript A}} 55 | \newcommand{\Ab}{\text{Ab}} 56 | 57 | 58 | \newcommand{\x}{{\tt x}} \newcommand{\y}{{\tt y}} 59 | \newcommand{\z}{{\tt z}} \renewcommand{\t}{{\tt t}} 60 | \newcommand{\s}{{\tt s}} \newcommand{\ww}{{\tt w}} 61 | \newcommand{\uu}{{\tt u}} 62 | \newcommand{\Var}[1]{\text{Var}\left[#1\right]} 63 | \newcommand{\Cov}[1]{\text{Cov}\left[#1\right]} 64 | \renewcommand{\P}[1]{\mathbb{P}\left[#1\right]} 65 | \newcommand{\Vart}{\text{Var}} 66 | \newcommand{\E}[1]{\mathbb{E}\left[ #1 \right]} 67 | \newcommand{\R}{\mathbb{R}} 68 | \newcommand{\Z}{\mathbb{Z}} 69 | \newcommand{\N}{\mathbb{N}} 70 | \newcommand{\pa}[1]{\left( #1\right)} 71 | %\newcommand{\norm}[1]{\left\| #1 \right\|} 72 | %\newcommand{\abs}[1]{\left| #1 \right|} 73 | %\renewcommand{\dot}[1]{\left\langle #1\right\rangle} 74 | \renewcommand{\L}{\mathscr{L}} 75 | \newcommand{\dirich}[1]{\mathcal{E}\left( #1 \right)} 76 | \newcommand{\grad}{\nabla} 77 | \renewcommand{\exp}[1]{\text{exp}\left(#1\right)} 78 | \newcommand{\Ent}[1]{\text{Ent}\left[#1\right]} 79 | \newcommand{\Entt}{\text{Ent}} 80 | \newcommand{\Lip}{\text{Lip}} 81 | \newcommand{\diam}[1]{\text{diam}\left(#1\right)} 82 | 83 | \newcommand{\one}[1]{\mathds{1}} 84 | %\newcommand{\ip}[2]{\left\langle{#1},{#2}\right\rangle} 85 | 86 | %%%%%% 87 | 88 | \usepackage{amartya_ltx} 89 | \title{Generelization Guarantees through Low Rank Jacobian} 90 | \author{} 91 | \date{} 92 | \begin{document} 93 | \maketitle 94 | 95 | 96 | \section{Generalization Guarantees For Neural Nets Via Harnessing the Low-Rankness of Jacobian} 97 | 98 | 99 | \subsection*{Definitions and notations.} 100 | \begin{itemize} 101 | \item $n$: number of samples. 102 | \item $d$: dimension of training data. 103 | \item $K$: Number of classes, dimension of the output. 104 | \item One hidden layer neural network with the form 105 | \[ 106 | x \mapsto f(x ; W):= V \phi(W x). 107 | \] 108 | where $x\in\R^d$, $W\in \R^{k\times d}$, $V\in \R^{K\times k}$ and $\phi$ is an activation function that acts component-wise. Only $W$ is trained for simplicity in this work (but it is outlined how results can be generalized to the case in which $V$ is also trained). We use the shorthand 109 | \[ 110 | f(W) = [f(x_1;W)^\top, \dots, f(x_n;W)^\top]^\top \in \R^{nK}. 111 | \] 112 | \item $(x_i, y_i) \in \R^d\times\R^K, 1\leq i\leq n$: training data and corresponding labels (one-hot encodings). 113 | \item $\eta$: learning rate for gradient descent. 114 | \item $\theta \in \R^{kd}$: vectorized parameters of the neural network. We will denote $p=kd$. 115 | \item $\tilde{\theta} \in \R^{\max(Kn,p)}$: parameters of the linearized problem (more on this below). 116 | \item $\bar{\theta} \in \R^{\max(Kn,p)}$: $\theta$ (possibly) padded with $x$ zeroes so it has the same length as $\tilde{\theta}$. 117 | \item $y = (y_1^\top,\dots, y_n^\top)^\top \in \R^{nK}$: concatenation of labels. 118 | \item The loss function used in the optimization is the $\ell_2$ loss: 119 | \[ 120 | \mathcal{L}(W) = \frac{1}{2} \norm{f(W)-y}_2^2. 121 | \] 122 | \item The optimization algorithm is gradient descent, starting from an initialization $W_0$: 123 | \[ 124 | W_{\tau+1}=W_{\tau}-\eta \nabla \mathcal{L}\left(W_{\tau}\right). 125 | \] 126 | \item (Remember we use $\theta\in\R^{p}$ for the vectorization of $W$). We use 127 | \[ 128 | \mathcal{J}(\theta) = \frac{\partial f(\theta)}{\partial \theta} \in \R^{Kn\times p}\text{ so that } \theta_{\tau+1} = \theta_\tau - \eta \nabla \mathcal{L}(\theta_\tau) \text{ and } \nabla\mathcal{L}(\theta) = \mathcal{J}(\theta)^\top r(\theta). 129 | \] 130 | where we define the residual $r(\theta)$ as $f(\theta)- y$. 131 | \item \textbf{Information and Nuisance spaces}: For a matrix $J \in \R^{nK\times p}$ (that will typically be a Jacobian), consider its singular value decomposition 132 | \[ 133 | J=\sum_{s=1}^{n K} \lambda_{s} u_{s} v_{s}^{T}=U \operatorname{diag}\left(\lambda_{1}, \lambda_{2}, \ldots, \lambda_{n K}\right) V^{T} 134 | \] 135 | with $\lambda_{1} \geq \lambda_{2} \geq \ldots \geq \lambda_{n K}$ and $u_s \in \R^{Kn}$, $v_s \in \R^p$ being the left and right singular vectors respectivelyi For a spectrum cutoff $0<\alpha<\lambda_1$ let $c = c(\alpha)$ denote the index of the smallest singular value above $\alpha$. Then the information and nuisance space associated with $J$ are defined as 136 | \[ 137 | \mathcal{I}:=\operatorname{span}\left(\left\{\boldsymbol{u}_{s}\right\}_{s=1}^{c}\right) \text { and } \mathcal{N}:=\operatorname{span}\left(\left\{\boldsymbol{u}_{s}\right\}_{s=c+1}^{K n}\right). 138 | \] 139 | \item Multiclass Neural Tangent Kernel (M-NTK). Let $w \sim \mathcal{N}(0, I_d)$. Consider $n$ input data points $x_1, \dots, x_n \in \R^d$ aggregated in $X \in \R^{n\times d}$ and activation $\phi$ (it is assumed to be Lipschitz and smooth but the authors argue that they assume it for simplicity and outline how the result could be extended to use relu as activation). We define the multiclass kernel 140 | \[ 141 | \Sigma(X):=I_{K} \otimes \mathbb{E}\left[\left(\phi^{\prime}(X w) \phi^{\prime}(X w)^{T}\right) \odot\left(X X^{T}\right)\right], 142 | \] 143 | where $\otimes$ is the Kronecker product and $\odot$ is the Hadamard product. This kernel is closely related to the Jacobian, it is known that $\mathbb{E}\left[\mathcal{J}\left(W_{0}\right) \mathcal{J}\left(W_{0}\right)^{T}\right]=\nu^{2} \Sigma(X)$ if $V$ has i.i.d zero-mean entries with $\frac{\nu^2}{K}$ variance and $W_0$ has i.i.d. $\mathcal{N}(0,1)$ entries. 144 | 145 | \end{itemize} 146 | 147 | \section{Overview} 148 | 149 | This work is along the lines of previous works that work with the NTK. In particular, using overparametrization, they will prove that the problem is close to its linearization $f(\theta) \approx f_{\operatorname{lin}}(\theta) = f(\theta_0) + \mathcal{J}(\theta_0)(\theta - \bar{\theta}_0)$ (since we will find solutions close to $\theta_0$) and that will allow them to state their optimization theorem for neural networks, to be explained later. The main trait of this work is that they remove the main assumption on the data (but the unit length assumption) made by other works and as a result their results incur some bias. In particular instead of assuming that two data points are not parallel, proving using this that the NTK is positive semi-definite and having a dependence (in terms of overparametrization and number of iterations needed) on the inverse minimum eigenvalue of the $NTK$ (e.g. \cite{du2018gradient}) or instead of assuming that any two data points satisfy $\norm{x_i-x_j} \geq \delta$ and having a dependence (in terms of overparametrization and number of iterations needed) on the inverse of $\delta$ (e.g. \cite{allen2018convergence}), they allow the NTK to have $0$ or very small eigenvalues and split the space into the information space (span of the first top left singular vectors of the jacobian at initialization or equivalently, first top eigenvectors of the NTK) and the nuisance space, proving that now the dependence (in terms of overparametrization and number of optimization time steps needed) is on the inverse of the lowest eigenvalue of the information space and the projection of the residual on the information space decreases exponentially while the projection of the residual on the nuisance space increases by a constant factor. They also work with the setting of arbitrary initialization in which under some assumptions, they can follow similar arguments to those made in the NTK, so they obtain an optimization guarantee in such a case. Also, the optimization guarantee translates to a generalization guarantee via the use of standard Rademacher complexity arguments. 150 | 151 | We outline now the main approach followed to prove their main (meta-)theorem: 152 | 153 | \begin{itemize} 154 | \item Provided that our network has enough overparametrization, we can \textbf{relate the training of the neural network with gradient descent with a linear method}. This is in the sense that both the trajectory and the residuals of the linear method and the residuals will be close. Given an initial point $\theta_0 \in \R^p$, define an $(\epsilon_0, \beta)$ reference Jacobian $J \in \R^{Kn\times \max(Kn,p)}$ a matrix satisfying: 155 | \[ 156 | \|J\| \leq \beta, \quad\left\|\mathcal{J}\left(\theta_{0}\right) \mathcal{J}^{T}\left(\theta_{0}\right)-J J^{T}\right\| \leq \epsilon_{0}^{2}, \quad \text { and } \quad\left\|\overline{\mathcal{J}}\left(\theta_{0}\right)-J\right\| \leq \epsilon_{0} 157 | \] 158 | where $\overline{\mathcal{J}}(\theta_0) \in \R^{Kn \times \max(Kn,p)}$ is a matrix obtained by appending $\max(0, Kn-p)$ zero columns to $\mathcal{J}(\theta_0)$ (note $\mathcal{J}(\theta_0) \in \R^{Kn\times p}$). 159 | 160 | In the random initialization setting, the reference Jacobian will be the NTK. In the arbitrary initialization setting, the reference Jacobian will be the Jacobian at that initialization. 161 | 162 | The bounded spectra of $J$ will be an assumption in the arbitrary initialization and a consequence of the properties of the NTK in the other case. The reason why the other two conditions are true for the random initialization regime is that in the overparametrization regime the NTK of the finite net tends to the infinite width limit NTK. 163 | \item \textbf{Bounded perturbation.} Due to the overparametrization and the small choice of the learning rate we will have 164 | \[ 165 | \norm{\theta_0-\theta_\tau} < R, 166 | \] 167 | for a constant $R$ for all $t$ between $0$ and $T$, where $T$ is picked later. This will along with overparametrization imply 168 | \[ 169 | \norm{J(\theta_0)-J(\theta_\tau)} < \epsilon. 170 | \] 171 | for a constant $\epsilon$. 172 | \item Now if we followed other works on the NTK, we would \textbf{analyze the linear case} and would see that the residual of the linearized problem, $\tilde{r}_\tau$ evolves in a precise sense 173 | \[ 174 | \widetilde{r}_{\tau}=U\left(I-\eta \Lambda^{2}\right)^{\tau} a=\sum_{s=1}^{n K}\left(1-\eta \lambda_{s}^{2}\right)^{\tau} a_{s} u_{s} 175 | \] 176 | where we are using the matrices $U$ and $\Lambda$ that come from the singular value decomposition of the reference Jacobian $J = U\Lambda V$. Also, $\lambda_s$ are the diagonal entries of $\Lambda$, $u_s$ are the rows of $U$ and $a$ is a vector whose value is the projection of the initial residual via $U$, i.e. $a = U^\top\tilde{r}_0 = U^\top r_0$. Previous approaches used that $\lambda_{nK}^2$, (the smallest one) is positive, and set a good value of the overparametrization and learning rate parameters (high and low respectively) to show that the corresponding eigenvalue for the Jacobian at initialization is positive too and finally, they used the bounded perturbation property to conclude that the residual also decreases with time. In this work, we follow this approach only for the information space, and since there is no assumption on $\lambda_{nK}^2$ being $>0$, the approximation error incurred by the linearization could mean that the projection of the residual on the nuisance space is increasing. However, if it increases it does it at a slow pace, since the approximation error is low in the overparametrization regime. In particular, we have, for the linearized regime 177 | \[ 178 | \left\|\widetilde{r}_{\tau}\right\|_{\ell_{2}} \leq\left(1-\eta \alpha^{2}\right)^{\tau}\left\|\Pi_{\mathcal{I}}\left(r_{0}\right)\right\|_{\ell_{2}}+\left\|\Pi_{\mathcal{N}}\left(r_{0}\right)\right\|_{\ell_{2}}. 179 | \] 180 | and if we define $e_{\tau+1}=r_{\tau+1}-\widetilde{r}_{\tau+1}$ then it obeys (assuming small learning rate, in particular $\eta<\beta^2$): 181 | \[ 182 | \left\|e_{\tau+1}\right\|_{\ell_{2}} \leq \eta\left(\epsilon_{0}^{2}+\epsilon \beta\right)\left\|\widetilde{r}_{\tau}\right\|_{\ell_{2}}+\left(1+\eta \epsilon^{2}\right)\left\|e_{\tau}\right\|_{\ell_{2}} 183 | \] 184 | which intuitively means that the error increases by a summand that is of the order of the residual plus a multiplicative expansion with respect to the previous error, due to the nuisance space. However, the rate of increase is small enough so that after $T$ iterations the error will be controlled. Once we have this, we can proceed to the next step. 185 | \item Use overparametrization (and bounded perturbation) to prove that in particular one has 186 | \[ 187 | \left\|r_{\tau}-\widetilde{r}_{\tau}\right\|_{\ell_{2}} \leq \frac{3}{5} \frac{\delta \alpha}{\beta}\left\|r_{0}\right\|_{\ell_{2}} \quad \text { and } \quad\left\|\overline{\theta}_{\tau}-\widetilde{\theta}_{\tau}\right\|_{\ell_{2}} \leq \delta \frac{\Gamma}{\alpha}\left\|r_{0}\right\|_{\ell_{2}}, 188 | \] 189 | where $\delta$ is a hyperparameter, $\Gamma$ is another hyperparameter that modulates the total number of time steps (that is chosen to be $T = \frac{\Gamma}{\eta \alpha^2}$). Finally, $\bar{\theta}$ is equal to $\theta\in\R^{p}$ padded with zeroes till size $\max(Kn,p)$. 190 | \item Prove bounded initial residual. In the random initialization regime, this will be a property one can prove about the NTK. In the arbitrary initialization regime, it is an assumption. 191 | \item \textbf{Put all together} to conclude 192 | \[ 193 | \left\|r_{T}\right\|_{\ell_{2}} \leq e^{-\Gamma}\left\|\Pi_{\mathcal{I}}\left(r_{0}\right)\right\|_{\ell_{2}}+\left\|\Pi_{\mathcal{N}}\left(r_{0}\right)\right\|_{\ell_{2}}+\frac{\delta \alpha}{\beta}\left\|r_{0}\right\|_{\ell_{2}}. 194 | \] 195 | 196 | 197 | \end{itemize} 198 | 199 | 200 | 201 | \subsection{Some Proofs} 202 | \label{sec:some-proofs} 203 | 204 | In this section, we give more precise statements and prove some of the 205 | things we talked about in the previous section. 206 | 207 | 208 | We will first make the following two assumptions about the jacobians 209 | of our non-linear models. We will see that these assumptions hold when 210 | these non-linear models are two-layered neural networks with smooth 211 | activation functions. 212 | 213 | \begin{assump}[$\beta$-Bounded spectrum]\label{assump:assump1} 214 | The non-linear function $f:\reals^p\rightarrow\reals^n$ satisfies 215 | the $\beta$-bounded spectrum assumption, when the jacobian 216 | associated with $f$ satisfies the following for all 217 | $\theta\in\reals^p$ 218 | \begin{equation} 219 | \label{eq:bound-spec-assump} 220 | \norm{\cJ\br{\vec{\theta}}}\le\beta 221 | \end{equation} 222 | \end{assump} 223 | 224 | \begin{assump}[$\br{\epsilon,R,\theta_0}$-bounded jacobian perturbation]\label{assump:assump2} 225 | The non-linear function $f:\reals^p\rightarrow\reals^n$ satisfies 226 | the $\br{\epsilon,R,\theta_0}$-bounded jacobian perturbation when 227 | the following is satisfied for all $\theta\in\reals^p$ such that 228 | $\norm{\theta-\theta_p}\le R$ 229 | \begin{equation} 230 | \label{eq:bound-pert-assump} 231 | \norm{\cJ\br{\theta} - \cJ\br{\theta_0}}\le\dfrac{\epsilon}{2} 232 | \end{equation} 233 | \end{assump} 234 | We will be looking at the following meta theorem. 235 | 236 | \begin{thm}\label{thm:meta-thm1} 237 | Consider a non-linear least squares problem of the form 238 | $\cL\br{\theta} = \frac{1}{2}\norm{f\br{\theta} - y}_2^2$ with 239 | $f:\reals^p\rightarrow\reals^{nK}$ the multi-class non-linear 240 | mapping, $\theta\in\reals^p$ the parameters of the model, and 241 | $\vec{y}\in\reals^{nK}$ the concatenated labels. Let $\bar{\theta}$ be a zero-padded $\theta$ till 242 | size $\mathrm{max}\br{Kn,p}$, Also consider a point 243 | $\theta_0\in\reals^p$ with $\vec{J}$ an $\br{\epsilon_0,\beta}$ 244 | reference Jacobian associated with $\cJ\br{\theta_0}$, and fitting the linearized problem 245 | $f_{\mathrm{lin}}\br{\widetilde{\theta}} = f\br{\theta_0} + 246 | \vec{J}\br{\widetilde{\theta} - \bar{\theta_0}}$ via the loss 247 | $\cL_{\mathrm{lin}}\br{\theta} = 248 | \frac{1}{2}\norm{f_{\mathrm{lin}}\br{\theta} - y}_2^2$.\\ 249 | % \\ \noindent\textbf{\textbullet Information and Nuisance 250 | % subspace:} 251 | 252 | Furthermore define the information $\cI$ and nuisance $\cN$ 253 | subspaces and the truncated Jacobian $\vec{J}_{\cI}$ associated with 254 | the reference jacobian $\vec{J}$ based on a cut-off spectrum value 255 | of $\alpha$.\\ 256 | 257 | Furthermore consider a point $\theta_0\in\reals^p$, a tolerance 258 | level $0<\delta\le 1$, stopping time $\Gamma\ge 1$ and assume the 259 | Jacobian mapping $\cJ\br{\theta}\in\reals^{nK\times p}$ associated 260 | with $f$ obeys $\beta$-Bounded spectrum assumption~(Assumption \ref{assump:assump1}) 261 | and $\br{\epsilon,R,\theta_0}$-bounded jacobian perturbation 262 | assumption~(Assumption \ref{assump:assump2}) for 263 | \begin{equation} 264 | \label{eq:theta_diam} 265 | %\norm{\theta - \theta_0}\le 266 | R := 2\br{\norm{\vec{J}_\cI^\dagger 267 | \vec{r}_0}_2 + 268 | \frac{\Gamma}{\alpha}\norm{\Pi_\cN\br{\vec{r}_0}} + \delta\frac{\Gamma}{\alpha}\norm{\vec{r}_0}_2} 269 | \end{equation} 270 | and 271 | \begin{equation} 272 | \label{eq:epsilon_upper} 273 | \epsilon\le \dfrac{\delta\alpha^3}{5\Gamma\beta^2} 274 | \end{equation} 275 | %\\ \noindent\textbf{\textbullet Closeness of Reference Jacobian$\vec{J}~\br{\epsilon_0^2}$ and True 276 | % Jacobian~$\cJ\br{\theta}~\br{\epsilon}$ to inital 277 | % Jacobian~$\cJ\br{\theta_0}$}: 278 | Finally assume the following in regards to the reference Jacobian. 279 | \begin{equation} 280 | \label{eq:epsilon_zero_upper} 281 | \epsilon_0\le \dfrac{\min\br{\delta\alpha, \sqrt{\frac{\delta\alpha^3}{\Gamma\beta}}}}{5} 282 | \end{equation} 283 | 284 | 285 | We run gradient descent iterations of the form a) Original Problem: $\theta_{\tau + 1} = 286 | \theta_\tau - \eta\nabla\cL\br{\theta_\tau}$ and 287 | b) Linearized Problem: $\widetilde{\theta}_{\tau+1} = \widetilde{\theta}_\tau - 288 | \eta\nabla\cL_{\mathrm{lin}}\br{\widetilde{\theta}_\tau}$ starting from $\theta_0$ with step 289 | size $\eta$ obeying $\eta\le \frac{1}{\beta^2}$. 290 | 291 | Then for all iterates obeying $0\le \tau\le T:= \frac{\Gamma}{\eta\alpha^2}$ 292 | iterations of the original $\br{\theta_\tau}$ and linearized 293 | $\br{\widetilde{\theta}_\tau}$ problems and the corresponding 294 | residuals $\vec{r}_\tau:=f\br{\theta_\tau} - \vec{y}$ and 295 | $\widetilde{\vec{r}}_\tau:=f_{\mathrm{lin}}\br{\widetilde{\theta}_\tau} 296 | - \vec{y}$ closely track each other. 297 | 298 | That is 299 | \begin{itemize} 300 | \item \textbf{Original and linear residuals are close}: \begin{equation} 301 | \label{eq:residual_close} 302 | \norm{\vec{r}_\tau - \widetilde{\vec{r}}_\tau}\le \dfrac{3}{5}\dfrac{\delta\alpha}{\beta}\norm{\vec{r}_0} 303 | \end{equation} 304 | \item \textbf{Original and linearized paramaters are close}: 305 | \begin{equation} 306 | \label{eq:param_close} 307 | \norm{\bar{\theta}_\tau - \widetilde{\theta}_\tau}\le \delta\dfrac{\Gamma}{\alpha}\norm{\vec{r}_0} 308 | \end{equation} 309 | \item \textbf{Original iterates are close to initialization}: Furthermore, for all iterates $0\le \tau\le 310 | T:=\dfrac{\Gamma}{\eta\alpha^2}$, we have that the original parameters 311 | $\theta_\tau$ is close to the inital parameters. 312 | \begin{equation} 313 | \label{eq:final_param_close} 314 | \norm{\theta_\tau - \theta_0}\le \dfrac{R}{2} = 315 | \norm{\vec{J}_{\cI}^\dagger\vec{r}_0}_2 + 316 | \dfrac{\Gamma}{\alpha}\norm{\Pi_\cN\br{\vec{r}_0}}_2 + \delta\dfrac{\Gamma}{\alpha}\norm{\vec{r}_0}_2 317 | \end{equation} 318 | \item \textbf{Final non-linear residual is bounded}: and at $\tau=T$, we have that 319 | \begin{equation} 320 | \label{eq:final_residual} 321 | \norm{\vec{r}_T}_2\le e^{-\Gamma}\norm{\Pi_\cI\br{\vec{r}_0}}_2 + 322 | \norm{\Pi_\cN\br{\vec{r}_0}} + \dfrac{\delta\alpha}{\beta}\norm{\vec{r}_0}_2 323 | \end{equation} 324 | \end{itemize} 325 | \end{thm} 326 | 327 | First, we will show that the difference of the non-linear and the 328 | linear residual in the $\tau^{\it th}$ time step is of the order of 329 | the difference in the previous time step added to a term linear in the 330 | residual of the linearized problem. Precisely, it is stated as 331 | follows. 332 | 333 | \begin{lem}[Lemma 6.7]\label{lem:pert-one-step} 334 | Assume Assumption~\ref{assump:assump1}~(with $\beta$) and 335 | Assumption~\ref{assump:assump2}~(with $\br{\epsilon,R,\theta_0}$) hold and 336 | $\theta_\tau$ and $\theta_{\tau + 1}$ are within an $R$ 337 | neighbourhood of the initilization $\theta_0$, 338 | i.e. $\norm{\theta_\tau - \theta_0}\le R$ and 339 | $\norm{\theta_{\tau+1}- \theta_0}\le R$. 340 | 341 | Then, on running gradient 342 | descent with $\eta\le \frac{1}{\beta^2}$. the difference between the 343 | non-linear and the linear residual $\vec{e}_{\tau+1} = 344 | \vec{r}_{\tau+1} - \widetilde{\vec{r}}_{\tau+1}$ follow 345 | 346 | \begin{equation} 347 | \label{eq:growth_of_res_error} 348 | \norm{\vec{r}_{\tau+1}}_2 \le \eta\br{\epsilon_0^2 + 349 | \epsilon\beta}\norm{\widetilde{\vec{r}}_\tau}_2 + \br{1 + \eta\epsilon^2}\norm{\vec{e}_\tau}_2 350 | \end{equation} 351 | \end{lem} 352 | \begin{proof} 353 | Let $\vec{A} = \cJ\br{\theta_0},\vec{B}_2=\cJ\br{\theta_\tau}$ and 354 | \[\vec{B}_1=\cJ\br{\theta_\tau,\theta_{\tau+1}} = 355 | \int_0^1\cJ\br{t\theta_{\tau+1}+\br{1-t}\theta_\tau}dt\] 356 | By $0^{\it th}$ order Taylor expansion with remainder term, we can 357 | write 358 | \begin{align*} 359 | f\br{\theta_{\tau+1}} &= f\br{\theta_\tau - 360 | \eta\nabla\cL\br{\theta_\tau}} = f\br{\theta_\tau} + 361 | \eta\vec{B}_1\nabla\cL\br{\vec{\theta_\tau}}\\ 362 | &= f\br{\theta_\tau} + 363 | \eta\vec{B}_1\vec{B}_2^\top\br{f\br{\theta_\tau} 364 | - \vec{y}}\\ 365 | \vec{r}_{\tau+1} = f\br{\theta_{\tau+1}} - \vec{y} &= \br{\vec{I} 366 | - \eta\vec{B}_1\vec{B}_2^\top}\vec{r}_\tau 367 | \end{align*} 368 | For the linear problem, we have 369 | \[\widetilde{\vec{r}}_{\tau+1} = \br{\vec{I} - 370 | \vec{J}\vec{J}^\top}\widetilde{\vec{r}}_\tau\] 371 | Thus 372 | \begin{align*} 373 | \norm{\vec{e}_{\tau+1}} = \norm{{\vec{r}}_{\tau+1} - 374 | \widetilde{\vec{r}}_{\tau+1}} &= \norm{\br{\vec{I} 375 | - 376 | \eta\vec{B}_1\vec{B}_2^\top}\vec{r}_\tau 377 | - \br{\vec{I} - 378 | \vec{J}\vec{J}^\top}\widetilde{\vec{r}}_\tau}\\ 379 | &=\norm{\br{\vec{I} - 380 | \vec{B}_1\vec{B}_2^\top}\vec{e}_\tau 381 | - \eta\br{\vec{B}_1\vec{B}_2^\top 382 | - \vec{J}\vec{J}^\top}\widetilde{\vec{r}}_\tau}\\ 383 | &\le\norm{\br{\vec{I} - 384 | \vec{B}_1\vec{B}_2^\top}\vec{e}_\tau} 385 | + \eta\norm{\br{\vec{B}_1\vec{B}_2^\top 386 | - \vec{J}\vec{J}^\top}}\norm{\widetilde{\vec{r}}_\tau}\\ 387 | \end{align*} 388 | 389 | First, we bound $\norm{\br{\vec{I} - 390 | \vec{B}_1\vec{B}_2^\top}\vec{e}_\tau}$ 391 | using the fact~(Lemma 6.3) that if 392 | $\vec{A},\vec{B}\in\reals^{n\times p}$ 393 | are matrices obeying 394 | $\norm{\vec{A}},\norm{\vec{B}}\le\beta,\norm{\vec{B}-\vec{A}}\le\epsilon$, 395 | then 396 | $\forall~\vec{z}\in\reals^n,\eta\le\frac{1}{\beta^2}$ 397 | we have that $\norm{\br{\vec{I} - 398 | \vec{B}_1\vec{B}_2^\top}\vec{z}}\le\br{1+\eta\epsilon^2}\norm{\vec{z}}_2$. 399 | Next, we bound $\norm{\br{\vec{B}_1\vec{B}_2^\top 400 | - 401 | \vec{J}\vec{J}^\top}}$ 402 | as follows. 403 | \begin{align*} 404 | \norm{\br{\vec{B}_1\vec{B}_2^\top 405 | - 406 | \vec{J}\vec{J}^\top}} &= \norm{\br{\vec{B}_1\vec{B}_2^\top 407 | -\vec{A}\vec{B}_2^\top + 408 | \vec{A}\vec{B}_2^\top 409 | - \vec{A}\vec{A}^\top 410 | + \vec{A}\vec{A}^\top 411 | - \vec{J}\vec{J}^\top}}\\ 412 | &\le \norm{\br{\vec{B}_1 413 | -\vec{A}}\vec{B}_2^\top} + 414 | \norm{\vec{A}\br{\vec{B}_2^\top 415 | - \vec{A}^\top}} 416 | + \norm{\vec{A}\vec{A}^\top 417 | - 418 | \vec{J}\vec{J}^\top}\\ 419 | &\beta\dfrac{\epsilon}{2} 420 | + 421 | \beta\dfrac{\epsilon}{2} 422 | + \epsilon_0^2 423 | \end{align*} 424 | \end{proof} 425 | 426 | Next, we will prove a lemma that will finally allow us to control the 427 | growth of the difference between the linear and the non-linear 428 | residuals. 429 | 430 | \begin{lem}[Lemma 6.8]\label{eq:growth-pert-lemma} 431 | Consider positive scalars $\Gamma,\alpha,\epsilon,\eta>0$. Also 432 | assume $\eta\le\frac{1}{\alpha^2}$ and 433 | $\alpha\ge\sqrt{2\Gamma\epsilon}$ and set 434 | $T=\frac{\Gamma}{\eta\alpha^2}$. For $0\le 435 | \tau\le T$, non-negative entries $\rho_,\rho_+\ge 0$, assume that the scalar sequences 436 | $e_\tau$ and $\widetilde{r}_\tau$ obey the following 437 | \begin{itemize} 438 | \item $e_0 = 0$ 439 | \item $\widetilde{r}_\tau\le\br{1 - \eta\alpha^2}^\tau \rho_+ + 440 | \rho_-$ 441 | \item $e_\tau\le \br{1 + \eta\epsilon^2}e_{\tau -1} + 442 | \eta\Theta\widetilde{r}_{\tau - 1}$ 443 | \end{itemize} 444 | 445 | Let $\Lambda = \dfrac{2\br{\Gamma\rho_- + \rho_+}}{\alpha^2}$. Then for all $0\le\tau\le T$, the following holds 446 | \[e_\tau\le \Theta\Lambda\] 447 | \end{lem} 448 | \begin{proof} 449 | We will prove this by induction. Note that $e_0=0$ satisfies the 450 | base condition. Suppose, $e_{t}\le\Theta\Lambda$ holds for all 451 | $t <\tau$. 452 | 453 | Thus for all $ 0\le t\le \tau$, 454 | \begin{align*} 455 | e_{t}&\le \br{1 + \eta\epsilon^2}e_{t - 456 | 1}+\eta\Theta\widetilde{r}_{t-1}\\ 457 | &\le e_{t - 1} + \eta\epsilon^2e_{t-1} + 458 | \eta\Theta\br{\br{1-\eta\alpha^2}^{t-1}\rho_+ + 459 | \rho_-}\\ 460 | &\le e_{t-1} + \eta\Theta\br{\epsilon^2\Lambda + \br{1-\eta\alpha^2}^{t-1}\rho_+ + 461 | \rho_- }\\ 462 | \dfrac{ e_{t} - 463 | e_{t-1}}{\Theta}&\le \eta\br{\epsilon^2\Lambda + \br{1 - 464 | \eta\alpha^2}^{t-1}\rho_++\rho_-}\\ 465 | \dfrac{ e_{\tau}}{\Theta} = \sum_{t=0}^\tau \dfrac{ e_{t} - 466 | e_{t-1}}{\Theta} &\le \eta\tau\br{\epsilon^2\Lambda + \rho_-}+ \eta\rho_+\sum_{t=0}^{\tau}{\br{1 - 467 | \eta\alpha^2}^{t-1}}\\ 468 | &=\eta\tau\br{\epsilon^2\Lambda + \rho_-}+ \eta\rho_+\dfrac{1 469 | - \br{1 - 470 | \eta\alpha^2}^{\tau}}{\eta\alpha^2}\\ 471 | &\le\eta T\br{\epsilon^2\Lambda + \rho_-}+ \dfrac{\rho_+}{\alpha^2}\\ 472 | &\le \dfrac{\Gamma\epsilon^2\Lambda + 473 | \Gamma\rho_-}{\alpha^2}+ \dfrac{\rho_+}{\alpha^2}\\ 474 | &= \dfrac{\Gamma\epsilon^2\Lambda}{\alpha^2}+ 475 | \dfrac{\Lambda}{2}\\ 476 | &\le \Gamma &&\because \alpha \ge \sqrt{2\Gamma\epsilon} 477 | \end{align*} 478 | \end{proof} 479 | 480 | We will combine this to provide a rough proof 481 | of~\ref{eq:residual_close} using induction. 482 | 483 | \begin{proof}[Proof of Theorem~\ref{thm:meta-thm1}] We will prove this 484 | by induction. We will only provide a rough proof sketch to keep it 485 | simple and ignore the computations. We will assume that for all 486 | $0\le t\le \tau$ the induction hypothesis holds true 487 | i.e. $\norm{\theta_0 - \theta_t}\le R$~\ref{eq:theta_diam}, 488 | ~\ref{eq:residual_close},~\ref{eq:param_close},~\ref{eq:final_param_close}, 489 | and ~\ref{eq:final_residual} holds true. We will show that they all 490 | hold true for $t=\tau+1$ 491 | \begin{itemize} 492 | \item \textbf{Proving $\norm{\theta_0 - \theta_t}\le R$ for all 493 | $t=\tau+1$:} 494 | We know by~\eqref{eq:final_param_close} that $\norm{\theta_0 - 495 | \theta_\tau}\le \frac{R}{2}$. We need to show that 496 | $\norm{\theta_\tau - \theta_{\tau+1}}\le\frac{R}{2}$. 497 | \begin{align*} 498 | \norm{\theta_\tau - \theta_{\tau+1}} &= 499 | \eta\norm{\nabla\cL\br{\theta_\tau}} 500 | = 501 | \eta\norm{\cJ^\top\br{\theta_\tau}\vec{r}_\tau}\\ 502 | &\le 503 | \eta\norm{\vec{J}^\top\widetilde{\vec{r}}_\tau} 504 | + 505 | \eta\norm{\br{\cJ\br{\theta_\tau} 506 | - 507 | \vec{J}}^\top}\norm{\widetilde{\vec{r}}_\tau} 508 | + 509 | \eta\norm{\cJ\br{\theta_\tau}}\norm{\widetilde{\vec{r}}_\tau 510 | - \vec{r}_\tau}\\ 511 | \end{align*} 512 | 513 | We can bound the first term as~(Page 25) as \[ 514 | \eta\norm{\vec{J}^\top\widetilde{\vec{r}}_\tau}\le \norm{\vec{J}_\cI^\dagger 515 | \vec{r}_0}_2 + 516 | \frac{\Gamma}{\alpha}\norm{\Pi_\cN\br{\vec{r}_0}} \]the second 517 | term as 518 | \[ \eta\norm{\br{\cJ\br{\theta_\tau}- 519 | \vec{J}}^\top}\norm{\widetilde{\vec{r}}_\tau}\le 520 | \eta\norm{\br{\cJ\br{\theta_\tau}- \cJ\br{\theta_0}}+\br{\cJ\br{\theta_0}- 521 | \vec{J}}^\top}\norm{\widetilde{\vec{r}}_0}\le\br{\epsilon+\epsilon_0}\norm{\widetilde{\vec{r}}_0}\le\dfrac{2\delta\alpha}{5\beta^2}\norm{\widetilde{\vec{r}}_0}\] 522 | and the third term as~(Using Eq~\eqref{eq:residual_close}) 523 | 524 | \[\eta\norm{\cJ\br{\theta_\tau}}\norm{\widetilde{\vec{r}}_\tau 525 | - \vec{r}_\tau} \le 526 | \dfrac{3\delta\alpha}{5\beta^2}\norm{\widetilde{\vec{r}}_0}\] 527 | 528 | Combining them we get 529 | \[ \norm{\theta_\tau - \theta_{\tau+1}} \le 530 | \norm{\vec{J}_\cI^\dagger 531 | \vec{r}_0}_2 + 532 | \frac{\Gamma}{\alpha}\norm{\Pi_\cN\br{\vec{r}_0}} + 533 | \dfrac{\delta\alpha}{\beta^2}\norm{\widetilde{\vec{r}}_0} = 534 | \dfrac{R}{2} \] 535 | \item \textbf{Proving that $\norm{\vec{e}_{\tau+1}}\le \dfrac{3}{5}\dfrac{\delta\alpha}{\beta}\norm{\vec{r}_0}$:} 536 | We have shown that $\norm{\theta_t - \theta_0}\le R$, for 537 | $t\le \tau+1$. Then we can 538 | use Lemma~\ref{lem:pert-one-step} to say that for all 539 | $0< t\le \tau+1$ the following holds 540 | \[\norm{\vec{e}_t}\le \eta\br{\epsilon_0^2 + 541 | \epsilon\beta}\norm{\widetilde{\vec{r}}_{t-1}} + 542 | \br{1+\eta\epsilon^2}\norm{\vec{e}_{t-1}}_2\] 543 | 544 | We already know that the linear residuals satisfy the following 545 | for all $00$, consider an i.i.d. dataset 602 | $\bc{\br{\vec{x}_i,y_i}}\in\reals^d\times\reals^K$ where $\vec{x}_i$ 603 | are unit length data points and $\vec{y}_i$s are one-hot encoded 604 | labels. 605 | 606 | Consider the neural network to be initialized with $\vec{W}_0\sim 607 | \cN\br{0,\vec{I}}$ and $\vec{V}$ be properly scaled rademacher 608 | entries. 609 | 610 | Consider the reference jacobian, with information and nuisance 611 | subspaces divided according to $\alpha$, to be 612 | $\vec{J}=\Sigma\br{\vec{X}}^{\nicefrac{1}{2}}$ 613 | where \[\Sigma\br{\vec{X}} = 614 | \vec{I}_K\otimes\bE\bs{\br{\phi^\prime\br{\vec{X}\vec{w}}\phi^\prime\br{\vec{X}\vec{w}}^\top}\odot\br{\vec{X}\vec{X}^\top}}\] 615 | 616 | Assume the overparameterization to be \[k\ge\dfrac{\Gamma^4\log 617 | n}{\alpha^8}\] 618 | 619 | Then after $T=\dfrac{\Gamma}{\eta\alpha^2}$ iterations, the 620 | generalization error obeys 621 | \[\mathrm{Err}\br{\vec{W}_T}\le 622 | \dfrac{\Pi_\cN\br{\vec{y}}}{\sqrt{n}} + e^{-\Gamma} +\dfrac{\Gamma}{\alpha\sqrt{n}}\] 623 | \end{thm} 624 | \begin{lem} 625 | For a neural network as defined above where the activation function 626 | $\phi$ is such that $\abs{\phi^\prime\br{\vec{z}}}\le B$ and 627 | $\abs{\phi^{\prime\prime}\br{\vec{z}}}\le B$ for all $\vec{z}$, $K$ 628 | is the number of classes, then for all $\vec{W}\in\reals^{k\times 629 | d}$\[\norm{\cJ\br{\vec{W}}}\le 630 | B\sqrt{Kk}\norm{\vec{V}}_\infty\norm{\vec{X}}\] 631 | and if all data points are unit norm i.e. $\norm{\vec{x}_i} = 1$ 632 | then the jacobian is lipschitz with respect to the spectral norm for 633 | all $\vec{W},\vec{\widetilde{\vec{W}}}\in\reals^{k\times d}$ 634 | \[\norm{\cJ\br{\vec{W}} - \cJ\br{\widetilde{\vec{W}}}}\le B\sqrt{K}\norm{\vec{V}}_\infty\norm{\vec{X}}\] 635 | \end{lem} 636 | 637 | \begin{proof} 638 | Given two matrices $\vec{A} = 639 | \bs{\vec{A}_1^\top,\cdots,\vec{A}_K^\top}$ and $\vec{B} = 640 | \bs{\vec{B}_1^\top,\cdots,\vec{B}_K^\top}$, the following is true 641 | \[\norm{\vec{A}}\le\sqrt{K}\sup_{\ell=1,..,K}\norm{\vec{A}_\ell}\text{ 642 | \enskip and\enskip}\norm{\vec{A} - 643 | \vec{B}}\le\sqrt{K}\sup_{\ell=1,..,K}\norm{\vec{A}_\ell-\vec{B}_\ell}\] 644 | 645 | We will first show that for a single output neural network i.e. for 646 | $K=1$ we have $\norm{\cJ\br{\vec{W}}}\le 647 | B\sqrt{k}\norm{\vec{V}}_\infty\norm{\vec{X}}$ 648 | 649 | \begin{align*} 650 | \cJ\br{\vec{W}}\cJ^\top\br{\vec{W}} &= 651 | \br{\phi^\prime\br{\vec{X}\vec{W}^\top}\diag{\vec{v}}\diag{\vec{v}}\phi^\prime\br{\vec{W}\vec{X}^\top}}\odot\br{\vec{X}\vec{X}^\top}\\ 652 | \norm{ \cJ\br{\vec{W}}}^2 &\le 653 | \br{\max_i\norm{\diag{\vec{v}}\phi^\prime\br{\vec{W}\vec{x}_i}}^2}\norm{\vec{X}}_2^2\\ 654 | &\le kB^2\norm{\vec{v}}_\infty^2\norm{\vec{X}}_2^2 655 | \end{align*} 656 | Thus for multi-output neural networks, we have that 657 | \[\norm{ \cJ\br{\vec{W}}}\le 658 | \sqrt{KkB^2}\norm{\vec{V}}_\infty\norm{\vec{X}}_2 \] 659 | 660 | We are omitting the proof of lipschitzness but will cover it if time permits. 661 | \end{proof} 662 | With this, we can apply the meta-theorem directly to multi-output 663 | neural networks by taking the NTK to be the reference Jacobian. One 664 | can prove that the NTK satisfies the assumptions of the reference 665 | Jacobian but we will omit the proof for simplicity and might discuss 666 | it if time permits. 667 | 668 | 669 | 670 | 671 | 672 | 673 | 674 | \section{Experiments} 675 | 676 | Authors use very recent methods to approximate the spectra of $J(\theta_\tau)J^\top(\theta_\tau)$. They perform experiments on CIFAR-10 and MNIST on ResNet20. 677 | 678 | \begin{itemize} 679 | \item The value of the top eigenvalues increases significantly, comparing the Jacobian at initialization with the Jacobian after training. In general, it is observed that the Jacobian is approximately low-rank, in the sense that it has a small set of large eigenvalues and then the rest are fairly small. This fits naturally with their theory, allowing to set a good cutoff for their bounds. 680 | \item They plot the norm of the projection of the residual onto the information and the nuisance space and observe that indeed, as predicted by the theory, the projection on the information space decreases much rapidly than the other one. Note that if training with some corrupted labels, and the residual corresponding to the noisy labels falls mostly into the nuisance space and analogously for the uncorrupted data and the information space, then the neural network would fit much faster the data that conveys information and this would have generalization implications at early stopping. 681 | \item They measured the norm of the projection of the labels and the residuals at initialization, but using the information and nuisance space of two Jacobians: the one given by the initialization and the one given by the trained network. For both the labels and the initial residual, the majority of the projection lies in the nuisance space for the Jacobian at initialization but for the other one the converse happens: the projected norm onto the information space is significantly bigger than the projection onto the nuisance space. Authors argue that this adaptation would in principle suggest a better generalization according to their theory (possibly using the arbitrary initialization theorem and initializing to the value of the Jacobian at time some iterations before stopping). It would also suggest that this adaptation of the Jacobian would speed up training. They also performed experiments with corrupted labels and saw that the projection onto the information space after training is not that large in that case. Also the normalized projection of the labels onto the nuisance space correlates with the test error. 682 | \end{itemize} 683 | 684 | \nocite{*} % Include refs not cited 685 | \bibliography{refs} %use a bibtex bibliography file refs.bib 686 | \bibliographystyle{plain} %use the plain bibliography style 687 | 688 | \end{document} 689 | 690 | 691 | %%% Local Variables: 692 | %%% mode: latex 693 | %%% TeX-master: t 694 | %%% End: 695 | -------------------------------------------------------------------------------- /notes/low_rank_jac/refs.bib: -------------------------------------------------------------------------------- 1 | @article{du2018gradient, 2 | title={Gradient descent finds global minima of deep neural networks}, 3 | author={Du, Simon S and Lee, Jason D and Li, Haochuan and Wang, Liwei and Zhai, Xiyu}, 4 | journal={arXiv preprint arXiv:1811.03804}, 5 | year={2018} 6 | } 7 | 8 | @article{allen2018convergence, 9 | title={A convergence theory for deep learning via over-parameterization}, 10 | author={Allen-Zhu, Zeyuan and Li, Yuanzhi and Song, Zhao}, 11 | journal={arXiv preprint arXiv:1811.03962}, 12 | year={2018} 13 | } 14 | -------------------------------------------------------------------------------- /notes/low_rank_jac_thm.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/notes/low_rank_jac_thm.pdf -------------------------------------------------------------------------------- /papers/1805.00915.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1805.00915.pdf -------------------------------------------------------------------------------- /papers/1806.07572.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1806.07572.pdf -------------------------------------------------------------------------------- /papers/1808.09372.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1808.09372.pdf -------------------------------------------------------------------------------- /papers/1810.02054.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1810.02054.pdf -------------------------------------------------------------------------------- /papers/1810.09665.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1810.09665.pdf -------------------------------------------------------------------------------- /papers/1810.12065.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1810.12065.pdf -------------------------------------------------------------------------------- /papers/1811.03804.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1811.03804.pdf -------------------------------------------------------------------------------- /papers/1811.03962.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1811.03962.pdf -------------------------------------------------------------------------------- /papers/1811.04918.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1811.04918.pdf -------------------------------------------------------------------------------- /papers/1811.08888.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1811.08888.pdf -------------------------------------------------------------------------------- /papers/1812.07956.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1812.07956.pdf -------------------------------------------------------------------------------- /papers/1812.10004.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1812.10004.pdf -------------------------------------------------------------------------------- /papers/1901.08572.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1901.08572.pdf -------------------------------------------------------------------------------- /papers/1901.08584.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1901.08584.pdf -------------------------------------------------------------------------------- /papers/1902.01384.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1902.01384.pdf -------------------------------------------------------------------------------- /papers/1902.04760.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1902.04760.pdf -------------------------------------------------------------------------------- /papers/1902.06720.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1902.06720.pdf -------------------------------------------------------------------------------- /papers/1904.11955.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1904.11955.pdf -------------------------------------------------------------------------------- /papers/1905.03684.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1905.03684.pdf -------------------------------------------------------------------------------- /papers/1905.05095.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1905.05095.pdf -------------------------------------------------------------------------------- /papers/1905.10337.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1905.10337.pdf -------------------------------------------------------------------------------- /papers/1905.10843.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1905.10843.pdf -------------------------------------------------------------------------------- /papers/1905.12173.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1905.12173.pdf -------------------------------------------------------------------------------- /papers/1905.13210.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1905.13210.pdf -------------------------------------------------------------------------------- /papers/1905.13654.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1905.13654.pdf -------------------------------------------------------------------------------- /papers/1906.01930.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1906.01930.pdf -------------------------------------------------------------------------------- /papers/1906.05392.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1906.05392.pdf -------------------------------------------------------------------------------- /papers/1906.05827.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1906.05827.pdf -------------------------------------------------------------------------------- /papers/1906.06247.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1906.06247.pdf -------------------------------------------------------------------------------- /papers/1906.06321.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1906.06321.pdf -------------------------------------------------------------------------------- /papers/1906.08034.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1906.08034.pdf -------------------------------------------------------------------------------- /papers/1911.00809.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1911.00809.pdf -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | This is a list of papers that use the Neural Tangent Kernel (NTK). In each category, papers are sorted chronologically. Some of these papers were presented in the NTK reading group during the summer 2019 at the University of Oxford. 2 | 3 | We used [hypothes.is](https://web.hypothes.is/) to some extent, see [this](https://via.hypothes.is/https://arxiv.org/pdf/1806.07572.pdf) for instance. There are notes for a few of the papers, which you can find linked below the relevant papers. 4 | 5 | ## Schedule 6 | + 2/08/2019 [[notes](./notes/Neural_Tangent_kernels___Jacot_et_al.pdf)] Neural Tangent Kernel: Convergence and Generalization in Neural Networks. 7 | + 9/08/2019 [[notes](./notes/du_et_al.pdf)] Gradient Descent Finds Global Minima of Deep Neural Network. 8 | + 16/08/2019 Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks + insights from Gradient Descent Provably Optimizes Over-parameterized Neural Networks. 9 | + 23/08/2019 On Lazy Training in Differentiable Programming 10 | + 13/09/2019 Generalization bounds of stochastic gradient descent for wide and deep networks 11 | + 18/10/2019 [[notes](./notes/low_rank_jac_thm.pdf)] Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian 12 | 13 | # Neural tangent kernel 14 | 15 | [https://www.youtube.com/watch?v=NGon2JyjO6Y]: # 16 | + [Recent Developments in Over-parametrized Neural Networks, Part II](https://www.youtube.com/watch?v=NGon2JyjO6Y) 17 | + Interesting, nice overview of a few things, mostly related to optimization and NTK 18 | + YouTube, Simons institute workshop. 19 | + Part I is interesting, but take into account that it is about other optimization things for NNs, but not about NTK. 20 | 21 | ## Optimization 22 | 23 | ### Infinite limit 24 | 25 | [https://arxiv.org/pdf/1806.07572.pdf ]: # 26 | + [Neural Tangent Kernel: Convergence and Generalization in Neural Networks ](./papers/1806.07572.pdf) -- [link](https://arxiv.org/pdf/1806.07572.pdf) 27 | + [Notes](./notes/Neural_Tangent_kernels___Jacot_et_al.pdf) 28 | + 06/2018 29 | + Original NTK paper. 30 | + Exposes the idea of the NTK for the first time, although the proof that the Kernel in the limit is deterministic is done tending the number of neurons of each layer to infinity, layer by layer sequentially. 31 | + It proves positive definiteness of the kernel for certain regimes, thus proving you can optimize to reach a global minimum at a linear rate. 32 | 33 | [https://arxiv.org/pdf/1902.06720.pdf]: # 34 | + [Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent](./papers/1902.06720.pdf) -- [link](https://arxiv.org/pdf/1902.06720.pdf) 35 | + 02/2019 36 | + They apparently prove that a finite learning rate is enough for the model to follow NTK dynamics in infinite width limit. 37 | + Experiments 38 | 39 | 40 | [https://arxiv.org/pdf/1904.11955.pdf]: # 41 | + [On Exact Computation with an Infinitely Wide Neural Net](./papers/1904.11955.pdf) -- [link](https://arxiv.org/pdf/1904.11955.pdf) 42 | + 04/2019 43 | + Shows that NTK work somewhat worse than NNs, but not as much worse as previous work suggested. 44 | + Claims to show a proof that sounds similar to those of Allen-Zhu, Du etc. but not sure what the difference is. 45 | 46 | 47 | ### Finite results 48 | 49 | [https://arxiv.org/abs/1810.02054]: # 50 | + [Gradient Descent Provably Optimizes Over-parameterized Neural Networks](./papers/1810.02054.pdf) -- [link](https://arxiv.org/abs/1810.02054) 51 | + 04/10/2018 52 | + A preliminar result of Gradient Descent Finds Global Minima of Deep Neural Network (below) but only for two layer neural networks. 53 | 54 | [https://arxiv.org/abs/1810.12065]: # 55 | + [On the Convergence Rate of Training Recurrent Neural Networks](./papers/1810.12065.pdf) -- [link](https://arxiv.org/abs/1810.1206) 56 | + 29/10/2018 57 | + See below 58 | 59 | [https://arxiv.org/pdf/1811.03962.pdf]: # 60 | + [A Convergence Theory for Deep Learning via Over-Parameterization](./papers/1811.03962.pdf) -- [link](https://arxiv.org/pdf/1811.03962.pdf) 61 | + 9/11/2018 62 | + Simplification of [On the Convergence Rate of Training Recurrent Neural Networks](./papers/1810.12065.pdf). 63 | + Convergence to global optima whp for GD and SGD. 64 | + Works for \ell_2, cross entropy and other losses. 65 | + Works for fully connected, ResNets, ConvNets, (and RNNs, in the paper above) 66 | 67 | 68 | [https://arxiv.org/pdf/1811.03804.pdf]: # 69 | + [Gradient Descent Finds Global Minima of Deep Neural Network.](./papers/1811.03804.pdf) -- [link](https://arxiv.org/pdf/1811.03804.pdf) 70 | + [Notes](./notes/du_et_al.pdf) 71 | + 9/11/2018 72 | + Du et al 73 | + Convergence to global optima whp for GD for \ell_2. 74 | + Exponential width wrt depth needed in fully connected. Polynomial for resnets. 75 | 76 | [https://arxiv.org/pdf/1901.08572.pdf]: # 77 | + [Width Provably Matters in Optimization for Deep Linear Neural Networks](./papers/1901.08572.pdf) -- [link](https://arxiv.org/pdf/1901.08572.pdf) 78 | + 12/2019 79 | + Du et al. 80 | + Deep linear neural network 81 | + Convergence to global minima if low polynomial width is assumed. 82 | 83 | [https://arxiv.org/pdf/1811.08888.pdf]: # 84 | + [Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks](./papers/1811.08888.pdf) -- [link](https://arxiv.org/pdf/1811.08888.pdf) 85 | + 21/11/2018 86 | 87 | [https://arxiv.org/pdf/1812.10004.pdf]: # 88 | + [Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?](./papers/1812.10004.pdf) -- [link](https://arxiv.org/pdf/1812.10004.pdf) 89 | + 25/11/2018 90 | + Results for one hidden layer NNs, generalized linear models and low-rank matrix regression. 91 | 92 | [https://arxiv.org/abs/1905.13654.pdf]: # 93 | + [Training Dynamics of Deep Networks using Stochastic Gradient Descent via Neural Tangent Kernel](./papers/1905.13654.pdf) -- [link](https://arxiv.org/abs/1905.13654.pdf) 94 | + 06/2019 95 | + SGD analyzed from the point of view of Stochastic Differential Equations 96 | 97 | 98 | ### Lazy training 99 | 100 | [https://arxiv.org/pdf/1812.07956.pdf ]: # 101 | + [On Lazy Training in Differentiable Programming](./papers/1812.07956.pdf) -- [link](https://arxiv.org/pdf/1812.07956.pdf) 102 | + 12/2018 103 | + They show that NTK regime can be controlled by rescaling the model, and show (experimentally) that neural nets in practice perform better than those in lazy regime. 104 | + Also this seems to be independent of width. So scaling the model is a much easier way to get to lazy training, versus the infinite width + infinitesimal learning rate route?? 105 | 106 | [https://arxiv.org/pdf/1906.08034.pdf]: # 107 | + [Disentangling feature and lazy learning in deep neural networks: an empirical study](./papers/1906.08034.pdf) -- [link](https://arxiv.org/pdf/1906.08034.pdf) 108 | + 06/2019 109 | + Similar to above (Chizat et al.), but more experimental. 110 | 111 | [https://arxiv.org/pdf/1906.05827.pdf]: # 112 | + [Kernel and deep regimes in overparametrized models](./papers/1906.05827.pdf) -- [link](https://arxiv.org/pdf/1906.05827.pdf) 113 | + 06/2019 114 | + Large initialization leads to kernel/lazy regime 115 | + Small initialization leads to deep/active/adaptive regime, which can sometimes lead to better generalization. They claim this is the regime that allows one to "exploit the power of depth", and thus is key to understanding deep learning. 116 | + The systems they analyze in detail are rather simple (like matrix completion) or artificial (like a very ad-hoc type of neural network) 117 | 118 | ## Generalization 119 | 120 | [https://arxiv.org/pdf/1811.04918.pdf]: # 121 | + [Learning and Generalization in Overparameterized NeuralNetworks, Going Beyond Two Layers](./papers/1811.04918.pdf) -- [link](https://arxiv.org/pdf/1811.04918.pdf) 122 | + 11/2018 123 | + Theorems are not based on NTKs, but it has experiments showing how generalization for 3-layer NNs is better than for its corresponding NTK. 124 | 125 | [https://arxiv.org/pdf/1901.08584.pdf]: # 126 | + [Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks](./papers/1901.08584.pdf) -- [link](https://arxiv.org/pdf/1901.08584.pdf) 127 | + 01/2019 128 | + Arora et al 129 | + "Our work is related to kernel methods, especially recent discoveries of the connection between deep 130 | learning and kernels (Jacot et al., 2018; Chizat & Bach, 2018b;...) Our analysis utilized several properties of a related kernel from the ReLU activation." 131 | 132 | [https://arxiv.org/pdf/1902.01384.pdf]: # 133 | + [Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks](./papers/1902.01384.pdf) -- [link](https://arxiv.org/pdf/1902.01384.pdf) 134 | + 02/2019 135 | + See below 136 | 137 | [https://arxiv.org/pdf/1905.13210.pdf]: # 138 | + [Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks](./papers/1905.13210.pdf) -- [link](https://arxiv.org/pdf/1905.13210.pdf) 139 | + 05/2019 140 | + Seems very similar to the one above. What are the differences? Just that this is SGD vs GD in the above paper? 141 | + Improves on the Arora2019 paper showing generalization bounds for NTK. 142 | + I’d be interested in understanding the connection of their bound to classical margin and pac bayes bounds for kernel regression. 143 | + They don’t show any plots demonstrating how good their bounds are, which probably means they are vacuous though... 144 | 145 | 146 | [https://arxiv.org/pdf/1905.10337.pdf]: # 147 | + [What Can ResNet Learn Efficiently, Going Beyond Kernels?](./papers/1905.10337.pdf) -- [link](https://arxiv.org/pdf/1905.10337.pdf) 148 | + 05/2019 149 | + Shows in the PAC setting that there are ("simple") functions that ResNets learn efficiently and such that any kernel gets test error much greater for the same sample complexity. in particular NTKs too. 150 | 151 | [https://arxiv.org/pdf/1905.10843.pdf]: # 152 | + [Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm](./papers/1905.10843.pdf) -- [link](https://arxiv.org/pdf/1905.10843.pdf) 153 | + 05/2019 154 | + I think that getting learning curves for neural nets is a very interesting challenge. 155 | + Here they do it for kernels, but if the NN behaves like a kernel, it would be relevant.. 156 | 157 | [https://arxiv.org/pdf/1906.05392.pdf]: # 158 | + [Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian](./papers/1906.05392.pdf) -- [link](https://arxiv.org/pdf/1906.05392.pdf) 159 | + 06/2019 160 | + [Notes](./notes/low_rank_jac_thm.pdf) 161 | + Uses NTK mainly and splits the eigenspace into two (based on a cutoff value of the eigenvalues). Projection of residuals onto the top eigenspace trains very fast and the rest could not train at all and loss could increase. Trade off based on cutoff value. 162 | + Two layers. 163 | + \ell\_2 loss. 164 | 165 | ## Others 166 | 167 | [https://arxiv.org/pdf/1902.04760.pdf]: # 168 | + [Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation](./papers/1902.04760.pdf) -- [link](https://arxiv.org/pdf/1902.04760.pdf) 169 | + 02/2019 170 | + Although this paper is really cool in that it shows that most kinds of neural networks become GPs when infinitely wide, w.r.t. NTK, it just shows a proof where the layer widths can go to infinity at the same time, and generalizes it to more architectures, so doesn’t feel like necessarily much new insight? 171 | 172 | [https://arxiv.org/pdf/1905.12173.pdf]: # 173 | + [On the Inductive Bias of Neural Tangent Kernels](./papers/1905.12173.pdf) -- [link](https://arxiv.org/pdf/1905.12173.pdf) 174 | + 05/2019 175 | + This is just about properties of NTK (so not studying NNs directly). 176 | + They find that the NTK model has different type of stability to deformations of the input than other NNGPs, and better approximation properties (whatever that means) 177 | 178 | [https://arxiv.org/pdf/1906.01930.pdf]: # 179 | + [Approximate Inference Turns Deep Networks into Gaussian Processes](./papers/1906.01930.pdf) -- [link](https://arxiv.org/pdf/1906.01930.pd) 180 | + 06/2019 181 | + Shows Bayesian NNs (of any width) are equivalent to GPs, surprisingly with kernel given by NTK 182 | 183 | # ToClassify 184 | 185 | [https://arxiv.org/pdf/1905.05095.pdf]: # 186 | + [Spectral Analysis of Kernel and Neural Embeddings: Optimization and Generalization](./papers/1905.05095.pdf) -- [link](https://arxiv.org/pdf/1905.05095.pdf) 187 | + 05/2019 188 | + They just study what happens when you use a neural network or a kernel representation for data (fed as input to a NN I guess). 189 | 190 | [https://arxiv.org/pdf/1808.09372.pdf]: # 191 | + [Mean Field Analysis of Neural Networks: A Central Limit Theorem](./papers/1808.09372.pdf) -- [link](https://arxiv.org/pdf/1808.09372.pdf) 192 | + 08/2018 193 | + they only look at one hidden layer and squared error loss, so I’m not convinced of the novelty of results? 194 | 195 | [https://arxiv.org/pdf/1906.06321.pdf]: # 196 | + [Provably Efficient $Q$-learning with Function Approximation via Distribution Shift Error Checking Oracle](./papers/1906.06321.pdf) -- [link](https://arxiv.org/pdf/1906.06321.pdf) 197 | + 06/2019 198 | + Not about NTK, but authors suggest it could be extended to use NTK to analyze NN-based function approximation. 199 | 200 | [https://arxiv.org/pdf/1911.00809.pdf]: # 201 | 202 | + [Enhanced Convolutional Neural Tangent Kernels](./papers/1911.00809.pdf) -- [link](https://arxiv.org/pdf/1911.00809.pdf) 203 | + 11/2019 204 | + Enhances the NTK for convolutional networks of "On Exact Computation..." by adding some implicit data augmentation to the kernel that encodes some kind of local translation invariance and horizontal flipping. 205 | + They have experiments that show good empirical performance, in particular they get 89% accuracy for CIFAR-10, matching AlexNet. This is the first time a kernel gets this results. 206 | 207 | 208 | # Some notes 209 | 210 | + NTK depends on initialization. 211 | 212 | --------------------------------------------------------------------------------