├── notes
    ├── Neural Tangent kernels - Jacot et al.zip
    ├── Neural_Tangent_kernels___Jacot_et_al.pdf
    ├── du_et_al.pdf
    ├── du_et_al.tex
    ├── low_rank_jac
    │   ├── .gitignore
    │   ├── Makefile
    │   ├── amartya_ltx.sty
    │   ├── low_rank_jac_thm.tex
    │   └── refs.bib
    └── low_rank_jac_thm.pdf
├── papers
    ├── 1805.00915.pdf
    ├── 1806.07572.pdf
    ├── 1808.09372.pdf
    ├── 1810.02054.pdf
    ├── 1810.09665.pdf
    ├── 1810.12065.pdf
    ├── 1811.03804.pdf
    ├── 1811.03962.pdf
    ├── 1811.04918.pdf
    ├── 1811.08888.pdf
    ├── 1812.07956.pdf
    ├── 1812.10004.pdf
    ├── 1901.08572.pdf
    ├── 1901.08584.pdf
    ├── 1902.01384.pdf
    ├── 1902.04760.pdf
    ├── 1902.06720.pdf
    ├── 1904.11955.pdf
    ├── 1905.03684.pdf
    ├── 1905.05095.pdf
    ├── 1905.10337.pdf
    ├── 1905.10843.pdf
    ├── 1905.12173.pdf
    ├── 1905.13210.pdf
    ├── 1905.13654.pdf
    ├── 1906.01930.pdf
    ├── 1906.05392.pdf
    ├── 1906.05827.pdf
    ├── 1906.06247.pdf
    ├── 1906.06321.pdf
    ├── 1906.08034.pdf
    └── 1911.00809.pdf
└── readme.md


/notes/Neural Tangent kernels - Jacot et al.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/notes/Neural Tangent kernels - Jacot et al.zip


--------------------------------------------------------------------------------
/notes/Neural_Tangent_kernels___Jacot_et_al.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/notes/Neural_Tangent_kernels___Jacot_et_al.pdf


--------------------------------------------------------------------------------
/notes/du_et_al.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/notes/du_et_al.pdf


--------------------------------------------------------------------------------
/notes/du_et_al.tex:
--------------------------------------------------------------------------------
  1 | \documentclass[11pt]{article}
  2 | 
  3 | % Estilo del documento
  4 | \usepackage[utf8]{inputenc} 		% Lets you write accents with áéíóú etc
  5 | \usepackage[T1]{fontenc}  			% Lets you write UTF-8 chars in the code
  6 | \setlength{\headheight}{14.0pt}		% Removes fancy header warning (Not sure what it does)
  7 | \usepackage{geometry}				% To edit margins and their format
  8 | \usepackage[english]{babel} 		% language
  9 | \usepackage{indentfirst}			% First paragraph of each section / subsection
 10 | \usepackage[linktocpage]{hyperref}	% References inside the document and hyperrefs out of it
 11 | \usepackage{url}					% url colors and so
 12 | \hypersetup{colorlinks=true, urlcolor=blue}
 13 | \usepackage{graphicx}				% to include images, Gull page: http://en.wikibooks.org/wiki/LaTeX/Floats,_Figures_and_Captions
 14 | \usepackage[export]{adjustbox} 		% Images layout (e.g. lets you put right, left in the includegraphix)
 15 | \usepackage{listings}				% Show code
 16 | \usepackage{fancyhdr}				% Headers y footers
 17 | \usepackage{multicol}				% http://stackoverflow.com/questions/1491717/how-to-display-a-content-in-two-column-layout-in-latex
 18 | \usepackage{blindtext}				% For the cool paragraph (Enter after the paragraph section)
 19 | \usepackage{textcomp}
 20 | \usepackage{bussproofs}
 21 | \usepackage{enumitem} 				% To enum with letters and other things
 22 | \usepackage{leftidx} 				% left superindices
 23 | \usepackage{euscript}				% Fancy A and S for symmetry groups (among other things)
 24 | \usepackage{dsfont}
 25 | 
 26 | % Math packages
 27 | \usepackage{amsmath}		% General maths
 28 | \usepackage{amsthm}			% theorems, propositions...
 29 | \usepackage{amssymb}		% symbols, arrows...
 30 | \usepackage{amsrefs}		% Automatically formatted bibliography
 31 | \usepackage{mathrsfs}		% Very flamboyant letters
 32 | %\usepackage{stmaryrd} 	% Square brackets for semantics
 33 | \usepackage{bussproofs}
 34 | 
 35 | \usepackage{xparse}
 36 | 
 37 | \usepackage{color}
 38 | % Colors 
 39 | \definecolor{mygreen}{rgb}{0,0.6,0}
 40 | \definecolor{mygray}{rgb}{0.8,0.8,0.8}
 41 | \definecolor{mymauve}{rgb}{0.58,0,0.82}
 42 | 
 43 | %Others
 44 | \usepackage{nag} 					% Warning for deprecated methods
 45 | 
 46 | % Document style
 47 | \geometry{margin=3cm}					% 
 48 | \geometry{a4paper}						%
 49 | %\setlength{\parindent}{1.5em}			% First line indentation
 50 | \setlength{\parskip}{0.5\baselineskip}	% Paragraph separation
 51 | \setcounter{tocdepth}{2}				% Table of contents until subsection
 52 | 
 53 | % amsthm style definitions
 54 | \theoremstyle{plain}
 55 | \newtheorem{thm}{Theorem}[section]
 56 | \newtheorem{prop}[thm]{Proposition}
 57 | \newtheorem{lemma}[thm]{Lemma}
 58 | \newtheorem{condition}[thm]{Condition}
 59 | \newtheorem{corol}[thm]{Corollary}
 60 | 
 61 | \newtheorem{tma}{Teorema}[section]
 62 | \newtheorem{prob}[thm]{Problem}
 63 | \newtheorem{lema}[tma]{Lema}
 64 | \newtheorem{corolario}[tma]{Corolario}
 65 | 
 66 | \theoremstyle{definition}
 67 | \newtheorem{example}{Example}
 68 | \newtheorem{remark}[thm]{Remark}
 69 | \newtheorem*{exer}{Exercise}
 70 | \newtheorem{pr}{Proof}
 71 | \newtheorem{defi}[thm]{Definition}
 72 | 
 73 | \newtheorem{ejem}{Ejemplo}
 74 | \newtheorem{obs}{Observación} 
 75 | \newtheorem*{ejer}{Ejercicio}
 76 | \newtheorem{demo}{Demostración}
 77 | \newtheorem{definicion}[thm]{Definición}
 78 | 
 79 | % Tikz's shit
 80 | \usepackage{tikz}			% To draw cats automatas etc etc
 81 | \usetikzlibrary{automata} 	% 
 82 | \usetikzlibrary{arrows} 	% Different types of arrows (e.g. inclusion)
 83 | 
 84 | \usetikzlibrary[shapes.arrows]
 85 | \usetikzlibrary{shapes.geometric}
 86 | \usetikzlibrary{backgrounds}
 87 | \usetikzlibrary{positioning}
 88 | \usetikzlibrary{calc}
 89 | \usetikzlibrary{intersections}
 90 | \usetikzlibrary{fadings}
 91 | \usetikzlibrary{decorations.footprints}
 92 | \usetikzlibrary{patterns}
 93 | \usetikzlibrary{shapes.callouts}
 94 | \usetikzlibrary{fit}
 95 | 
 96 | % Tikz Settings
 97 | \tikzset{->, >=stealth', shorten >=1pt, auto, node distance=1cm, semithick, baseline=(current bounding box.center)}
 98 | 
 99 | % Listing	
100 | \lstset{
101 |   columns=fullflexible,
102 |   backgroundcolor=\color{white},   % choose the background color; you must add \usepackage{color} or \usepackage{xcolor}
103 |   basicstyle=\ttfamily,        % the size of the fonts that are used for the code
104 |   breakatwhitespace=false,         % sets if automatic breaks should only happen at whitespace
105 |   breaklines=true,                 % sets automatic line breaking
106 |   captionpos=b,                    % sets the caption-position to bottom
107 |   commentstyle=\color{mygreen},    % comment style
108 |   %deletekeywords={...},            % if you want to delete keywords from the given language
109 |   inputencoding=utf8,
110 |   %escapeinside={\%*}{*)},          % if you want to add LaTeX within your code
111 |   extendedchars=true,              % lets you use non-ASCII characters; for 8-bits encodings only, does not work with UTF-8
112 |   literate= {á}{{\'a}}1 {é}{{\'e}}1 {í}{{\'i}}1 {ó}{{\'o}}1 {ú}{{\'u}}1 {ñ}{{\~n}}1
113 | 			{Á}{{\'A}}1 {É}{{\'E}}1 {Í}{{\'I}}1 {Ó}{{\'O}}1 {Ú}{{\'U}}1 {Ñ}{{\~N}}1
114 | 			{_}{{\_}}1 {^}{{\textasciicircum}}1,
115 |   frame=single,                    % adds a frame around the code
116 |   keepspaces=true,                 % keeps spaces in text, useful for keeping indentation of code (possibly needs columns=flexible)
117 |   keywordstyle=\color{blue},       % keyword style
118 |   language=C++,                 % the language of the code
119 |   morekeywords={ll,ii,vi,vii,vvi,vll,mii,ld,point,vect,line,circle,polygon, each},
120 |             % if you want to add more keywords to the set
121 |   numbers=left,                    % where to put the line-numbers; possible values are (none, left, right)
122 |   numbersep=5pt,                   % how far the line-numbers are from the code
123 |   numberstyle=\tiny\color{mygray}, % the style that is used for the line-numbers
124 |   rulecolor=\color{black},         % if not set, the frame-color may be changed on line-breaks within not-black text (e.g. comments (green here))
125 |   showspaces=false,                % show spaces everywhere adding particular underscores; it overrides 'showstringspaces'
126 |   showstringspaces=false,          % underline spaces within strings only
127 |   showtabs=false,                  % show tabs within strings adding particular underscores
128 |   stepnumber=1,                    % the step between two line-numbers. If it's 1, each line will be numbered
129 |   stringstyle=\color{mymauve},     % string literal style
130 |   tabsize=4,                       % sets default tabsize to 4 spaces
131 |   %title=\lstname,                   % show the filename of files included with \lstinputlisting; also try caption instead of title
132 |   texcl=true,
133 |   morecomment=[l][basicstyle]{http://}
134 | }
135 | 
136 | % Config Headers y footers
137 | %\pagestyle{fancy}
138 | %\fancyhf{}
139 | %\renewcommand{\sectionmark}[1]{\markright{#1}{}}		% Stop showing section numbers in the header
140 | %\renewcommand{\subsectionmark}[1]{\markright{#1}{}}	% Stop showing subsection numberless in the header
141 | %\renewcommand{\subsubsectionmark}[1]{\markright{#1}{}}	% Stop showing subsubsection numberless in the header
142 | 
143 | % Cool Paragraph
144 | \makeatletter
145 | \renewcommand{\paragraph}{\@startsection{paragraph}{4}{0ex}%
146 |    {-3.25ex plus -1ex minus -0.2ex}%
147 |    {1ex plus 0.2ex}%
148 |    {\normalfont\normalsize\bfseries}}
149 | \makeatother
150 | 
151 | \renewcommand{\baselinestretch}{1.3}
152 | 
153 | % Config caption names:
154 | \renewcommand{\lstlistingname}{Algorithm}
155 | 
156 | % Usage: \circled{1}[\leq]
157 | \newcommand*\circledaux[1]{\tikz[baseline=(char.base)]{
158 |     \node[shape=circle,draw,inner sep=0.8pt] (char) {#1};}}
159 | 
160 | \NewDocumentCommand{\circled}{ m o }{%
161 |     \IfNoValueTF{#2}{ \circledaux{#1} }{ \stackrel{\circledaux{#1}}{#2} }%
162 | }
163 | 
164 | 
165 | %\rhead{\fancyplain{}{}} % predefined ()
166 | %\lhead{\fancyplain{}{\rightmark }} % 1. sectionname, 1.1 subsection name etc
167 | %\cfoot{\fancyplain{}{\thepage}}
168 | 
169 | % Totally necessary: always writes correctly epsilon and phi
170 | \let\temp\phi
171 | \let\phi\varphi
172 | \let\varphi\temp
173 | \let\temp\epsilon
174 | \let\epsilon\varepsilon
175 | \let\varepsilon\temp
176 | \renewcommand{\star}{\ast}
177 | 
178 | % My definitions
179 | \newcommand{\Ss}{{\EuScript S}}
180 | \newcommand{\Aa}{{\EuScript A}}
181 | \newcommand{\Ab}{\text{Ab}}
182 | 
183 | 
184 | \newcommand{\x}{{\tt x}} \newcommand{\y}{{\tt y}}
185 | \newcommand{\z}{{\tt z}} \renewcommand{\t}{{\tt t}}
186 | \newcommand{\s}{{\tt s}} \newcommand{\ww}{{\tt w}}
187 | \newcommand{\uu}{{\tt u}} 
188 | \newcommand{\Var}[1]{\text{Var}\left[#1\right]} 
189 | \newcommand{\Cov}[1]{\text{Cov}\left[#1\right]} 
190 | \renewcommand{\P}[1]{\mathbb{P}\left[#1\right]} 
191 | \newcommand{\Vart}{\text{Var}} 
192 | \newcommand{\E}[1]{\mathbb{E}\left[ #1 \right]} 
193 | \newcommand{\R}{\mathbb{R}} 
194 | \newcommand{\Z}{\mathbb{Z}} 
195 | \newcommand{\N}{\mathbb{N}} 
196 | \newcommand{\pa}[1]{\left( #1\right)} 
197 | \newcommand{\norm}[1]{\left\| #1 \right\|} 
198 | \newcommand{\abs}[1]{\left| #1 \right|} 
199 | %\renewcommand{\dot}[1]{\left\langle #1\right\rangle} 
200 | \renewcommand{\L}{\mathscr{L}} 
201 | \newcommand{\dirich}[1]{\mathcal{E}\left( #1 \right)} 
202 | \newcommand{\grad}{\nabla} 
203 | \renewcommand{\exp}[1]{\text{exp}\left(#1\right)} 
204 | \newcommand{\Ent}[1]{\text{Ent}\left[#1\right]} 
205 | \newcommand{\Entt}{\text{Ent}} 
206 | \newcommand{\Lip}{\text{Lip}} 
207 | \newcommand{\diam}[1]{\text{diam}\left(#1\right)} 
208 | 
209 | \newcommand{\one}[1]{\mathds{1}} 
210 | \newcommand{\ip}[2]{\left\langle{#1},{#2}\right\rangle}
211 | 
212 | \DeclareMathOperator*{\argmax}{arg\,max}
213 | \DeclareMathOperator*{\argmin}{arg\,min}
214 | 
215 | % Rules
216 | \newcommand{\HRule}{\rule{\linewidth}{0.5mm}}	% Title's rule
217 | 
218 | \renewcommand{\arraystretch}{1.5} % Space between rows in tabular
219 | \usepackage{multirow}
220 | 
221 | \usepackage{xcolor}
222 | \usepackage[framemethod=tikz]{mdframed}
223 | 
224 | \definecolor{cccolor}{rgb}{.67,.7,.67}
225 | 
226 | 
227 | \usepackage{mdframed}
228 | \usetikzlibrary{shadows}
229 | \newmdtheoremenv[shadow=true, shadowsize=5pt]{boxedthm}{Theorem} %TODO shared counter + italic font
230 | 
231 | 
232 | 
233 | % Wrapper for pseudocode
234 | \usepackage{algorithm}
235 | % Pseudocode
236 | \usepackage[noend]{algpseudocode}% https://tex.stackexchange.com/questions/177025/hyperref-cleveref-and-algpseudocode-same-identifier-warning
237 | 
238 | % PseudoCode
239 | \newcommand*\var{\mathit}                   % Variables in pseudocode
240 | \newcommand*\fn{\operatorname}              % Functions in pseudocode
241 | \newcommand{\code}{\texttt}                 % Inline Code
242 | 
243 | \makeatletter
244 | \newcounter{algorithmicH}% New algorithmic-like hyperref counter
245 | \let\oldalgorithmic\algorithmic
246 | \renewcommand{\algorithmic}{%
247 |   \stepcounter{algorithmicH}% Step counter
248 |   \oldalgorithmic}% Do what was always done with algorithmic environment
249 | \renewcommand{\theHALG@line}{ALG@line.\thealgorithmicH.\arabic{ALG@line}}
250 | \makeatother
251 | 
252 | \iffalse
253 |     \begin{algorithm}[!htp]
254 |       \caption{Rejection Sampling}\label{lst:rej_samp}
255 |       \begin{algorithmic}[1]
256 |         \Procedure{$\operatorname{rejection\_sampling}$}{$f, g, M$}
257 |                     \While{\code{true}}
258 |                         \State $x \gets $ \code{sample}$\pa{g}$
259 |                         \State $\var{accept} \gets \frac{f(x)}{Mg(x)}$
260 |                         \If{\code{sample}$\pa{\mathcal{U}(0,1)} < \var{accept}$}
261 |                             \State \Return $x$ \Comment{Accept $x$}
262 |                         \EndIf
263 |                     \EndWhile
264 |         \EndProcedure
265 |       \end{algorithmic}
266 |     \end{algorithm}
267 | \fi
268 | 
269 | \usepackage{epigraph}
270 | \setlength{\epigraphwidth}{0.5\linewidth}
271 | \setlength{\epigraphrule}{0pt}
272 | \renewcommand*{\textflush}{flushright}
273 | \renewcommand*{\epigraphsize}{\normalsize\itshape}
274 | 
275 |  \usepackage[capitalise,nameinlink,noabbrev]{cleveref} % Cite with \cref or \Cref so the name of the object (Theorem, Proposition, etc.) is written automatically
276 | 
277 | % Customized sections: http://tex.stackexchange.com/questions/136527/section-numbering-without-numbers/136541#136541
278 | 
279 | %\usepackage{titlesec}
280 | %\titlelabel{\thetitle.\enspace}
281 | %\titleformat{\section}
282 | %    {\normalsize\bfseries\centering}    % The style of the section title
283 | %    {}                                  % a prefix
284 | %    {0pt}                               % How much space exists between the prefix and the title
285 | %    {Question \thesection}           % How the section is represented
286 | %    %{Section \thesection:\quad}    % How the section is represented
287 | %
288 | %% Starred variant
289 | %\titleformat{name=\section,numberless}
290 | %  {\normalfont\Large\bfseries}
291 | %  {}
292 | %  {0pt}
293 | %  {}
294 | 
295 | % Graphics
296 | 
297 | 
298 | %================================================================================
299 | % Comments
300 | %================================================================================
301 | \iffalse
302 |     
303 | % Align
304 | \begin{align*} 
305 |  \begin{aligned}
306 |       i &= i \\
307 |         &= i \\
308 |    \end{aligned}
309 | \end{align*}
310 | 
311 | % Stack things
312 | \stackrel{?}{<}
313 | 
314 | % Graphics
315 | \begin{figure}[h!]
316 | \centering
317 |         \includegraphics[scale=0.1]{1} 
318 | \caption{SGD adaptation}
319 | \end{figure}
320 | 
321 | \fi
322 | 
323 | \title{}
324 | \date{}
325 | \author{}
326 | 
327 | 
328 | 
329 | \begin{document}
330 | 
331 | \section{Gradient Descent Finds Global Minima of Deep Neural Networks}
332 | \subsection*{Definitions}
333 | \begin{itemize}
334 | \item $m$: Width of each layer of the neural network.
335 | \item $n$: number of samples.
336 | \item $d$: dimension of training data.
337 | \item $H$: number of layers of the neural network.
338 | \item $\eta$: learning rate for gradient descent.
339 | \item $\theta$: parameters of the neural network. 
340 | \item $\theta(k)$: parameters of the neural network after $k$ iterations of training with gradient descent. $\theta(0)$ are the parameters at initialization (iid $N(0,1)$).
341 | \item $\sigma$: Activation function. It is Lipschitz, smooth, analytical and not a polynomial.
342 | \item $(\mathbf{x}_i, y_i) \in \R^d\times\R, 1\leq i\leq n$: training data and corresponding labels. In this work, it is assumed that no two input points are parallel, i.e. $x_i \nparallel x_j$ for $i\neq j$.
343 | \item $\mathbf{y} = (y_1,\dots, y_n) \in \R^n$: vector of labels.
344 | \item $\mathbf{W}^{(1)} \in \R^{m\times d}, \mathbf{W}^{(h)}\in \R^{m\times m} 2\leq h\leq H, \mathbf{a}\in \R^m$ are, respectively, the first layer, the $h$ layer and the output layer of the neural network respectively. We also use $\mathbf{W}^{(h)}(k)$, $\mathbf{a}(k)$ to denote the layers after $k$ iterations of training with GD.
345 | \item $c_{\sigma}=\left(\mathbb{E}_{x \sim N(0,1)}\left[\sigma(x)^{2}\right]\right)^{-1}$ is a scaling factor to normalize the input in the initialization phase of the neural network.
346 | \item \textbf{Fully-connected neural network (NN)}. Let $\mathbf{x}^{(0)}$ be an input of the NN. Then the fully-connected neural network function $f$ is defined recursively in the following way:
347 | \begin{align*} 
348 | \begin{aligned}
349 |      \mathbf{x}^{(h)} &= \sqrt{\frac{c_\sigma}{m}} \sigma\left(\mathbf{W}^{(h)} \mathbf{x}^{(h-1)}\right), 1 \leq h \leq H \\
350 |      f(\mathbf{x}, \theta) &= \mathbf{a}^\top \mathbf{x}^{(H)}.
351 | \end{aligned}
352 | \end{align*}
353 | where $c_{\sigma}=\left(\mathbb{E}_{x \sim N(0,1)}\left[\sigma(x)^{2}\right]\right)^{-1}$ is the scaling defined above.
354 | 
355 | \item \textbf{Loss function ($\ell_2$)}. $L(\theta) = \frac{1}{2}\sum_{i=1}^n (f(\theta,\mathbf{x}_i)-y_i)^2$.
356 | \item $u_i(k) = f(\theta(k), \mathbf{x}_i)$. Output of the NN for sample $i$ after $k$ iterations of GD.
357 | \item $\mathbf{u}(k) = (u_1(k), \dots, u_n(k))^\top \in \R^n$. 
358 | \item $\mathbf{G}^{(h)}(k) \in \R^{n\times n}$, $1\leq h \leq H+1$ defined as $\mathbf{G}_{ij}^{(h)}(k) = \left\langle\frac{\partial u_{i}(k)}{\partial \mathbf{W}^{(h)}(k)}, \frac{\partial u_{j}(k)}{\partial \mathbf{W}^{(h)}(k)}\right\rangle$ for $h=1, \ldots, H$ and $\mathbf{G}_{i j}^{(H+1)}(k)=\left\langle\frac{\partial u_{i}(k)}{\partial \mathbf{a}(k)}, \frac{\partial u_{j}(k)}{\partial \mathbf{a}(k)}\right\rangle$. So that the following definition can be used to express the dynamics of the NN.
359 | \item $\mathbf{G}(k)$ defined as $\mathbf{G}_{ij}(k) = \sum_{h=1}^{H+1} \mathbf{G}_{ij}^{(h)}(k)$. Note that for the infinite NTK the function behaves as its linearization and it holds
360 | \[
361 | \mathbf{y}-\mathbf{u}(k+1)=(\mathbf{I}-\eta K)(\mathbf{y}-\mathbf{u}(k)),
362 | \] 
363 | We want to argue that 
364 | \[
365 |     \mathbf{y}-\mathbf{u}(k+1)\approx(\mathbf{I}-\eta \mathbf{G}(k))(\mathbf{y}-\mathbf{u}(k)),
366 | \]
367 | in a precise way. Note the gradient descent update is
368 | \begin{align*} 
369 | \begin{aligned} 
370 |     \mathbf{W}^{(h)}(k) &=\mathbf{W}^{(h)}(k-1)-\eta \frac{\partial L(\theta(k-1))}{\partial \mathbf{W}^{(h)}(k-1)}, \\
371 |     \mathbf{a}(k) &=\mathbf{a}(k-1)-\eta \frac{\partial L(\theta(k-1))}{\partial \mathbf{a}(k-1)}. 
372 | \end{aligned}
373 | \end{align*}
374 | 
375 | \begin{remark}
376 |     Each entry of $\mathbf{G}^{(h)}(k)$  is an inner product and thus $\mathbf{G}^{(h)}(k)$ is a PSD matrix.  Furthermore, if there exists one $h\in[H]$ such that $\mathbf{G}^{(h)}(k)$ is strictly positive definite, then if one chooses the step size $\eta$ to be sufficiently small, the loss decreases at the $k-th $ iteration according the analysis of power method, which presents linear convergence rate. In the paper they focus on $\mathbf{G}^{(H)}(k)$ only.
377 | \end{remark}
378 | 
379 | \item $\mathbf{K}^{(h)}$ is a fixed matrix which depends on the input data, neural network architecture (including the activation function but does not depend on the parameters $\theta$. It will be shown that $\mathbf{G}^{(H)}(0)$ at initialization is close to $\mathbf{K}^{(H)}$, that $\mathbf{G}^{(H)}(k)$ is close to $\mathbf{G}^{(H)}(0)$ and that $\mathbf{K}^{(H)}$ is positive semidefinite. These three things imply linear convergence of gradient descent by proving that the minimum eigenvalue of $\mathbf{G}^{(H)}(k)$ is bounded below by a constant independent of $k$. The definition of these matrices for the fully neural network connected the following: 
380 |  \begin{align} 
381 | \begin{aligned} 
382 |     \mathbf{K}_{i j}^{(0)} &=\left\langle\mathbf{x}_{i}, \mathbf{x}_{j}\right\rangle \\ \mathbf{A}_{i j}^{(h)} &=\left(\begin{array}{cc}{\mathbf{K}_{i i}^{(h-1)}} & {\mathbf{K}_{i j}^{(h-1)}} \\ {\mathbf{K}_{j i}^{(h-1)}} & {\mathbf{K}_{j j}^{(h-1)}}\end{array}\right) \\ \mathbf{K}_{i j}^{(h)} &=c_{\sigma} \mathbb{E}_{(u, v)^{\top} \sim N\left(\mathbf{0}, \mathbf{A}_{i j}^{(h)}\right)}[\sigma(u) \sigma(v)] \\ \mathbf{K}_{i j}^{(H)} &=c_{\sigma} \mathbf{K}_{i j}^{(H-1)} \mathbb{E}_{(u, v)^{\top} \sim N\left(\mathbf{0}, \mathbf{A}_{i j}^{(H-1)}\right)}\left[\sigma^{\prime}(u) \sigma^{\prime}(v)\right] 
383 | \end{aligned}
384 | \end{align}
385 | 
386 | \item  $u_{i}^{\prime}(\theta) = \frac{\partial u_{i}}{\partial \theta},  u_{i}^{(h)}(\theta) = \frac{\partial u_{i}}{\partial \mathbf{W}^{(h)}},  u_{i}^{(a)}(\theta) = \frac{\partial u_{i}}{\partial \mathbf{a}}, L^{\prime}(\theta)=\frac{\partial L(\theta)}{\partial \theta},  L^{(h)}(\mathbf{W}^{(h)})=\frac{\partial L(\theta)}{\partial \mathbf{W}^{(h)}},  L^{(a)}(\theta) = \frac{\partial L}{\partial \mathbf{a}}$.
387 | 
388 | \end{itemize}
389 | 
390 | \subsection*{Results}
391 | 
392 | The paper proves linear global convergence, i.e. to zero training error of some deep networks architectures with high probability with respect to the initialization assuming the networks are sufficiently overparametrized and that $\ell_2$ loss is used. Note the learning rate has to be quite small, much more that what would be used in practice. Another caveat is that overparametrization depends on $\lambda_0$ the minimum eigenvalue of $K^(H)$ which is proved to be positive but it is not provided any kind of guarantee for $\lambda_0$ not being arbitrarily small in some cases. 
393 | 
394 | The results of the paper are for fully-connected NNs, which needs exponential overparametrization with depth, for ResNets, in which this dependence with depth drops to a polynomial, and convolutional ResNets. In these notes we focus on the fully-connected architecture for simplicity. The arguments are quite similar across architectures.
395 | 
396 | \begin{thm}[Convergence Rate of Gradient Descent for Deep Fully-connected Neural Networks]\label{thm:convergence}
397 | Assume for all $i \in [n]$, $\norm{\mathbf{x}_i}_2 = 1$, $\abs{y_i} = O(1)$  and the number of hidden nodes per layer 
398 | \begin{align*}
399 | m=\Omega\left(2^{O(H)}\max\left\{
400 | \frac{n^4}{\lambda_{\min}^4\left(\mathbf{K}^{(H)}\right)},\frac{n}{\delta}, \frac{n^2\log(\frac{Hn}{\delta})}{\lambda_{\min}^2\left(\mathbf{K}^{(H)}\right)}
401 | \right\}\right)
402 | \end{align*}
403 | If we set the step size 
404 | \[\eta = O\left(\frac{\lambda_{\min}\left(\mathbf{K}^{(H)}\right)}{n^22^{O(H)}}\right),\] 
405 | then with probability at least $1-\delta$ over the random initialization, for $k=1,2,\ldots$, the loss at each iteration satisfies
406 | \begin{align*}
407 | L(\theta(k))\le \left(1-\frac{\eta \lambda _{\min}\left(\mathbf{K}^{(H)}\right)}{2}\right)^{k}L(\theta(0)).
408 | \end{align*}
409 | \end{thm}
410 | 
411 | In order to prove the theorem, we introduce a few lemmas. First, we state the condition of the theorem we want to prove for all $k$ with high probability, where $\lambda_0$ is the minimum eigenvalue of $\mathbf{K}^{(H)}$.
412 | 
413 | \begin{condition}\label{cond:linear_converge}
414 | 	At the $k$-th iteration, we have \begin{align*}
415 | 	\norm{\mathbf{y}-\mathbf{u}(k)}_2^2 \le (1-\frac{\eta \lambda_0}{2})^{k} \norm{\mathbf{y}-\mathbf{u}(0)}_2^2.
416 | 	\end{align*}
417 | \end{condition}
418 | 
419 | 
420 | \begin{lemma}[Initialization norm] If $\sigma(\cdot)$ is $L$-Lipschitz and $m= \Omega\left(\frac{nHg_c(H)^2}{\delta}\right)$ with $C = c_\sigma L(2\abs{\sigma(0)} \sqrt{\frac{2}{\pi}}+2L)$, then with probability at least $1-\delta$ over random initialization, for every $h \in [H]$ and $i \in [n]$ we have
421 | \[
422 | \frac{1}{c_{x, 0}} \leq\left\|\mathbf{x}_{i}^{(h)}(0)\right\|_{2} \leq c_{x, 0},
423 | \]
424 | where $c_{x,0}=2$. 
425 | \end{lemma}   
426 | A similar lemma can be proven for different architectures with a different value of $c_{x,0}$. This lemma is needed in the proofs of Lemmas \ref{lemma:activations_stability} and \ref{lemma:eigenvalue_stability_while_training}.
427 | 
428 | \begin{lemma}[Least Eigenvalue at the Initialization] If $m= \Omega\left(\frac{n^2\log(Hn/\delta)2^{O(H)}}{\lambda_0^2}\right)$ we have
429 | \[
430 |     \lambda_{\textup{min}}(\mathbf{G}^{(H)}(0)) \geq \frac{3}{4}\lambda_0.
431 | \]
432 | \end{lemma}
433 | 
434 | \begin{lemma}[Least Eigenvalue at the Initialization]\label{lemma:activations_stability}
435 | 	Suppose for every $h\in[H]$, $\norm{\mathbf{W}^{(h)}(0)}_2 \le c_{w,0}\sqrt{m}$, $\norm{\mathbf{x}^{(h)}(0)}_2 \le c_{x,0}$ and $\norm{\mathbf{W}^{(h)}(k)-\mathbf{W}^{(h)}(0)}_F \le \sqrt{m} R$ for some constant $c_{w,0},c_{x,0} > 0$ and $R \le c_{w,0}$.
436 | 	If $\sigma(\cdot)$ is $L-$Lipschitz, we have \begin{align*}
437 | 	\norm{\mathbf{x}^{(h)}(k)-\mathbf{x}^{(h)}(0)}_2 \le \sqrt{c_{\sigma}}Lc_{x,0}g_{c_x}(h)R
438 | 	\end{align*} where $c_x=2\sqrt{c_{\sigma}}Lc_{w,0}$.
439 | \end{lemma}
440 | 
441 | \begin{lemma} \label{lemma:eigenvalue_stability_while_training}    	Suppose $\sigma(\cdot)$ is $L-$Lipschitz and $\beta-$smooth. Suppose for $h\in[H]$, $\norm{\mathbf{W}^{(h)}(0)}_2\le c_{w,0}\sqrt{m}$, $\norm{\mathbf{a}(0)}_2\le a_{2,0}\sqrt{m}$, $\norm{\mathbf{a}(0)}_4\le a_{4,0}m^{1/4}$ , $\frac{1}{c_{x,0}}\le\norm{\mathbf{x}^{(h)}(0)}_2 \le c_{x,0}$,  if $\norm{\mathbf{W}^{(h)}(k)-\mathbf{W}^{(h)}(0)}_F$, $\norm{\mathbf{a}(k)-\mathbf{a}(0)}_2 \le \sqrt{m}R$ where $R \le c g_{c_x}(H)^{-1}\lambda_0n^{-1}$ and $R\le c g_{c_x}(H)^{-1}$ for some small constant $c$ and $c_x = 2\sqrt{c_{\sigma}}Lc_{w,0}$, we have \begin{align*}
442 | 	\norm{\mathbf{G}^{(H)}(k) - \mathbf{G}^{(H)}(0)}_2 \le \frac{\lambda_0}{4}.
443 | 	\end{align*}
444 | \end{lemma}
445 | The assumption $\norm{W^{(h)}(0)}_2 \leq c_{w,0}\sqrt{m}$ is a well know fact of gaussian initialized matrices and the bounds on $\norm{a(0)}_2$ and $\norm{a(0)}_4$ can be proved using standard concentration inequalities. $a_{2,0}$ and $a_{4,0}$ are universal constants.
446 | 
447 | \begin{lemma} \label{lemma:weights_stability}
448 | 	If Condition~\ref{cond:linear_converge} holds for $k'=1,\ldots,k$, we have for any $s =1,\ldots,k+1$
449 | 	\begin{align*}
450 | 	&\norm{\mathbf{W}^{(h)}(s)-\mathbf{W}^{(h)}(0)}_F, \norm{\mathbf{a}(s)-\mathbf{a}(0)}_2 \le  R'\sqrt{m}\\
451 | 	&\norm{\mathbf{W}^{(h)}(s)-\mathbf{W}^{(h)}(s-1)}_F,  \norm{\mathbf{a}(s)-\mathbf{a}(s-1)}_2\le \eta Q'(s-1)
452 | 	\end{align*}where $R'=\frac{16c_{x,0}a_{2,0}\left(c_x\right)^H \sqrt{n} \norm{\mathbf{y}-\mathbf{u}(0)}_2}{\lambda_0\sqrt{m}}  \le cg_{c_x}(H)^{-1}$ for some small constant $c$ with $c_x=\max\{2\sqrt{c_{\sigma}}Lc_{w,0},1\}$ and $ Q'(s)= 4c_{x,0}a_{2,0}\left(c_x\right)^{H}\sqrt{n} \norm{\mathbf{y}-\mathbf{u}(s)}_2$
453 | 
454 | \end{lemma}
455 | 
456 | \begin{lemma}\label{lemma:small_snd_order_term}
457 |     Let 
458 |     \[
459 |         I_2^i(k) = \int_{s=0}^{\eta}\left\langle L^{\prime}(\theta(k)), u_{i}^{\prime}(\theta(k))-u_{i}^{\prime}\left(\theta(k)-s L^{\prime}(\theta(k))\right)\right\rangle d s
460 |     \]
461 |     and $\mathbf{I}_2(k) = (I_2^1(k), \dots, I_2^n(k))^\top$.
462 | If Condition~\ref{cond:linear_converge} holds for $k'=1,\ldots,k$, suppose $\eta\le c\lambda_0\left(n^{2}H^2(c_x)^{3H}g_{2c_x}(H)\right)^{-1}$ for some small constant $c$, we have \begin{align*}
463 | 	\norm{\mathbf{I}_2(k)}_2 \le \frac{1}{8}\eta \lambda_0 \norm{\mathbf{y}-\mathbf{u}(k)}_2.
464 | \end{align*}
465 | \end{lemma}
466 | 
467 | \begin{lemma}\label{lemma:small_snd_order_term_2}
468 | If Condition~\ref{cond:linear_converge} holds for $k'=1,\ldots,k$, suppose $\eta\le c\lambda_0\left(n^{2}H^2(c_x)^{2H}g_{2c_x}(H)\right)^{-1}$ for some small constant $c$, then we have
469 | $\norm{\mathbf{u}(k+1)-\mathbf{u}(k)}_2^2\le \frac{1}{8}\eta \lambda_0 \norm{\mathbf{y}-\mathbf{u}(k)}_2^2$.
470 | 
471 | \end{lemma}
472 | 
473 | \begin{proof}[Proof of Theorem \ref{thm:convergence}]
474 |     We want to prove Condition \ref{cond:linear_converge} for all $k$. We proceed by induction. Note that
475 |     \begin{equation} \label{eq:decomposition}
476 |         \begin{aligned} &\|\mathbf{y}-\mathbf{u}(k+1)\|_{2}^{2} \\=&\|\mathbf{y}-\mathbf{u}(k)-(\mathbf{u}(k+1)-\mathbf{u}(k))\|_{2}^{2} \\=&\|\mathbf{y}-\mathbf{u}(k)\|_{2}^{2}-2(\mathbf{y}-\mathbf{u}(k))^{\top}(\mathbf{u}(k+1)-\mathbf{u}(k))+\|\mathbf{u}(k+1)-\mathbf{u}(k)\|_{2}^{2} \end{aligned}
477 |     \end{equation}
478 | 
479 |     We need the second summand to be greater in absolute value than the third one for the loss to decrease. Intuitively this is true because by a Taylor expansion of $\mathbf{u}(k+1)-\mathbf{u}(k)$ with respect to $\eta$ we have that the second summand is of order $\eta$ plus second order terms and the third summand is of order $\eta^2$, so for $\eta$ small enough we can proof that the loss decreases. Then we have to prove that the first order term in $\eta$ is proportional to the constant of \ref{cond:linear_converge}. Expanding one coordinate of  $\mathbf{u}(k+1)-\mathbf{u}(k)$ by Taylor we obtain
480 | \begin{align*} 
481 |  \begin{aligned}
482 |      \mathbf{u}_i(k+1)-\mathbf{u}_i(k) = \left( -\eta\left\langle L^{\prime}(\theta(k)), u_{i}^{\prime}(\theta(k))\right\rangle \right) + I_2^i(k)
483 |    \end{aligned}
484 | \end{align*}
485 | where, following the notation of the paper we denote $I_2^i(k)$ the second order term on $\eta$. It is equal to
486 | \[
487 |    I_2^i(k) = \int_{s=0}^{\eta}\left\langle L^{\prime}(\theta(k)), u_{i}^{\prime}(\theta(k))-u_{i}^{\prime}\left(\theta(k)-s L^{\prime}(\theta(k))\right)\right\rangle \mathrm{d}s.
488 | \] 
489 | But let's focus on the first term, which we denote $I_1^i(k)$, and let $\mathbf{I}_1(k) =(I_1^1(k)), \dots, (I_1^n(k))^\top $ and $\mathbf{I}_2(k) =(I_2^1(k)), \dots, (I_2^n(k))^\top $. We have 
490 | \begin{align*} 
491 |  \begin{aligned} I_{1}^{i} &=-\eta\left\langle L^{\prime}(\theta(k)), u_{i}^{\prime}(\theta(k))\right\rangle \\ &=-\eta \sum_{j=1}^{n}\left(u_{j}-y_{j}\right)\left\langle u_{j}^{\prime}(\theta(k)), u_{i}^{\prime}(\theta(k))\right\rangle \\ & \triangleq-\eta \sum_{j=1}^{n}\left(u_{j}-y_{j}\right) \sum_{h=1}^{H+1} \mathbf{G}_{i j}^{(h)}(k) \end{aligned}
492 | \end{align*}
493 | or in matricial form
494 | \[
495 | \mathbf{I}_{1}(k)=-\eta \mathbf{G}(k)(\mathbf{u}(k)-\mathbf{y})
496 | \] 
497 | Now observe that 
498 | \begin{align}  \label{ineq:bound_Gh}
499 | \begin{aligned}(\mathbf{y}-\mathbf{u}(k))^{\top} \mathbf{I}_{1}(k) &=\eta(\mathbf{y}-\mathbf{u}(k))^{\top} \mathbf{G}(k)(\mathbf{y}-\mathbf{u}(k)) \\ & \geq \lambda_{\min }(\mathbf{G}(k))\|\mathbf{y}-\mathbf{u}(k)\|_{2}^{2} \\ & \geq \lambda_{\min }\left(\mathbf{G}^{(H)}(k)\right)\|\mathbf{y}-\mathbf{u}(k)\|_{2}^{2} \end{aligned}
500 | \end{align}
501 | 
502 |     We will only need to look at $\mathbf{G}^{(H)}$ which has the following form
503 | \[
504 | \mathbf{G}_{i, j}^{(H)}(k)=\left(\mathbf{x}_{i}^{(H-1)}(k)\right)^{\top} \mathbf{x}_{j}^{(H-1)}(k) \cdot \frac{c_{\sigma}}{m} \sum_{r=1}^{m} a_{r}^{2} \sigma^{\prime}\left(\left(\theta_{r}^{(H)}(k)\right)^{\top} \mathbf{x}_{i}^{(H-1)}(k)\right) \sigma^{\prime}\left(\left(\theta_{r}^{(H)}(k)\right)^{\top} \mathbf{x}_{j}^{(H-1)}(k)\right)
505 | \] 
506 | 
507 | In principle one could look at $\mathbf{G}(k)$ but in the paper they do not do that. The analysis becomes simple if only $\mathbf{G}^{(H)}$ is used.
508 | 
509 | So putting all together we have
510 | \begin{align*} 
511 | \begin{aligned} &\|\mathbf{y}-\mathbf{u}(k+1)\|_{2}^{2} \\
512 |     \circled{1}[\leq] &\left(1-\eta \lambda_{\min }\left(\mathbf{G}^{(H)}(k)\right)\right)\|\mathbf{y}-\mathbf{u}(k)\|_{2}^{2}-2(\mathbf{y}-\mathbf{u}(k))^{\top} \mathbf{I}_{2}(k)+\|\mathbf{u}(k+1)-\mathbf{u}(k)\|_{2}^{2} \\
513 |     \circled{2}[\leq] & \left(1-\eta \lambda_{0}\right)\|\mathbf{y}-\mathbf{u}(k)\|_{2}^{2}-2(\mathbf{y}-\mathbf{u}(k))^{\top} \mathbf{I}_{2}+\|\mathbf{u}(k+1)-\mathbf{u}(k)\|_{2}^{2} \\
514 |     \circled{3}[\leq] &\left(1-\frac{\eta \lambda_{0}}{2}\right)\|\mathbf{y}-\mathbf{u}(k)\|_{2}^{2}.
515 | \end{aligned}
516 | \begin{aligned}\end{aligned}
517 | \end{align*}
518 | 
519 | $\circled{1}$ uses Equation \eqref{eq:decomposition} and inequality \eqref{ineq:bound_Gh}. \circled{3} uses Lemmas \ref{lemma:small_snd_order_term} and \ref{lemma:small_snd_order_term_2}. For $\circled{2}$, by induction hypothesis, using Lemma \ref{lemma:weights_stability} we obtain
520 | \[
521 | \begin{aligned}\left\|\mathbf{W}^{(h)}(k)-\mathbf{W}^{(h)}(0)\right\|_{F} & \leq R^{\prime} \sqrt{m} \\ & \leq R \sqrt{m} \end{aligned}
522 | \]
523 | for the choice of $m$ in the theorem. By Lemma \ref{lemma:eigenvalue_stability_while_training} we get $\lambda_{\min }\left(\mathbf{G}^{(H)}(k)\right) \geq \frac{\lambda_{0}}{2}$.
524 | 
525 | 
526 | \end{proof}
527 | 
528 | 
529 | \nocite{*}                 % Include refs not cited
530 | \bibliography{refs}        %use a bibtex bibliography file refs.bib
531 | \bibliographystyle{plain}  %use the plain bibliography style
532 | 
533 | \end{document}
534 | 


--------------------------------------------------------------------------------
/notes/low_rank_jac/.gitignore:
--------------------------------------------------------------------------------
  1 | # -*- mode: gitignore; -*-
  2 | *~
  3 | \#*\#
  4 | /.emacs.desktop
  5 | /.emacs.desktop.lock
  6 | *.elc
  7 | auto-save-list
  8 | tramp
  9 | .\#*
 10 | 
 11 | # Org-mode
 12 | .org-id-locations
 13 | *_archive
 14 | 
 15 | # flymake-mode
 16 | *_flymake.*
 17 | 
 18 | # eshell files
 19 | /eshell/history
 20 | /eshell/lastdir
 21 | 
 22 | # elpa packages
 23 | /elpa/
 24 | 
 25 | # reftex files
 26 | *.rel
 27 | 
 28 | # AUCTeX auto folder
 29 | /auto/
 30 | 
 31 | # cask packages
 32 | .cask/
 33 | dist/
 34 | 
 35 | # Flycheck
 36 | flycheck_*.el
 37 | 
 38 | # server auth directory
 39 | /server/
 40 | 
 41 | # projectiles files
 42 | .projectile
 43 | 
 44 | # directory configuration
 45 | .dir-locals.el
 46 | 
 47 | # network security
 48 | /network-security.data
 49 | 
 50 | 
 51 | *.pdf
 52 | *.pdf_tex
 53 | *.synctex.gz
 54 | 
 55 | ## Core latex/pdflatex auxiliary files:
 56 | *.aux
 57 | *.lof
 58 | *.log
 59 | *.lot
 60 | *.fls
 61 | *.out
 62 | *.toc
 63 | *.fmt
 64 | *.fot
 65 | *.cb
 66 | *.cb2
 67 | .*.lb
 68 | 
 69 | ## Intermediate documents:
 70 | *.dvi
 71 | *.xdv
 72 | *-converted-to.*
 73 | # these rules might exclude image files for figures etc.
 74 | # *.ps
 75 | # *.eps
 76 | # *.pdf
 77 | 
 78 | ## Generated if empty string is given at "Please type another file name for output:"
 79 | .pdf
 80 | 
 81 | ## Bibliography auxiliary files (bibtex/biblatex/biber):
 82 | *.bbl
 83 | *.bcf
 84 | *.blg
 85 | *-blx.aux
 86 | *-blx.bib
 87 | *.run.xml
 88 | 
 89 | ## Build tool auxiliary files:
 90 | *.fdb_latexmk
 91 | *.synctex
 92 | *.synctex(busy)
 93 | *.synctex.gz
 94 | *.synctex.gz(busy)
 95 | *.pdfsync
 96 | 
 97 | ## Build tool directories for auxiliary files
 98 | # latexrun
 99 | latex.out/
100 | 
101 | ## Auxiliary and intermediate files from other packages:
102 | # algorithms
103 | *.alg
104 | *.loa
105 | 
106 | # achemso
107 | acs-*.bib
108 | 
109 | # amsthm
110 | *.thm
111 | 
112 | # beamer
113 | *.nav
114 | *.pre
115 | *.snm
116 | *.vrb
117 | 
118 | # changes
119 | *.soc
120 | 
121 | # comment
122 | *.cut
123 | 
124 | # cprotect
125 | *.cpt
126 | 
127 | # elsarticle (documentclass of Elsevier journals)
128 | *.spl
129 | 
130 | # endnotes
131 | *.ent
132 | 
133 | # fixme
134 | *.lox
135 | 
136 | # feynmf/feynmp
137 | *.mf
138 | *.mp
139 | *.t[1-9]
140 | *.t[1-9][0-9]
141 | *.tfm
142 | 
143 | #(r)(e)ledmac/(r)(e)ledpar
144 | *.end
145 | *.?end
146 | *.[1-9]
147 | *.[1-9][0-9]
148 | *.[1-9][0-9][0-9]
149 | *.[1-9]R
150 | *.[1-9][0-9]R
151 | *.[1-9][0-9][0-9]R
152 | *.eledsec[1-9]
153 | *.eledsec[1-9]R
154 | *.eledsec[1-9][0-9]
155 | *.eledsec[1-9][0-9]R
156 | *.eledsec[1-9][0-9][0-9]
157 | *.eledsec[1-9][0-9][0-9]R
158 | 
159 | # glossaries
160 | *.acn
161 | *.acr
162 | *.glg
163 | *.glo
164 | *.gls
165 | *.glsdefs
166 | *.lzo
167 | *.lzs
168 | 
169 | # uncomment this for glossaries-extra (will ignore makeindex's style files!)
170 | # *.ist
171 | 
172 | # gnuplottex
173 | *-gnuplottex-*
174 | 
175 | # gregoriotex
176 | *.gaux
177 | *.gtex
178 | 
179 | # htlatex
180 | *.4ct
181 | *.4tc
182 | *.idv
183 | *.lg
184 | *.trc
185 | *.xref
186 | 
187 | # hyperref
188 | *.brf
189 | 
190 | # knitr
191 | *-concordance.tex
192 | # TODO Comment the next line if you want to keep your tikz graphics files
193 | *.tikz
194 | *-tikzDictionary
195 | 
196 | # listings
197 | *.lol
198 | 
199 | # luatexja-ruby
200 | *.ltjruby
201 | 
202 | # makeidx
203 | *.idx
204 | *.ilg
205 | *.ind
206 | 
207 | # minitoc
208 | *.maf
209 | *.mlf
210 | *.mlt
211 | *.mtc[0-9]*
212 | *.slf[0-9]*
213 | *.slt[0-9]*
214 | *.stc[0-9]*
215 | 
216 | # minted
217 | _minted*
218 | *.pyg
219 | 
220 | # morewrites
221 | *.mw
222 | 
223 | # nomencl
224 | *.nlg
225 | *.nlo
226 | *.nls
227 | 
228 | # pax
229 | *.pax
230 | 
231 | # pdfpcnotes
232 | *.pdfpc
233 | 
234 | # sagetex
235 | *.sagetex.sage
236 | *.sagetex.py
237 | *.sagetex.scmd
238 | 
239 | # scrwfile
240 | *.wrt
241 | 
242 | # sympy
243 | *.sout
244 | *.sympy
245 | sympy-plots-for-*.tex/
246 | 
247 | # pdfcomment
248 | *.upa
249 | *.upb
250 | 
251 | # pythontex
252 | *.pytxcode
253 | pythontex-files-*/
254 | 
255 | # tcolorbox
256 | *.listing
257 | 
258 | # thmtools
259 | *.loe
260 | 
261 | # TikZ & PGF
262 | *.dpth
263 | *.md5
264 | *.auxlock
265 | 
266 | # todonotes
267 | *.tdo
268 | 
269 | # vhistory
270 | *.hst
271 | *.ver
272 | 
273 | # easy-todo
274 | *.lod
275 | 
276 | # xcolor
277 | *.xcp
278 | 
279 | # xmpincl
280 | *.xmpi
281 | 
282 | # xindy
283 | *.xdy
284 | 
285 | # xypic precompiled matrices and outlines
286 | *.xyc
287 | *.xyd
288 | 
289 | # endfloat
290 | *.ttt
291 | *.fff
292 | 
293 | # Latexian
294 | TSWLatexianTemp*
295 | 
296 | ## Editors:
297 | # WinEdt
298 | *.bak
299 | *.sav
300 | 
301 | # Texpad
302 | .texpadtmp
303 | 
304 | # LyX
305 | *.lyx~
306 | 
307 | # Kile
308 | *.backup
309 | 
310 | # gummi
311 | .*.swp
312 | 
313 | # KBibTeX
314 | *~[0-9]*
315 | 
316 | # auto folder when using emacs and auctex
317 | ./auto/*
318 | *.el
319 | 
320 | # expex forward references with \gathertags
321 | *-tags.tex
322 | 
323 | # standalone packages
324 | *.sta
325 | 
326 | # Makeindex log files
327 | *.lpzreport.aux
328 | auto/
329 | supp.zip
330 | 


--------------------------------------------------------------------------------
/notes/low_rank_jac/Makefile:
--------------------------------------------------------------------------------
 1 | ALL=$(wildcard *.sty *.tex figs/*.svg)
 2 | PAPER=low_rank_jac_thm
 3 | SHELL=/bin/zsh
 4 | 
 5 | #FIGS_SVG=$(wildcard figs/*.svg)
 6 | #FIGS_PDF=$(FIGS_SVG:%.svg=%.pdf)
 7 | 
 8 | #./figs/%.pdf: ./figs/%.svg  ## Figures for the manuscript
 9 | #	inkscape -D -z --file=$< --export-pdf=$@ --export-latex
10 | 
11 | #FIGS_SVG2=$(wildcard images_adv/*.svg)
12 | #FIGS_PDF2=$(FIGS_SVG2:%.svg=%.pdf)
13 | 
14 | #./images_adv/%.pdf: ./images_adv/%.svg  ## Figures for the manuscript
15 | #	inkscape -D -z --file=$< --export-pdf=$@ --export-latex
16 | 
17 | 
18 | # all: $(FIGS_PDF2) $(FIGS_PDF) ## Build full thesis (LaTeX + figures)
19 | # 	pdflatex $(PAPER)
20 | # 	pdflatex $(PAPER)
21 | # 	bibtex $(PAPER)
22 | # 	pdflatex $(PAPER)
23 | # 	pdflatex $(PAPER)
24 | 
25 | all:
26 | 	pdflatex $(PAPER)
27 | 	pdflatex $(PAPER)
28 | 	bibtex $(PAPER)
29 | 	pdflatex $(PAPER)
30 | 	pdflatex $(PAPER)
31 | 
32 | clean:  ## Clean LaTeX and output figure files
33 | 	rm -f *.out *.aux *.log *.blg *.bbl
34 | #	rm -f $(FIGS_PDF)
35 | 
36 | #watch:  ## Recompile on any update of LaTeX or SVG sources
37 | #    @while [ 1 ]; do; inotifywait $(ALL); sleep 0.01; make all; done
38 | 


--------------------------------------------------------------------------------
/notes/low_rank_jac/amartya_ltx.sty:
--------------------------------------------------------------------------------
 1 | 
 2 | % Theorem Environments
 3 | 
 4 | \newtheorem{thm}{Theorem}
 5 | \newtheorem{lem}[thm]{Lemma}
 6 | \newtheorem{corollary}[thm]{Corollary}
 7 | \newtheorem{claim}[thm]{Claim}
 8 | \newtheorem{proposition}[thm]{Proposition}
 9 | \newtheorem{remark}{Remark}
10 | \newtheorem{defn}{Definition}
11 | \newtheorem{example}{Example}
12 | \newtheorem{assump}{Assumption}
13 | 
14 | 
15 | \def\LatinUpper{A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z}
16 | \def\LatinLower{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}
17 | 
18 | 
19 | 
20 | 
21 | % Caligraphic fonts
22 | \newcommand{\genCal}[1]{\expandafter\newcommand\csname c#1\endcsname{{\mathcal #1}}}
23 | \@for\q:=\LatinUpper\do{%
24 | 	\expandafter\genCal\q
25 | }
26 | 
27 | % Blackboard fonts
28 | \newcommand{\genBb}[1]{\expandafter\newcommand\csname b#1\endcsname{{\mathbb #1}}}
29 | \@for\q:=\LatinUpper\do{%
30 | 	\expandafter\genBb\q
31 | }
32 | 
33 | % Fraktur fonts
34 | \newcommand{\genFk}[1]{\expandafter\newcommand\csname k#1\endcsname{{\mathfrak #1}}}
35 | \@for\q:=\LatinUpper\do{%
36 | 	\expandafter\genFk\q
37 | }
38 | 
39 | \newcommand{\genFkl}[1]{\expandafter\newcommand\csname k#1\endcsname{{\mathfrak #1}}}
40 | \@for\q:=\LatinLower\do{%
41 | 	\expandafter\genFkl\q
42 | }
43 | 
44 | 
45 | % Vectors
46 | \renewcommand{\vec}[1]{{\mathbf{#1}}}
47 | \newcommand{\genLatinVec}[1]{\expandafter\newcommand\csname v#1\endcsname{{\vec #1}}}
48 | \@for\q:=\LatinLower\do{%
49 | 	\expandafter\genLatinVec\q
50 | }
51 | 
52 | 
53 | % Greek symbol vectors
54 | \def\mydefgreek#1{\expandafter\def\csname v#1\endcsname{\text{\boldmath$\mathbf{\csname #1\endcsname}$}}}
55 | \def\mydefallgreek#1{\ifx\mydefallgreek#1\else\mydefgreek{#1}%
56 |    \lowercase{\mydefgreek{#1}}\expandafter\mydefallgreek\fi}
57 | \mydefallgreek {alpha}{beta}{gamma}{delta}{epsilon}{zeta}{eta}{theta}{iota}{kappa}{lambda}{mu}{nu}{xi}{omicron}{pi}{rho}{sigma}{tau}{upsilon}{phi}{chi}{psi}{omega}\mydefallgreek
58 | 
59 | % Parentheses
60 | \newcommand{\bc}[1]{\left\{{#1}\right\}}
61 | \newcommand{\br}[1]{\left({#1}\right)}
62 | \newcommand{\bs}[1]{\left[{#1}\right]}
63 | \newcommand{\abs}[1]{\left| {#1} \right|}
64 | \newcommand{\ceil}[1]{\left\lceil #1 \right\rceil}
65 | \newcommand{\floor}[1]{\left\lfloor #1 \right\rfloor}
66 | \newcommand{\bsd}[1]{\left\llbracket{#1}\right\rrbracket}
67 | \newcommand{\ip}[2]{\left\langle{#1},{#2}\right\rangle}
68 | 
69 | % Vector notations
70 | \newcommand{\reals}{\mathbb{R}}
71 | 
72 | %Important functions
73 | \newcommand{\sgn}[1]{\mathrm{sign}(#1)}
74 | \newcommand{\diag}[1]{\mathrm{diag}\left(#1\right)}
75 | \newcommand{\rank}[1]{\mathrm{rank}\left(#1\right)}
76 | \newcommand{\rad}[2]{\mathrm{RAD}_{#2}(#1)}
77 | \newcommand{\supp}{\mathop{\mathrm{sup}}}
78 | \newcommand{\inff}{\mathop{\mathrm{inf}}}
79 | \newcommand{\argmax}{\mathop{\mathrm{argmax}}}
80 | \newcommand{\argmin}{\mathop{\mathrm{argmin}}}
81 | \newcommand{\norm}[1]{\mathrm{\left\lVert#1\right\rVert}}
82 | 
83 | % Complexity operators
84 | \newcommand{\bigO}[1]{O\left(#1\right)}
85 | \newcommand{\softO}[1]{\widetilde{\cO}\br{{#1}}}
86 | \newcommand{\Om}[1]{\Omega\br{{#1}}}
87 | \newcommand{\softOm}[1]{\tilde\Omega\br{{#1}}}
88 | 


--------------------------------------------------------------------------------
/notes/low_rank_jac/low_rank_jac_thm.tex:
--------------------------------------------------------------------------------
  1 | \documentclass[a4paper]{article}
  2 | \usepackage[utf8]{inputenc} % allow utf-8 input
  3 | \usepackage[T1]{fontenc}    % use 8-bit T1 fonts
  4 | \usepackage{hyperref}       % hyperlinks
  5 | \usepackage{url}            % simple URL typesetting
  6 | \usepackage{booktabs}       % professional-quality tables
  7 | \usepackage{amsfonts}       % blackboard math symbols
  8 | \usepackage{nicefrac}       % compact symbols for 1/2, etc.
  9 | \usepackage{microtype}      % microtypography
 10 | 
 11 | \setlength{\headheight}{14.0pt}		% Removes fancy header warning (Not sure what it does)
 12 | \usepackage[margin=3cm]{geometry}				% To edit margins and their format
 13 | \usepackage[english]{babel} 		% language
 14 | \usepackage{indentfirst}			% First paragraph of each section / subsection
 15 | \usepackage{listings}				% Show code
 16 | \usepackage{fancyhdr}				% Headers y footers
 17 | \usepackage{multicol}				% http://stackoverflow.com/questions/1491717/how-to-display-a-content-in-two-column-layout-in-latex
 18 | \usepackage{blindtext}				% For the cool paragraph (Enter after the paragraph section)
 19 | \usepackage{textcomp}
 20 | \usepackage{bussproofs}
 21 | \usepackage{enumitem} 				% To enum with letters and other things
 22 | \usepackage{leftidx} 				% left superindices
 23 | \usepackage{euscript}				% Fancy A and S for symmetry groups (among other things)
 24 | \usepackage{dsfont}
 25 | 
 26 | 
 27 | 
 28 | \usepackage{hyperref}
 29 | \usepackage{enumerate}
 30 | %\usepackage{enumitem}
 31 | 
 32 | \usepackage{nicefrac}
 33 | \usepackage{mathtools}
 34 | \usepackage{amssymb}
 35 | \usepackage{amsthm}
 36 | \usepackage{bbm}
 37 | 
 38 | 
 39 | \usepackage{algpseudocode}
 40 | %\usepackage{algorithmic}
 41 | \usepackage{algorithm}
 42 | 
 43 | 
 44 | %%% DAVID%%%%
 45 | 
 46 | % Totally necessary: always writes correctly epsilon
 47 | \let\temp\epsilon
 48 | \let\epsilon\varepsilon
 49 | \let\varepsilon\temp
 50 | \renewcommand{\star}{\ast}
 51 | 
 52 | % My definitions
 53 | \newcommand{\Ss}{{\EuScript S}}
 54 | \newcommand{\Aa}{{\EuScript A}}
 55 | \newcommand{\Ab}{\text{Ab}}
 56 | 
 57 | 
 58 | \newcommand{\x}{{\tt x}} \newcommand{\y}{{\tt y}}
 59 | \newcommand{\z}{{\tt z}} \renewcommand{\t}{{\tt t}}
 60 | \newcommand{\s}{{\tt s}} \newcommand{\ww}{{\tt w}}
 61 | \newcommand{\uu}{{\tt u}} 
 62 | \newcommand{\Var}[1]{\text{Var}\left[#1\right]} 
 63 | \newcommand{\Cov}[1]{\text{Cov}\left[#1\right]} 
 64 | \renewcommand{\P}[1]{\mathbb{P}\left[#1\right]} 
 65 | \newcommand{\Vart}{\text{Var}} 
 66 | \newcommand{\E}[1]{\mathbb{E}\left[ #1 \right]} 
 67 | \newcommand{\R}{\mathbb{R}} 
 68 | \newcommand{\Z}{\mathbb{Z}} 
 69 | \newcommand{\N}{\mathbb{N}} 
 70 | \newcommand{\pa}[1]{\left( #1\right)} 
 71 | %\newcommand{\norm}[1]{\left\| #1 \right\|} 
 72 | %\newcommand{\abs}[1]{\left| #1 \right|} 
 73 | %\renewcommand{\dot}[1]{\left\langle #1\right\rangle} 
 74 | \renewcommand{\L}{\mathscr{L}} 
 75 | \newcommand{\dirich}[1]{\mathcal{E}\left( #1 \right)} 
 76 | \newcommand{\grad}{\nabla} 
 77 | \renewcommand{\exp}[1]{\text{exp}\left(#1\right)} 
 78 | \newcommand{\Ent}[1]{\text{Ent}\left[#1\right]} 
 79 | \newcommand{\Entt}{\text{Ent}} 
 80 | \newcommand{\Lip}{\text{Lip}} 
 81 | \newcommand{\diam}[1]{\text{diam}\left(#1\right)} 
 82 | 
 83 | \newcommand{\one}[1]{\mathds{1}} 
 84 | %\newcommand{\ip}[2]{\left\langle{#1},{#2}\right\rangle}
 85 | 
 86 | %%%%%%
 87 | 
 88 | \usepackage{amartya_ltx}
 89 | \title{Generelization Guarantees through Low Rank  Jacobian}
 90 | \author{}
 91 | \date{}
 92 | \begin{document}
 93 | \maketitle
 94 | 
 95 | 
 96 | \section{Generalization Guarantees For Neural Nets Via Harnessing the Low-Rankness of Jacobian}
 97 | 
 98 | 
 99 | \subsection*{Definitions and notations.}
100 | \begin{itemize}
101 | \item $n$: number of samples.
102 | \item $d$: dimension of training data.
103 | \item $K$: Number of classes, dimension of the output.
104 | \item One hidden layer neural network with the form
105 |     \[
106 |     x \mapsto f(x ; W):= V \phi(W x).
107 |     \] 
108 |     where $x\in\R^d$, $W\in \R^{k\times d}$, $V\in \R^{K\times k}$ and $\phi$ is an activation function that acts component-wise. Only $W$ is trained for simplicity in this work (but it is outlined how results can be generalized to the case in which $V$ is also trained). We use the shorthand
109 |     \[
110 |         f(W) = [f(x_1;W)^\top, \dots, f(x_n;W)^\top]^\top \in \R^{nK}.
111 |     \] 
112 | \item $(x_i, y_i) \in \R^d\times\R^K, 1\leq i\leq n$: training data and corresponding labels (one-hot encodings). 
113 | \item $\eta$: learning rate for gradient descent.
114 | \item $\theta \in \R^{kd}$: vectorized parameters of the neural network. We will denote $p=kd$.
115 | \item $\tilde{\theta} \in \R^{\max(Kn,p)}$: parameters of the linearized problem (more on this below).
116 | \item $\bar{\theta} \in \R^{\max(Kn,p)}$: $\theta$ (possibly) padded with $x$ zeroes so it has the same length as $\tilde{\theta}$.
117 | \item $y = (y_1^\top,\dots, y_n^\top)^\top \in \R^{nK}$: concatenation of labels.
118 | \item The loss function used in the optimization is the $\ell_2$ loss:
119 |     \[
120 |         \mathcal{L}(W) = \frac{1}{2} \norm{f(W)-y}_2^2.
121 |     \] 
122 | \item The optimization algorithm is gradient descent, starting from an initialization $W_0$:
123 |     \[
124 |         W_{\tau+1}=W_{\tau}-\eta \nabla \mathcal{L}\left(W_{\tau}\right).
125 |     \] 
126 | \item (Remember we use $\theta\in\R^{p}$ for the vectorization of $W$). We use
127 |     \[
128 |         \mathcal{J}(\theta) = \frac{\partial f(\theta)}{\partial \theta}  \in \R^{Kn\times p}\text{ so that } \theta_{\tau+1} = \theta_\tau - \eta \nabla \mathcal{L}(\theta_\tau) \text{ and } \nabla\mathcal{L}(\theta) = \mathcal{J}(\theta)^\top r(\theta).
129 |     \] 
130 |     where we define the residual $r(\theta)$ as $f(\theta)- y$.
131 | \item \textbf{Information and Nuisance spaces}: For a matrix $J \in \R^{nK\times p}$ (that will typically be a Jacobian), consider its singular value decomposition
132 |     \[
133 |     J=\sum_{s=1}^{n K} \lambda_{s} u_{s} v_{s}^{T}=U \operatorname{diag}\left(\lambda_{1}, \lambda_{2}, \ldots, \lambda_{n K}\right) V^{T}
134 |     \] 
135 |     with $\lambda_{1} \geq \lambda_{2} \geq \ldots \geq \lambda_{n K}$ and $u_s \in \R^{Kn}$, $v_s \in \R^p$ being the left and right singular vectors respectivelyi For a spectrum cutoff $0<\alpha<\lambda_1$ let $c = c(\alpha)$ denote the index of the smallest singular value above $\alpha$. Then the information and nuisance space associated with $J$ are defined as
136 |     \[
137 | \mathcal{I}:=\operatorname{span}\left(\left\{\boldsymbol{u}_{s}\right\}_{s=1}^{c}\right) \text { and } \mathcal{N}:=\operatorname{span}\left(\left\{\boldsymbol{u}_{s}\right\}_{s=c+1}^{K n}\right).
138 |     \] 
139 | \item Multiclass Neural Tangent Kernel (M-NTK). Let $w \sim \mathcal{N}(0, I_d)$. Consider $n$ input data points $x_1, \dots, x_n \in \R^d$ aggregated in $X \in \R^{n\times d}$ and activation $\phi$ (it is assumed to be Lipschitz and smooth but the authors argue that they assume it for simplicity and outline how the result could be extended to use relu as activation). We define the multiclass kernel 
140 |     \[
141 |     \Sigma(X):=I_{K} \otimes \mathbb{E}\left[\left(\phi^{\prime}(X w) \phi^{\prime}(X w)^{T}\right) \odot\left(X X^{T}\right)\right],
142 |     \] 
143 | where $\otimes$ is the Kronecker product and $\odot$ is the Hadamard product. This kernel is closely related to the Jacobian, it is known that $\mathbb{E}\left[\mathcal{J}\left(W_{0}\right) \mathcal{J}\left(W_{0}\right)^{T}\right]=\nu^{2} \Sigma(X)$ if $V$ has i.i.d zero-mean entries with $\frac{\nu^2}{K}$ variance and $W_0$ has i.i.d. $\mathcal{N}(0,1)$ entries.
144 |     
145 | \end{itemize}
146 | 
147 | \section{Overview}
148 | 
149 | This work is along the lines of previous works that work with the NTK. In particular, using overparametrization, they will prove that the problem is close to its linearization $f(\theta) \approx f_{\operatorname{lin}}(\theta) = f(\theta_0) + \mathcal{J}(\theta_0)(\theta - \bar{\theta}_0)$ (since we will find solutions close to $\theta_0$) and that will allow them to state their optimization theorem for neural networks, to be explained later. The main trait of this work is that they remove the main assumption on the data (but the unit length assumption) made by other works and as a result their results incur some bias. In particular instead of assuming that two data points are not parallel, proving using this that the NTK is positive semi-definite and having a dependence (in terms of overparametrization and number of iterations needed) on the inverse minimum eigenvalue of the $NTK$ (e.g. \cite{du2018gradient}) or instead of assuming that any two data points satisfy $\norm{x_i-x_j} \geq \delta$ and having a dependence (in terms of overparametrization and number of iterations needed) on the inverse of $\delta$ (e.g. \cite{allen2018convergence}), they allow the NTK to have $0$ or very small eigenvalues and split the space into the information space (span of the first top left singular vectors of the jacobian at initialization or equivalently, first top eigenvectors of the NTK) and the nuisance space, proving that now the dependence (in terms of overparametrization and number of optimization time steps needed) is on the inverse of the lowest eigenvalue of the information space and the projection of the residual on the information space decreases exponentially while the projection of the residual on the nuisance space increases by a constant factor. They also work with the setting of arbitrary initialization in which under some assumptions, they can follow similar arguments to those made in the NTK, so they obtain an optimization guarantee in such a case. Also, the optimization guarantee translates to a generalization guarantee via the use of standard Rademacher complexity arguments.
150 | 
151 | We outline now the main approach followed to prove their main (meta-)theorem:
152 | 
153 | \begin{itemize}
154 |     \item Provided that our network has enough overparametrization, we can \textbf{relate the training of the neural network with gradient descent with a linear method}. This is in the sense that both the trajectory and the residuals of the linear method and the residuals will be close. Given an initial point $\theta_0 \in \R^p$, define an $(\epsilon_0, \beta)$ reference Jacobian $J \in \R^{Kn\times \max(Kn,p)}$ a matrix satisfying:
155 |         \[
156 |         \|J\| \leq \beta, \quad\left\|\mathcal{J}\left(\theta_{0}\right) \mathcal{J}^{T}\left(\theta_{0}\right)-J J^{T}\right\| \leq \epsilon_{0}^{2}, \quad \text { and } \quad\left\|\overline{\mathcal{J}}\left(\theta_{0}\right)-J\right\| \leq \epsilon_{0}
157 |         \] 
158 |         where $\overline{\mathcal{J}}(\theta_0) \in \R^{Kn \times \max(Kn,p)}$ is a matrix obtained by appending $\max(0, Kn-p)$ zero columns to $\mathcal{J}(\theta_0)$ (note $\mathcal{J}(\theta_0) \in \R^{Kn\times p}$).
159 | 
160 |         In the random initialization setting, the reference Jacobian will be the NTK. In the arbitrary initialization setting, the reference Jacobian will be the Jacobian at that initialization.
161 | 
162 |         The bounded spectra of $J$ will be an assumption in the arbitrary initialization and a consequence of the properties of the NTK in the other case. The reason why the other two conditions are true for the random initialization regime is that in the overparametrization regime the NTK of the finite net tends to the infinite width limit NTK.
163 |     \item \textbf{Bounded perturbation.} Due to the overparametrization and the small choice of the learning rate we will have 
164 |         \[
165 |             \norm{\theta_0-\theta_\tau} < R,
166 |         \] 
167 |         for a constant $R$ for all $t$ between $0$ and $T$, where $T$ is picked later. This will along with overparametrization imply 
168 |         \[
169 |         \norm{J(\theta_0)-J(\theta_\tau)} < \epsilon.
170 |         \] 
171 |         for a constant $\epsilon$.
172 |     \item Now if we followed other works on the NTK, we would \textbf{analyze the linear case} and would see that the residual of the linearized problem, $\tilde{r}_\tau$ evolves in a precise sense
173 |         \[
174 |         \widetilde{r}_{\tau}=U\left(I-\eta \Lambda^{2}\right)^{\tau} a=\sum_{s=1}^{n K}\left(1-\eta \lambda_{s}^{2}\right)^{\tau} a_{s} u_{s}
175 |         \] 
176 |         where we are using the matrices $U$ and $\Lambda$ that come from the singular value decomposition of the reference Jacobian $J = U\Lambda V$. Also, $\lambda_s$ are the diagonal entries of $\Lambda$, $u_s$ are the rows of $U$ and $a$ is a vector whose value is the projection of the initial residual via $U$, i.e. $a = U^\top\tilde{r}_0 = U^\top r_0$. Previous approaches used that $\lambda_{nK}^2$, (the smallest one) is positive, and set a good value of the overparametrization and learning rate parameters (high and low respectively) to show that the corresponding eigenvalue for the Jacobian at initialization is positive too and finally, they used the bounded perturbation property to conclude that the residual also decreases with time. In this work, we follow this approach only for the information space, and since there is no assumption on $\lambda_{nK}^2$ being $>0$, the approximation error incurred by the linearization could mean that the projection of the residual on the nuisance space is increasing. However, if it increases it does it at a slow pace, since the approximation error is low in the overparametrization regime. In particular, we have, for the linearized regime
177 |         \[
178 |         \left\|\widetilde{r}_{\tau}\right\|_{\ell_{2}} \leq\left(1-\eta \alpha^{2}\right)^{\tau}\left\|\Pi_{\mathcal{I}}\left(r_{0}\right)\right\|_{\ell_{2}}+\left\|\Pi_{\mathcal{N}}\left(r_{0}\right)\right\|_{\ell_{2}}.
179 |         \] 
180 |         and if we define $e_{\tau+1}=r_{\tau+1}-\widetilde{r}_{\tau+1}$ then it obeys (assuming small learning rate, in particular $\eta<\beta^2$):
181 |         \[
182 |         \left\|e_{\tau+1}\right\|_{\ell_{2}} \leq \eta\left(\epsilon_{0}^{2}+\epsilon \beta\right)\left\|\widetilde{r}_{\tau}\right\|_{\ell_{2}}+\left(1+\eta \epsilon^{2}\right)\left\|e_{\tau}\right\|_{\ell_{2}}
183 |         \] 
184 |         which intuitively means that the error increases by a summand that is of the order of the residual plus a multiplicative expansion with respect to the previous error, due to the nuisance space. However, the rate of increase is small enough so that after $T$ iterations the error will be controlled. Once we have this, we can proceed to the next step.
185 |     \item Use overparametrization (and bounded perturbation) to prove that in particular one has
186 |         \[
187 |         \left\|r_{\tau}-\widetilde{r}_{\tau}\right\|_{\ell_{2}} \leq \frac{3}{5} \frac{\delta \alpha}{\beta}\left\|r_{0}\right\|_{\ell_{2}} \quad \text { and } \quad\left\|\overline{\theta}_{\tau}-\widetilde{\theta}_{\tau}\right\|_{\ell_{2}} \leq \delta \frac{\Gamma}{\alpha}\left\|r_{0}\right\|_{\ell_{2}},
188 |         \] 
189 |         where $\delta$ is a hyperparameter, $\Gamma$ is another hyperparameter that modulates the total number of time steps (that is chosen to be $T = \frac{\Gamma}{\eta \alpha^2}$). Finally, $\bar{\theta}$ is equal to $\theta\in\R^{p}$ padded with zeroes till size $\max(Kn,p)$.
190 |     \item Prove bounded initial residual. In the random initialization regime, this will be a property one can prove about the NTK. In the arbitrary initialization regime, it is an assumption.
191 |     \item \textbf{Put all together} to conclude
192 |         \[
193 |         \left\|r_{T}\right\|_{\ell_{2}} \leq e^{-\Gamma}\left\|\Pi_{\mathcal{I}}\left(r_{0}\right)\right\|_{\ell_{2}}+\left\|\Pi_{\mathcal{N}}\left(r_{0}\right)\right\|_{\ell_{2}}+\frac{\delta \alpha}{\beta}\left\|r_{0}\right\|_{\ell_{2}}.
194 |         \] 
195 |     
196 |         
197 | \end{itemize}
198 | 
199 | 
200 | 
201 | \subsection{Some Proofs}
202 | \label{sec:some-proofs}
203 | 
204 | In this section, we give more precise statements and prove some of the
205 | things we talked about in the previous section.
206 | 
207 | 
208 | We will first make the following two assumptions about the jacobians
209 | of our non-linear models. We will see that these assumptions hold when
210 | these non-linear models are two-layered neural networks with smooth
211 | activation functions.
212 | 
213 | \begin{assump}[$\beta$-Bounded spectrum]\label{assump:assump1}
214 |   The non-linear function $f:\reals^p\rightarrow\reals^n$ satisfies
215 |   the $\beta$-bounded spectrum assumption, when the jacobian
216 |   associated with $f$ satisfies the following for all
217 |   $\theta\in\reals^p$
218 |   \begin{equation}
219 |     \label{eq:bound-spec-assump}
220 |     \norm{\cJ\br{\vec{\theta}}}\le\beta
221 |   \end{equation}
222 | \end{assump}
223 | 
224 | \begin{assump}[$\br{\epsilon,R,\theta_0}$-bounded jacobian perturbation]\label{assump:assump2}
225 |   The non-linear function $f:\reals^p\rightarrow\reals^n$ satisfies
226 |   the $\br{\epsilon,R,\theta_0}$-bounded jacobian perturbation when
227 |   the following is satisfied for all $\theta\in\reals^p$ such that
228 |   $\norm{\theta-\theta_p}\le R$
229 |   \begin{equation}
230 |     \label{eq:bound-pert-assump}
231 |     \norm{\cJ\br{\theta} - \cJ\br{\theta_0}}\le\dfrac{\epsilon}{2}
232 |   \end{equation}
233 | \end{assump}
234 | We will be looking at the following meta theorem.
235 | 
236 | \begin{thm}\label{thm:meta-thm1}
237 |   Consider a non-linear least squares problem of the form
238 |   $\cL\br{\theta} = \frac{1}{2}\norm{f\br{\theta} - y}_2^2$ with
239 |   $f:\reals^p\rightarrow\reals^{nK}$ the multi-class non-linear
240 |   mapping, $\theta\in\reals^p$ the parameters of the model, and
241 |   $\vec{y}\in\reals^{nK}$ the concatenated labels. Let $\bar{\theta}$ be a zero-padded $\theta$ till
242 |   size $\mathrm{max}\br{Kn,p}$, Also consider a point
243 |   $\theta_0\in\reals^p$ with $\vec{J}$ an $\br{\epsilon_0,\beta}$
244 |   reference Jacobian associated with $\cJ\br{\theta_0}$, and fitting the linearized problem
245 |   $f_{\mathrm{lin}}\br{\widetilde{\theta}} = f\br{\theta_0} +
246 |   \vec{J}\br{\widetilde{\theta} - \bar{\theta_0}}$ via the loss
247 |   $\cL_{\mathrm{lin}}\br{\theta} =
248 |   \frac{1}{2}\norm{f_{\mathrm{lin}}\br{\theta} - y}_2^2$.\\
249 |   % \\ \noindent\textbf{\textbullet Information and Nuisance
250 |   % subspace:}
251 |   
252 |   Furthermore define the information $\cI$ and nuisance $\cN$
253 |   subspaces and the truncated Jacobian $\vec{J}_{\cI}$ associated with
254 |   the reference jacobian $\vec{J}$ based on a cut-off spectrum value
255 |   of $\alpha$.\\
256 | 
257 |   Furthermore consider a point $\theta_0\in\reals^p$, a tolerance
258 |   level $0<\delta\le 1$, stopping time $\Gamma\ge 1$ and assume the
259 |   Jacobian mapping $\cJ\br{\theta}\in\reals^{nK\times p}$ associated
260 |   with $f$ obeys $\beta$-Bounded spectrum assumption~(Assumption \ref{assump:assump1})
261 |   and $\br{\epsilon,R,\theta_0}$-bounded jacobian perturbation
262 |   assumption~(Assumption \ref{assump:assump2}) for 
263 |   \begin{equation}
264 |     \label{eq:theta_diam}
265 |     %\norm{\theta - \theta_0}\le
266 |     R := 2\br{\norm{\vec{J}_\cI^\dagger
267 |         \vec{r}_0}_2 +
268 |       \frac{\Gamma}{\alpha}\norm{\Pi_\cN\br{\vec{r}_0}} + \delta\frac{\Gamma}{\alpha}\norm{\vec{r}_0}_2}
269 |   \end{equation}
270 |   and
271 |   \begin{equation}
272 |     \label{eq:epsilon_upper}
273 |     \epsilon\le \dfrac{\delta\alpha^3}{5\Gamma\beta^2}
274 |   \end{equation}
275 |  %\\ \noindent\textbf{\textbullet Closeness of Reference Jacobian$\vec{J}~\br{\epsilon_0^2}$ and True
276 |  %  Jacobian~$\cJ\br{\theta}~\br{\epsilon}$ to inital
277 |  %  Jacobian~$\cJ\br{\theta_0}$}:
278 |  Finally assume the following in regards to the reference Jacobian.
279 |    \begin{equation}
280 |     \label{eq:epsilon_zero_upper}
281 |     \epsilon_0\le \dfrac{\min\br{\delta\alpha, \sqrt{\frac{\delta\alpha^3}{\Gamma\beta}}}}{5}
282 |   \end{equation}
283 |   
284 |   
285 |   We run gradient descent iterations of the form a) Original Problem: $\theta_{\tau + 1} =
286 |   \theta_\tau - \eta\nabla\cL\br{\theta_\tau}$ and
287 |   b) Linearized Problem: $\widetilde{\theta}_{\tau+1} = \widetilde{\theta}_\tau -
288 |   \eta\nabla\cL_{\mathrm{lin}}\br{\widetilde{\theta}_\tau}$ starting from $\theta_0$ with step
289 |   size $\eta$ obeying $\eta\le \frac{1}{\beta^2}$. 
290 | 
291 | Then for all iterates obeying $0\le \tau\le T:= \frac{\Gamma}{\eta\alpha^2}$
292 |   iterations of the original $\br{\theta_\tau}$ and linearized
293 |   $\br{\widetilde{\theta}_\tau}$ problems and the corresponding
294 |   residuals $\vec{r}_\tau:=f\br{\theta_\tau} - \vec{y}$ and
295 |   $\widetilde{\vec{r}}_\tau:=f_{\mathrm{lin}}\br{\widetilde{\theta}_\tau}
296 |   - \vec{y}$ closely track each other.
297 | 
298 |   That is
299 |   \begin{itemize}
300 |   \item \textbf{Original and linear residuals are close}: \begin{equation}
301 |     \label{eq:residual_close}
302 |     \norm{\vec{r}_\tau - \widetilde{\vec{r}}_\tau}\le \dfrac{3}{5}\dfrac{\delta\alpha}{\beta}\norm{\vec{r}_0}
303 |   \end{equation}
304 | \item \textbf{Original and linearized paramaters are close}:
305 |   \begin{equation}
306 |     \label{eq:param_close}
307 |    \norm{\bar{\theta}_\tau - \widetilde{\theta}_\tau}\le \delta\dfrac{\Gamma}{\alpha}\norm{\vec{r}_0}
308 |   \end{equation}
309 | \item \textbf{Original iterates are close to initialization}: Furthermore, for all iterates $0\le \tau\le
310 | T:=\dfrac{\Gamma}{\eta\alpha^2}$, we have that the original parameters
311 | $\theta_\tau$ is close to the inital parameters.
312 | \begin{equation}
313 |   \label{eq:final_param_close}
314 |   \norm{\theta_\tau - \theta_0}\le \dfrac{R}{2} =
315 |   \norm{\vec{J}_{\cI}^\dagger\vec{r}_0}_2 +
316 |   \dfrac{\Gamma}{\alpha}\norm{\Pi_\cN\br{\vec{r}_0}}_2 + \delta\dfrac{\Gamma}{\alpha}\norm{\vec{r}_0}_2
317 | \end{equation}
318 | \item \textbf{Final non-linear residual is bounded}: and at $\tau=T$, we have that
319 | \begin{equation}
320 |   \label{eq:final_residual}
321 |   \norm{\vec{r}_T}_2\le e^{-\Gamma}\norm{\Pi_\cI\br{\vec{r}_0}}_2 +
322 |   \norm{\Pi_\cN\br{\vec{r}_0}} + \dfrac{\delta\alpha}{\beta}\norm{\vec{r}_0}_2
323 | \end{equation}
324 | \end{itemize}
325 | \end{thm}
326 | 
327 | First, we will show that the difference of the non-linear and the
328 | linear residual in the $\tau^{\it th}$ time step is of the order of
329 | the difference in the previous time step added to a term linear in the
330 | residual of the linearized problem. Precisely, it is stated as
331 | follows.
332 | 
333 | \begin{lem}[Lemma 6.7]\label{lem:pert-one-step}
334 |   Assume Assumption~\ref{assump:assump1}~(with $\beta$) and
335 |   Assumption~\ref{assump:assump2}~(with $\br{\epsilon,R,\theta_0}$) hold and
336 |   $\theta_\tau$ and $\theta_{\tau + 1}$ are within an $R$
337 |   neighbourhood of the initilization $\theta_0$,
338 |   i.e. $\norm{\theta_\tau - \theta_0}\le R$ and
339 |   $\norm{\theta_{\tau+1}- \theta_0}\le R$.
340 | 
341 |   Then, on running gradient
342 |   descent with $\eta\le \frac{1}{\beta^2}$. the difference between the
343 |   non-linear and the linear residual $\vec{e}_{\tau+1} =
344 |   \vec{r}_{\tau+1} - \widetilde{\vec{r}}_{\tau+1}$ follow
345 | 
346 |   \begin{equation}
347 |     \label{eq:growth_of_res_error}
348 |     \norm{\vec{r}_{\tau+1}}_2 \le \eta\br{\epsilon_0^2 +
349 |       \epsilon\beta}\norm{\widetilde{\vec{r}}_\tau}_2 + \br{1 + \eta\epsilon^2}\norm{\vec{e}_\tau}_2
350 |   \end{equation}
351 | \end{lem}
352 | \begin{proof}
353 |   Let $\vec{A} = \cJ\br{\theta_0},\vec{B}_2=\cJ\br{\theta_\tau}$ and
354 |   \[\vec{B}_1=\cJ\br{\theta_\tau,\theta_{\tau+1}} =
355 |     \int_0^1\cJ\br{t\theta_{\tau+1}+\br{1-t}\theta_\tau}dt\]
356 |   By $0^{\it th}$ order Taylor expansion with remainder term, we can
357 |   write
358 |   \begin{align*}
359 |     f\br{\theta_{\tau+1}} &= f\br{\theta_\tau -
360 |     \eta\nabla\cL\br{\theta_\tau}} = f\br{\theta_\tau} +
361 |                             \eta\vec{B}_1\nabla\cL\br{\vec{\theta_\tau}}\\
362 |                           &= f\br{\theta_\tau} +
363 |                             \eta\vec{B}_1\vec{B}_2^\top\br{f\br{\theta_\tau}
364 |                             - \vec{y}}\\
365 |     \vec{r}_{\tau+1} =  f\br{\theta_{\tau+1}} - \vec{y} &= \br{\vec{I}
366 |                                                           - \eta\vec{B}_1\vec{B}_2^\top}\vec{r}_\tau
367 |   \end{align*}
368 |   For the linear problem, we have
369 |   \[\widetilde{\vec{r}}_{\tau+1} = \br{\vec{I} -
370 |       \vec{J}\vec{J}^\top}\widetilde{\vec{r}}_\tau\]
371 |   Thus
372 |   \begin{align*}
373 |     \norm{\vec{e}_{\tau+1}} = \norm{{\vec{r}}_{\tau+1} -
374 |     \widetilde{\vec{r}}_{\tau+1}} &= \norm{\br{\vec{I}
375 |                                                           -
376 |                                     \eta\vec{B}_1\vec{B}_2^\top}\vec{r}_\tau
377 |                                     - \br{\vec{I} -
378 |                                     \vec{J}\vec{J}^\top}\widetilde{\vec{r}}_\tau}\\
379 |                                   &=\norm{\br{\vec{I} -
380 |                                     \vec{B}_1\vec{B}_2^\top}\vec{e}_\tau
381 |                                     - \eta\br{\vec{B}_1\vec{B}_2^\top
382 |                                     - \vec{J}\vec{J}^\top}\widetilde{\vec{r}}_\tau}\\
383 |                                  &\le\norm{\br{\vec{I} -
384 |                                     \vec{B}_1\vec{B}_2^\top}\vec{e}_\tau}
385 |                                     + \eta\norm{\br{\vec{B}_1\vec{B}_2^\top
386 |                                     - \vec{J}\vec{J}^\top}}\norm{\widetilde{\vec{r}}_\tau}\\
387 |   \end{align*}
388 |   
389 |   First, we bound $\norm{\br{\vec{I} -
390 |       \vec{B}_1\vec{B}_2^\top}\vec{e}_\tau}$
391 |   using the fact~(Lemma 6.3) that if
392 |   $\vec{A},\vec{B}\in\reals^{n\times p}$
393 |   are matrices obeying
394 |   $\norm{\vec{A}},\norm{\vec{B}}\le\beta,\norm{\vec{B}-\vec{A}}\le\epsilon$,
395 |   then
396 |   $\forall~\vec{z}\in\reals^n,\eta\le\frac{1}{\beta^2}$
397 |   we have that $\norm{\br{\vec{I} -
398 |       \vec{B}_1\vec{B}_2^\top}\vec{z}}\le\br{1+\eta\epsilon^2}\norm{\vec{z}}_2$.
399 |   Next, we bound $\norm{\br{\vec{B}_1\vec{B}_2^\top
400 |                                     -
401 |                                     \vec{J}\vec{J}^\top}}$
402 |                                 as follows.
403 |                                 \begin{align*}
404 |                                   \norm{\br{\vec{B}_1\vec{B}_2^\top
405 |                                     -
406 |                                     \vec{J}\vec{J}^\top}} &= \norm{\br{\vec{B}_1\vec{B}_2^\top
407 |                                     -\vec{A}\vec{B}_2^\top +
408 |                                                             \vec{A}\vec{B}_2^\top
409 |                                                             - \vec{A}\vec{A}^\top
410 |                                                             + \vec{A}\vec{A}^\top
411 |                                                             - \vec{J}\vec{J}^\top}}\\
412 |                                                           &\le \norm{\br{\vec{B}_1
413 |                                     -\vec{A}}\vec{B}_2^\top} +
414 |                                                             \norm{\vec{A}\br{\vec{B}_2^\top
415 |                                                             - \vec{A}^\top}}
416 |                                                             + \norm{\vec{A}\vec{A}^\top
417 |                                                             -
418 |                                                             \vec{J}\vec{J}^\top}\\
419 |                                                           &\beta\dfrac{\epsilon}{2}
420 |                                                             +
421 |                                                             \beta\dfrac{\epsilon}{2}
422 |                                                             + \epsilon_0^2
423 |                                 \end{align*}
424 | \end{proof}
425 | 
426 | Next, we will prove a lemma that will finally allow us to control the
427 | growth of the difference between the linear and the non-linear
428 | residuals.
429 | 
430 | \begin{lem}[Lemma 6.8]\label{eq:growth-pert-lemma}
431 |   Consider positive scalars $\Gamma,\alpha,\epsilon,\eta>0$. Also
432 |   assume $\eta\le\frac{1}{\alpha^2}$ and
433 |   $\alpha\ge\sqrt{2\Gamma\epsilon}$ and set
434 |   $T=\frac{\Gamma}{\eta\alpha^2}$. For $0\le
435 |   \tau\le T$, non-negative entries $\rho_,\rho_+\ge 0$, assume that the scalar sequences
436 |   $e_\tau$ and $\widetilde{r}_\tau$ obey the following 
437 |   \begin{itemize}
438 |   \item $e_0 = 0$
439 |   \item $\widetilde{r}_\tau\le\br{1 - \eta\alpha^2}^\tau \rho_+ +
440 |     \rho_-$
441 |   \item $e_\tau\le \br{1 + \eta\epsilon^2}e_{\tau -1} +
442 |     \eta\Theta\widetilde{r}_{\tau - 1}$
443 |   \end{itemize}
444 |   
445 |   Let  $\Lambda = \dfrac{2\br{\Gamma\rho_- + \rho_+}}{\alpha^2}$. Then for all $0\le\tau\le T$, the following holds
446 |   \[e_\tau\le \Theta\Lambda\]
447 | \end{lem}
448 | \begin{proof}
449 |   We will prove this by induction. Note that $e_0=0$ satisfies the
450 |   base condition. Suppose, $e_{t}\le\Theta\Lambda$ holds for all
451 |   $t <\tau$.
452 | 
453 |   Thus for all $ 0\le t\le \tau$,
454 |   \begin{align*}
455 |     e_{t}&\le \br{1 + \eta\epsilon^2}e_{t -
456 |               1}+\eta\Theta\widetilde{r}_{t-1}\\
457 |             &\le e_{t - 1} + \eta\epsilon^2e_{t-1} +
458 |               \eta\Theta\br{\br{1-\eta\alpha^2}^{t-1}\rho_+ +
459 |               \rho_-}\\
460 |             &\le e_{t-1} + \eta\Theta\br{\epsilon^2\Lambda + \br{1-\eta\alpha^2}^{t-1}\rho_+ +
461 |               \rho_- }\\
462 |               \dfrac{  e_{t} -
463 |     e_{t-1}}{\Theta}&\le \eta\br{\epsilon^2\Lambda + \br{1 -
464 |                          \eta\alpha^2}^{t-1}\rho_++\rho_-}\\
465 |               \dfrac{  e_{\tau}}{\Theta} = \sum_{t=0}^\tau  \dfrac{  e_{t} -
466 |     e_{t-1}}{\Theta} &\le \eta\tau\br{\epsilon^2\Lambda + \rho_-}+ \eta\rho_+\sum_{t=0}^{\tau}{\br{1 -
467 |                        \eta\alpha^2}^{t-1}}\\
468 |          &=\eta\tau\br{\epsilon^2\Lambda + \rho_-}+ \eta\rho_+\dfrac{1
469 |            - \br{1 -
470 |                        \eta\alpha^2}^{\tau}}{\eta\alpha^2}\\
471 |          &\le\eta T\br{\epsilon^2\Lambda + \rho_-}+ \dfrac{\rho_+}{\alpha^2}\\
472 |          &\le \dfrac{\Gamma\epsilon^2\Lambda +
473 |            \Gamma\rho_-}{\alpha^2}+ \dfrac{\rho_+}{\alpha^2}\\
474 |     &= \dfrac{\Gamma\epsilon^2\Lambda}{\alpha^2}+
475 |       \dfrac{\Lambda}{2}\\
476 |          &\le \Gamma &&\because \alpha \ge \sqrt{2\Gamma\epsilon}
477 |   \end{align*}
478 | \end{proof}
479 | 
480 | We will combine this to provide a rough proof
481 | of~\ref{eq:residual_close} using induction.
482 | 
483 | \begin{proof}[Proof of Theorem~\ref{thm:meta-thm1}] We will prove this
484 |   by induction. We will only provide a rough proof sketch to keep it
485 |   simple and ignore the computations. We will assume that for all
486 |   $0\le  t\le \tau$ the induction hypothesis holds true
487 |   i.e. $\norm{\theta_0 - \theta_t}\le R$~\ref{eq:theta_diam},
488 |   ~\ref{eq:residual_close},~\ref{eq:param_close},~\ref{eq:final_param_close},
489 |   and ~\ref{eq:final_residual} holds true. We will show that they all
490 |   hold true for $t=\tau+1$
491 |   \begin{itemize}
492 |   \item \textbf{Proving $\norm{\theta_0 - \theta_t}\le R$ for all
493 |       $t=\tau+1$:}
494 |     We know by~\eqref{eq:final_param_close} that $\norm{\theta_0 -
495 |       \theta_\tau}\le \frac{R}{2}$. We need to show that
496 |     $\norm{\theta_\tau - \theta_{\tau+1}}\le\frac{R}{2}$.
497 |     \begin{align*}
498 |       \norm{\theta_\tau - \theta_{\tau+1}} &=
499 |                                              \eta\norm{\nabla\cL\br{\theta_\tau}}
500 |                                              =
501 |                                              \eta\norm{\cJ^\top\br{\theta_\tau}\vec{r}_\tau}\\
502 |                                            &\le
503 |                                              \eta\norm{\vec{J}^\top\widetilde{\vec{r}}_\tau}
504 |                                              +
505 |                                              \eta\norm{\br{\cJ\br{\theta_\tau}
506 |                                              -
507 |                                              \vec{J}}^\top}\norm{\widetilde{\vec{r}}_\tau}
508 |                                              +
509 |                                              \eta\norm{\cJ\br{\theta_\tau}}\norm{\widetilde{\vec{r}}_\tau
510 |                                              - \vec{r}_\tau}\\
511 |     \end{align*}
512 | 
513 |     We can bound the first term as~(Page 25) as \[
514 |     \eta\norm{\vec{J}^\top\widetilde{\vec{r}}_\tau}\le  \norm{\vec{J}_\cI^\dagger
515 |         \vec{r}_0}_2 +
516 |       \frac{\Gamma}{\alpha}\norm{\Pi_\cN\br{\vec{r}_0}}  \]the second
517 |     term as
518 |     \[ \eta\norm{\br{\cJ\br{\theta_\tau}-
519 |           \vec{J}}^\top}\norm{\widetilde{\vec{r}}_\tau}\le
520 |       \eta\norm{\br{\cJ\br{\theta_\tau}- \cJ\br{\theta_0}}+\br{\cJ\br{\theta_0}-
521 |           \vec{J}}^\top}\norm{\widetilde{\vec{r}}_0}\le\br{\epsilon+\epsilon_0}\norm{\widetilde{\vec{r}}_0}\le\dfrac{2\delta\alpha}{5\beta^2}\norm{\widetilde{\vec{r}}_0}\]
522 |     and the third term as~(Using Eq~\eqref{eq:residual_close})
523 |     
524 |     \[\eta\norm{\cJ\br{\theta_\tau}}\norm{\widetilde{\vec{r}}_\tau
525 |         - \vec{r}_\tau} \le
526 |       \dfrac{3\delta\alpha}{5\beta^2}\norm{\widetilde{\vec{r}}_0}\]
527 |     
528 |     Combining them we get
529 |     \[ \norm{\theta_\tau - \theta_{\tau+1}} \le
530 |        \norm{\vec{J}_\cI^\dagger
531 |         \vec{r}_0}_2 +
532 |       \frac{\Gamma}{\alpha}\norm{\Pi_\cN\br{\vec{r}_0}}  +
533 |       \dfrac{\delta\alpha}{\beta^2}\norm{\widetilde{\vec{r}}_0} =
534 |       \dfrac{R}{2} \]
535 |   \item \textbf{Proving that $\norm{\vec{e}_{\tau+1}}\le \dfrac{3}{5}\dfrac{\delta\alpha}{\beta}\norm{\vec{r}_0}$:}
536 |     We have shown  that $\norm{\theta_t - \theta_0}\le R$, for
537 |     $t\le \tau+1$. Then we can
538 |     use Lemma~\ref{lem:pert-one-step} to say that for all
539 |     $0< t\le \tau+1$ the following holds
540 |     \[\norm{\vec{e}_t}\le \eta\br{\epsilon_0^2 +
541 |         \epsilon\beta}\norm{\widetilde{\vec{r}}_{t-1}} +
542 |       \br{1+\eta\epsilon^2}\norm{\vec{e}_{t-1}}_2\]
543 | 
544 |     We already know that the linear residuals satisfy the following
545 |     for all $0<t\le \tau+1$
546 |     \[ \norm{\widetilde{r}_t} \le \br{1 -
547 |         \eta\alpha^2}^t\norm{\Pi_\cI\br{\vec{r_0}} } +
548 |       \norm{\Pi_\cN\br{\vec{r}_0}}\]
549 | 
550 |     Finally, we can apply Lemma~\ref{eq:growth-pert-lemma} with the
551 |     following substitutions
552 | 
553 |     \begin{itemize}
554 |     \item $\Theta = \epsilon_0^2 + \epsilon\beta$
555 |     \item $\rho_+ = \norm{\Pi_\cI}\br{\vec{r}_0},$\quad
556 |       $\rho_{\_} = \norm{\Pi_\cN}\br{\vec{r}_0}$
557 |     \item $e_\tau =
558 |       \norm{\vec{e}_{\tau+1}},$\quad$\widetilde{r}_\tau =
559 |       \norm{\widetilde{\vec{r}}_{\tau+1}}$
560 |     \end{itemize}
561 |     Note that all the assumptions of Lemma~\ref{eq:growth-pert-lemma}
562 |     are also satisfied i.e. a)
563 |     $\eta\le\frac{1}{\beta^2}\le\frac{1}{\alpha^2}$ as
564 |     $\beta\ge\alpha$, the cutoff singular value of $\vec{J}$, and b)
565 |     By definition of $\epsilon$~(eq.~\ref{eq:epsilon_upper}),
566 |     $\frac{\alpha}{\epsilon}\ge\frac{5\Gamma}{\delta}\frac{\beta^2}{\alpha^2}\ge\sqrt{2\Gamma}.~\therefore\alpha\ge$
567 | 
568 | 
569 |     Thus,
570 |     \begin{align*}
571 |       \norm{\vec{e}_{\tau+1}}&\le 2\br{\epsilon_0^2
572 |                            +\epsilon\beta}\dfrac{\Pi_\cI\br{\vec{r}_0} +
573 |                            \Gamma\Pi_{\cN}\br{\vec{r}_0}}{\alpha^2}\\
574 |                          &\le \dfrac{2\Gamma\br{\epsilon_0^2
575 |                            +\epsilon\beta}\norm{\vec{r}_0}}{\alpha^2}\\
576 |                          &\le
577 |                            \br{\frac{2}{25}+\frac{2}{5}}\frac{\delta\alpha}{\beta}\norm{\vec{r}_0}\le \frac{3}{5}\frac{\delta\alpha}{\beta}\norm{\vec{r}_0}
578 |     \end{align*}
579 |   \end{itemize}
580 | 
581 |   Finally, we can prove Eq~\eqref{eq:final_residual}
582 |   \begin{align*}
583 |     \norm{\vec{r}_T}&\le \norm{\widetilde{\vec{r}}_T} + \norm{\vec{r}_T
584 |                       - \widetilde{\vec{r}}_T}\\
585 |                     &\le \br{1 -
586 |         \eta\alpha^2}^t\norm{\Pi_\cI\br{\vec{r_0}} } +
587 |       \norm{\Pi_\cN\br{\vec{r}_0}} +
588 |                       \frac{3}{5}\frac{\delta\alpha}{\beta}\norm{\vec{r}_0}\\
589 |                     &\le e^{-\Gamma}\norm{\Pi_\cI\br{\vec{r_0}} } +
590 |       \norm{\Pi_\cN\br{\vec{r}_0}} +\frac{\delta\alpha}{\beta}\norm{\vec{r}_0}
591 |   \end{align*}
592 | \end{proof}
593 | 
594 | For a neural network, we just need to ensure that
595 | Assumption~\eqref{assump:assump1} and~\eqref{assump:assump2} are
596 | satisfied. We state the a simplified theorem for the generelization of
597 | multi-class neural networks below without providing a proof. But it
598 | follows from the above using a rademacher complexity argument.
599 | 
600 | \begin{thm}[Generelization of multi-output neural network]
601 |   With $\Gamma>0$, consider an i.i.d. dataset
602 |   $\bc{\br{\vec{x}_i,y_i}}\in\reals^d\times\reals^K$ where $\vec{x}_i$
603 |   are unit length data points and $\vec{y}_i$s are one-hot encoded
604 |   labels.
605 | 
606 |   Consider the neural network to be initialized with $\vec{W}_0\sim
607 |   \cN\br{0,\vec{I}}$ and $\vec{V}$ be properly scaled rademacher
608 |   entries.
609 | 
610 |   Consider the reference jacobian, with information and nuisance
611 |   subspaces divided according to $\alpha$, to be
612 |   $\vec{J}=\Sigma\br{\vec{X}}^{\nicefrac{1}{2}}$
613 |   where \[\Sigma\br{\vec{X}} =
614 |     \vec{I}_K\otimes\bE\bs{\br{\phi^\prime\br{\vec{X}\vec{w}}\phi^\prime\br{\vec{X}\vec{w}}^\top}\odot\br{\vec{X}\vec{X}^\top}}\]
615 | 
616 |   Assume the overparameterization to be \[k\ge\dfrac{\Gamma^4\log
617 |       n}{\alpha^8}\]
618 | 
619 |   Then after $T=\dfrac{\Gamma}{\eta\alpha^2}$ iterations, the
620 |   generalization error obeys
621 |   \[\mathrm{Err}\br{\vec{W}_T}\le
622 |     \dfrac{\Pi_\cN\br{\vec{y}}}{\sqrt{n}} + e^{-\Gamma} +\dfrac{\Gamma}{\alpha\sqrt{n}}\]
623 | \end{thm}
624 | \begin{lem}
625 |   For a neural network as defined above where the activation function
626 |   $\phi$ is such that $\abs{\phi^\prime\br{\vec{z}}}\le B$ and
627 |   $\abs{\phi^{\prime\prime}\br{\vec{z}}}\le B$ for all $\vec{z}$, $K$
628 |   is the number of classes, then for all $\vec{W}\in\reals^{k\times
629 |     d}$\[\norm{\cJ\br{\vec{W}}}\le
630 |     B\sqrt{Kk}\norm{\vec{V}}_\infty\norm{\vec{X}}\]
631 |   and if all data points are unit norm i.e. $\norm{\vec{x}_i} = 1$
632 |   then the jacobian is lipschitz with respect to the spectral norm for
633 |   all $\vec{W},\vec{\widetilde{\vec{W}}}\in\reals^{k\times d}$
634 |   \[\norm{\cJ\br{\vec{W}} - \cJ\br{\widetilde{\vec{W}}}}\le B\sqrt{K}\norm{\vec{V}}_\infty\norm{\vec{X}}\]
635 | \end{lem}
636 | 
637 | \begin{proof}
638 |   Given two matrices $\vec{A} =
639 |   \bs{\vec{A}_1^\top,\cdots,\vec{A}_K^\top}$ and  $\vec{B} =
640 |   \bs{\vec{B}_1^\top,\cdots,\vec{B}_K^\top}$, the following is true
641 |   \[\norm{\vec{A}}\le\sqrt{K}\sup_{\ell=1,..,K}\norm{\vec{A}_\ell}\text{
642 |     \enskip and\enskip}\norm{\vec{A} -
643 |     \vec{B}}\le\sqrt{K}\sup_{\ell=1,..,K}\norm{\vec{A}_\ell-\vec{B}_\ell}\]
644 | 
645 | We will first show that for a single output neural network i.e. for
646 | $K=1$ we have $\norm{\cJ\br{\vec{W}}}\le
647 | B\sqrt{k}\norm{\vec{V}}_\infty\norm{\vec{X}}$
648 | 
649 | \begin{align*}
650 |   \cJ\br{\vec{W}}\cJ^\top\br{\vec{W}} &=
651 |   \br{\phi^\prime\br{\vec{X}\vec{W}^\top}\diag{\vec{v}}\diag{\vec{v}}\phi^\prime\br{\vec{W}\vec{X}^\top}}\odot\br{\vec{X}\vec{X}^\top}\\
652 | \norm{ \cJ\br{\vec{W}}}^2 &\le
653 |                             \br{\max_i\norm{\diag{\vec{v}}\phi^\prime\br{\vec{W}\vec{x}_i}}^2}\norm{\vec{X}}_2^2\\
654 |                                       &\le kB^2\norm{\vec{v}}_\infty^2\norm{\vec{X}}_2^2
655 | \end{align*}
656 | Thus for multi-output neural networks, we have that
657 | \[\norm{ \cJ\br{\vec{W}}}\le
658 |   \sqrt{KkB^2}\norm{\vec{V}}_\infty\norm{\vec{X}}_2 \]
659 | 
660 | We are omitting the proof of lipschitzness but will cover it if time permits.
661 | \end{proof}
662 | With this, we can apply the meta-theorem directly to multi-output
663 | neural networks by taking the NTK to be the reference Jacobian. One
664 | can prove that the NTK satisfies the assumptions of the reference
665 | Jacobian but we will omit the proof for simplicity and might discuss
666 | it if time permits.
667 | 
668 | 
669 | 
670 | 
671 | 
672 | 
673 | 
674 | \section{Experiments}
675 | 
676 | Authors use very recent methods to approximate the spectra of $J(\theta_\tau)J^\top(\theta_\tau)$. They perform experiments on CIFAR-10 and MNIST on ResNet20.
677 | 
678 | \begin{itemize}
679 |     \item The value of the top eigenvalues increases significantly, comparing the Jacobian at initialization with the Jacobian after training. In general, it is observed that the Jacobian is approximately low-rank, in the sense that it has a small set of large eigenvalues and then the rest are fairly small. This fits naturally with their theory, allowing to set a good cutoff for their bounds.
680 |     \item They plot the norm of the projection of the residual onto the information and the nuisance space and observe that indeed, as predicted by the theory, the projection on the information space decreases much rapidly than the other one. Note that if training with some corrupted labels, and the residual corresponding to the noisy labels falls mostly into the nuisance space and analogously for the uncorrupted data and the information space, then the neural network would fit much faster the data that conveys information and this would have generalization implications at early stopping.
681 |     \item They measured the norm of the projection of the labels and the residuals at initialization, but using the information and nuisance space of two Jacobians: the one given by the initialization and the one given by the trained network. For both the labels and the initial residual, the majority of the projection lies in the nuisance space for the Jacobian at initialization but for the other one the converse happens: the projected norm onto the information space is significantly bigger than the projection onto the nuisance space. Authors argue that this adaptation would in principle suggest a better generalization according to their theory (possibly using the arbitrary initialization theorem and initializing to the value of the Jacobian at time some iterations before stopping). It would also suggest that this adaptation of the Jacobian would speed up training.  They also performed experiments with corrupted labels and saw that the projection onto the information space after training is not that large in that case. Also the normalized projection of the labels onto the nuisance space correlates with the test error.
682 | \end{itemize}
683 | 
684 | \nocite{*}                 % Include refs not cited
685 | \bibliography{refs}        %use a bibtex bibliography file refs.bib
686 | \bibliographystyle{plain}  %use the plain bibliography style
687 | 
688 | \end{document}
689 | 
690 | 
691 | %%% Local Variables:
692 | %%% mode: latex
693 | %%% TeX-master: t
694 | %%% End:
695 | 


--------------------------------------------------------------------------------
/notes/low_rank_jac/refs.bib:
--------------------------------------------------------------------------------
 1 | @article{du2018gradient,
 2 |   title={Gradient descent finds global minima of deep neural networks},
 3 |   author={Du, Simon S and Lee, Jason D and Li, Haochuan and Wang, Liwei and Zhai, Xiyu},
 4 |   journal={arXiv preprint arXiv:1811.03804},
 5 |   year={2018}
 6 | }
 7 | 
 8 | @article{allen2018convergence,
 9 |   title={A convergence theory for deep learning via over-parameterization},
10 |   author={Allen-Zhu, Zeyuan and Li, Yuanzhi and Song, Zhao},
11 |   journal={arXiv preprint arXiv:1811.03962},
12 |   year={2018}
13 | }
14 | 


--------------------------------------------------------------------------------
/notes/low_rank_jac_thm.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/notes/low_rank_jac_thm.pdf


--------------------------------------------------------------------------------
/papers/1805.00915.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1805.00915.pdf


--------------------------------------------------------------------------------
/papers/1806.07572.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1806.07572.pdf


--------------------------------------------------------------------------------
/papers/1808.09372.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1808.09372.pdf


--------------------------------------------------------------------------------
/papers/1810.02054.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1810.02054.pdf


--------------------------------------------------------------------------------
/papers/1810.09665.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1810.09665.pdf


--------------------------------------------------------------------------------
/papers/1810.12065.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1810.12065.pdf


--------------------------------------------------------------------------------
/papers/1811.03804.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1811.03804.pdf


--------------------------------------------------------------------------------
/papers/1811.03962.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1811.03962.pdf


--------------------------------------------------------------------------------
/papers/1811.04918.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1811.04918.pdf


--------------------------------------------------------------------------------
/papers/1811.08888.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1811.08888.pdf


--------------------------------------------------------------------------------
/papers/1812.07956.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1812.07956.pdf


--------------------------------------------------------------------------------
/papers/1812.10004.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1812.10004.pdf


--------------------------------------------------------------------------------
/papers/1901.08572.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1901.08572.pdf


--------------------------------------------------------------------------------
/papers/1901.08584.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1901.08584.pdf


--------------------------------------------------------------------------------
/papers/1902.01384.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1902.01384.pdf


--------------------------------------------------------------------------------
/papers/1902.04760.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1902.04760.pdf


--------------------------------------------------------------------------------
/papers/1902.06720.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1902.06720.pdf


--------------------------------------------------------------------------------
/papers/1904.11955.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1904.11955.pdf


--------------------------------------------------------------------------------
/papers/1905.03684.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1905.03684.pdf


--------------------------------------------------------------------------------
/papers/1905.05095.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1905.05095.pdf


--------------------------------------------------------------------------------
/papers/1905.10337.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1905.10337.pdf


--------------------------------------------------------------------------------
/papers/1905.10843.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1905.10843.pdf


--------------------------------------------------------------------------------
/papers/1905.12173.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1905.12173.pdf


--------------------------------------------------------------------------------
/papers/1905.13210.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1905.13210.pdf


--------------------------------------------------------------------------------
/papers/1905.13654.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1905.13654.pdf


--------------------------------------------------------------------------------
/papers/1906.01930.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1906.01930.pdf


--------------------------------------------------------------------------------
/papers/1906.05392.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1906.05392.pdf


--------------------------------------------------------------------------------
/papers/1906.05827.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1906.05827.pdf


--------------------------------------------------------------------------------
/papers/1906.06247.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1906.06247.pdf


--------------------------------------------------------------------------------
/papers/1906.06321.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1906.06321.pdf


--------------------------------------------------------------------------------
/papers/1906.08034.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1906.08034.pdf


--------------------------------------------------------------------------------
/papers/1911.00809.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/damaru2/ntk/fe13836a25f4ccd9dc8053e6f247f39628ba09b9/papers/1911.00809.pdf


--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
  1 | This is a list of papers that use the Neural Tangent Kernel (NTK). In each category, papers are sorted chronologically. Some of these papers were presented in the NTK reading group during the summer 2019 at the University of Oxford.
  2 | 
  3 | We used [hypothes.is](https://web.hypothes.is/) to some extent, see [this](https://via.hypothes.is/https://arxiv.org/pdf/1806.07572.pdf) for instance. There are notes for a few of the papers, which you can find linked below the relevant papers.
  4 | 
  5 | ## Schedule
  6 | + 2/08/2019 [[notes](./notes/Neural_Tangent_kernels___Jacot_et_al.pdf)] Neural Tangent Kernel: Convergence and Generalization in Neural Networks.
  7 | + 9/08/2019 [[notes](./notes/du_et_al.pdf)] Gradient Descent Finds Global Minima of Deep Neural Network.
  8 | + 16/08/2019 Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks + insights from Gradient Descent Provably Optimizes Over-parameterized Neural Networks.
  9 | + 23/08/2019 On Lazy Training in Differentiable Programming
 10 | + 13/09/2019 Generalization bounds of stochastic gradient descent for wide and deep networks
 11 | + 18/10/2019 [[notes](./notes/low_rank_jac_thm.pdf)] Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian
 12 | 
 13 | # Neural tangent kernel
 14 | 
 15 | [https://www.youtube.com/watch?v=NGon2JyjO6Y]: #
 16 | + [Recent Developments in Over-parametrized Neural Networks, Part II](https://www.youtube.com/watch?v=NGon2JyjO6Y)
 17 |     + Interesting, nice overview of a few things, mostly related to optimization and NTK
 18 |     + YouTube, Simons institute workshop.
 19 |     + Part I is interesting, but take into account that it is about other optimization things for NNs, but not about NTK.
 20 | 
 21 | ## Optimization
 22 | 
 23 | ### Infinite limit
 24 | 
 25 | [https://arxiv.org/pdf/1806.07572.pdf ]: #
 26 | + [Neural Tangent Kernel: Convergence and Generalization in Neural Networks ](./papers/1806.07572.pdf)  -- [link](https://arxiv.org/pdf/1806.07572.pdf)
 27 |     + [Notes](./notes/Neural_Tangent_kernels___Jacot_et_al.pdf)
 28 |     + 06/2018
 29 |     + Original NTK paper.
 30 |     + Exposes the idea of the NTK for the first time, although the proof that the Kernel in the limit is deterministic is done tending the number of neurons of each layer to infinity, layer by layer sequentially.
 31 |     + It proves positive definiteness of the kernel for certain regimes, thus proving you can optimize to reach a global minimum at a linear rate.
 32 | 
 33 | [https://arxiv.org/pdf/1902.06720.pdf]: #
 34 | + [Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent](./papers/1902.06720.pdf)  -- [link](https://arxiv.org/pdf/1902.06720.pdf)
 35 |     + 02/2019
 36 |     + They apparently prove that a finite learning rate is enough for the model to follow NTK dynamics in infinite width limit.
 37 |     + Experiments
 38 | 
 39 | 
 40 | [https://arxiv.org/pdf/1904.11955.pdf]: #
 41 | + [On Exact Computation with an Infinitely Wide Neural Net](./papers/1904.11955.pdf)  -- [link](https://arxiv.org/pdf/1904.11955.pdf)
 42 |     + 04/2019
 43 |     + Shows that NTK work somewhat worse than NNs, but not as much worse as previous work suggested.
 44 |     + Claims to show a proof that sounds similar to those of Allen-Zhu, Du etc. but not sure what the difference is.
 45 | 
 46 | 
 47 | ### Finite results
 48 | 
 49 | [https://arxiv.org/abs/1810.02054]: #
 50 | + [Gradient Descent Provably Optimizes Over-parameterized Neural Networks](./papers/1810.02054.pdf)  -- [link](https://arxiv.org/abs/1810.02054)
 51 |     + 04/10/2018
 52 |     + A preliminar result of Gradient Descent Finds Global Minima of Deep Neural Network (below) but only for two layer neural networks.
 53 | 
 54 | [https://arxiv.org/abs/1810.12065]: #
 55 | + [On the Convergence Rate of Training Recurrent Neural Networks](./papers/1810.12065.pdf)  -- [link](https://arxiv.org/abs/1810.1206)
 56 |     + 29/10/2018
 57 |     + See below
 58 | 
 59 | [https://arxiv.org/pdf/1811.03962.pdf]: #
 60 | + [A Convergence Theory for Deep Learning via Over-Parameterization](./papers/1811.03962.pdf)  -- [link](https://arxiv.org/pdf/1811.03962.pdf)
 61 |     + 9/11/2018
 62 |     + Simplification of [On the Convergence Rate of Training Recurrent Neural Networks](./papers/1810.12065.pdf).
 63 |     + Convergence to global optima whp for GD and SGD.
 64 |     + Works for \ell_2, cross entropy and other losses. 
 65 |     + Works for fully connected, ResNets, ConvNets, (and RNNs, in the paper above)
 66 | 
 67 | 
 68 | [https://arxiv.org/pdf/1811.03804.pdf]: #
 69 | + [Gradient Descent Finds Global Minima of Deep Neural Network.](./papers/1811.03804.pdf)  -- [link](https://arxiv.org/pdf/1811.03804.pdf)
 70 |     + [Notes](./notes/du_et_al.pdf)
 71 |     + 9/11/2018
 72 |     + Du et al 
 73 |     + Convergence to global optima whp for GD for \ell_2.
 74 |     + Exponential width wrt depth needed in fully connected. Polynomial for resnets.
 75 | 
 76 | [https://arxiv.org/pdf/1901.08572.pdf]: #
 77 | + [Width Provably Matters in Optimization for Deep Linear Neural Networks](./papers/1901.08572.pdf)  -- [link](https://arxiv.org/pdf/1901.08572.pdf)
 78 |     + 12/2019
 79 |     + Du et al. 
 80 |     + Deep linear neural network
 81 |     + Convergence to global minima if low polynomial width is assumed.
 82 | 
 83 | [https://arxiv.org/pdf/1811.08888.pdf]: #
 84 | + [Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks](./papers/1811.08888.pdf)  -- [link](https://arxiv.org/pdf/1811.08888.pdf)
 85 |     + 21/11/2018
 86 | 
 87 | [https://arxiv.org/pdf/1812.10004.pdf]: #
 88 | + [Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?](./papers/1812.10004.pdf)  -- [link](https://arxiv.org/pdf/1812.10004.pdf)
 89 |     + 25/11/2018
 90 |     + Results for one hidden layer NNs, generalized linear models and low-rank matrix regression.
 91 | 
 92 | [https://arxiv.org/abs/1905.13654.pdf]: #
 93 | + [Training Dynamics of Deep Networks using Stochastic Gradient Descent via Neural Tangent Kernel](./papers/1905.13654.pdf)  -- [link](https://arxiv.org/abs/1905.13654.pdf)
 94 |     + 06/2019
 95 |     + SGD analyzed from the point of view of Stochastic Differential Equations
 96 | 
 97 | 
 98 | ### Lazy training
 99 | 
100 | [https://arxiv.org/pdf/1812.07956.pdf ]: #
101 | + [On Lazy Training in Differentiable Programming](./papers/1812.07956.pdf)  -- [link](https://arxiv.org/pdf/1812.07956.pdf)
102 |     + 12/2018
103 |     + They show that NTK regime can be controlled by rescaling the model, and show (experimentally) that neural nets in practice perform better than those in lazy regime.
104 |     + Also this seems to be independent of width. So scaling the model is a much easier way to get to lazy training, versus the infinite width + infinitesimal learning rate route??
105 | 
106 | [https://arxiv.org/pdf/1906.08034.pdf]: #
107 | + [Disentangling feature and lazy learning in deep neural networks: an empirical study](./papers/1906.08034.pdf)  -- [link](https://arxiv.org/pdf/1906.08034.pdf)
108 |     + 06/2019
109 |     + Similar to above (Chizat et al.), but more experimental.
110 |     
111 | [https://arxiv.org/pdf/1906.05827.pdf]: #
112 | + [Kernel and deep regimes in overparametrized models](./papers/1906.05827.pdf)  -- [link](https://arxiv.org/pdf/1906.05827.pdf)
113 |     + 06/2019
114 |     + Large initialization leads to kernel/lazy regime
115 |     + Small initialization leads to deep/active/adaptive regime, which can sometimes lead to better generalization. They claim this is the regime that allows one to "exploit the power of depth", and thus is key to understanding deep learning.
116 |     + The systems they analyze in detail are rather simple (like matrix completion) or artificial (like a very ad-hoc type of neural network)
117 | 
118 | ## Generalization
119 | 
120 | [https://arxiv.org/pdf/1811.04918.pdf]: #
121 | + [Learning and Generalization in Overparameterized NeuralNetworks, Going Beyond Two Layers](./papers/1811.04918.pdf)  -- [link](https://arxiv.org/pdf/1811.04918.pdf)
122 |     + 11/2018
123 |     + Theorems are not based on NTKs, but it has experiments showing how generalization for 3-layer NNs is better than for its corresponding NTK.
124 | 
125 | [https://arxiv.org/pdf/1901.08584.pdf]: #
126 | + [Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks](./papers/1901.08584.pdf)  -- [link](https://arxiv.org/pdf/1901.08584.pdf)
127 |     + 01/2019
128 |     + Arora et al
129 |     + "Our work is related to kernel methods, especially recent discoveries of the connection between deep
130 | learning and kernels (Jacot et al., 2018; Chizat & Bach, 2018b;...) Our analysis utilized several properties of a related kernel from the ReLU activation."
131 | 
132 | [https://arxiv.org/pdf/1902.01384.pdf]: #
133 | + [Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks](./papers/1902.01384.pdf)  -- [link](https://arxiv.org/pdf/1902.01384.pdf)
134 |     + 02/2019
135 |     + See below
136 |     
137 | [https://arxiv.org/pdf/1905.13210.pdf]: #
138 | + [Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks](./papers/1905.13210.pdf)  -- [link](https://arxiv.org/pdf/1905.13210.pdf)
139 |     + 05/2019
140 |     + Seems very similar to the one above. What are the differences? Just that this is SGD vs GD in the above paper?
141 |     + Improves on the Arora2019 paper showing generalization bounds for NTK.
142 |     + I’d be interested in understanding the connection of their bound to classical margin and pac bayes bounds for kernel regression.
143 |     + They don’t show any plots demonstrating how good their bounds are, which probably means they are vacuous though...
144 | 
145 | 
146 | [https://arxiv.org/pdf/1905.10337.pdf]: #
147 | + [What Can ResNet Learn Efficiently, Going Beyond Kernels?](./papers/1905.10337.pdf)  -- [link](https://arxiv.org/pdf/1905.10337.pdf)
148 |     + 05/2019
149 |     + Shows in the PAC setting that there are ("simple") functions that ResNets learn efficiently and such that any kernel gets test error much greater for the same sample complexity. in particular NTKs too.
150 |     
151 | [https://arxiv.org/pdf/1905.10843.pdf]: #
152 | + [Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm](./papers/1905.10843.pdf)  -- [link](https://arxiv.org/pdf/1905.10843.pdf)
153 |     + 05/2019
154 |     + I think that getting learning curves for neural nets is a very interesting challenge.
155 |     + Here they do it for kernels, but if the NN behaves like a kernel, it would be relevant..
156 | 
157 | [https://arxiv.org/pdf/1906.05392.pdf]: #
158 | + [Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian](./papers/1906.05392.pdf)  -- [link](https://arxiv.org/pdf/1906.05392.pdf)
159 |     + 06/2019
160 |     + [Notes](./notes/low_rank_jac_thm.pdf)
161 |     + Uses NTK mainly and splits the eigenspace into two (based on a cutoff value of the eigenvalues). Projection of residuals onto the top eigenspace trains very fast and the rest could not train at all and loss could increase. Trade off based on cutoff value.
162 |     + Two layers.
163 |     + \ell\_2 loss.
164 | 
165 | ## Others
166 | 
167 | [https://arxiv.org/pdf/1902.04760.pdf]: #
168 | + [Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation](./papers/1902.04760.pdf)  -- [link](https://arxiv.org/pdf/1902.04760.pdf)
169 |     + 02/2019
170 |     + Although this paper is really cool in that it shows that most kinds of neural networks become GPs when infinitely wide, w.r.t. NTK, it just shows a proof where the layer widths can go to infinity at the same time, and generalizes it to more architectures, so doesn’t feel like necessarily much new insight?
171 | 
172 | [https://arxiv.org/pdf/1905.12173.pdf]: #
173 | + [On the Inductive Bias of Neural Tangent Kernels](./papers/1905.12173.pdf)  -- [link](https://arxiv.org/pdf/1905.12173.pdf)
174 |     + 05/2019
175 |     + This is just about properties of NTK (so not studying NNs directly).
176 |     + They find that the NTK model has different type of stability to deformations of the input than other NNGPs, and better approximation properties (whatever that means)
177 | 
178 | [https://arxiv.org/pdf/1906.01930.pdf]: #
179 | + [Approximate Inference Turns Deep Networks into Gaussian Processes](./papers/1906.01930.pdf)  -- [link](https://arxiv.org/pdf/1906.01930.pd)
180 |     + 06/2019
181 |     + Shows Bayesian NNs (of any width) are equivalent to GPs, surprisingly with kernel given by NTK
182 | 
183 | # ToClassify
184 | 
185 | [https://arxiv.org/pdf/1905.05095.pdf]: #
186 | + [Spectral Analysis of Kernel and Neural Embeddings: Optimization and Generalization](./papers/1905.05095.pdf)  -- [link](https://arxiv.org/pdf/1905.05095.pdf)
187 |     + 05/2019
188 |     + They just study what happens when you use a neural network or a kernel representation for data (fed as input to a NN I guess).
189 | 
190 | [https://arxiv.org/pdf/1808.09372.pdf]: #
191 | + [Mean Field Analysis of Neural Networks: A Central Limit Theorem](./papers/1808.09372.pdf)  -- [link](https://arxiv.org/pdf/1808.09372.pdf)
192 |     + 08/2018
193 |     + they only look at one hidden layer and squared error loss, so I’m not convinced of the novelty of results?
194 | 
195 | [https://arxiv.org/pdf/1906.06321.pdf]: #
196 | + [Provably Efficient $Q$-learning with Function Approximation via Distribution Shift Error Checking Oracle](./papers/1906.06321.pdf)  -- [link](https://arxiv.org/pdf/1906.06321.pdf)
197 |     + 06/2019
198 |     + Not about NTK, but authors suggest it could be extended to use NTK to analyze NN-based function approximation.
199 | 
200 | [https://arxiv.org/pdf/1911.00809.pdf]: #
201 | 
202 | + [Enhanced Convolutional Neural Tangent Kernels](./papers/1911.00809.pdf)  -- [link](https://arxiv.org/pdf/1911.00809.pdf)
203 |     + 11/2019
204 |     + Enhances the NTK for convolutional networks of "On Exact Computation..." by adding some implicit data augmentation to the kernel that encodes some kind of local translation invariance and horizontal flipping.
205 |     + They have experiments that show good empirical performance, in particular they get 89% accuracy for CIFAR-10, matching AlexNet. This is the first time a kernel gets this results.
206 | 
207 | 
208 | # Some notes
209 | 
210 | + NTK depends on initialization.
211 | 
212 | 


--------------------------------------------------------------------------------