├── DoubleDescentTutorial.ipynb ├── DoubleDescentTutorialPart2.ipynb ├── InterpolationWithNoise.ipynb ├── NTKAnalysis.ipynb ├── README.md ├── dataloader.py ├── dl_environment.yml ├── eigenpro.py ├── kernel.py ├── svd.py └── utils.py /DoubleDescentTutorialPart2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "b234b088", 6 | "metadata": {}, 7 | "source": [ 8 | "# Double Descent with 1 Hidden Layer Neural Networks \n", 9 | "\n", 10 | "In this notebook, we identify the double descent phenomenon when training 1 hidden layer neural networks. Again, given a dataset $\\{(x^{(i)}, y^{(i)})\\}_{i=1}^{n} \\subset \\mathbb{R}^{d} \\times \\mathbb{R}$, we wish to learn a map from $x^{(i)} \\to y^{(i)}$. To learn such a map, we use the following 1 hidden layer nonlinear network: \n", 11 | "\\begin{align*}\n", 12 | " f(\\mathbf{W} ; x) = a \\frac{\\sqrt{c}}{\\sqrt{k}} \\phi(B x) ~~;\n", 13 | "\\end{align*}\n", 14 | "where $a \\in \\mathbb{R}^{1 \\times k}$, $B \\in \\mathbb{R}^{k \\times d}$, $x \\in \\mathbb{R}^{d}$, $c \\in \\mathbb{R}$ is a fixed constant, $\\phi$ is an elementwise nonlinearity, and $\\mathbf{W}$ is a vectorized version of all entries of $a, B$ (e.g. $\\mathbf{W} \\in \\mathbb{R}^{k + dk}$). We will also assume that $\\phi$ is a real valued function (as is the case in many models in practice). \n", 15 | "\n", 16 | "\n", 17 | "We will assume that the parameters $\\mathbf{W}_i \\overset{i.i.d}{\\sim} \\mathcal{N}(0, 1)$. We then use gradient descent to minimize the following loss: \n", 18 | "\\begin{align}\n", 19 | " \\mathcal{L}(w) = \\sum_{i=1}^{n} ( y^{(i)} - f(x^{(i)}))^2 ~~;\n", 20 | "\\end{align}\n", 21 | "\n", 22 | "We will now show that double descent occurs when the number of hidden units $k$ increases. \n", 23 | "\n", 24 | "**Note:** The following code will make use of the GPU (it can still run without the GPU, but will take a bit longer). " 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 1, 30 | "id": "78ca0cc1", 31 | "metadata": {}, 32 | "outputs": [ 33 | { 34 | "name": "stderr", 35 | "output_type": "stream", 36 | "text": [ 37 | "/home/aradha/anaconda3/envs/dl_tutorial/lib/python3.7/site-packages/torchvision/datasets/mnist.py:498: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1623448265233/work/torch/csrc/utils/tensor_numpy.cpp:180.)\n", 38 | " return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)\n" 39 | ] 40 | }, 41 | { 42 | "name": "stdout", 43 | "output_type": "stream", 44 | "text": [ 45 | "Train Set: torch.Size([4000, 784])\n", 46 | "Train Labels: torch.Size([4000, 10])\n", 47 | "Test Set: torch.Size([10000, 784])\n", 48 | "Test Labels: torch.Size([10000, 10])\n" 49 | ] 50 | } 51 | ], 52 | "source": [ 53 | "# We will use a subset of MNIST for demonstrating double descent \n", 54 | "import torch\n", 55 | "from torchvision import datasets, transforms\n", 56 | "\n", 57 | "\n", 58 | "train_set = datasets.MNIST('./data', train=True, download=True)\n", 59 | "test_set = datasets.MNIST('./data', train=False, download=True)\n", 60 | "\n", 61 | "# Loading/Normalizing training & test images\n", 62 | "train_imgs, train_labels = train_set.data / 256, train_set.targets\n", 63 | "test_imgs, test_labels = test_set.data / 256, test_set.targets\n", 64 | "\n", 65 | "classes = {}\n", 66 | "max_per_class = 400\n", 67 | "max_labels = 10\n", 68 | "\n", 69 | "for idx, label in enumerate(train_labels): \n", 70 | " label = label.data.numpy().item()\n", 71 | " if label in classes and len(classes[label]) < max_per_class: \n", 72 | " classes[label].append(train_imgs[idx])\n", 73 | " elif label not in classes: \n", 74 | " classes[label] = [train_imgs[idx]]\n", 75 | " \n", 76 | " if len(classes) >= max_labels:\n", 77 | " early_exit = True\n", 78 | " for label in classes: \n", 79 | " early_exit &= len(classes[label]) >= max_per_class\n", 80 | " if early_exit: \n", 81 | " break\n", 82 | "\n", 83 | "all_train_examples = []\n", 84 | "all_train_labels = []\n", 85 | "for label in classes:\n", 86 | " label_vec = torch.zeros(max_labels)\n", 87 | " label_vec[label] = 1.\n", 88 | " all_train_examples.extend(classes[label])\n", 89 | " all_train_labels.extend([label_vec]*len(classes[label]))\n", 90 | " \n", 91 | "all_test_labels = [] \n", 92 | "for label in test_labels: \n", 93 | " label = label.data.numpy().item()\n", 94 | " label_vec = torch.zeros(max_labels)\n", 95 | " label_vec[label] = 1.\n", 96 | " all_test_labels.append(label_vec)\n", 97 | " \n", 98 | " \n", 99 | "train_set = torch.stack(all_train_examples, dim=0).view(max_labels * max_per_class, -1)\n", 100 | "train_set = train_set / torch.norm(train_set, p=2, dim=1).view(-1, 1)\n", 101 | "train_labels = torch.stack(all_train_labels, dim=0)\n", 102 | "\n", 103 | "test_set = test_imgs.view(-1, 28*28)\n", 104 | "test_set = test_set / torch.norm(test_set, p=2, dim=1).view(-1, 1) \n", 105 | "test_labels = torch.stack(all_test_labels, dim=0)\n", 106 | "\n", 107 | "print(\"Train Set: \", train_set.shape)\n", 108 | "print(\"Train Labels: \", train_labels.shape)\n", 109 | "print(\"Test Set: \", test_set.shape)\n", 110 | "print(\"Test Labels: \", test_labels.shape)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "id": "63062efd", 116 | "metadata": {}, 117 | "source": [ 118 | "## Neural Network for MNIST Classification\n", 119 | "\n", 120 | "Below we provide code for constructing a 1 hidden layer network of width $k$ in PyTorch. We will consider networks with a bias term in the hidden layer just as in teh previous notebook. " 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 3, 126 | "id": "714d74e6", 127 | "metadata": { 128 | "collapsed": true 129 | }, 130 | "outputs": [], 131 | "source": [ 132 | "## We now need to define and train a neural network to map x^{(i)} to y^{(i)}\n", 133 | "import torch\n", 134 | "import torch.nn as nn\n", 135 | "import torch.nn.functional as F\n", 136 | "\n", 137 | "# Abstraction for nonlinearity \n", 138 | "class Nonlinearity(torch.nn.Module):\n", 139 | " \n", 140 | " def __init__(self):\n", 141 | " super(Nonlinearity, self).__init__()\n", 142 | "\n", 143 | " def forward(self, x):\n", 144 | " #return F.leaky_relu(x)\n", 145 | " return F.relu(x)\n", 146 | " \n", 147 | "class Net(nn.Module):\n", 148 | "\n", 149 | " def __init__(self, width):\n", 150 | " super(Net, self).__init__()\n", 151 | "\n", 152 | " self.k = width\n", 153 | " self.first = nn.Sequential(nn.Linear(784, self.k, bias=True), \n", 154 | " Nonlinearity())\n", 155 | " self.sec = nn.Linear(self.k, 10, bias=False)\n", 156 | "\n", 157 | " def forward(self, x):\n", 158 | " #C = np.sqrt(2/(.01**2 + 1)) * 1/np.sqrt(self.k)\n", 159 | " C = np.sqrt(2/self.k)\n", 160 | " o = self.first(x) * C\n", 161 | " return self.sec(o)" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "id": "09112886", 167 | "metadata": {}, 168 | "source": [ 169 | "### Training a neural network with gradient descent\n", 170 | "\n", 171 | "Below, we provide code to train neural networks of varying width to classify $4000$ MNIST digits using gradient descent. We chose to run gradient descent for $10^5$ epochs to minimize the training loss and accuracy as much as possible. In practice, we would just early stop the code when the validation accuracy stops improving. The code below takes too much time to run in the tutorial, but you are encouraged to try it out offline. " 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 40, 177 | "id": "d1135c7e", 178 | "metadata": {}, 179 | "outputs": [ 180 | { 181 | "name": "stdout", 182 | "output_type": "stream", 183 | "text": [ 184 | "Number of Parameters: 12720\n" 185 | ] 186 | }, 187 | { 188 | "data": { 189 | "application/vnd.jupyter.widget-view+json": { 190 | "model_id": "", 191 | "version_major": 2, 192 | "version_minor": 0 193 | }, 194 | "text/plain": [ 195 | " 0%| | 0/100000 [00:00" 662 | ] 663 | }, 664 | "execution_count": 46, 665 | "metadata": {}, 666 | "output_type": "execute_result" 667 | }, 668 | { 669 | "data": { 670 | "image/png": "\n", 671 | "text/plain": [ 672 | "
" 673 | ] 674 | }, 675 | "metadata": { 676 | "needs_background": "light" 677 | }, 678 | "output_type": "display_data" 679 | } 680 | ], 681 | "source": [ 682 | "import matplotlib.pyplot as plt\n", 683 | "%matplotlib inline\n", 684 | "\n", 685 | "plt.plot(num_params, [1 - acc for acc in test_accs], 'bo-')\n", 686 | "plt.plot(num_params, [1-inf_test_acc]*len(widths), 'k--', label='Infinite Width')\n", 687 | "plt.axvline(x=40000, color='r', linestyle='--')\n", 688 | "plt.xscale(\"log\")\n", 689 | "plt.xlabel(\"Number of Parameters (Log)\")\n", 690 | "plt.ylabel(\"Test Error (1 - Acc.)\")\n", 691 | "plt.title(\"Double Descent with ReLU Network on MNIST\")\n", 692 | "plt.legend()" 693 | ] 694 | } 695 | ], 696 | "metadata": { 697 | "kernelspec": { 698 | "display_name": "dl_tutorial", 699 | "language": "python", 700 | "name": "dl_tutorial" 701 | }, 702 | "language_info": { 703 | "codemirror_mode": { 704 | "name": "ipython", 705 | "version": 3 706 | }, 707 | "file_extension": ".py", 708 | "mimetype": "text/x-python", 709 | "name": "python", 710 | "nbconvert_exporter": "python", 711 | "pygments_lexer": "ipython3", 712 | "version": "3.7.10" 713 | } 714 | }, 715 | "nbformat": 4, 716 | "nbformat_minor": 5 717 | } 718 | -------------------------------------------------------------------------------- /InterpolationWithNoise.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Interpolation in the Presence of Noisy Data\n", 8 | "\n", 9 | "In this notebook, we provide a simple example demonstrating the effectiveness of interpolating models (in this case models that achieve 100% training accuracy in classification) even in the presence of incorrectly labelled data. In particular, we use the Laplace kernel for image classification on MNIST. We load the relevant subset of MNIST data below. " 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 12, 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stdout", 19 | "output_type": "stream", 20 | "text": [ 21 | "Train Set: torch.Size([4000, 784])\n", 22 | "Train Labels: torch.Size([4000, 10])\n", 23 | "Test Set: torch.Size([10000, 784])\n", 24 | "Test Labels: torch.Size([10000, 10])\n" 25 | ] 26 | } 27 | ], 28 | "source": [ 29 | "# We will use a subset of MNIST for demonstrating double descent \n", 30 | "import torch\n", 31 | "from torchvision import datasets, transforms\n", 32 | "\n", 33 | "\n", 34 | "train_set = datasets.MNIST('./data', train=True, download=True)\n", 35 | "test_set = datasets.MNIST('./data', train=False, download=True)\n", 36 | "\n", 37 | "# Loading/Normalizing training & test images\n", 38 | "train_imgs, train_labels = train_set.data / 256, train_set.targets\n", 39 | "test_imgs, test_labels = test_set.data / 256, test_set.targets\n", 40 | "\n", 41 | "classes = {}\n", 42 | "max_per_class = 400\n", 43 | "max_labels = 10\n", 44 | "\n", 45 | "for idx, label in enumerate(train_labels): \n", 46 | " label = label.data.numpy().item()\n", 47 | " if label in classes and len(classes[label]) < max_per_class: \n", 48 | " classes[label].append(train_imgs[idx])\n", 49 | " elif label not in classes: \n", 50 | " classes[label] = [train_imgs[idx]]\n", 51 | " \n", 52 | " if len(classes) >= max_labels:\n", 53 | " early_exit = True\n", 54 | " for label in classes: \n", 55 | " early_exit &= len(classes[label]) >= max_per_class\n", 56 | " if early_exit: \n", 57 | " break\n", 58 | "\n", 59 | "all_train_examples = []\n", 60 | "all_train_labels = []\n", 61 | "for label in classes:\n", 62 | " label_vec = torch.zeros(max_labels)\n", 63 | " label_vec[label] = 1.\n", 64 | " all_train_examples.extend(classes[label])\n", 65 | " all_train_labels.extend([label_vec]*len(classes[label]))\n", 66 | " \n", 67 | "all_test_labels = [] \n", 68 | "for label in test_labels: \n", 69 | " label = label.data.numpy().item()\n", 70 | " label_vec = torch.zeros(max_labels)\n", 71 | " label_vec[label] = 1.\n", 72 | " all_test_labels.append(label_vec)\n", 73 | " \n", 74 | " \n", 75 | "train_set = torch.stack(all_train_examples, dim=0).view(max_labels * max_per_class, -1)\n", 76 | "train_set = train_set / torch.norm(train_set, p=2, dim=1).view(-1, 1)\n", 77 | "train_labels = torch.stack(all_train_labels, dim=0)\n", 78 | "\n", 79 | "test_set = test_imgs.view(-1, 28*28)\n", 80 | "test_set = test_set / torch.norm(test_set, p=2, dim=1).view(-1, 1) \n", 81 | "test_labels = torch.stack(all_test_labels, dim=0)\n", 82 | "\n", 83 | "print(\"Train Set: \", train_set.shape)\n", 84 | "print(\"Train Labels: \", train_labels.shape)\n", 85 | "print(\"Test Set: \", test_set.shape)\n", 86 | "print(\"Test Labels: \", test_labels.shape)" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "### Interpolation with EigenPro\n", 94 | "\n", 95 | "Below we use the Laplace kernel to classify MNIST digits from pixels. We make use of the EigenPro library (https://github.com/EigenPro/EigenPro-pytorch) below for solving kernel regression. " 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 15, 101 | "metadata": {}, 102 | "outputs": [ 103 | { 104 | "name": "stdout", 105 | "output_type": "stream", 106 | "text": [ 107 | "probability: 0.00 & Number of labels corrupted: 0\n" 108 | ] 109 | }, 110 | { 111 | "name": "stderr", 112 | "output_type": "stream", 113 | "text": [ 114 | "/home/aradha/princeton_dl_tutorial/eigenpro.py:102: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).\n", 115 | " requires_grad=False).to(self.device)\n" 116 | ] 117 | }, 118 | { 119 | "name": "stdout", 120 | "output_type": "stream", 121 | "text": [ 122 | "SVD time: 0.69, top_q: 36, top_eigval: 0.90, new top_eigval: 5.20e-04\n", 123 | "n_subsamples=2000, bs_gpu=2000, eta=1923.82, bs=1905, top_eigval=8.97e-01, beta=0.99\n", 124 | "train error: 0.00%\tval error: 4.58% (150 epochs, 1.47 seconds)\ttrain l2: 1.57e-07\tval l2: 1.44e-02\n", 125 | "probability: 0.10 & Number of labels corrupted: 371\n", 126 | "SVD time: 0.69, top_q: 36, top_eigval: 0.90, new top_eigval: 5.20e-04\n", 127 | "n_subsamples=2000, bs_gpu=2000, eta=1923.82, bs=1905, top_eigval=8.97e-01, beta=0.99\n", 128 | "train error: 0.00%\tval error: 4.89% (150 epochs, 1.44 seconds)\ttrain l2: 2.14e-06\tval l2: 1.77e-02\n", 129 | "probability: 0.20 & Number of labels corrupted: 805\n", 130 | "SVD time: 0.70, top_q: 36, top_eigval: 0.90, new top_eigval: 5.20e-04\n", 131 | "n_subsamples=2000, bs_gpu=2000, eta=1923.82, bs=1905, top_eigval=8.97e-01, beta=0.99\n", 132 | "train error: 0.00%\tval error: 5.56% (150 epochs, 1.44 seconds)\ttrain l2: 5.00e-06\tval l2: 2.27e-02\n", 133 | "probability: 0.30 & Number of labels corrupted: 1204\n", 134 | "SVD time: 0.69, top_q: 36, top_eigval: 0.90, new top_eigval: 5.20e-04\n", 135 | "n_subsamples=2000, bs_gpu=2000, eta=1923.82, bs=1905, top_eigval=8.97e-01, beta=0.99\n", 136 | "train error: 0.00%\tval error: 7.54% (150 epochs, 1.44 seconds)\ttrain l2: 7.06e-06\tval l2: 2.89e-02\n", 137 | "probability: 0.40 & Number of labels corrupted: 1610\n", 138 | "SVD time: 0.69, top_q: 36, top_eigval: 0.90, new top_eigval: 5.20e-04\n", 139 | "n_subsamples=2000, bs_gpu=2000, eta=1923.82, bs=1905, top_eigval=8.97e-01, beta=0.99\n", 140 | "train error: 0.00%\tval error: 10.51% (150 epochs, 1.44 seconds)\ttrain l2: 8.20e-06\tval l2: 3.57e-02\n", 141 | "probability: 0.50 & Number of labels corrupted: 2016\n", 142 | "SVD time: 0.69, top_q: 36, top_eigval: 0.90, new top_eigval: 5.20e-04\n", 143 | "n_subsamples=2000, bs_gpu=2000, eta=1923.82, bs=1905, top_eigval=8.97e-01, beta=0.99\n", 144 | "train error: 0.00%\tval error: 15.34% (150 epochs, 1.44 seconds)\ttrain l2: 9.80e-06\tval l2: 4.36e-02\n", 145 | "probability: 0.60 & Number of labels corrupted: 2429\n", 146 | "SVD time: 0.70, top_q: 36, top_eigval: 0.90, new top_eigval: 5.20e-04\n", 147 | "n_subsamples=2000, bs_gpu=2000, eta=1923.82, bs=1905, top_eigval=8.97e-01, beta=0.99\n", 148 | "train error: 0.00%\tval error: 23.33% (150 epochs, 1.45 seconds)\ttrain l2: 1.12e-05\tval l2: 5.31e-02\n", 149 | "probability: 0.70 & Number of labels corrupted: 2809\n", 150 | "SVD time: 0.70, top_q: 36, top_eigval: 0.90, new top_eigval: 5.20e-04\n", 151 | "n_subsamples=2000, bs_gpu=2000, eta=1923.82, bs=1905, top_eigval=8.97e-01, beta=0.99\n", 152 | "train error: 0.00%\tval error: 35.97% (150 epochs, 1.44 seconds)\ttrain l2: 1.26e-05\tval l2: 6.29e-02\n", 153 | "probability: 0.80 & Number of labels corrupted: 3189\n", 154 | "SVD time: 0.69, top_q: 36, top_eigval: 0.90, new top_eigval: 5.20e-04\n", 155 | "n_subsamples=2000, bs_gpu=2000, eta=1923.82, bs=1905, top_eigval=8.97e-01, beta=0.99\n", 156 | "train error: 0.00%\tval error: 53.70% (150 epochs, 1.44 seconds)\ttrain l2: 1.31e-05\tval l2: 7.45e-02\n", 157 | "probability: 0.90 & Number of labels corrupted: 3583\n", 158 | "SVD time: 0.70, top_q: 36, top_eigval: 0.90, new top_eigval: 5.20e-04\n", 159 | "n_subsamples=2000, bs_gpu=2000, eta=1923.82, bs=1905, top_eigval=8.97e-01, beta=0.99\n", 160 | "train error: 0.00%\tval error: 73.56% (150 epochs, 1.44 seconds)\ttrain l2: 1.33e-05\tval l2: 8.73e-02\n" 161 | ] 162 | }, 163 | { 164 | "data": { 165 | "image/png": "\n", 166 | "text/plain": [ 167 | "
" 168 | ] 169 | }, 170 | "metadata": { 171 | "needs_background": "light" 172 | }, 173 | "output_type": "display_data" 174 | } 175 | ], 176 | "source": [ 177 | "from numpy.linalg import pinv, solve\n", 178 | "import numpy as np\n", 179 | "import time \n", 180 | "import random\n", 181 | "import matplotlib.pyplot as plt\n", 182 | "%matplotlib inline\n", 183 | "\n", 184 | "import kernel\n", 185 | "import eigenpro\n", 186 | "import torch\n", 187 | "\n", 188 | "SEED = 2134\n", 189 | "np.random.seed(SEED)\n", 190 | "random.seed(SEED)\n", 191 | "\n", 192 | "def mse(preds, labels): \n", 193 | " return np.mean(np.abs(np.power(preds - labels, 2)))\n", 194 | "\n", 195 | "def numpy_acc(preds, labels):\n", 196 | " preds_max = np.argmax(preds, axis=0)\n", 197 | " labels_max = np.argmax(labels, axis=0)\n", 198 | " return np.mean(preds_max == labels_max)\n", 199 | "\n", 200 | "\n", 201 | "X = train_set.cpu().data.numpy().astype(\"float32\")\n", 202 | "y = train_labels.cpu().data.numpy().astype(\"float32\")\n", 203 | "X_test = test_set.cpu().data.numpy().astype(\"float32\")\n", 204 | "y_test = test_labels.cpu().data.numpy().astype(\"float32\")\n", 205 | "\n", 206 | "possible_labels = np.eye(10)\n", 207 | "random_idxs = np.random.randint(low=0, high=10, size=len(y))\n", 208 | "random_labels = possible_labels[random_idxs, :]\n", 209 | "\n", 210 | "random_test_idxs = np.random.randint(low=0, high=10, size=len(y_test))\n", 211 | "random_test_labels = possible_labels[random_test_idxs, :]\n", 212 | "\n", 213 | "noise_probs = np.linspace(0, .9, 10)\n", 214 | "train_errors = []\n", 215 | "test_errors = []\n", 216 | "for p in noise_probs:\n", 217 | " choice = np.random.uniform(size=y.shape[0])\n", 218 | " choice = np.where(choice < p, 1, 0)\n", 219 | " y[choice==1] = random_labels[choice==1]\n", 220 | "\n", 221 | " # Uncomment if you want to corrupt the labels for test data as well\n", 222 | " # choice_test = np.random.uniform(size=y_test.shape[0])\n", 223 | " # choice_test = np.where(choice_test < p, 1, 0)\n", 224 | " # y_test[choice_test==1] = random_test_labels[choice_test==1]\n", 225 | " \n", 226 | " print(\"probability: %.2f & Number of labels corrupted: %d\"%(p, np.sum(choice))) \n", 227 | "\n", 228 | " use_cuda = torch.cuda.is_available()\n", 229 | " device = torch.device(\"cuda\" if use_cuda else \"cpu\")\n", 230 | " n_class = 10\n", 231 | " num_epochs=150\n", 232 | " kernel_fn = lambda x,y: kernel.laplacian(x, y, bandwidth=10)\n", 233 | " model = eigenpro.FKR_EigenPro(kernel_fn, X, n_class, device=device)\n", 234 | " res = model.fit(X, y, X_test, y_test, epochs=[num_epochs], mem_gb=12)\n", 235 | " train_errors.append(1 - res[num_epochs][0]['multiclass-acc'])\n", 236 | " test_errors.append(1 - res[num_epochs][1]['multiclass-acc'])\n", 237 | "\n", 238 | "plt.title(\"Interpolation in the Presence of Noisy Data\")\n", 239 | "plt.xlabel(\"Added Label Noise %\")\n", 240 | "plt.ylabel(\"Classification Error (1 - Acc.)\")\n", 241 | "plt.plot(noise_probs, test_errors, 'rx--', label='Test Error')\n", 242 | "plt.plot(noise_probs, train_errors, 'k--', label='Training Error')\n", 243 | "plt.legend()\n", 244 | "plt.show()" 245 | ] 246 | } 247 | ], 248 | "metadata": { 249 | "kernelspec": { 250 | "display_name": "dl_tutorial", 251 | "language": "python", 252 | "name": "dl_tutorial" 253 | }, 254 | "language_info": { 255 | "codemirror_mode": { 256 | "name": "ipython", 257 | "version": 3 258 | }, 259 | "file_extension": ".py", 260 | "mimetype": "text/x-python", 261 | "name": "python", 262 | "nbconvert_exporter": "python", 263 | "pygments_lexer": "ipython3", 264 | "version": "3.7.10" 265 | } 266 | }, 267 | "nbformat": 4, 268 | "nbformat_minor": 2 269 | } 270 | -------------------------------------------------------------------------------- /NTKAnalysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "5992bf0f", 6 | "metadata": {}, 7 | "source": [ 8 | "# NTK Derivation and Analysis\n", 9 | "\n", 10 | "In this Notebook, we will derive a closed form for the NTK for 1 hidden layer ReLU networks. We will then present experiments to show that the NTK can be used to describe the behavior of large width neural networks. \n", 11 | "\n", 12 | "We begin with a derivation of the NTK below (this is basically the solution to Section 2 Problem 2 of the worksheet shared in the github). " 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "a8dbae44", 18 | "metadata": {}, 19 | "source": [ 20 | "## Derivation of the NTK \n", 21 | "\n", 22 | "Suppose we are given a dataset $\\{(x^{(i)}, y^{(i)}\\}_{i=1}^{n} \\subset \\mathbb{R}^{d} \\times \\mathbb{R}$ (also written as $X \\in \\mathbb{R}^{d \\times n}, y \\in \\mathbb{R}^{1 \\times n}$). Let $f$ denote a 1 hidden layer neural network with parameters $\\mathbf{W}$. To train the neural network $f$ to fit the data $(X, y)$, we typically use gradient descent to minimize the following loss: \n", 23 | "\n", 24 | "\\begin{align*}\n", 25 | "\\mathcal{L}(\\mathbf{W}) = \\sum_{i=1}^{n} (y^{(i)} - f(\\mathbf{W} ; x^{(i)}))^2 \n", 26 | "\\end{align*}\n", 27 | "\n", 28 | "**Important:** Note that the network $f$ is written as a function of parameters $\\mathbf{W}$ and data $x^{(i)}$, as opposed to just data. For the neural tangent kernel derivation, we consider the cross section of $f$ given by fixing the data component and writing the neural network as a function of parameters, i.e. consider $f_x(\\mathbf{W}): \\mathbb{R}^{dk + k} \\to \\mathbb{R}$. \n", 29 | "\n", 30 | "### Linearization around Initialization\n", 31 | "Before training the network as usual, let us consider the following alternative. Viewing the neural network as only a function of parameters, we train the linear approximation for $f_x(\\mathbf{W})$, which is given as follows: \n", 32 | "\n", 33 | "\\begin{align*}\n", 34 | "\\tilde{f_x}(\\mathbf{W}) = f_x(\\mathbf{W}^{(0)}) + \\nabla f_x(\\mathbf{W}^{(0)})^T (\\mathbf{W} - \\mathbf{W}^{(0)}) ~;\n", 35 | "\\end{align*}\n", 36 | "where $\\mathbf{W}^{(0)} \\in \\mathbb{R}^{dk + k}$ denotes the parameters at initialization and $\\nabla f_x(\\mathbf{W}^{(0)})^T \\in \\mathbb{R}^{1 \\times (dk + k)}$ denotes the gradient of $f_x(\\mathbf{W})$. Instead of minimizing the loss for the model $f(\\mathbf{W} ; x^{(i)})$ given above, we instead minimize the following loss: \n", 37 | "\n", 38 | "\\begin{align*}\n", 39 | " \\tilde{\\mathcal{L}}(\\mathbf{W}) = \\sum_{i=1}^{n} (y^{(i)} - \\tilde{f}_{x^{(i)}}(\\mathbf{W}))^2 = \\sum_{i=1}^{n} (y^{(i)} - f_{x^{(i)}}(\\mathbf{W}^{(0)}) - \\nabla f_{x^{(i)}}(\\mathbf{W}^{(0)})^T (\\mathbf{W} - \\mathbf{W}^{(0)}))^2\n", 40 | "\\end{align*}\n", 41 | "\n", 42 | "Minimizing this loss naively can be computationally expensive since the vector $\\mathbf{W} \\in \\mathbb{R}^{kd + k}$ depends on $k$, which can be arbitrarily large. To remedy this, we let $\\mathbf{W} = \\mathbf{W}^{(0)} + \\sum_{i=1}^{n} \\nabla f_{x^{(i)}}(\\mathbf{W}^{(0)})\\alpha_i$. \n", 43 | "\n", 44 | "\n", 45 | "**Remark:** At this point, you should be asking why this is a reasonable step to take. The rationale for this step is that we can use this to find the minimum norm minizimer, which lies in the span of the training data. If you haven't seen this trick before, I encourage you to review the Representer theorem. \n", 46 | "\n", 47 | "Using the new form for $\\mathbf{W}$, we can simplify our loss $\\tilde{\\mathcal{L}}(\\mathbf{W})$ as follows: \n", 48 | "\\begin{align*}\n", 49 | "\\tilde{\\mathcal{L}}(\\mathbf{W}) = \\sum_{i=1}^{n} (y^{(i)} - f_{x^{(i)}}(\\mathbf{W}^{(0)}) - \\alpha k(x^{(i)}) )^2 ~;\n", 50 | "\\end{align*}\n", 51 | "where $\\alpha \\in \\mathbb{R}^{1 \\times n}$ and $$k(x) = \\begin{bmatrix} \\langle \\nabla f_{x}(\\mathbf{W}^{(0)}), \\nabla f_{x^{(1)}}(\\mathbf{W}^{(0)}) \\rangle \\\\ \\langle \\nabla f_{x}(\\mathbf{W}^{(0)}), \\nabla f_{x^{(2)}}(\\mathbf{W}^{(0)}) \\rangle \\\\ \\vdots \\\\ \\langle \\nabla f_{x}(\\mathbf{W}^{(0)}), \\nabla f_{x^{(n)}}(\\mathbf{W}^{(0)}) \\rangle \\end{bmatrix} \\in \\mathbb{R}^{n}$$\n", 52 | "\n", 53 | "We can now recognize minimizing the loss $\\tilde{\\mathcal{L}}(\\mathbf{W})$ as solving the following system of equations: \n", 54 | "\\begin{align*}\n", 55 | " \\alpha K = y - f_X(\\mathbf{W}^{(0)}) ~;\n", 56 | "\\end{align*}\n", 57 | "where $K \\in \\mathbb{R}^{n \\times n}$ with $K_{i,j} = \\langle \\nabla f_{x^{(i)}}(\\mathbf{W}^{(0)}), \\nabla f_{x^{(j)}}(\\mathbf{W}^{(0)}) \\rangle$ and $f_X(\\mathbf{W}^{(0)}) \\in \\mathbb{R}^{1 \\times n}$ with $f_X(\\mathbf{W}^{(0)})_i = f_{x^{(i)}}(\\mathbf{W}^{(0)})$. \n", 58 | "\n", 59 | "**Definition [NTK]:** The function $K_{i,j}$ above is written generally as the following Neural Tangent Kernel:\n", 60 | "$$ K(x, x') = \\langle \\nabla f_{x}(\\mathbf{W}^{(0)}), \\nabla f_{x'}(\\mathbf{W}^{(0)}) \\rangle $$. \n", 61 | "\n", 62 | "**Remarks:** This kernel can of course be evaluated using any auto-differentition software (e.g. PyTorch, Tensorflow, Jax, etc.). This is generally memory (and runtime) expensive since neural networks can have millions or billions of parameters. On the other hand, we can actually analytically compute the kernel $K$ when the width of neural networks approaches infinity. We do this below. \n" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "id": "3f964c28", 68 | "metadata": {}, 69 | "source": [ 70 | "### Analytical Evaluation of the NTK (1 Hidden Layer,)\n", 71 | "Thus far, we have defined the NTK without explicitly computing it for a given architecture. We now write a closed form for the NTK given a specific archticture. In particular, let $f$ denote a 1 hidden layer network defined as follows: \n", 72 | "\\begin{align*}\n", 73 | " f(\\mathbf{W} ; x) = a \\frac{\\sqrt{c}}{\\sqrt{k}} \\phi(Bx) ~;\n", 74 | "\\end{align*}\n", 75 | "where $a \\in \\mathbb{R}^{1 \\times k}, B \\in \\mathbb{R}^{k \\times d}$ are the trainable parameters ($\\mathbf{W} = [a_1, a_2, \\ldots a_k, B_{1,1}, B_{1,2}, \\ldots B_{k, d}]^T \\in \\mathbb{R}^{k + dk}$ denotes the vector containing all trainable parameters), $c \\in \\mathbb{R}$ is an absolute constant, and $\\phi: \\mathbb{R} \\to \\mathbb{R}$ is an elementwise nonlinearity. \n", 76 | "\n", 77 | "Let us now compute the NTK $K(x, x') = \\langle \\nabla f_{x}(\\mathbf{W}^{(0)}), \\nabla f_{x'}(\\mathbf{W}^{(0)}) \\rangle$ as $k \\to \\infty$ assuming that $\\mathbf{W}_j^{(0)} \\overset{i.i.d.}{\\sim} \\mathcal{N}(0, 1)$. Letting $\\mathbf{W} = [a_1, a_2, \\ldots a_k, B_{1,1}, B_{1,2}, \\ldots B_{k, d}] $, we compute $\\nabla f_{x}(\\mathbf{W}^{(0)})$ as follows: \n", 78 | "\n", 79 | "\\begin{align*}\n", 80 | " \\nabla f_{x}(\\mathbf{W}) = \\begin{bmatrix}\\frac{\\partial f_{x}}{\\partial a_1} \\\\ \\frac{\\partial f_{x}}{\\partial a_2} \\\\ \\vdots \\\\ \\frac{\\partial f_{x}}{\\partial a_k} \\\\ \\frac{\\partial f_{x}}{\\partial B_{1,1}} \\\\ \\vdots \\\\ \\frac{\\partial f_{x}}{\\partial B_{k, d}}\n", 81 | " \\end{bmatrix}\n", 82 | "\\end{align*}\n", 83 | "\n", 84 | "We thus first calculate $\\frac{\\partial f_{x}}{\\partial a_j}$ and $\\frac{\\partial f_{x}}{\\partial B_{j, \\ell}}$: \n", 85 | "\\begin{align*}\n", 86 | " \\frac{\\partial f_{x}}{\\partial a_j} = \\frac{\\sqrt{c}}{\\sqrt{k}} \\phi(B_{j, :} x) \\\\\n", 87 | " \\frac{\\partial f_{x}}{\\partial B_{j, \\ell}} = a_j \\frac{\\sqrt{c}}{\\sqrt{k}} \\phi'(B_{j,:}x) x_{\\ell} \n", 88 | "\\end{align*}\n", 89 | "\n", 90 | "Now that we have all the relevant terms to compute $\\nabla f_x(\\mathbf{W}^{(0)})$, we can compute $K(x, x')$ as follows: \n", 91 | "\\begin{align*}\n", 92 | " K(x, x') &= \\langle \\nabla f_{x}(\\mathbf{W}^{(0)}), \\nabla f_{x'}(\\mathbf{W}^{(0)}) \\rangle \\\\\n", 93 | " &= \\sum_{j=1}^{k} \\frac{\\partial f_x(\\mathbf{W}^{(0)})}{\\partial a_j} \\frac{\\partial f_{x'}(\\mathbf{W}^{(0)})}{\\partial a_j} + \\sum_{j=1}^{k} \\sum_{\\ell = 1}^{d} \\frac{\\partial f_x(\\mathbf{W}^{(0)})}{\\partial B_{j, \\ell}} \\frac{\\partial f_{x'}(\\mathbf{W}^{(0)})}{\\partial B_{j, \\ell}} \\\\\n", 94 | " &= \\color{red}{\\text{$\\frac{c}{k} \\sum_{j=1}^{k} \\phi(B_{j, :} x) \\phi(B_{j, :} x')$}} ~ + ~ \\color{blue}{\\text{$\\frac{c}{k} \\sum_{j=1}^{k} \\sum_{\\ell=1}^{d} a_j^2 \\phi'(B_{j, :} x) \\phi'(B_{j, :} x') x_{\\ell} x'_{\\ell}$}} \\\\\n", 95 | " &= \\color{red}{\\text{$\\frac{c}{k} \\sum_{j=1}^{k} \\phi(B_{j, :} x) \\phi(B_{j, :} x')$}} ~ + ~ \\color{blue}{\\text{$\\frac{c}{k} \\sum_{j=1}^{k} a_j^2 \\phi'(B_{j, :} x) \\phi'(B_{j, :} x') \\sum_{\\ell=1}^{d} x_{\\ell} x'_{\\ell}$}} \\\\\n", 96 | " &= \\color{red}{\\text{$\\frac{c}{k} \\sum_{j=1}^{k} \\phi(B_{j, :} x) \\phi(B_{j, :} x')$}} ~ + ~ \\langle x, x' \\rangle \\color{blue}{\\text{$\\frac{c}{k} \\sum_{j=1}^{k} a_j^2 \\phi'(B_{j, :} x) \\phi'(B_{j, :} x') $}} \n", 97 | "\\end{align*}\n", 98 | "\n", 99 | "**Remark:** Do the red and blue terms look familiar? If you worked through the notebook *DoubleDescentTutorial*, they should. Indeed, as $k \\to \\infty$, the terms in the red and blue correspond to the NNGP kernel for a network with activation $\\phi$ and $\\phi'$ respectively. We know how to evaluate these using dual activations. Namely, we have: \n", 100 | "\\begin{align*}\n", 101 | " \\color{red}{\\text{$\\frac{c}{k} \\sum_{j=1}^{k} \\phi(B_{j, :} x) \\phi(B_{j, :} x')$}} &\\to c \\mathbb{E}_{(u, v) \\sim \\mathcal{N}(\\mathbf{0}, \\Lambda)} [\\phi(u) \\phi(v) ] \\\\\n", 102 | " \\color{blue}{\\text{$\\frac{c}{k} \\sum_{j=1}^{k} a_j^2 \\phi'(B_{j, :} x) \\phi'(B_{j, :} x') $}} &\\to c \\mathbb{E}_{(u, v) \\sim \\mathcal{N}(\\mathbf{0}, \\Lambda)} [\\phi'(u) \\phi'(v)] \\\\\n", 103 | " \\Lambda &= \\begin{bmatrix} \\|x\\|_2^2 & x^T x' \\\\ x^T x' & \\|x'\\|_2^2 \\end{bmatrix}\n", 104 | "\\end{align*}\n", 105 | "\n", 106 | "Let $\\xi = x^T x'$ and $\\check{\\phi}$ denote the dual of $\\phi$. Assuming $\\phi$ is homogeneous of degree 1 and that $\\|x\\|_2 = \\|x'\\|_2 = 1$ we conclude: \n", 107 | "\\begin{align*}\n", 108 | " K(x, x') = \\check{\\phi}(\\xi) + \\xi \\check{\\phi'}(\\xi)\n", 109 | "\\end{align*}\n", 110 | "\n", 111 | "Recalling that the dual activation is computed in closed form for a number of nonlinearities including ReLU, we now have a closed form for the NTK. Next, let's try training some simple neural networks to verify that the NTK does describe the training dynamics of large neural networks. " 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "id": "30982ecb", 117 | "metadata": {}, 118 | "source": [ 119 | "## Training Neural Nets vs. Using the NTK" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 170, 125 | "id": "fb4a3f93", 126 | "metadata": {}, 127 | "outputs": [ 128 | { 129 | "name": "stdout", 130 | "output_type": "stream", 131 | "text": [ 132 | "(32, 100) (32, 1) (100, 100) (100, 1)\n" 133 | ] 134 | } 135 | ], 136 | "source": [ 137 | "# Loading high dimensional linear data\n", 138 | "import dataloader as dl\n", 139 | "import numpy as np\n", 140 | "from numpy.linalg import norm\n", 141 | "import matplotlib.pyplot as plt\n", 142 | "%matplotlib inline\n", 143 | "\n", 144 | "SEED = 2134\n", 145 | "\n", 146 | "np.random.seed(SEED)\n", 147 | "d = 100\n", 148 | "n = 32\n", 149 | "n_test = 100\n", 150 | "\n", 151 | "X = np.random.randn(n, d)\n", 152 | "X = X / norm(X, axis=-1).reshape(-1, 1)\n", 153 | "X_test = np.random.randn(n_test, d)\n", 154 | "X_test = X_test / norm(X_test, axis=-1).reshape(-1, 1)\n", 155 | "w = np.random.randn(1, d)\n", 156 | "y = (w @ X.T).T\n", 157 | "y_test = (w @ X_test.T).T\n", 158 | "print(X.shape, y.shape, X_test.shape, y_test.shape)" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 171, 164 | "id": "c1ae559d", 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "## We now need to define and train a neural network to map x^{(i)} to y^{(i)}\n", 169 | "import torch\n", 170 | "import torch.nn as nn\n", 171 | "import torch.nn.functional as F\n", 172 | "\n", 173 | "# Abstraction for nonlinearity \n", 174 | "class Nonlinearity(torch.nn.Module):\n", 175 | " \n", 176 | " def __init__(self):\n", 177 | " super(Nonlinearity, self).__init__()\n", 178 | "\n", 179 | " def forward(self, x):\n", 180 | " # return F.leaky_relu(x)\n", 181 | " return F.relu(x)\n", 182 | " \n", 183 | "class Net(nn.Module):\n", 184 | "\n", 185 | " def __init__(self, width, f_in):\n", 186 | " super(Net, self).__init__()\n", 187 | "\n", 188 | " self.k = width\n", 189 | " self.first = nn.Sequential(nn.Linear(f_in, self.k, bias=True), \n", 190 | " Nonlinearity())\n", 191 | " self.sec = nn.Linear(self.k, 1, bias=False)\n", 192 | "\n", 193 | " def forward(self, x):\n", 194 | " #C = np.sqrt(2/(.01**2 + 1)) * 1/np.sqrt(self.k)\n", 195 | " C = np.sqrt(2/self.k)\n", 196 | " o = self.first(x) * C\n", 197 | " return self.sec(o)" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 185, 203 | "id": "3e5cc8cf", 204 | "metadata": {}, 205 | "outputs": [ 206 | { 207 | "name": "stdout", 208 | "output_type": "stream", 209 | "text": [ 210 | "torch.Size([16000, 100])\n", 211 | "torch.Size([16000])\n", 212 | "torch.Size([1, 16000])\n" 213 | ] 214 | }, 215 | { 216 | "data": { 217 | "application/vnd.jupyter.widget-view+json": { 218 | "model_id": "cc04b360b5454c97ab466ddb2d35d864", 219 | "version_major": 2, 220 | "version_minor": 0 221 | }, 222 | "text/plain": [ 223 | " 0%| | 0/100000 [00:00 0 47 | kernel_mat = euclidean_distances(samples, centers) 48 | kernel_mat.clamp_(min=0) 49 | gamma = 1. / (2 * bandwidth ** 2) 50 | kernel_mat.mul_(-gamma) 51 | kernel_mat.exp_() 52 | return kernel_mat 53 | 54 | 55 | def laplacian(samples, centers, bandwidth): 56 | '''Laplacian kernel. 57 | 58 | Args: 59 | samples: of shape (n_sample, n_feature). 60 | centers: of shape (n_center, n_feature). 61 | bandwidth: kernel bandwidth. 62 | 63 | Returns: 64 | kernel matrix of shape (n_sample, n_center). 65 | ''' 66 | assert bandwidth > 0 67 | kernel_mat = euclidean_distances(samples, centers, squared=False) 68 | kernel_mat.clamp_(min=0) 69 | gamma = 1. / bandwidth 70 | kernel_mat.mul_(-gamma) 71 | kernel_mat.exp_() 72 | return kernel_mat 73 | 74 | 75 | def dispersal(samples, centers, bandwidth, gamma): 76 | '''Dispersal kernel. 77 | 78 | Args: 79 | samples: of shape (n_sample, n_feature). 80 | centers: of shape (n_center, n_feature). 81 | bandwidth: kernel bandwidth. 82 | gamma: dispersal factor. 83 | 84 | Returns: 85 | kernel matrix of shape (n_sample, n_center). 86 | ''' 87 | assert bandwidth > 0 88 | kernel_mat = euclidean_distances(samples, centers) 89 | kernel_mat.pow_(gamma / 2.) 90 | kernel_mat.mul_(-1. / bandwidth) 91 | kernel_mat.exp_() 92 | return kernel_mat 93 | -------------------------------------------------------------------------------- /svd.py: -------------------------------------------------------------------------------- 1 | '''Utility functions for performing fast SVD.''' 2 | import scipy.linalg as linalg 3 | import numpy as np 4 | 5 | import utils 6 | 7 | 8 | def nystrom_kernel_svd(samples, kernel_fn, top_q): 9 | """Compute top eigensystem of kernel matrix using Nystrom method. 10 | 11 | Arguments: 12 | samples: data matrix of shape (n_sample, n_feature). 13 | kernel_fn: tensor function k(X, Y) that returns kernel matrix. 14 | top_q: top-q eigensystem. 15 | 16 | Returns: 17 | eigvals: top eigenvalues of shape (top_q). 18 | eigvecs: (rescaled) top eigenvectors of shape (n_sample, top_q). 19 | """ 20 | 21 | n_sample, _ = samples.shape 22 | kmat = kernel_fn(samples, samples).cpu().data.numpy() 23 | scaled_kmat = kmat / n_sample 24 | vals, vecs = linalg.eigh(scaled_kmat, 25 | eigvals=(n_sample - top_q, n_sample - 1)) 26 | eigvals = vals[::-1][:top_q] 27 | eigvecs = vecs[:, ::-1][:, :top_q] / np.sqrt(n_sample) 28 | 29 | return utils.float_x(eigvals), utils.float_x(eigvecs) 30 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | '''Helper functions.''' 2 | import numpy as np 3 | 4 | 5 | def float_x(data): 6 | '''Set data array precision.''' 7 | return np.float32(data) 8 | --------------------------------------------------------------------------------