├── README.md ├── lectures ├── lecture01.pdf ├── lecture02.pdf ├── lecture03.pdf ├── lecture04.pdf ├── lecture05.pdf ├── lecture06.pdf ├── lecture07.pdf ├── lecture08.pdf ├── lecture09.pdf └── lecture10.pdf └── practicals ├── README.md ├── compressed_sensing.ipynb ├── implicit_regularization.ipynb ├── limitations_of_gradient_based_learning.ipynb ├── model_selection_aggregation.ipynb ├── offset_rademacher_complexity.ipynb ├── optimization.ipynb ├── restricted_eigenvalue.ipynb ├── robust_mean_estimation.ipynb └── solved_practicals ├── README.md ├── compressed_sensing.ipynb ├── implicit_regularization.ipynb ├── limitations_of_gradient_based_learning.ipynb ├── model_selection_aggregation.ipynb ├── offset_rademacher_complexity.ipynb ├── optimization.ipynb └── robust_mean_estimation.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Mathematics of Machine Learning - Summer School 2 | 3 | This repository contains the practical session notebooks for the 4 | [Mathematics of Machine Learning summer school](https://www.turingevents.co.uk/turingevents/frontend/reg/thome.csp?pageID=19480&eventID=60). 5 | 6 | ## Schedule 7 | 8 | **DAY 1** 9 | | Activity | Topic | 10 | | ---- | ---- | 11 | | Lecture 1 | Introduction | 12 | | Practical 1| Robust One-Dimensional Mean Estimation | 13 | | Lecture 2 | Concentration Inequalities. Bounds in Probability | 14 | | Practical 2 | Model Selection Aggregation (Exercises 1-8) | 15 | 16 | **DAY 2** 17 | | Activity | Topic | 18 | | ---- | ---- | 19 | | Lecture 3 | Bernstein’s Concentration Inequalities. Fast Rates | 20 | | Practical 3 | Model Selection Aggregation (Exercises 9-12) | 21 | | Lecture 4 | Maximal Inequalities and Rademacher Complexity | 22 | | Practical 4 | Offset Rademacher Complexity | 23 | 24 | **DAY 3** 25 | | Activity | Topic | 26 | | ---- | ---- | 27 | | Lecture 5 | Convex Loss Surrogates. Gradient Descent | 28 | | Practical 5 | Optimization (Exercises 1-4) | 29 | | Lecture 6 | Mirror Descent | 30 | | Practical 6 | Optimization (Exercises 5-6) | 31 | 32 | **DAY 4** 33 | | Activity | Topic | 34 | | ---- | ---- | 35 | | Lecture 7 | Stochastic Methods. Algorithmic Stability | 36 | | Practical 7 | Limitations of Gradient-Based Learning | 37 | | Lecture 8 | Least Squares. Implicit Bias and Regularization | 38 | | Practical 8 | Implicit Regularization | 39 | 40 | **DAY 5** 41 | | Activity | Topic | 42 | | ---- | ---- | 43 | | Lecture 9 | High-Dimensional Statistics. Gaussian Complexity | 44 | | Practical 9 | Compressed Sensing | 45 | | Lecture 10 | The Lasso Estimator. Proximal Gradient Methods | 46 | | Practical 10 | Restricted Eigenvalue Condition | 47 | 48 | -------------------------------------------------------------------------------- /lectures/lecture01.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alan-turing-institute/mathematics-of-ml-course/7e0aa0918ba2fec644160689c50767ce04439291/lectures/lecture01.pdf -------------------------------------------------------------------------------- /lectures/lecture02.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alan-turing-institute/mathematics-of-ml-course/7e0aa0918ba2fec644160689c50767ce04439291/lectures/lecture02.pdf -------------------------------------------------------------------------------- /lectures/lecture03.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alan-turing-institute/mathematics-of-ml-course/7e0aa0918ba2fec644160689c50767ce04439291/lectures/lecture03.pdf -------------------------------------------------------------------------------- /lectures/lecture04.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alan-turing-institute/mathematics-of-ml-course/7e0aa0918ba2fec644160689c50767ce04439291/lectures/lecture04.pdf -------------------------------------------------------------------------------- /lectures/lecture05.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alan-turing-institute/mathematics-of-ml-course/7e0aa0918ba2fec644160689c50767ce04439291/lectures/lecture05.pdf -------------------------------------------------------------------------------- /lectures/lecture06.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alan-turing-institute/mathematics-of-ml-course/7e0aa0918ba2fec644160689c50767ce04439291/lectures/lecture06.pdf -------------------------------------------------------------------------------- /lectures/lecture07.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alan-turing-institute/mathematics-of-ml-course/7e0aa0918ba2fec644160689c50767ce04439291/lectures/lecture07.pdf -------------------------------------------------------------------------------- /lectures/lecture08.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alan-turing-institute/mathematics-of-ml-course/7e0aa0918ba2fec644160689c50767ce04439291/lectures/lecture08.pdf -------------------------------------------------------------------------------- /lectures/lecture09.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alan-turing-institute/mathematics-of-ml-course/7e0aa0918ba2fec644160689c50767ce04439291/lectures/lecture09.pdf -------------------------------------------------------------------------------- /lectures/lecture10.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alan-turing-institute/mathematics-of-ml-course/7e0aa0918ba2fec644160689c50767ce04439291/lectures/lecture10.pdf -------------------------------------------------------------------------------- /practicals/README.md: -------------------------------------------------------------------------------- 1 | # Instructions 2 | 3 | ## Option 1 4 | 5 | To start the practical sessions using Google Colab: 6 | - Open https://colab.research.google.com/; 7 | - Go to `File->Open Notebook`; 8 | - Select the `GitHub` tab in the pop-up window; 9 | - Enter `alan-turing-institute/mathematics-of-ml-course` in the search bar; 10 | - Select the practical session corresponding to the schedule outlined at the root of this repository. 11 | 12 | ![Google Colab](https://user-images.githubusercontent.com/8312273/123556234-52d3c100-d78a-11eb-9bc0-4306418c94a9.png) 13 | 14 | ## Option 2 15 | 16 | Clone this repository and run the practical sessions locally. 17 | 18 | 19 | -------------------------------------------------------------------------------- /practicals/compressed_sensing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "compressed_sensing.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "toc_visible": true 10 | }, 11 | "kernelspec": { 12 | "name": "python3", 13 | "display_name": "Python 3" 14 | }, 15 | "language_info": { 16 | "name": "python" 17 | } 18 | }, 19 | "cells": [ 20 | { 21 | "cell_type": "markdown", 22 | "metadata": { 23 | "id": "pEmm_fdhRP74" 24 | }, 25 | "source": [ 26 | "# Compressed Sensing\n", 27 | "\n", 28 | " This practical session serves as a gentle introduction to compressed sensing - a signal processing/statistical technique for recovering a sparse signal from an underdetermined system of linear equations. Our main objectives are the following:\n", 29 | "\n", 30 | "- introducing the basis pursuit linear program - a convex relaxation to the combinatorial optimization problem that we want to solve;\n", 31 | "- understanding the interplay between the nullspace of the design matrix and the geometry of $\\ell_{1}$ norm that allows for sparse signal recovery via the basis pursuit program;\n", 32 | "- introducing the restricted isometry property - a sufficient condition that ensures the success of the basis pursuit program;\n", 33 | "- demonstrating (one of many possible) practical applications of the presented theory — showing how to utilize compressed sensing ideas to design single-pixel cameras.\n" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": { 39 | "id": "IkeaKF3oNY4f" 40 | }, 41 | "source": [ 42 | "The setup of this practical session can be described as follows. We want to recover some signal vector $x^{\\star} \\in \\mathbb{R}^{d}$ given access to $n \\ll d$ linear measurements $y_{i} = \\langle x_{i}, w^{\\star} \\rangle$. In matrix-vector notation, the observations $y_{i}$ follow the model\n", 43 | "$$\n", 44 | " y = X w^{\\star}, \\tag{1}\n", 45 | "$$\n", 46 | "where $X \\in \\mathbb{R}^{n \\times d}$.\n", 47 | "The key difficulty in recovering $w^{\\star}$ given $(x_{i}, y_{i})_{i=1}^{n}$ stems from the fact that the above linear system is *underdetermined*, and so there exist infinitely many $w$ such that $Xw = y$. Indeed, since $d > n$, the nullspace of $X$ is of rank at least $n - d \\geq 1$, and hence it contains an infinite number of vectors. A sum of any such vector with the target vector $w^{\\star}$ yields a candidate solution to the above linear system.\n", 48 | "\n", 49 | "However, if the signal vector $w^{\\star}$ is known to be $k$-sparse\n", 50 | "(with $k \\ll n$), then the above linear system can be solved *efficiently* for certain design matrices $X$. Understanding sufficient conditions for the solvability of underdetermined linear systems as well as some real-world implications are the primary subjects of this practical session.\n", 51 | "\n", 52 | "We remark that in this practical session, we work under idealized conditions.\n", 53 | "In particular, we assume that the linear measurements $y_{i}$ contain no noise and that the target signal vector $w^{\\star}$ is exactly sparse. Various extensions can be obtained for noisy observations as well as approximately sparse target vectors; the interested reader will find pointers to the existing literature at the end of this notebook." 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": { 59 | "id": "7zdVAd1ABTYJ" 60 | }, 61 | "source": [ 62 | "## Sparse Recovery via Linear Programming" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": { 68 | "id": "6OZjotWRNXBh" 69 | }, 70 | "source": [ 71 | "Let $\\|w\\|_{0}$ denote the \"$\\ell_{0}$ norm\" of $w$, equal to the number of non-zero coordinates of $w$. Given the knowledge that the true signal vector $w^{\\star}$ is sparse, arguably the most natural approach to recovering $w^{\\star}$ from the underdetermined linear system $(1)$ is to look for a vector $w$ with fewest non-zero entries that is consistent with the observations, that is, a solution to the following program:\n", 72 | "$$\n", 73 | " \\min \\|w\\|_{0} \\quad\\text{subject to}\\quad Xw = y\n", 74 | "$$\n", 75 | "The combinatorial nature of the above problem, however, presents computational challenges: the naive approach of enumerating all possible subsets of coordinates and trying to solve the above linear system using the selected variables would require exponential running time in\n", 76 | "the sparsity level $k$ and thus, is infeasible in practice.\n", 77 | "\n", 78 | "To circumvent computational issues, we will instead consider\n", 79 | "replacing the \"$\\ell_{0}$ norm\" with an $\\ell_{q}$ norm for the smallest $q > 0$ that yields a convex program. The smallest such $q$ is given by the choice $q=1$ and thus we will aim to solve the following program, called *basis pursuit*:\n", 80 | "$$\n", 81 | " \\min \\|w\\|_{1}\\quad\\text{subject to}\\quad Xw = y.\n", 82 | " \\tag{2}\n", 83 | "$$\n", 84 | "\n", 85 | "**While the above optimization problem is readily seen to be convex, it can be\n", 86 | "rephrased as a [linear program](https://en.wikipedia.org/wiki/Linear_programming)**.\n", 87 | "Such problems, written in a standardized form, can be expressed as\n", 88 | "\\begin{align*}\n", 89 | " &\\min_{x \\in \\mathbb{R}^{d}} \\langle c, x\\rangle \\\\\n", 90 | " &\\text{subject to}\\quad\n", 91 | " Ax = b\\quad\\text{and}\\quad\n", 92 | " Gx \\preccurlyeq h,\n", 93 | "\\end{align*}\n", 94 | "where the vectors $c,b,h$ and the matrices $A, G$ are arbitrary problem parameters, and the notation $\\preccurlyeq$ denotes a componentwise inequality. \n", 95 | "Linear programs can be solved in [weakly polynomial](https://en.wikipedia.org/wiki/Linear_programming#Open_problems_and_recent_work) time by general-purpose solvers." 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": { 101 | "id": "DctUonBGSAhg" 102 | }, 103 | "source": [ 104 | "### Exercise 1" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": { 110 | "id": "VKyBz0OcSCQ0" 111 | }, 112 | "source": [ 113 | "Show that the convex program $(2)$ can be expressed as a linear program.\n", 114 | "Use the [cvxopt](https://cvxopt.org/) package (imported in the below cell) to implement a solver for $(2)$ by completing the missing code two cells below. For the cvxopt package documentation concerning linear programs see http://cvxopt.org/userguide/coneprog.html#linear-programming." 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "metadata": { 120 | "id": "FBoiBuX2UuXg" 121 | }, 122 | "source": [ 123 | "import numpy as np\n", 124 | "import cvxopt\n", 125 | "from matplotlib import pyplot as plt" 126 | ], 127 | "execution_count": null, 128 | "outputs": [] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "metadata": { 133 | "id": "THt6rBpuURyr" 134 | }, 135 | "source": [ 136 | "def compute_minimum_l1_norm_solution(X, y):\n", 137 | " \"\"\" :X: An n \\times d matrix.\n", 138 | " :y: An n dimensional array such that y = Xw* for some k-sparse vector w*.\n", 139 | " :returns: A vector w that solves the l1 minimization program (2).\n", 140 | " \"\"\"\n", 141 | " ##############################################################################\n", 142 | " # Exercise 1. Fill in the implementation of this function by using\n", 143 | " # cvxopt.solvers.lp function.\n", 144 | " \n", 145 | " ##############################################################################\n", 146 | "\n", 147 | "# The below code is designed to test your implementation of the above function.\n", 148 | "n = 100\n", 149 | "d = 1000\n", 150 | "k = 10\n", 151 | "w_star = np.zeros((d,1))\n", 152 | "w_star[:k,0] = 1\n", 153 | "# We will see why the below choice of the measurements matrix X works in\n", 154 | "# Exercise 2.\n", 155 | "X = n**(-1/2) * np.random.binomial(n=1, p=0.5, size=(n,d))*2 - 1.0\n", 156 | "y = X @ w_star\n", 157 | "w = compute_minimum_l1_norm_solution(X, y)\n", 158 | "# The below should print approximately 0.\n", 159 | "print(\"||w - w*||_{2}^{2} =\", np.sum((w.reshape(d, 1) - w_star)**2))" 160 | ], 161 | "execution_count": null, 162 | "outputs": [] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": { 167 | "id": "o4k1Q1AdSYAA" 168 | }, 169 | "source": [ 170 | "#### Solution" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": { 176 | "id": "SCwwXmLaSZ5m" 177 | }, 178 | "source": [ 179 | "We can reformulate $(2)$ via the following equivalent linear program, by intruducing an additional variable $t \\in \\mathbb{R}^{d}$:\n", 180 | "\\begin{align*}\n", 181 | " &\\min_{w,t \\in \\mathbb{R}^{d}} \\sum_{i=1}^{d} t_{i} \\\\\n", 182 | " &\\text{subject to}\\quad\n", 183 | " Xw = y\\quad\\text{and}\\quad\n", 184 | " w_{i} \\leq t_{i},\\, w_{i} \\geq -t_{i}\\text{ for }i = 1,\\dots d.\n", 185 | "\\end{align*}\n", 186 | "The above formulation can be passed to the cvxopt package as follows:\n", 187 | "```\n", 188 | " n = X.shape[0]\n", 189 | " d = X.shape[1]\n", 190 | " # Set up the linear programming variables. \n", 191 | " c = np.concatenate((np.zeros(d), np.ones(d))).reshape(-1, 1)\n", 192 | " A = np.concatenate((X, np.zeros((n, d))), axis=1)\n", 193 | " h = np.zeros(2 * d).reshape(-1, 1)\n", 194 | " # We will use sparse matrix representation of G.\n", 195 | " # Note that cvxopt takes lists columns as arguments rather than lists of rows\n", 196 | " # as done by numpy.\n", 197 | " Id = cvxopt.spmatrix(1.0, range(d), range(d))\n", 198 | " G = cvxopt.sparse([[Id, -Id], [-Id, -Id]])\n", 199 | " # Convert c,A,b and h to cvxopt matrices.\n", 200 | " c = cvxopt.matrix(c)\n", 201 | " A = cvxopt.matrix(A)\n", 202 | " b = cvxopt.matrix(y.reshape((n,1)))\n", 203 | " h = cvxopt.matrix(h)\n", 204 | " # Solve the linear program.\n", 205 | " solution = cvxopt.solvers.lp(c, G, h, A, b)\n", 206 | " w = np.array(solution['x'][:d]).reshape(-1,1)\n", 207 | " return w\n", 208 | "```" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": { 214 | "id": "XtRDrJHOCZhS" 215 | }, 216 | "source": [ 217 | "## Restricted Nullspace and Restricted Isometry Properties" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": { 223 | "id": "g2T8PKxuZlxd" 224 | }, 225 | "source": [ 226 | "We now investigate the necessary and sufficient conditions under which the basis pursuit program $(2)$ succeeds to recover any $k$-sparse target vector exactly. **Ultimately, the success of the basis pursuit linear program will depend on the interplay between the nullspace of $X$ and the $\\ell_{1}$ geometry, in the precise sense explained below.** Before proceeding, we introduce some additional notation. Let $S \\subseteq \\{1, \\dots, d\\}$ denote some index set and let $S^{c} = \\{1, \\dots, d\\} \\backslash S$. Given a vector $w \\in \\mathbb{R}^{d}$ we write $w_{S} \\in \\mathbb{R}^{d}$ to denote a restriction of $w$ to the support set $S$ by setting the other coordinates to $0$. Hence, we have $w = w_{S} + w_{S^{c}}$ for any vector $w$ and any support set $S$.\n", 227 | "\n", 228 | "\n", 229 | "Let $\\hat{w}$ denote the output of $(2)$ and let $\\Delta = w^{\\star} - \\hat{w}$.\n", 230 | "Then, since $\\|\\hat{w}\\|_{1} \\leq \\|w^{\\star}\\|_{1}$ we can deduce that\n", 231 | "the $\\ell_{1}$ mass of $\\Delta_{S^{c}}$ is at most equal to the $\\ell_{1}$ mass of $\\Delta_{S}$:\n", 232 | "\\begin{align*}\n", 233 | " \\|\\Delta_{S^{c}}\\|_{1} \n", 234 | " &= \\|\\hat{w}_{S^{c}}\\|_{1} \\\\\n", 235 | " &\\leq \\|\\hat{w}_{S^{c}}\\|_{1} + \\underbrace{(\\|w^{\\star}\\|_{1} - \\|\\hat{w}\\|_{1})}_{\\geq 0 \\text{ by definition of } (2)} \\\\\n", 236 | " &= \\|w^{\\star}\\|_{1} - \\|\\hat{w}_{S}\\| \\\\\n", 237 | " &= \\|w^{\\star}_{S}\\|_{1} - \\|\\hat{w}_{S}\\| & \\text{since }w^{\\star}\\text{ is supported on }S \\\\\n", 238 | " &\\leq \\|w^{\\star}_{S} - \\hat{w}_{S}\\|_{1} \\\\\n", 239 | " &= \\|\\Delta_{S}\\|_{1}.\n", 240 | "\\end{align*}\n", 241 | "In particular, the residual vector $\\Delta$ belongs to the cone $\\mathcal{C}(S)$ defined as\n", 242 | "$$\n", 243 | " \\Delta \\in \\mathcal{C}(S) = \\{\\Delta \\in \\mathbb{R}^{d} : \\|\\Delta_{S^{c}}\\|_{1} \\leq \\|\\Delta_{S}\\|_{1}\\}.\n", 244 | "$$\n", 245 | "Notice that since $X\\hat{w} = y = Xw^{\\star}$, we also have $X \\Delta = 0$, \n", 246 | "that is, $\\Delta \\in \\mathrm{ker} X = \\{w \\in \\mathbb{R}^{d} : Xw = 0\\}$. It \n", 247 | "follows that $\\mathrm{ker} X \\cap \\mathcal{C}(S) = \\{0\\}$ is a *sufficient* \n", 248 | "condition to ensure that the basis pursuit program $(2)$ succeeds to output a vector $\\hat{w} = w^{\\star}$.\n", 249 | "\n", 250 | "In fact, the above condition is also *necessary*. To see that, suppose for a contradiction that\n", 251 | "$(2)$ succeeds to recover the correct solution for any underlying vector\n", 252 | "$w^{\\star}$ supported on $S$ despite the existence of some non-zero $y \\in \\mathrm{ker} K \\cap \\mathcal{C}(S)$.\n", 253 | "Set $w^{\\star} = y_{S}$ and note that $Xw^{\\star} = X(-y_{S^{c}})$. At the same time, we have $\\|y_{S^{c}}\\|_{1} \\leq \\|w^{\\star}\\|_{1}$, and thus the basis pursuit program $(2)$ can output $y_{S^{c}} \\neq w^{\\star}$, which contradicts the assumption that $(2)$ always succeeds to output the correct vector $w^{\\star}$.\n", 254 | "\n", 255 | "\n", 256 | "We can thus formulate a necessary and sufficient condition for the success of basis pursuit program for any target vector $w^{\\star}$ supported on $S$:" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": { 262 | "id": "SPDuPaz3CMhL" 263 | }, 264 | "source": [ 265 | "---\n", 266 | "\n", 267 | "**Restricted Nullspace Property (RNP)**\n", 268 | "\n", 269 | "A matrix $X \\in \\mathbb{R}^{n \\times d}$ satisfies the restricted nullspace property with respect to an index set $S \\subseteq \\{1, \\dots, d\\}$ if\n", 270 | "$\\mathrm{ker} X \\cap \\mathcal{C}(S) = \\{0\\}$, where\n", 271 | "$\n", 272 | " \\mathcal{C}(S) = \\{ \\Delta \\in \\mathbb{R}^{d} : \\|\\Delta_{S^{c}}\\|_{1} \\leq \\|\\Delta_{S}\\|_{1}\\}.\n", 273 | "$\n", 274 | "\n", 275 | "---" 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": { 281 | "id": "qIIXZR1vCitc" 282 | }, 283 | "source": [ 284 | "Since we want the basis pursuit problem to succeed for any $k$-sparse vector $w^{\\star}$, we want the measurement matrix $X$ to satisfy the RNP uniformly over all support sets $S$ of size $k$. There is, unfortunately, no easy way to verify wheter a given matrix $X$ satisfies uniform RNP; **however, there exist sufficient conditions that imply the uniform RNP. On such condition, called *the restricted isometry property*, is stated below.**\n", 285 | "\n", 286 | "---\n", 287 | "\n", 288 | "**Restricted Isometry Property (RIP)**\n", 289 | "\n", 290 | "A matrix $X \\in \\mathbb{R}^{n \\times d}$ satisfies a $(k, \\delta)$-restricted isometry property if the following deterministic inequality holds for any k-sparse vector $w \\in \\mathbb{R}^{d}$:\n", 291 | "$$\n", 292 | " (1-\\delta)\\|w\\|_{2}^{2}\n", 293 | " \\leq \\|Xw\\|_{2}^{2}\n", 294 | " \\leq\n", 295 | " (1 + \\delta)\\|w\\|_{2}^{2}.\n", 296 | "$$\n", 297 | "\n", 298 | "---\n", 299 | "\n", 300 | "Let $X_{S} \\in \\mathbb{R}^{n \\times |S|}$ denote a subset of $X$ formed by only including the columns indexed by some support set $S \\subseteq \\{1, \\dots, n\\}$.\n", 301 | "Suppose that $w$ is a k-sparse vector with support $S$, that is, $w = w_{S}$.\n", 302 | "Let $\\tilde{w}_{S} \\in \\mathbb{R}^{|S|}$ denote the projection of $w_{S}$ onto the coordinates indexed by $S$ (dropping the other coordinates). Then, we have\n", 303 | "$$\n", 304 | " \\|Xw_{S}\\|_{2}^{2} = \\|X_{S}\\tilde{w}_{S}\\|_{2}^{2} = \\tilde{w}_{S}^{\\mathsf{T}}X_{S}^{\\mathsf{T}}X_{S}\\tilde{w}_{S}.\n", 305 | "$$\n", 306 | "It follows that the $(k, \\delta)$-RIP can be restated as:\n", 307 | "$$\n", 308 | " \\left\\|\n", 309 | " X_{S}^{\\mathsf{T}}X_{S} - I\n", 310 | " \\right\\|_{\\mathrm{op}} \\leq \\delta\n", 311 | " \\text{ for any }S \\subseteq\\{1, \\dots, d\\}\\text{ such that }|S| \\leq k,\n", 312 | "$$\n", 313 | "where $\\|A\\|_{\\mathrm{op}} = \\sup_{\\|u\\|_{2} \\leq 1} \\|Au\\|_{2}$ denotes the\n", 314 | "$\\ell_{2} \\to \\ell_{2}$ operator norm.\n", 315 | "**In words, the RIP property states that for any $S \\subseteq \\{1, \\dots, d\\}$ of size $k$, the matrix $X_{S}^{\\mathsf{T}}X_{S}$ is approximately equal to the identity matrix.** We will now prove that the RIP indeed implies the RNP.\n", 316 | "\n", 317 | "---\n", 318 | "\n", 319 | "**Theorem**\n", 320 | "\n", 321 | "Suppose that a matrix $X \\in \\mathbb{R}^{n \\times d}$ satisfies $(2k, \\delta)$-RIP with $\\delta \\in (0, \\frac{1}{3})$.\n", 322 | "Then, $X$ satisfies the RNP for any index set $S$ of size at most $k$.\n", 323 | "\n", 324 | "\n", 325 | "> **Proof**\n", 326 | ">\n", 327 | "> - We will write $\\delta$ for the $(2k,\\delta)$-RIP constant of $X$ and deduce in the end that $\\delta = 1/3$ suffices.\n", 328 | "> - Fix any $\\Delta \\in \\mathrm{ker} X$ such that $\\Delta \\neq 0$. Let $S_{0}$ denote a subset of $\\{1, \\dots, d\\}$ of size $k$ where $\\Delta$ has largest coordinates in absolute value. To prove the above theorem it suffices to show that $\\|\\Delta_{S_{0}^{c}}\\|_{1} > \\|\\Delta_{S_{0}}\\|_{1}$.\n", 329 | "> - Decompose $\\Delta_{S_{0}^{c}}$ into $\\sum_{i \\geq 1} \\Delta_{S_{i}}$, where $S_{1}$ indicates a subset of $S_{0}^{c}$ with largest absolute values of $\\Delta_{S^{c}}$, $S_{2}$ indicates a subset of $S_{0}^{c} \\backslash S_{1}$ with largest absolute values of $\\Delta$, etc. All subsets $S_{i}$, are of size $|S| = k$, possibly except for the last subset which may contain fewer indices.\n", 330 | "> - By definition of the subsets $S_{0}, S_{1}, S_{2}, \\dots$, it follows that \n", 331 | "for any $i \\geq 1$ we have $\\|\\Delta_{S_{i}}\\|_{2} \\leq \\sqrt{s}\\|\\Delta_{S_{i}}\\|_{\\infty} \\leq s^{-1/2}\\|\\Delta_{S_{i-1}}\\|_{1}$. It follows that \n", 332 | "$$\n", 333 | " \\sum_{i \\geq 1} \\|\\Delta_{S_{j}}\\|_{2}\n", 334 | " \\leq s^{-1/2}(\\|\\Delta_{S_{0}}\\|_{1} + \\|\\Delta_{S_{0}^{c}}\\|_{1}).\n", 335 | "$$\n", 336 | "> - Since $\\Delta \\in \\mathrm{ker} X$, we have\n", 337 | "$X\\Delta_{S_{0}} = -X\\Delta_{S_{0}^{c}} = -X\\sum_{i \\geq 1}\\Delta_{S_{i}}$.\n", 338 | "Applying the RIP assumption, it follows that\n", 339 | "$$\n", 340 | " (1-\\delta)\\|\\Delta_{S_{0}}\\|_{2}^{2} \\leq \\|X\\Delta_{S_{0}}\\|_{2}^{2}\n", 341 | " = \\langle X \\Delta_{S_{0}}, -X\\sum_{i \\geq 1}\\Delta_{S_{i}}\\rangle\n", 342 | " \\leq \\left|\n", 343 | " \\sum_{i \\geq 1}\\langle X \\Delta_{S_{0}}, X \\Delta_{S_{i}}\\rangle\n", 344 | " \\right|\n", 345 | "$$\n", 346 | "Now, noting that for any $i \\geq 1$, $S_{0}$ and $S_{i}$ are disjoint. Hence, $\\langle \\Delta_{S_{0}}, \\Delta_{S_{i}} \\rangle = 0$ and we can write \n", 347 | " $$\n", 348 | " | \\langle X \\Delta_{S_{0}}, X \\Delta_{S_{i}} \\rangle |\n", 349 | " =\n", 350 | " | \\langle X \\Delta_{S_{0}}, X \\Delta_{S_{i}} \\rangle \n", 351 | " - \\langle \\Delta_{S_{0}}, \\Delta_{S_{i}} \\rangle |\n", 352 | " =\n", 353 | " | \\langle \\Delta_{S_{0}},\n", 354 | " (X^{\\mathsf{T}}_{S_{0} \\cup S_{i}}X_{S_{0} \\cup S_{i}} - I) \\Delta_{S_{i}} \\rangle \n", 355 | " \\leq \\delta \\|\\Delta_{S_{0}}\\|_{2}\\|\\Delta_{S_{i}}\\|_{2}.\n", 356 | " $$\n", 357 | " Combining the above two inequalities, it follows that\n", 358 | " $$\n", 359 | " \\|\\Delta_{S_{0}}\\|_{2} \\leq\n", 360 | " \\frac{\\delta}{1-\\delta}\\sum_{i\\geq 1}\\|\\Delta_{S_{i}}\\|_{2}.\n", 361 | " $$\n", 362 | "> - Putting everything together yields\n", 363 | "$$\n", 364 | " \\|\\Delta_{S_{0}}\\|_{1}\n", 365 | " \\leq \\sqrt{s}\\|\\Delta_{S_{0}}\\|_{2}\n", 366 | " \\leq \n", 367 | " \\sqrt{s}\\frac{\\delta}{1-\\delta}\\sum_{i\\geq 1}\\|\\Delta_{S_{i}}\\|_{2}\n", 368 | " \\leq \\frac{\\delta}{1- \\delta}(\\|\\Delta_{S_{0}}\\|_{1} + \\|\\Delta_{S_{0}^{c}}\\|_{1}).\n", 369 | "$$\n", 370 | "Rearranging, we obtain\n", 371 | "$$\n", 372 | " \\|\\Delta_{S_{0}}\\|_{1} \\leq \\frac{\\delta}{1 - 2\\delta}\\|\\Delta_{S_{0}^{c}}\\|_{1}.\n", 373 | "$$\n", 374 | "Since for any $\\delta \\in (0,1/3)$ we have $0 < \\frac{\\delta}{1-2\\delta} < 1$,\n", 375 | "our proof is complete. \n", 376 | "\n", 377 | "---\n", 378 | "\n", 379 | "In the next exercise, we show that random matrices with i.i.d. sub-Gaussian entries satisfy RIP with high probability." 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "metadata": { 385 | "id": "D0qf8vbsu0Ip" 386 | }, 387 | "source": [ 388 | "### Exercise 2" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": { 394 | "id": "btm-BLoMqbwN" 395 | }, 396 | "source": [ 397 | "Let $X$ denote an $n \\times m$ matrix such that $n \\geq m$ and the $i,j$-th entry $X_{i,j}$ is sampled i.i.d. from a zero-mean $1$-subGaussian distribution (recall that $Y \\sim P$ is $1$-subGaussian if $\\mathbf{E}[e^{\\lambda Y}] \\leq e^{\\lambda^{2}/2}$). Using results from random matrix theory, it can be shown that\n", 398 | "for any $\\varepsilon \\in (0,1)$ we have\n", 399 | "$$\n", 400 | " \\mathbb{P}\\left( \\left\\|\\frac{1}{n}X^{\\mathsf{T}}X - \\mathbf{E}\\left[\\frac{1}{n}X^{\\mathsf{T}}X\\right]\\right\\|_{\\mathrm{op}} \\geq c_{1}\\sqrt{\\frac{m}{n}} + \\varepsilon \\right) \\leq \\exp(-c_{3} n \\varepsilon^{2}),\n", 401 | "$$\n", 402 | "where $c_{1}, c_{2}$ and $c_{3}$ are absolute constants.\n", 403 | "For example, see Theorem 6.5 in the textbook by *Wainwright [2019]* for a more general statement and the proof of the above claim.\n", 404 | "\n", 405 | "Suggest a way to obtain an $n \\times d$ matrix (with $n \\ll d$) that satisfies $(2k, 1/3)$-RIP with high probability. Prove that it suffices to take $n = c'k\\log(d/k)$ for some absolute constant $c'$. **In particular, n needs to grow only linearly with respect to the sparsity parameter $k$ and logarithmically with the dimension $d$.**" 406 | ] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "metadata": { 411 | "id": "PSqSQBgk_NFG" 412 | }, 413 | "source": [ 414 | "#### Solution" 415 | ] 416 | }, 417 | { 418 | "cell_type": "markdown", 419 | "metadata": { 420 | "id": "F-zMRTGt_Ohj" 421 | }, 422 | "source": [ 423 | "We sample i.i.d. entries $X_{i,j}$ from some zero-mean $1$-sub-Gaussian distribution with variance equal to $1$ (among other examples, such conditions are satisfied by i.i.d. standard normal or Rademacher random variables). We will apply the given concentration result for all sub-matrices $X_{S}$ with $S \\subseteq \\{1, \\dots, d\\}$ and $|S| \\leq 2k$ and conclude via the union bound that $\\frac{1}{\\sqrt{n}}X$ satisfies $(2k, \\delta)-RIP$ with high probability.\n", 424 | "\n", 425 | "Note that for any $S$ we have $\\mathbf{E}[X_{S}^{\\mathsf{T}}X_{S}/n] = I$. Thus, for any fixed $S$ we have\n", 426 | "$$\n", 427 | " \\mathbb{P}\\left( \\left\\|\\frac{1}{n}X_{S}^{\\mathsf{T}}X_{S} - I\\right\\|_{\\mathrm{op}} \\geq c_{1}\\sqrt{\\frac{2k}{n}} + \\varepsilon \\right) \\leq \\exp(-c_{3} n \\varepsilon^{2}),\n", 428 | "$$\n", 429 | "For $n \\geq 72c_{1}^{2}k$ we have $c_{1}\\sqrt{\\frac{2k}{n}} \\leq \\frac{1}{6}$.\n", 430 | "Setting $\\varepsilon = \\frac{1}{6}$, $c_{4} = 72c_{1}^{2}$ and $c_{5} = c_{3}/36$ we have\n", 431 | "$$\n", 432 | " \\mathbb{P}\\left( \\left\\|\\frac{1}{n}X_{S}^{\\mathsf{T}}X_{S} - I\\right\\|_{\\mathrm{op}} \\geq \\frac{1}{3} \\right) \\leq \\exp(-c_{5} n),\n", 433 | "$$\n", 434 | "Note that $(2k, \\delta)$-RIP fails to hold for $\\frac{1}{\\sqrt{n}}X$ if and only if there exists some subset $S$ with $|S| \\leq 2k$ such that the above event of probability $\\exp(-c_{5}n)$ happens. Since the number of possible subsets $S$ can be upper bounded as $\\sum_{i=1}^{2k} \\binom{d}{i} \\leq (\\frac{ed}{2k})^{2k}$, by the union bound we have\n", 435 | "$$\n", 436 | " \\mathbb{P}\\left(\\frac{1}{\\sqrt{n}}X \\text{ fails to satisfy $(2k, 1/3)$-RIP}\\right)\n", 437 | " \\leq \\left(\\frac{ed}{2k}\\right)^{2k}\\exp(-c_{5}n).\n", 438 | "$$\n", 439 | "Thus, setting $n = \\frac{1}{c_5}\\log\\left(\\frac{ed}{2k}\\right)^{2k} + \\frac{1}{c5}\\log\\left(\\frac{1}{\\delta}\\right) \\sim k \\log(d/k) + \\log(1/\\delta)$ suffices to ensure that $\\frac{1}{\\sqrt{n}}X$ satisfies $(2k,\\delta)$-RIP with probability at least $1-\\delta$." 440 | ] 441 | }, 442 | { 443 | "cell_type": "markdown", 444 | "metadata": { 445 | "id": "z3rw-UcNBTiU" 446 | }, 447 | "source": [ 448 | "## Single-Pixel Camera" 449 | ] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "metadata": { 454 | "id": "n73R7BLbzl4e" 455 | }, 456 | "source": [ 457 | "Many real-world signals are structured and highly compressible, meaning they have approximately sparse representations in some appropriately chosen basis.\n", 458 | "As a typical example, suppose that our signal $w^{\\star} \\in \\mathbb{R}^{d}$ represents an image, where different values of $w^{\\star}_{i}$ represent the colour intensities of the $i$-th pixel. While real-world images represented by pixel intensities are not sparse in the standard basis, they contain redundant patterns of information that can be efficiently compressed via an appropriate change of basis $\\alpha^{\\star} = \\Phi w^{\\star}$, where $\\Phi \\in \\mathbb{R}^{d \\times d}$ is an orthonormal change-of-basis matrix, and $\\alpha^{\\star}$ is an approximately sparse vector.\n", 459 | "\n", 460 | "Suppose that we take linear measurements $y_{i} = \\langle x_{i}, w^{\\star}\\rangle$ of the signal $w^{\\star}$. Since $w^{\\star}$ is not sparse in the standard basis, solving\n", 461 | "$$\n", 462 | " \\min \\|w\\|_{1} \\text{ subject to } Xw = y\n", 463 | "$$\n", 464 | "will not yield a desirable solution. Instead, note that we can reformulate the above in the transformed coordinate system $\\alpha = \\Phi w$ as\n", 465 | "$$\n", 466 | " \\min \\|\\alpha\\|_{1}\n", 467 | " \\text{ subject to } X\\Phi\\alpha = y = X\\Phi\\alpha^{\\star}.\n", 468 | "$$\n", 469 | "If $\\alpha^{\\star}$ is indeed sparse and if $X\\Phi$ satisfies the restricted nullspace property, then we are guaranteed to recover the correct solution $\\alpha^{\\star}$ via the basis pursuit linear program. The image can then be transformed to the standard basis by taking $w^{\\star} = \\Phi^{\\mathsf{T}}\\alpha^{\\star}$. Although we have not proved this in exercise 2, given an orthonormal matrix $\\Phi$ and a random ensemble $X$ consisting of i.i.d. zero-mean $1$-sub-Gaussian random variables (e.g., Rademacher random variables), the matrix $\\frac{1}{\\sqrt{n}}X\\Phi$ indeed satisfies RIP with high probability. See the discussion section in [a paper by Baraniuk, Davenport, DeVore and Wakin](https://users.math.msu.edu/users/iwenmark/Teaching/MTH995/Papers/JL_RIP_Proof.pdf) for further details." 470 | ] 471 | }, 472 | { 473 | "cell_type": "markdown", 474 | "metadata": { 475 | "id": "4bDJLMNcA5QN" 476 | }, 477 | "source": [ 478 | "What we have described above forms the conceptual basis for an emerging technology of single-pixel cameras, [introduced by a team of researchers from Rice University](https://scholarship.rice.edu/bitstream/handle/1911/21682/csCamera-SPMag-web.pdf;jsessionid=AA8AD0D7D0FB3FB37624B270D3495D8B?sequence=1), which we are going to walk through in the remainder of this section. **The idea of single-pixel cameras is to combine sampling and compression into a single step by directly trying to sense the sparsified signal $\\alpha^{\\star}$, without first trying to recover the signal $w^{\\star}$ in the standard basis.** This should be contrasted with standard camera architectures, where the signal is first acquired in the standard basis, and only later it is compressed for storage or transmission purposes.\n", 479 | "\n", 480 | "Informally, the hardware implementation of a single-pixel camera consists of two lenses, an array of $d$ micro-mirrors, a single photon detector and an analog-to-digital signal converter. The incoming light is first oriented via the first lens onto an array of micro-mirrors. Each mirror can reflect light in one of two possible directions, depending on the orientation of the mirror, which changes between different measurements (thus implementing the Rademacher ensemble for the measurements matrix $X$). The reflected light from the mirrors is then focused by the second lens onto a photon detector that computes the measurement $y_{i} = \\langle w^{\\star}, x_{i}\\rangle$. The measurements are then passed to a digital computer via an analog-to-digital converter component of the camera. Among the benefits offered by single-pixel cameras are reduced costs due to the single photon detector design (especially concerning applications going beyond the scope of consumer photography)\n", 481 | "or reduced sampling time, in comparison to classical multiplexed architectures that try to acquire the signal in the standard basis and hence need to take $d \\gg c k \\log(d/k)$ measurements." 482 | ] 483 | }, 484 | { 485 | "cell_type": "markdown", 486 | "metadata": { 487 | "id": "uSMpiJ9MKP9g" 488 | }, 489 | "source": [ 490 | "We now turn to illustrating the above-outlined ideas via simulations. First, we will load an image that we will use in our simulations." 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "metadata": { 496 | "id": "C_h9zAMqz0nm" 497 | }, 498 | "source": [ 499 | "from skimage import data\n", 500 | "\n", 501 | "cameraman = np.array(data.camera(), dtype=np.float32)\n", 502 | "cameraman -= 128 # Normalize the pixel values.\n", 503 | "plt.figure(figsize=(6, 6))\n", 504 | "plt.imshow(cameraman, cmap='gray', interpolation='nearest')\n", 505 | "plt.axis('off')" 506 | ], 507 | "execution_count": null, 508 | "outputs": [] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "metadata": { 513 | "id": "bs-rWCInKsed" 514 | }, 515 | "source": [ 516 | "To perform the change of basis $\\alpha = \\Phi w$, we will use a [two-dimensional discrete cosine transform](https://en.wikipedia.org/wiki/Discrete_cosine_transform#M-D_DCT-II) on non-overlapping $8 \\times 8$ blocks of the target image. An implementation of a one-dimensional discrete cosine transform is provided by [scipy.fft](https://docs.scipy.org/doc/scipy/reference/tutorial/fft.html) package. In the below code, we implement two-dimensional discrete cosine transforms and display the $64$ basis vectors as $8\\times 8$ images." 517 | ] 518 | }, 519 | { 520 | "cell_type": "code", 521 | "metadata": { 522 | "id": "OO8BLApowDYs" 523 | }, 524 | "source": [ 525 | "from scipy.fft import dct, idct\n", 526 | "\n", 527 | "def dct2(block):\n", 528 | " return dct(dct(block, axis=0, norm='ortho'), axis=1, norm='ortho')\n", 529 | "\n", 530 | "def idct2(block):\n", 531 | " return idct(idct(block, axis=1, norm='ortho'), axis=0, norm='ortho')" 532 | ], 533 | "execution_count": null, 534 | "outputs": [] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "metadata": { 539 | "id": "Ijh5e8qc3Xow" 540 | }, 541 | "source": [ 542 | "I = np.identity(64)\n", 543 | "fig, ax = plt.subplots(8, 8)\n", 544 | "fig.set_size_inches(8,8)\n", 545 | "Phi = np.zeros((64, 64)) # A linear map implemented by dct2.\n", 546 | "for i in range(8):\n", 547 | " for j in range(8):\n", 548 | " block = I[:,8*i+j].reshape(8, 8)\n", 549 | " Phi[:,8*i+j] = idct2(block).reshape(-1)\n", 550 | " ax[i,j].imshow(Phi[:, 8*i+j].reshape(8,8), cmap='gray')\n", 551 | " ax[i,j].axis('off')\n", 552 | "\n", 553 | "print(\"Is Phi orthonormal:\", np.allclose(Phi.T @ Phi, np.identity(64)) \\\n", 554 | " and np.allclose(Phi @ Phi.T, np.identity(64)))" 555 | ], 556 | "execution_count": null, 557 | "outputs": [] 558 | }, 559 | { 560 | "cell_type": "markdown", 561 | "metadata": { 562 | "id": "YXrZ_gq1M-YU" 563 | }, 564 | "source": [ 565 | "Notice that the displayed basis vectors increase horizontal frequencies as we move to the right in the horizontal direction. Likewise, it increases the vertical frequencies as we move in the vertical direction downwards. Thus, the upper-left basis vectors represent low-frequency parts of the image, while the lower-right basis vectors would be used to represent high-frequency components of the image. **As the real-world images are structured, we expect that most parts of the image can be expressed by primarily relying on the low-frequency basis vectors, thus resulting in sparse representations.** In contrast, note that we would not expect an unstructured image comprised of completely random pixels to admit a sparse representation in the above basis." 566 | ] 567 | }, 568 | { 569 | "cell_type": "markdown", 570 | "metadata": { 571 | "id": "vHb672QXOxyK" 572 | }, 573 | "source": [ 574 | "We now implement code that slides along non-overlapping $8 \\times 8$ blocks of a target image and applies a discrete cosine transform to each of the blocks. We observe that the transformed cameraman image admits an approximately sparse representation in the transformed coordinate system." 575 | ] 576 | }, 577 | { 578 | "cell_type": "code", 579 | "metadata": { 580 | "id": "caHTCzJlOI-q" 581 | }, 582 | "source": [ 583 | "def transform_image(image, transformation):\n", 584 | " # For simplicity assume that the image dimensions are divisible by 8.\n", 585 | " assert image.shape[0] % 8 == 0\n", 586 | " assert image.shape[1] % 8 == 0\n", 587 | " assert len(image.shape) == 2\n", 588 | "\n", 589 | " transformed_image = np.zeros_like(image, dtype=np.float32)\n", 590 | " for i in range(image.shape[0]//8):\n", 591 | " for j in range(image.shape[1]//8):\n", 592 | " block = image[8*i:8*(i+1), 8*j:8*(j+1)] \n", 593 | " transformed_block = transformation(block)\n", 594 | " transformed_image[8*i:8*(i+1), 8*j:8*(j+1)] = transformed_block\n", 595 | " return transformed_image\n" 596 | ], 597 | "execution_count": null, 598 | "outputs": [] 599 | }, 600 | { 601 | "cell_type": "code", 602 | "metadata": { 603 | "id": "Ovm6cevJhKFl" 604 | }, 605 | "source": [ 606 | "print(\"Original image:\")\n", 607 | "plt.figure(figsize=(6,6))\n", 608 | "plt.imshow(cameraman, cmap='gray')\n", 609 | "plt.show()\n", 610 | "\n", 611 | "print(\"Absolue values of coefficients of transformed image:\")\n", 612 | "plt.figure(figsize=(6,6))\n", 613 | "encoded_cameraman = transform_image(cameraman, dct2)\n", 614 | "sorted_coefficients = np.sort(np.abs(encoded_cameraman.flatten()))\n", 615 | "# Set maximum size to display as white color.\n", 616 | "vmax = sorted_coefficients[-int(len(sorted_coefficients)*0.075)]\n", 617 | "plt.imshow(np.abs(encoded_cameraman), cmap='gray', vmax = vmax, vmin = 0)\n", 618 | "plt.show()\n", 619 | "\n", 620 | "print(\"Reconstructed image:\")\n", 621 | "plt.figure(figsize=(6,6))\n", 622 | "plt.imshow(transform_image(encoded_cameraman, idct2), cmap='gray')\n", 623 | "plt.show()" 624 | ], 625 | "execution_count": null, 626 | "outputs": [] 627 | }, 628 | { 629 | "cell_type": "markdown", 630 | "metadata": { 631 | "id": "ohKu5M8eTYhu" 632 | }, 633 | "source": [ 634 | "A typical compression scheme would *quantize* the `encoded_cameraman` variable by setting small coefficients to $0$ (there are smarter ways to do it by also taking into account whether the given coefficient represents a high-frequency or low-frequency basis vector; we may want to be more inclined to quantize high-frequency components. See the wikipedia page on [JPEG compression](https://en.wikipedia.org/wiki/JPEG#JPEG_compression)). Let us now try a simple quantizing scheme that simply keeps a given fraction `p` of coefficients largest in absolute value." 635 | ] 636 | }, 637 | { 638 | "cell_type": "code", 639 | "metadata": { 640 | "id": "erSMRGxFUEhl" 641 | }, 642 | "source": [ 643 | "def quantize_coefficients(transformed_image, p):\n", 644 | " \"\"\" :transformed_image: An image represented in the transformed basis.\n", 645 | " :p: A fraction of largest coefficients to keep.\n", 646 | " :returns: A quantized image, retaining p-fraction of the largest\n", 647 | " coefficients in absolute value.\n", 648 | " \"\"\"\n", 649 | " sorted = np.sort(np.abs(transformed_image).flatten())\n", 650 | " threshold = sorted[-int(len(sorted)*p)]\n", 651 | " quantized_image = np.copy(transformed_image)\n", 652 | " quantized_image[np.abs(quantized_image) < threshold] = 0\n", 653 | " return quantized_image" 654 | ], 655 | "execution_count": null, 656 | "outputs": [] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "metadata": { 661 | "id": "DYGli5aSVCHB" 662 | }, 663 | "source": [ 664 | "print(\"Original image:\")\n", 665 | "plt.figure(figsize=(6,6))\n", 666 | "plt.imshow(cameraman, cmap='gray')\n", 667 | "plt.show()\n", 668 | "\n", 669 | "print(\"Absolue values of coefficients of transformed image:\")\n", 670 | "plt.figure(figsize=(6,6))\n", 671 | "encoded_cameraman = transform_image(cameraman, dct2)\n", 672 | "sorted_coefficients = np.sort(np.abs(encoded_cameraman.flatten()))\n", 673 | "# Set maximum size to display as white color.\n", 674 | "vmax = sorted_coefficients[-int(len(sorted_coefficients)*0.075)]\n", 675 | "plt.imshow(np.abs(encoded_cameraman), cmap='gray', vmax = vmax, vmin = 0)\n", 676 | "plt.show()\n", 677 | "\n", 678 | "p = 0.025 # The fraction of coefficients to keep.\n", 679 | "print(\"Quantized image at level p=\",p)\n", 680 | "quantized_image = quantize_coefficients(encoded_cameraman, p)\n", 681 | "plt.figure(figsize=(6,6))\n", 682 | "plt.imshow(np.abs(quantized_image), cmap='gray', vmax=vmax, vmin=0)\n", 683 | "plt.show()\n", 684 | "\n", 685 | "print(\"Reconstruction from the quantized image:\")\n", 686 | "plt.figure(figsize=(6,6))\n", 687 | "reconstructed_image = transform_image(quantized_image, idct2)\n", 688 | "plt.imshow(reconstructed_image, cmap='gray')\n", 689 | "plt.show()\n", 690 | "\n", 691 | "print(\"Average l_2 squared reconstruction error:\",\n", 692 | " np.average((cameraman.flatten() - reconstructed_image.flatten())**2))" 693 | ], 694 | "execution_count": null, 695 | "outputs": [] 696 | }, 697 | { 698 | "cell_type": "markdown", 699 | "metadata": { 700 | "id": "Py8glXkUW8sD" 701 | }, 702 | "source": [ 703 | "In the next exercise, you are asked to implement a single-pixel camera." 704 | ] 705 | }, 706 | { 707 | "cell_type": "markdown", 708 | "metadata": { 709 | "id": "AcPxogAdXDLJ" 710 | }, 711 | "source": [ 712 | "### Exercise 3" 713 | ] 714 | }, 715 | { 716 | "cell_type": "markdown", 717 | "metadata": { 718 | "id": "vzz7dCVBXE6r" 719 | }, 720 | "source": [ 721 | "- Complete the single-pixel camera implementation in the below cell.\n", 722 | "- How many measurements need to be taken to recover a \"visually acceptable\" `cameraman` image? How does the number of measurements compare with the number of pixels in the image?" 723 | ] 724 | }, 725 | { 726 | "cell_type": "code", 727 | "metadata": { 728 | "id": "Cxkkkr-fXJKj" 729 | }, 730 | "source": [ 731 | "class SinglePixelCamera(object):\n", 732 | " \"\"\" An implementation of a single pixed camera. \"\"\"\n", 733 | "\n", 734 | "\n", 735 | " def __init__(self, d1, d2):\n", 736 | " \"\"\" :(d1, d2): Shape of the image. \"\"\"\n", 737 | " self.d1 = d1\n", 738 | " self.d2 = d2\n", 739 | "\n", 740 | "\n", 741 | " def get_mirror_positions(self, n_measurements):\n", 742 | " \"\"\" Returns a numpy array of shape (n_measuremes, self.d1 * self.d2), whose\n", 743 | " i-th row stores {-1, +1}-valued mirror positions to be used for the\n", 744 | " i-th linear measurement.\n", 745 | " \"\"\"\n", 746 | " ############################################################################\n", 747 | " # Complete the below implementation.\n", 748 | "\n", 749 | " ############################################################################\n", 750 | "\n", 751 | "\n", 752 | " def take_measurements(self, X, w_star):\n", 753 | " \"\"\" :X: A measurement matrix of shape (n, d1*d2).\n", 754 | " :w_star: The signal vector (represented as a 2d-array) that we are\n", 755 | " trying to measure.\n", 756 | " :returns: A vector of linear measurements of (an appropriately\n", 757 | " flattened) image w_star.\n", 758 | " \"\"\"\n", 759 | " ############################################################################\n", 760 | " # Complete the below implementation.\n", 761 | "\n", 762 | " ############################################################################\n", 763 | "\n", 764 | "\n", 765 | " def reconstruct_signal(self, X, y):\n", 766 | " \"\"\" :X: An n \\times d measurements matrix.\n", 767 | " :y: An n \\times 1 vector of observations.\n", 768 | " :returns: A vector w that should approximately recover the true signal\n", 769 | " w_star.\n", 770 | " \"\"\"\n", 771 | " ############################################################################\n", 772 | " # Complete the below implementation.\n", 773 | "\n", 774 | " ############################################################################\n", 775 | "\n", 776 | "\n", 777 | " def take_picture(self, n_measurements, w_star):\n", 778 | " \"\"\" :n_measurements: The number of measurements y_i = to take.\n", 779 | " :w_star: The signal vector passed as a two-dimensional image.\n", 780 | " :returns: A picture represented in the standard basis, returned as a\n", 781 | " two-dimensional array.\n", 782 | " \"\"\"\n", 783 | " X = self.get_mirror_positions(n_measurements)\n", 784 | " y = self.take_measurements(X, w_star)\n", 785 | " return self.reconstruct_signal(X, y)\n" 786 | ], 787 | "execution_count": null, 788 | "outputs": [] 789 | }, 790 | { 791 | "cell_type": "code", 792 | "metadata": { 793 | "id": "KWKbiDh-N46V" 794 | }, 795 | "source": [ 796 | "%%time\n", 797 | "\n", 798 | "# Let us test our implementation.\n", 799 | "# The cameraman image that we have used below is too big for the purpose of \n", 800 | "# testing out the implementation of single-pixel camera.\n", 801 | "# Instead, we will switch to a data.microaneurysms() image (which is approximately\n", 802 | "# 25 times smaller in size).\n", 803 | "#\n", 804 | "# WARNING: this can take few couple minutes to execute even for this relatively\n", 805 | "# small problem. The reason for slow reconstruction is because we are using a\n", 806 | "# general-purpose solver for our l1 minimization problem. Specialized solvers\n", 807 | "# can achieve much faster performance.\n", 808 | "\n", 809 | "test_image = np.array(data.microaneurysms(), dtype=np.float32)\n", 810 | "test_image = test_image[:96, :96] # Make dimensions divisible by 8.\n", 811 | "plt.imshow(test_image, cmap='gray')\n", 812 | "\n", 813 | "height = test_image.shape[0]\n", 814 | "width = test_image.shape[1]\n", 815 | "camera = SinglePixelCamera(height, width)\n", 816 | "print(\"Original image:\")\n", 817 | "plt.imshow(test_image, cmap='gray')\n", 818 | "plt.show()\n", 819 | "\n", 820 | "print(\"Number of pixels:\", height*width)\n", 821 | "n_measurements = (height*width)//4\n", 822 | "print(\"Number of measurements:\", n_measurements)\n", 823 | "print(\"Reconstructed image:\")\n", 824 | "picture = camera.take_picture(n_measurements, test_image)\n", 825 | "plt.imshow(picture, cmap='gray')\n", 826 | "print(\"Average l2 reconstruction error:\",\n", 827 | " np.average((test_image - picture)**2))" 828 | ], 829 | "execution_count": null, 830 | "outputs": [] 831 | }, 832 | { 833 | "cell_type": "markdown", 834 | "metadata": { 835 | "id": "CrGNybLwQ6om" 836 | }, 837 | "source": [ 838 | "#### Solution" 839 | ] 840 | }, 841 | { 842 | "cell_type": "markdown", 843 | "metadata": { 844 | "id": "Opa5gjzmQ8nb" 845 | }, 846 | "source": [ 847 | "A possible implementation is provided below.\n", 848 | "\n", 849 | "```\n", 850 | "class SinglePixelCamera(object):\n", 851 | " \"\"\" An implementation of a single pixed camera. \"\"\"\n", 852 | "\n", 853 | "\n", 854 | " def __init__(self, d1, d2):\n", 855 | " \"\"\" :(d1, d2): Shape of the image. \"\"\"\n", 856 | " self.d1 = d1\n", 857 | " self.d2 = d2\n", 858 | "\n", 859 | "\n", 860 | " def get_mirror_positions(self, n_measurements):\n", 861 | " \"\"\" Returns a numpy array of shape (n_measuremes, self.d1 * self.d2), whose\n", 862 | " i-th row stores {-1, +1}-valued mirror positions to be used for the\n", 863 | " i-th linear measurement.\n", 864 | " \"\"\"\n", 865 | " ############################################################################\n", 866 | " # Complete the below implementation.\n", 867 | " # Sample mirror positions as i.i.d. Rademacher random variables.\n", 868 | " return np.random.binomial(n=1, p=0.5,\n", 869 | " size=(n_measurements, self.d1*self.d2))*2.0 - 1.0\n", 870 | " ############################################################################\n", 871 | "\n", 872 | "\n", 873 | " def take_measurements(self, X, w_star):\n", 874 | " \"\"\" :X: A measurement matrix of shape (n, d1*d2).\n", 875 | " :w_star: The signal vector (represented as a 2d-array) that we are\n", 876 | " trying to measure.\n", 877 | " :returns: A vector of linear measurements of (an appropriately\n", 878 | " flattened) image w_star.\n", 879 | " \"\"\"\n", 880 | " ############################################################################\n", 881 | " # Complete the below implementation.\n", 882 | "\n", 883 | " # First, we flatten the image array w_star by flattening each 8 x 8 block.\n", 884 | " print(\"Started taking measurements\")\n", 885 | " flattened_image = np.zeros(self.d1 * self.d2)\n", 886 | " offset = 0\n", 887 | " for i in range(self.d1//8):\n", 888 | " for j in range(self.d2//8):\n", 889 | " block = w_star[8*i:8*(i+1), 8*j:8*(j+1)] \n", 890 | " flattened_image[offset:offset+64] = block.flatten()\n", 891 | " offset += 64\n", 892 | "\n", 893 | " # We can now compute the linear measurements.\n", 894 | " flattened_image.reshape(-1,1) \n", 895 | " y = X @ flattened_image\n", 896 | " return y\n", 897 | " ############################################################################\n", 898 | "\n", 899 | "\n", 900 | " def reconstruct_signal(self, X, y):\n", 901 | " \"\"\" :X: An n \\times d measurements matrix.\n", 902 | " :y: An n \\times 1 vector of observations.\n", 903 | " :returns: A vector w that should approximately recover the true signal\n", 904 | " w_star.\n", 905 | " \"\"\"\n", 906 | " ############################################################################\n", 907 | " # Complete the below implementation.\n", 908 | "\n", 909 | " # Transform X to X @ Phi (note that Phi=block-diagonal[Phi, Phi, ..., Phi]).\n", 910 | " blocks = []\n", 911 | " for block_id in range(X.shape[1]//64):\n", 912 | " blocks.append(X[:,block_id*64:(block_id+1)*64] @ Phi)\n", 913 | " transformed_X = np.concatenate(blocks, axis=1)\n", 914 | " transformed_X /= np.sqrt(transformed_X.shape[0]) # For numeric stability.\n", 915 | " y /= np.sqrt(transformed_X.shape[0]) # Now we also need to rescale y.\n", 916 | "\n", 917 | " # Apply the basis pursuit solver in the transformed coordinate system.\n", 918 | " print(\"Starting signal reconstruction.\")\n", 919 | " alpha = compute_minimum_l1_norm_solution(transformed_X, y)\n", 920 | "\n", 921 | " # Now we need to transform alpha into the original coordinate system\n", 922 | " # and reshape the image into an array of shape (self.d1, self.d2).\n", 923 | " reconstructed_image = np.zeros((self.d1, self.d2))\n", 924 | " offset = 0\n", 925 | " for i in range(self.d1//8):\n", 926 | " for j in range(self.d2//8):\n", 927 | " alpha_block = alpha[offset:offset+64]\n", 928 | " transformed_block = (Phi @ alpha_block).reshape(8, 8)\n", 929 | " reconstructed_image[8*i:8*(i+1), 8*j:8*(j+1)] = transformed_block\n", 930 | " offset += 64\n", 931 | "\n", 932 | " return reconstructed_image\n", 933 | " ############################################################################\n", 934 | "\n", 935 | "\n", 936 | " def take_picture(self, n_measurements, w_star):\n", 937 | " \"\"\" :n_measurements: The number of measurements y_i = to take.\n", 938 | " :w_star: The signal vector passed as a two-dimensional image.\n", 939 | " :returns: A picture represented in the standard basis, returned as a\n", 940 | " two-dimensional array.\n", 941 | " \"\"\"\n", 942 | " X = self.get_mirror_positions(n_measurements)\n", 943 | " y = self.take_measurements(X, w_star)\n", 944 | " return self.reconstruct_signal(X, y)\n", 945 | "\n", 946 | "```" 947 | ] 948 | }, 949 | { 950 | "cell_type": "markdown", 951 | "metadata": { 952 | "id": "YxofckZlBPHN" 953 | }, 954 | "source": [ 955 | "## Bibliographic Remarks" 956 | ] 957 | }, 958 | { 959 | "cell_type": "markdown", 960 | "metadata": { 961 | "id": "M9jrFdYKPtF7" 962 | }, 963 | "source": [ 964 | "The basis pursuit linear program is attributed to the seminal work of *Chen, Donoho, and Saunders [1998]*, with the closely related lasso program introduced in the statistics literature by *Tibshirani [1996]*. The theoretical foundations of compressed sensing, provably demonstrating that sparse signals can be recovered from underdetermined linear systems, were laid down in the pioneering works of *Candes and Tao [2005]*, *Donoho [2006]* and\n", 965 | "*Candès, Romberg, and Tao [2006]*. The proof that RIP implies RNP presented in this practical session was taken from the paper by *Candes [2008]* and Proposition 7.11 in the textbook by *Wainwright [2019]*. The idea of single-pixel cameras was introduced by *Duarte, Davenport, Takhar, Laska, Sun, Kelly, and Baraniuk [2008]*; for the current state of single-\n", 966 | "pixel imaging, see the recent review paper by *Gibson, Johnson, and Padgett [2020]*. One (perhaps surprising) extension of the results presented in this practical session is that access to linear measurements is not necessary to recover the underlying sparse signal. Indeed, linear programming can be used to recover sparse signals from underdetermined systems of signs of linear measurements, as shown in the work of *Plan and Vershynin [2013]*. In this\n", 967 | "practical session, we have only glimpsed into the rich theory of compressed sensing and $\\ell_{1}$ regularization. The interested reader will find more advanced topics, further discussions and extensive bibliographic details in the textbooks by *Bühlmann and Van De Geer [2011]*,\n", 968 | "*Foucart and Rauhut [2013]*, *Hastie, Tibshirani, and Wainwright [2019]* and *Wainwright [2019]*." 969 | ] 970 | }, 971 | { 972 | "cell_type": "markdown", 973 | "metadata": { 974 | "id": "fJ26voWWPtXN" 975 | }, 976 | "source": [ 977 | "**References**\n", 978 | "\n", 979 | "P. Bühlmann and S. Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.\n", 980 | "\n", 981 | "E. J. Candes. The restricted isometry property and its implications for compressed sensing. Comptes rendus mathematique, 346(9-10):589–592, 2008.\n", 982 | "\n", 983 | "E. J. Candes and T. Tao. Decoding by linear programming. IEEE transactions on\n", 984 | "information theory, 51(12):4203–4215, 2005.\n", 985 | "\n", 986 | "E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory, 52(2):489–509, 2006.\n", 987 | "\n", 988 | "S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33–61, 1998.\n", 989 | "\n", 990 | "D. L. Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4): 1289–1306, 2006.\n", 991 | "\n", 992 | "M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. Kelly, and\n", 993 | "R. G. Baraniuk. Single-pixel imaging via compressive sampling. IEEE signal processing magazine, 25(2):83–91, 2008.\n", 994 | "\n", 995 | "S. Foucart and H. Rauhut. An invitation to compressive sensing. In A mathematical introduction to compressive sensing, pages 1–39. Springer, 2013.\n", 996 | "\n", 997 | "G. M. Gibson, S. D. Johnson, and M. J. Padgett. Single-pixel imaging 12 years on: a review. Optics Express, 28(19):28190–28208, 2020.\n", 998 | "\n", 999 | "T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2019.\n", 1000 | "\n", 1001 | "Y. Plan and R. Vershynin. One-bit compressed sensing by linear programming. Communications on Pure and Applied Mathematics, 66(8):1275–1297, 2013.\n", 1002 | "\n", 1003 | "R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.\n", 1004 | "\n", 1005 | "M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019." 1006 | ] 1007 | } 1008 | ] 1009 | } -------------------------------------------------------------------------------- /practicals/implicit_regularization.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "implicit_regularization.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "toc_visible": true 10 | }, 11 | "kernelspec": { 12 | "display_name": "Python 3", 13 | "name": "python3" 14 | }, 15 | "language_info": { 16 | "name": "python" 17 | } 18 | }, 19 | "cells": [ 20 | { 21 | "cell_type": "markdown", 22 | "metadata": { 23 | "id": "97Oj7xECbGAR" 24 | }, 25 | "source": [ 26 | "# Implicit/Iterative/Early-Stopping Regularization\n", 27 | "\n", 28 | " In this practical session, we introduce the concept of implicit (also called iterative or early-stopping) regularization. Our main objectives are the following:\n", 29 | "\n", 30 | "- understanding connections between ridge regression regularization and iterative regularization via gradient descent;\n", 31 | "- introducing an algorithm with sparsity-inducing implicit regularization effect;\n", 32 | "- understanding how the choice of an optimization algorithm can impact the statistical properties of the traced optimization path." 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "metadata": { 38 | "id": "Vx-bim_Ela8o" 39 | }, 40 | "source": [ 41 | "import numpy as np\n", 42 | "from matplotlib import pyplot as plt\n", 43 | "import tensorflow as tf\n", 44 | "import tensorflow.experimental.numpy as tnp \n", 45 | "tnp.experimental_enable_numpy_behavior()" 46 | ], 47 | "execution_count": null, 48 | "outputs": [] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": { 53 | "id": "aSiohvOI3-l0" 54 | }, 55 | "source": [ 56 | "## Reusing Code From the \"Optimization\" Practical Session" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": { 62 | "id": "s54JvdCI9Dz_" 63 | }, 64 | "source": [ 65 | "We begin this session by importing some of the code already used in the \"optimization\" practical session. First, we will reuse our code for running gradient descent simulations by importing the `Optimizer` and `GradientDescent` classes." 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "metadata": { 71 | "id": "bF-r-qpt9DQr" 72 | }, 73 | "source": [ 74 | "class Optimizer(object):\n", 75 | " \"\"\" A base class for optimizers. \"\"\"\n", 76 | "\n", 77 | " def __init__(self, eta):\n", 78 | " \"\"\" :eta_t: A function taking as argument the current iteration t and\n", 79 | " returning the step size eta_t to be used in the current iteration. \"\"\"\n", 80 | " super().__init__()\n", 81 | " self.eta = eta\n", 82 | " self.t = 0 # Set iterations counter.\n", 83 | "\n", 84 | " def apply_gradient(self, x_t, g_t):\n", 85 | " \"\"\" Given the current iterate x_t and gradient g_t, updates the value\n", 86 | " of x_t to x_(t+1) by performing one iterative update.\n", 87 | " :x_t: A tf.Variable which value is to be updated.\n", 88 | " :g_t: The gradient value, to be used for performing the update.\n", 89 | " \"\"\"\n", 90 | " raise NotImplementedError(\"To be implemented by subclasses.\")\n", 91 | "\n", 92 | " def step(self, f, x_t):\n", 93 | " \"\"\" Updates the variable x_t by performing one optimization iteration.\n", 94 | " :f: A function which is being minimized.\n", 95 | " :x_t: A tf.Variable with respect to which the function is being\n", 96 | " minimized and which value is to be updated\n", 97 | ".\n", 98 | " \"\"\"\n", 99 | " with tf.GradientTape() as tape:\n", 100 | " fx = f(x_t)\n", 101 | " g_t = tape.gradient(fx, x_t)\n", 102 | " self.apply_gradient(x_t, g_t)\n", 103 | " # Update the iterations counter.\n", 104 | " self.t += 1\n", 105 | "\n", 106 | " def optimize(self, f, x_t, n_iterations):\n", 107 | " \"\"\" Applies the function step n_iterations number of times, starting from\n", 108 | " the iterate x_t. Note: the number of iterations member self.t is not\n", 109 | " restarted to 0, which may affects the computed step sizes. \n", 110 | " :f: Function to optimize.\n", 111 | " :x_t: Current iterate x_t.\n", 112 | " :returns: A list of length n_iterations+1, containing the iterates\n", 113 | " [x_t, x_{t+1}, ..., x_{t+n_iterations}].\n", 114 | " \"\"\"\n", 115 | " x = tf.Variable(x_t)\n", 116 | " iterates = []\n", 117 | " iterates.append(x.numpy().reshape(-1,1))\n", 118 | " for _ in range(n_iterations):\n", 119 | " self.step(f, x)\n", 120 | " iterates.append(x.numpy().reshape(-1,1))\n", 121 | " return iterates" 122 | ], 123 | "execution_count": null, 124 | "outputs": [] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "metadata": { 129 | "id": "ea7pfJk6AiHh" 130 | }, 131 | "source": [ 132 | "class GradientDescent(Optimizer):\n", 133 | " \"\"\" An implementation of gradient descent uptades. \"\"\"\n", 134 | "\n", 135 | " def apply_gradient(self, x_t, g_t):\n", 136 | " eta_t = self.eta(self.t)\n", 137 | " x_t.assign_add(-eta_t * g_t)" 138 | ], 139 | "execution_count": null, 140 | "outputs": [] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": { 145 | "id": "WqQAy4V28pIF" 146 | }, 147 | "source": [ 148 | "It will be helpful to visualize the solution paths generated by different procedures. We will reuse the class `Convergence2DPlotting` from the \"optimization\" practical session notebook." 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "metadata": { 154 | "id": "v2jdnHe98FwZ" 155 | }, 156 | "source": [ 157 | "class Convergence2DPlotting(object):\n", 158 | " \"\"\" Plotting utils for visualizing optimization paths on 2D functions. \"\"\"\n", 159 | "\n", 160 | " def __init__(self):\n", 161 | " self.fig, self.ax = plt.subplots()\n", 162 | " self.fig.set_size_inches(8.0, 8.0)\n", 163 | " self.ax.set_aspect('equal')\n", 164 | "\n", 165 | " def plot_iterates(self, iterates, color='C0'):\n", 166 | " iterates = np.array(iterates).squeeze()\n", 167 | " x, y = iterates[:,0], iterates[:, 1]\n", 168 | " self.ax.scatter(x,y,s=0)\n", 169 | " for i in range(len(x)-1):\n", 170 | " self.ax.annotate('', xy=(x[i+1], y[i+1]), xytext=(x[i], y[i]),\n", 171 | " arrowprops={'arrowstyle': '->',\n", 172 | " 'color': color, 'lw': 2})\n", 173 | "\n", 174 | " def plot_contours(self, f):\n", 175 | " x_min, x_max = self.ax.get_xlim()\n", 176 | " y_min, y_max = self.ax.get_ylim()\n", 177 | "\n", 178 | " # Generate the contours of f on the above computed range.\n", 179 | " n_points = 50\n", 180 | " x = np.linspace(start=x_min, stop=x_max, num=n_points)\n", 181 | " y = np.linspace(start=y_min, stop=y_max, num=n_points)\n", 182 | " x, y = np.meshgrid(x, y)\n", 183 | " z = np.zeros_like(x)\n", 184 | " for x_idx in range(n_points):\n", 185 | " for y_idx in range(n_points):\n", 186 | " input = np.array([x[x_idx, y_idx], y[x_idx, y_idx]]).reshape(2,1)\n", 187 | " z[x_idx, y_idx] = f(input)\n", 188 | " self.ax.contour(x,y,z, colors='k')" 189 | ], 190 | "execution_count": null, 191 | "outputs": [] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": { 196 | "id": "EazmQxm3AsOt" 197 | }, 198 | "source": [ 199 | "## Ridge Regression vs Unregularized Gradient Descent" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": { 205 | "id": "lMpfKWJ4e98I" 206 | }, 207 | "source": [ 208 | "In this section, we attempt to understand the similarities between the ridge regression *regularization paths* and gradient descent *optimization paths* obtained by applying gradient descent updates to the *unregularized* empirical risk. For simplicity, we will consider a simple data-generating mechanism given by\n", 209 | "\\begin{align*}\n", 210 | " X &\\sim N(0, I_{d}), \\\\\n", 211 | " Y \\mid X = x &\\sim N(\\langle w^{\\star}, x\\rangle, \\sigma^{2}),\n", 212 | "\\end{align*}\n", 213 | "where $w^{\\star} \\in \\mathbb{R}^{d}$ denotes some ground-truth parameter.\n", 214 | "We implement code for sampling data from such distributions in the following cell.\n" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "metadata": { 220 | "id": "H4UK3bVgAwOu" 221 | }, 222 | "source": [ 223 | "class GaussianData(object):\n", 224 | " \"\"\" A class for sampling Gaussian data with i.i.d. N(0, I_d) covariates. \"\"\"\n", 225 | " \n", 226 | " def __init__(self, n, d, w_star, noise_std):\n", 227 | " \"\"\" :n: Number of data points.\n", 228 | " :d: Dimension of the covariates.\n", 229 | " :w_star: A d-dimensional vector used to generate noisy observations\n", 230 | " y_i = + N(0, noise_std**2).\n", 231 | " :noise_std: Standard deviation of the zero-mean Gaussian label noise.\n", 232 | " \"\"\"\n", 233 | " self.n = n\n", 234 | " self.d = d\n", 235 | " self.w_star = w_star.reshape(self.d, 1)\n", 236 | " self.noise_std = noise_std\n", 237 | " self.resample_data(seed=0)\n", 238 | "\n", 239 | " def resample_data(self, seed):\n", 240 | " \"\"\" Resamples dataset X, y using the given seed and stores it as class\n", 241 | " members. \"\"\"\n", 242 | " np.random.seed(seed)\n", 243 | " self.X = np.random.normal(loc=0.0, scale=1.0, size=(self.n, self.d))\n", 244 | " self.xi = np.random.normal(loc=0.0, scale=1.0, size=(self.n, 1))\n", 245 | " self.y = self.X @ self.w_star + self.xi\n", 246 | "\n", 247 | " def compute_empirical_risk(self, w):\n", 248 | " \"\"\" For a d-dimensional vector w, outputs R(w) = 1/n ||Xw - y||_{2}^{2}. \"\"\"\n", 249 | " return tnp.average((self.X @ w - self.y)**2)\n", 250 | "\n", 251 | " def compute_population_risk(self, w):\n", 252 | " \"\"\" For a d-dimensional vector w, outputs\n", 253 | " r(w) = ||w-w*||_{2}^{2} + noise_std**2. \"\"\"\n", 254 | " return tnp.sum((w - self.w_star)**2) + self.noise_std**2" 255 | ], 256 | "execution_count": null, 257 | "outputs": [] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": { 262 | "id": "turDM8y7iTnd" 263 | }, 264 | "source": [ 265 | "For the next exercise, fix the following setup. You are encouraged, however, to also experiment with other parameter choices." 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "metadata": { 271 | "id": "Atz28J5_ifeK" 272 | }, 273 | "source": [ 274 | "isotropic_gaussian_2d_data = GaussianData(\n", 275 | " n=15,\n", 276 | " d=2, # d=2 for visualizing the paths.\n", 277 | " w_star=np.array([1.0, 0.0]), # Make w* \"sparse\".\n", 278 | " noise_std=1.0)" 279 | ], 280 | "execution_count": null, 281 | "outputs": [] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": { 286 | "id": "ZwlzOJoQJ0aa" 287 | }, 288 | "source": [ 289 | "### Exercise 1" 290 | ] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "metadata": { 295 | "id": "MGC67jmnJ378" 296 | }, 297 | "source": [ 298 | "This exercise introduces some connections between ridge regression and gradient descent applied to unregularized empirical risk. In your simulations, use the `isotropic_gaussian_2d_data` object, which contains two-dimensional Gaussian dataset $X, y$, where $X \\in \\mathbb{R}^{n \\times 2}$ and $y \\in \\mathbb{R}^{n}$.\n", 299 | "\n", 300 | "- For a grid of $m$ regularization parameters $\\lambda_{1}, \\dots, \\lambda_{m}$ of your choice, compute ridge regression estimates $w_{\\lambda}^{\\text{ridge}} = \\mathrm{argmin}_{w \\in \\mathbb{R}^{d}} \\frac{1}{n}\\|Xw - y\\|_{2}^{2} + \\frac{\\lambda}{n}\\|w\\|_{2}^{2}$.\n", 301 | "\n", 302 | "- Let $w_{0}^{\\text{gd}} = 0$. For some numer of iterations $T$ of your choice, compute iterates of gradient descent $w^{\\text{gd}}_{t+1} = w^{\\text{gd}}_{t} - \\eta \\nabla R(w_{t}^{\\mathrm{gd}})$, where $R(w) = \\frac{1}{n}\\|Xw - y\\|_{2}^{2}$ is the empirical risk and $\\eta > 0$ is some constant step size (the exact value does not matter, as long as it is small enough).\n", 303 | "\n", 304 | "- Plot the population risks attained by ridge regression *regularization path* $(w^{\\text{ridge}}_{\\lambda_{i}})_{i=1}^{m}$ and gradient descent *optimization path* $(w^{\\text{gd}}_{t})_{t=0}^{T}$. Comment on your findings.\n", 305 | "\n", 306 | "- Using the class `Convergence2DPlotting` visualize the computed ridge regression regularization path and gradient descent optimization path.\n", 307 | "\n", 308 | "- Comment on different computational considerations concerning the computation of regularization and optimization paths." 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": { 314 | "id": "J-b5TLSLaYJT" 315 | }, 316 | "source": [ 317 | "#### Solution" 318 | ] 319 | }, 320 | { 321 | "cell_type": "markdown", 322 | "metadata": { 323 | "id": "jZ8bDL2Nj8pV" 324 | }, 325 | "source": [ 326 | "For a given $\\lambda > 0$, ridge regression estimate $w_{\\lambda}$ can be computed by $(X^{\\mathsf{T}}X + \\lambda I)^{-1} X^{\\mathsf{T}}y$. We implement computation of ridge regression regularization paths below." 327 | ] 328 | }, 329 | { 330 | "cell_type": "code", 331 | "metadata": { 332 | "id": "wvcMDjyjlE5j" 333 | }, 334 | "source": [ 335 | "def compute_ridge_regression_regularization_path(data, lambdas):\n", 336 | " \"\"\" :data: An object of type GaussianData.\n", 337 | " :lambdas: A sorted list of regularization parameters lambda.\n", 338 | " :returns: A list of fitted parameters w_{\\lambda}, one for\n", 339 | " each provided lambda.\n", 340 | " \"\"\"\n", 341 | " # Compute and store the values of X^{t}X and X^{t}y and Id.\n", 342 | " XtX = np.transpose(data.X) @ data.X\n", 343 | " Xty = np.transpose(data.X) @ data.y\n", 344 | " Id = np.identity(data.d)\n", 345 | " \n", 346 | " regularization_path = []\n", 347 | " for l in lambdas:\n", 348 | " w_lambda = np.linalg.inv(XtX + l*Id) @ Xty\n", 349 | " regularization_path.append(w_lambda)\n", 350 | " return regularization_path" 351 | ], 352 | "execution_count": null, 353 | "outputs": [] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "metadata": { 358 | "id": "b9oI5_8Xmb5l" 359 | }, 360 | "source": [ 361 | "# Create a grid of regularization parameters equally spaced on a log scale.\n", 362 | "lambdas = np.exp(np.linspace(start=np.log(10**(-3)), stop=np.log(1e3),\n", 363 | " num=2000))\n", 364 | "ridge_regularization_path = compute_ridge_regression_regularization_path(\n", 365 | " isotropic_gaussian_2d_data, lambdas)" 366 | ], 367 | "execution_count": null, 368 | "outputs": [] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "metadata": { 373 | "id": "E5GblpXPqGZX" 374 | }, 375 | "source": [ 376 | "Computation of gradient descent optimization path can be readily obtained via `GradientDescent` optimizer class." 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "metadata": { 382 | "id": "mkdbH0KssVMu" 383 | }, 384 | "source": [ 385 | "T = 2000\n", 386 | "eta = lambda t: 0.001\n", 387 | "w_0 = np.zeros(isotropic_gaussian_2d_data.d).reshape(-1, 1)\n", 388 | "gradient_descent = GradientDescent(eta)\n", 389 | "gd_optimization_path = gradient_descent.optimize(\n", 390 | " isotropic_gaussian_2d_data.compute_empirical_risk, w_0, T)" 391 | ], 392 | "execution_count": null, 393 | "outputs": [] 394 | }, 395 | { 396 | "cell_type": "markdown", 397 | "metadata": { 398 | "id": "T1wIc93XxlN6" 399 | }, 400 | "source": [ 401 | "We can now compare the population (or out-of-sample) risks attained by ridge regression and gradient descent." 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "metadata": { 407 | "id": "MClxCf52xqM_" 408 | }, 409 | "source": [ 410 | "ridge_risks = [isotropic_gaussian_2d_data.compute_population_risk(w) \\\n", 411 | " for w in ridge_regularization_path]\n", 412 | "gd_risks = [isotropic_gaussian_2d_data.compute_population_risk(w) \\\n", 413 | " for w in gd_optimization_path]\n", 414 | "plt.plot(ridge_risks)\n", 415 | "plt.plot(gd_risks)\n", 416 | "plt.xlabel('Parameter index')\n", 417 | "plt.ylabel('Population risk')\n", 418 | "plt.legend(['ridge', 'gd'])" 419 | ], 420 | "execution_count": null, 421 | "outputs": [] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": { 426 | "id": "BAhf5ws66t6i" 427 | }, 428 | "source": [ 429 | "We observe that the minimum population risk achieved by both algorithms is approximately the same. Notice that large values of $\\lambda$ for ridge regression correspond to small values of $t$ for gradient descent (i.e., using a lot of regularization). On the other hand, small values of $\\lambda$ correspond to large values of $t$ (using little regularization).\n", 430 | "\n", 431 | "Let us now plot the obtained regularization and optimization paths against the contours of the population risk." 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "metadata": { 437 | "id": "xW5BJfEf72Nt" 438 | }, 439 | "source": [ 440 | "path_plots = Convergence2DPlotting()\n", 441 | "path_plots.plot_iterates(ridge_regularization_path, color='C0') \n", 442 | "path_plots.plot_iterates(gd_optimization_path, color='C1')\n", 443 | "path_plots.ax.set_ylim(-0.2, 0.45)\n", 444 | "path_plots.ax.set_xlim(-0.1, 1.1)\n", 445 | "path_plots.plot_contours(isotropic_gaussian_2d_data.compute_population_risk)\n", 446 | "path_plots.ax.scatter(isotropic_gaussian_2d_data.w_star[0],\n", 447 | " isotropic_gaussian_2d_data.w_star[1],\n", 448 | " marker='x',\n", 449 | " color='red')" 450 | ], 451 | "execution_count": null, 452 | "outputs": [] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "metadata": { 457 | "id": "-lA_Dpfc8kbr" 458 | }, 459 | "source": [ 460 | "**We observe that gradient descent optimization path nearly matches that of ridge regression regularization path.** We will make this observation more precise in the next exercise.\n", 461 | "\n", 462 | "Finally, regarding the computational considerations, notice that we generated new points along ridge regression optimization problem by solving a new optimization problem for each value of $\\lambda$. While there are better ways to compute regularization paths (e.g., based on *warm restarts*), computational analysis and implementation might become more tricky in such cases. On the other hand, obtaining new iterates along the gradient optimization path only cost one gradient descent update, resulting in arguably simpler and more efficient procedure.\n", 463 | "\n", 464 | "**We observe that the gradient descent optimization path nearly matches that of the ridge regression regularization path.** We will make this observation more precise in the next exercise.\n", 465 | "\n", 466 | "Finally, regarding the computational considerations, notice that **we generated new points along the ridge regularization path by solving a new optimization problem for each value of $\\lambda$.** While there are better ways to compute regularization paths (e.g., based on *warm restarts*), computational analysis and implementation might become more tricky in such cases. On the other hand, **obtaining new iterates along the gradient optimization path only cost one gradient descent update**, resulting in an arguably more straightforward and more efficient procedure." 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": { 472 | "id": "zZ8J8tafazgI" 473 | }, 474 | "source": [ 475 | "### Exercise 2" 476 | ] 477 | }, 478 | { 479 | "cell_type": "markdown", 480 | "metadata": { 481 | "id": "X0daTkG4be_E" 482 | }, 483 | "source": [ 484 | "In exercse 1, we have observed that the gradient descent optimization path nearly matches the ridge regression regularization path. In this exercise we investigate why this happens. Recall that ridge regression estimates $w_{\\lambda}$ can be computed via the following expression:\n", 485 | "$$\n", 486 | " w_{\\lambda} = T^{\\text{ridge}}_{\\lambda}(X^{\\mathsf{T}}X)X^{\\mathsf{T}}y ,\\text{ where }\n", 487 | " T^{\\text{ridge}}_{\\lambda}(X^{\\mathsf{T}}X) = (X^{\\mathsf{T}}X + \\lambda I_{d})^{-1}.\n", 488 | "$$\n", 489 | "The mapping $T_{\\lambda}^{\\text{ridge}}$ can be seen as an operator acting on the eigenvalues of the sample covariance matrix $X^{\\mathsf{T}}X$. In particular, $T_{\\lambda}^{\\text{ridge}}$ aims to invert the eigenvalues of $X^{\\mathsf{T}}X$, with the inverse getting closer to the true inverse as $\\lambda \\to 0$.\n", 490 | "\n", 491 | "- Find an expression for some operator $T^{\\text{gd}}_{\\eta, t}$ acting on the eigenvalues of $X^{\\mathsf{T}}X$, such that the $t$-th iterate of gradient descent, obtained with a constant step-size $\\eta$ and $w_0 = 0$ satisfies\n", 492 | "$$\n", 493 | " w_{t} = T^{\\text{gd}}_{\\eta, t}(X^{\\mathsf{T}}X)X^{\\mathsf{T}}y.\n", 494 | "$$\n", 495 | "- Using the above-computed expression, suggest (informally) some mapping $f$ between $\\lambda$ and $(\\eta, t)$, such that $w^{\\text{ridge}}_{\\lambda} \\approx w^{\\text{gd}}_{t}$.\n", 496 | "- Using the above-suggested mapping $f$ which approximately corresponds to $\\lambda = f(\\eta, t)$, plot gradient descent and ridge regression population risks computed in Exercise 1. For the x-axis, use the number of gradient descent iterations. The ridge regression lambdas should be renormalized using the mapping $f$. **You should find a mapping $f$ such that the two curves approximately overlap.**" 497 | ] 498 | }, 499 | { 500 | "cell_type": "markdown", 501 | "metadata": { 502 | "id": "7TAQAKalD-rB" 503 | }, 504 | "source": [ 505 | "#### Solution" 506 | ] 507 | }, 508 | { 509 | "cell_type": "markdown", 510 | "metadata": { 511 | "id": "3qlSlTrXEArR" 512 | }, 513 | "source": [ 514 | "For gradient descent, it can be shown by induction that\n", 515 | "$$\n", 516 | " w_{t}^{\\text{gd}} = \\underbrace{\\frac{2\\eta}{n}\\left(I + \\left(I - \\frac{2\\eta}{n}X^{\\mathsf{T}}X\\right) + \\dots + \\left(I - \\frac{2\\eta}{n}X^{\\mathsf{T}}X\\right)^{t-1}\\right)}_{= T_{\\eta,t}^{\\text{gd}}(X^{\\mathsf{T}}X) }X^{\\mathsf{T}}y.\n", 517 | "$$\n", 518 | "Suppose that the eigenvalues of $X^{\\mathsf{T}}X$ are given by $\\mu_{1}, \\dots, \\mu_{d}$. Then, the operator $T_{\\eta,t}^{\\text{gd}}$ acts on the $i$-th eigenvalue $\\mu_{i}$ via the formula:\n", 519 | "$$\n", 520 | " \\mu_{i} \\mapsto \\frac{2\\eta}{n}\\sum_{s=0}^{t-1}\\left(1 - \\frac{2\\eta}{n}\\mu_{i}\\right)^{s}\n", 521 | " =\n", 522 | " \\frac{1 - \\left(1 - \\frac{2\\eta}{n}\\mu_{i}\\right)^{t}}{\\mu_{i}}.\n", 523 | "$$\n", 524 | "**In essence, the gradient descent iterates are inverting the eigenvalues of the sample covariance matrix using a slightly different formula than ridge regression.**\n", 525 | "Recall that the corresponding ridge operator $T_{\\lambda}^{\\text{ridge}}$ acts on the $i$-th eigenvalue $\\mu_{i}$ via the mapping\n", 526 | "$$\n", 527 | " \\mu_{i} \\mapsto \\frac{1}{\\mu_{i} + \\lambda}.\n", 528 | "$$\n", 529 | "Equating the right-hand sides of the two expressions above, we for given values of $\\lambda$ and $\\eta$, we want to find $t$ such that\n", 530 | "$$\n", 531 | " \\frac{1 - \\left(1 - \\frac{2\\eta}{n}\\mu_{i}\\right)^{t}}{\\mu_{i}}\n", 532 | " \\approx \\frac{1}{\\mu_{i} + \\lambda}.\n", 533 | "$$\n", 534 | "In the isotropic setting considered in our simulations, we have $\\mu_{1} \\approx \\mu_{2} \\approx n$. Solving the above equation for $t$ yields the expression\n", 535 | "$$\n", 536 | " t \\approx \\frac{\\log\\frac{\\lambda}{n + \\lambda}}{\\log(1-2\\eta)}.\n", 537 | "$$\n", 538 | "We apply the above formula in the below code cell. Observe that the generated risk curves using this mapping are aligned as required." 539 | ] 540 | }, 541 | { 542 | "cell_type": "code", 543 | "metadata": { 544 | "id": "Zd4CsclHEzTd" 545 | }, 546 | "source": [ 547 | "# Note that small lambda corresponds to large T (i.e., fitting to convergence).\n", 548 | "eta = gradient_descent.eta(0)\n", 549 | "X = isotropic_gaussian_2d_data.X\n", 550 | "n = X.shape[0]\n", 551 | "remapped_lambdas = np.log(lambdas/(n +lambdas)) / np.log(1.0 - 2.0*eta)\n", 552 | "# We want to keep the largest remmaped_lambdas element up to T, so we will\n", 553 | "# remove some part of this above lambdas.\n", 554 | "truncate_idx = np.argmin(remapped_lambdas > T)\n", 555 | "plt.plot(remapped_lambdas[truncate_idx:], ridge_risks[truncate_idx:])\n", 556 | "plt.plot(np.arange(T+1), gd_risks)\n", 557 | "plt.xlabel('t')\n", 558 | "plt.ylabel('Population risk')\n", 559 | "plt.legend(['ridge (lambda mapped to t)', 'gd'])" 560 | ], 561 | "execution_count": null, 562 | "outputs": [] 563 | }, 564 | { 565 | "cell_type": "markdown", 566 | "metadata": { 567 | "id": "9uTiHI9Hatyd" 568 | }, 569 | "source": [ 570 | "## Sparsity-Inducing Implicit Regularization" 571 | ] 572 | }, 573 | { 574 | "cell_type": "markdown", 575 | "metadata": { 576 | "id": "7xSXkRR5CVqG" 577 | }, 578 | "source": [ 579 | "In the previous section we have seen how gradient descent applied to the unregularized empirical risk generate iterates that approximately correspond to the ridge regression regularization path. Ridge regularization is, of course, just one of the many ways to induce a regularizing effect. For instance, for learning problems with some underlying sparsity structure, lasso regularization might be preferred, giving the regularization path $(w_{\\lambda}^{\\text{lasso}})_{\\lambda \\geq 0}$ whose iterates are defined by\n", 580 | "$$\n", 581 | " w_{\\lambda} \\in \\mathrm{argmin}_{w \\in \\mathbb{R}^{d}} \\frac{1}{n} \\|Xw - y\\|_{2}^{2} + \\frac{\\lambda}{n}\\|w\\|_{1}.\n", 582 | "$$\n", 583 | "Indeed, the toy setup explored in our previous simulations exhibits some underlying sparsity structure, as the parameter that generates the data $w^{\\star} = (1, 0)^{\\mathsf{T}}$ can be considered as a sparse vector.\n", 584 | " We can indeed show that the lasso regularization path contains marginally better solutions for our two-dimensional toy problem than the ridge regularization path (for larger and more sparse problems lasso penalties would significantly outperform ridge penalties)." 585 | ] 586 | }, 587 | { 588 | "cell_type": "code", 589 | "metadata": { 590 | "id": "XV98P7mlGIwO" 591 | }, 592 | "source": [ 593 | "from sklearn import linear_model\n", 594 | "\n", 595 | "def compute_lasso_regression_regularization_path(data, lambdas):\n", 596 | " regularization_path = []\n", 597 | " for l in lambdas:\n", 598 | " lasso = linear_model.Lasso(alpha=l/(2.0*n), fit_intercept=False)\n", 599 | " lasso.fit(data.X, data.y)\n", 600 | " w_lambda = lasso.coef_.reshape(-1, 1)\n", 601 | " regularization_path.append(w_lambda)\n", 602 | " return regularization_path" 603 | ], 604 | "execution_count": null, 605 | "outputs": [] 606 | }, 607 | { 608 | "cell_type": "code", 609 | "metadata": { 610 | "id": "8GLHDPD7HRAO" 611 | }, 612 | "source": [ 613 | "lasso_regularization_path = compute_lasso_regression_regularization_path(\n", 614 | " isotropic_gaussian_2d_data, lambdas)\n", 615 | "lasso_risks = [isotropic_gaussian_2d_data.compute_population_risk(w) \\\n", 616 | " for w in lasso_regularization_path]\n", 617 | "plt.plot(ridge_risks, color='C0')\n", 618 | "plt.plot(lasso_risks, color='C2')\n", 619 | "plt.xlabel(r'$\\lambda$')\n", 620 | "plt.ylabel('Population risk')\n", 621 | "plt.legend(['ridge', 'lasso'])\n", 622 | "plt.ylim(0.9,1.4)\n", 623 | "plt.show()\n", 624 | "path_plots = Convergence2DPlotting()\n", 625 | "path_plots.plot_iterates(ridge_regularization_path, color='C0') \n", 626 | "path_plots.plot_iterates(lasso_regularization_path, color='C2')\n", 627 | "path_plots.ax.set_ylim(-0.2, 0.45)\n", 628 | "path_plots.ax.set_xlim(-0.1, 1.1)\n", 629 | "path_plots.plot_contours(isotropic_gaussian_2d_data.compute_population_risk)\n", 630 | "path_plots.ax.scatter(isotropic_gaussian_2d_data.w_star[0],\n", 631 | " isotropic_gaussian_2d_data.w_star[1],\n", 632 | " marker='x',\n", 633 | " color='red')" 634 | ], 635 | "execution_count": null, 636 | "outputs": [] 637 | }, 638 | { 639 | "cell_type": "markdown", 640 | "metadata": { 641 | "id": "2MvKDVXoJmM3" 642 | }, 643 | "source": [ 644 | "A natural question occurs: **how can one introduce a non-Euclidean regularization effect implicitly?** In the following exercise, we present one such way, which induces a sparsity-promoting regularization effect by running gradient descent on a reparametrized coordinate system." 645 | ] 646 | }, 647 | { 648 | "cell_type": "markdown", 649 | "metadata": { 650 | "id": "8UXZSb6MBVVs" 651 | }, 652 | "source": [ 653 | "### Exercise 3" 654 | ] 655 | }, 656 | { 657 | "cell_type": "markdown", 658 | "metadata": { 659 | "id": "nAZ0FtKcMQ1x" 660 | }, 661 | "source": [ 662 | "For a vector $u \\in \\mathbb{R}^{d}$ let $u^{2}$ denote a component-wise square operation. Consider the reparametrization $w = u^{2}$ (viable for any $w \\in \\mathbb{R}^{d}$ with non-negative components (to treat the general case, we could instead consider $w = u^{2} - v^{2}$; for simplicity, we will consider the reparametrization $w = u^{2}$ only).\n", 663 | "\n", 664 | "In the new coordinate system $w = u^{2}$, the empirical risk function is defined as\n", 665 | "$$\n", 666 | " \\widetilde{R}(u) = \\frac{1}{n} \\|Xu^{2} - y\\|_{2}^{2}.\n", 667 | "$$\n", 668 | "Consider setting $u_{0} = (0.01, 0.01)^{\\mathsf{T}}$ and running gradient descent on $\\widetilde{R}$ for the problem instance stored in the python variable `isotropic_gaussian_2d_data`:\n", 669 | "- plot the population risks traced by the gradient descent iterates $(u_{t})_{t=0}^{T}$;\n", 670 | "- plot the two-dimensional optimization path;\n", 671 | "- compare the simulation output with the ridge and lasso regularization paths." 672 | ] 673 | }, 674 | { 675 | "cell_type": "markdown", 676 | "metadata": { 677 | "id": "BCiMEauWBYET" 678 | }, 679 | "source": [ 680 | "#### Solution" 681 | ] 682 | }, 683 | { 684 | "cell_type": "markdown", 685 | "metadata": { 686 | "id": "6vTM7ynALLjY" 687 | }, 688 | "source": [ 689 | "The reparametrized gradient descent can be implemented in just a few lines by redefining the function we want to optimize in the new coordinate system." 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "metadata": { 695 | "id": "AdTztsYkc9BM" 696 | }, 697 | "source": [ 698 | "class ReparametrizedGradientDescent(GradientDescent):\n", 699 | "\n", 700 | " def optimize(self, f, w_0, n_iterations):\n", 701 | " \"\"\" The initialization point w_0 needs to have stricly positive coordinates.\n", 702 | " This function optimizes f using reparametrization w = u^2, running gradient\n", 703 | " descent updates on the parameter u. \"\"\"\n", 704 | " def g(u):\n", 705 | " return f(u**2)\n", 706 | " u_0 = tnp.sqrt(w_0)\n", 707 | " u_iterates = super().optimize(g, tnp.sqrt(w_0), n_iterations)\n", 708 | " w_iterates = [u**2 for u in u_iterates]\n", 709 | " return w_iterates" 710 | ], 711 | "execution_count": null, 712 | "outputs": [] 713 | }, 714 | { 715 | "cell_type": "markdown", 716 | "metadata": { 717 | "id": "3CaKSrENLdPF" 718 | }, 719 | "source": [ 720 | "We can now execute the algorithm." 721 | ] 722 | }, 723 | { 724 | "cell_type": "code", 725 | "metadata": { 726 | "id": "6CenTGjXUAJG" 727 | }, 728 | "source": [ 729 | "w_0 = np.array([1e-4, 1e-4]).reshape(-1, 1)\n", 730 | "reparametrized_gd = ReparametrizedGradientDescent(eta = lambda t : 0.005)\n", 731 | "reparametrized_iterates = reparametrized_gd.optimize(\n", 732 | " isotropic_gaussian_2d_data.compute_empirical_risk, w_0, 2000)\n", 733 | "reparametrized_gd_risks = [isotropic_gaussian_2d_data.compute_population_risk(w) \\\n", 734 | " for w in reparametrized_iterates]" 735 | ], 736 | "execution_count": null, 737 | "outputs": [] 738 | }, 739 | { 740 | "cell_type": "markdown", 741 | "metadata": { 742 | "id": "IY0T8VCyLf0m" 743 | }, 744 | "source": [ 745 | "Finally, we can visually observe that the reparametrized gradient descent is indeed biased towards first exploring sparse iterates." 746 | ] 747 | }, 748 | { 749 | "cell_type": "code", 750 | "metadata": { 751 | "id": "Fjt_sAkvUVOe" 752 | }, 753 | "source": [ 754 | "plt.plot(ridge_risks, color='C0')\n", 755 | "plt.plot(gd_risks, color='C1')\n", 756 | "plt.plot(lasso_risks, color='C2')\n", 757 | "plt.plot(reparametrized_gd_risks, color='C3')\n", 758 | "plt.xlabel(r'$\\lambda$ or $t$')\n", 759 | "plt.ylabel('Population risk')\n", 760 | "plt.legend(['ridge', 'gd', 'lasso', 'reparametrized gd'])\n", 761 | "plt.ylim(1.0,1.3)\n", 762 | "plt.show()\n", 763 | "\n", 764 | "path_plots = Convergence2DPlotting()\n", 765 | "path_plots.plot_iterates(ridge_regularization_path, color='C0') \n", 766 | "path_plots.plot_iterates(gd_optimization_path, color='C1')\n", 767 | "path_plots.plot_iterates(lasso_regularization_path, color='C2')\n", 768 | "path_plots.plot_iterates(reparametrized_iterates, color='C3')\n", 769 | "path_plots.ax.set_ylim(-0.2, 0.45)\n", 770 | "path_plots.ax.set_xlim(-0.1, 1.1)\n", 771 | "path_plots.plot_contours(isotropic_gaussian_2d_data.compute_population_risk)\n", 772 | "path_plots.ax.scatter(isotropic_gaussian_2d_data.w_star[0],\n", 773 | " isotropic_gaussian_2d_data.w_star[1],\n", 774 | " marker='x',\n", 775 | " color='red')" 776 | ], 777 | "execution_count": null, 778 | "outputs": [] 779 | }, 780 | { 781 | "cell_type": "markdown", 782 | "metadata": { 783 | "id": "myOMA-TuERMX" 784 | }, 785 | "source": [ 786 | "## Bibliographic Remarks" 787 | ] 788 | }, 789 | { 790 | "cell_type": "markdown", 791 | "metadata": { 792 | "id": "VwYJl7FkYzo7" 793 | }, 794 | "source": [ 795 | "The use of early-stopping as a regularization mechanism dates back at least to Landweber iteration [*Landweber, 1951*] studied in the context of ill-posed inverse problems; for this point of view, see the textbook by *Engl, Hanke, and Neubauer [1996]*. As early as in the 90s, early stopping has also been one of the standard ways to regularize the training procedures used to fit neural network parameters [*Prechelt, 1998*]. In the statistics literature, the first\n", 796 | "paper to connect early stopping to the notion of minimax optimality is due to *Bühlmann and Yu [2003]*. Implicit regularization effects in the Euclidean setting of gradient descent updates are by now well-understood; see, for example, the paper by *Yao, Rosasco, and Caponnetto [2007]*, *Raskutti, Wainwright, and Yu [2014]*, *Wei, Yang, and Wainwright [2019]*.\n", 797 | "Much less is known about non-Euclidean setups. The quadratic reparametrization trick was noticed in [*Gunasekar, Woodworth, Bhojanapalli, Neyshabur, and Srebro, 2017*] in the context of matrix factorization. Linear regression setting with quadratic reparametrization was studied by *Vaškevičius, Kanade, and Rebeschini [2019]*, where minimax optimal bounds were obtained for early-stopped iterates under the signal-sparsity and the restricted isometry assumption on the design matrix. Later, it was shown in [*Vaškevičius, Kanade, and Rebeschini, 2020*] how to obtain estimation error upper bounds for early-stopped iterates of mirror descent algorithms via offset Rademacher complexities. For other works concerning early-stopping in non-Euclidean setups, see, for example, [*Osher, Ruan, Xiong, Yao, and Yin, 2016, Molinari,\n", 798 | "Massias, Rosasco, and Villa, 2021, Wu and Rebeschini, 2021*]." 799 | ] 800 | }, 801 | { 802 | "cell_type": "markdown", 803 | "metadata": { 804 | "id": "_2yYioTPZt52" 805 | }, 806 | "source": [ 807 | "**References**\n", 808 | "\n", 809 | "P. Bühlmann and B. Yu. Boosting with the l2 loss: regression and classification. Journal of the American Statistical Association, 98(462):324–339, 2003.\n", 810 | "\n", 811 | "H. W. Engl, M. Hanke, and A. Neubauer. Regularization of inverse problems, volume 375. Springer Science & Business Media, 1996.\n", 812 | "\n", 813 | "S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6151–6159, 2017.\n", 814 | "\n", 815 | "L. Landweber. An iteration formula for fredholm integral equations of the first kind. American journal of mathematics, 73(3):615–624, 1951.\n", 816 | "\n", 817 | "C. Molinari, M. Massias, L. Rosasco, and S. Villa. Iterative regularization for convex regularizers. In International Conference on Artificial Intelligence and Statistics, pages 1684–1692. PMLR, 2021.\n", 818 | "\n", 819 | "S. Osher, F. Ruan, J. Xiong, Y. Yao, and W. Yin. Sparse recovery via differential inclusions. Applied and Computational Harmonic Analysis, 41(2):436–469, 2016.\n", 820 | "\n", 821 | "L. Prechelt. Early stopping-but when? In Neural Networks: Tricks of the trade, pages 55–69. Springer, 1998.\n", 822 | "\n", 823 | "G. Raskutti, M. J. Wainwright, and B. Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. The Journal of Machine Learning Research, 15(1):335–366, 2014.\n", 824 | "\n", 825 | "T. Vaškevičius, V. Kanade, and P. Rebeschini. The statistical complexity of early-stopped mirror descent. arXiv preprint arXiv:2002.00189, 2020.\n", 826 | "\n", 827 | "T. Vaškevičius, V. Kanade, and P. Rebeschini. Implicit regularization for optimal sparse recovery. In Advances in Neural Information Processing Systems, pages 2968–2979, 2019.\n", 828 | "\n", 829 | "Y. Wei, F. Yang, and M. J. Wainwright. Early stopping for kernel boosting algorithms: A general analysis with localized complexities. IEEE Transactions on Information Theory, 65(10):6685–6703, 2019.\n", 830 | "\n", 831 | "F. Wu and P. Rebeschini. Nearly minimax-optimal rates for noisy sparse phase retrieval via early-stopped mirror descent. arXiv preprint arXiv:2105.03678, 2021.\n", 832 | "\n", 833 | "Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007." 834 | ] 835 | } 836 | ] 837 | } -------------------------------------------------------------------------------- /practicals/limitations_of_gradient_based_learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "limitations_of_gradient_based_learning.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "toc_visible": true 10 | }, 11 | "kernelspec": { 12 | "display_name": "Python 3", 13 | "language": "python", 14 | "name": "python3" 15 | }, 16 | "language_info": { 17 | "codemirror_mode": { 18 | "name": "ipython", 19 | "version": 3 20 | }, 21 | "file_extension": ".py", 22 | "mimetype": "text/x-python", 23 | "name": "python", 24 | "nbconvert_exporter": "python", 25 | "pygments_lexer": "ipython3", 26 | "version": "3.8.3" 27 | }, 28 | "latex_envs": { 29 | "LaTeX_envs_menu_present": true, 30 | "autoclose": false, 31 | "autocomplete": true, 32 | "bibliofile": "biblio.bib", 33 | "cite_by": "apalike", 34 | "current_citInitial": 1, 35 | "eqLabelWithNumbers": true, 36 | "eqNumInitial": 1, 37 | "hotkeys": { 38 | "equation": "Ctrl-E", 39 | "itemize": "Ctrl-I" 40 | }, 41 | "labels_anchors": false, 42 | "latex_user_defs": false, 43 | "report_style_numbering": false, 44 | "user_envs_cfg": false 45 | } 46 | }, 47 | "cells": [ 48 | { 49 | "cell_type": "markdown", 50 | "metadata": { 51 | "id": "_X052qCZh00o" 52 | }, 53 | "source": [ 54 | "# Limitations of Gradient-Based Learning\n", 55 | "\n", 56 | " In this practical session we explore some limitations of learning algorithms based on gradient updates. Our main objectives are the following:\n", 57 | "\n", 58 | "- introducing the problem of learning parity functions under a uniform distribution of covariates on the $d$-dimensional boolean hypercube;\n", 59 | "- understanding if the above problem is easy from the statistical perspective;\n", 60 | "- understanding if the above problem is easy from the computational perspective;\n", 61 | "- understanding intuitively why the above problem is not solvable in polynomial time for a large class of algorithms.\n" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "metadata": { 67 | "id": "WRPHmKQMiOwN" 68 | }, 69 | "source": [ 70 | "import numpy as np\n", 71 | "from matplotlib import pyplot as plt\n", 72 | "import tensorflow as tf\n", 73 | "import tensorflow.experimental.numpy as tnp\n", 74 | "tnp.experimental_enable_numpy_behavior()\n", 75 | "plt.rcParams['figure.figsize'] = [9, 6]" 76 | ], 77 | "execution_count": null, 78 | "outputs": [] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": { 83 | "id": "ECKHl-WPsTlR" 84 | }, 85 | "source": [ 86 | "## The Problem of Learning Parity Functions" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": { 92 | "id": "l34DTIMAlg5r" 93 | }, 94 | "source": [ 95 | "Learning parity functions is a classical example of a problem that is easy to state, yet hard to solve for a large class of machine learning algorithms (specifically, for *statistical query* algorithms; see the bibliographic remarks section). In this practical session, we explore why learning parity functions is difficult for parametric models trained by gradient descent.\n", 96 | "\n", 97 | "To introduce the problem, let $d$ denote the input dimension and let the input space be $\\mathcal{X} = \\{-1, +1\\}^{d}$ — the $d$-dimensional boolean hypercube. Given any subset $S \\subseteq \\{0, 1, \\dots, d - 1\\}$, the *parity function* $f_{S} : \\mathcal{X} \\to \\{-1, +1\\}$ is defined by\n", 98 | "$$\n", 99 | " f_{S}(x) = \\prod_{i \\in S} x_{i}, \\quad \\text{where } x = (x_{0}\\, x_{1}\\, \\dots\\, x_{d-1})^{\\mathsf{T}} \\in \\mathcal{X}.\n", 100 | "$$\n", 101 | "Thus, $f_{S}(x)$ evaluates to $+1$ if the number of negative coordinates of $x$ with indexes in the support set $S$ is even; otherwise $f_{S}(x)$ evaluates to $-1$.\n", 102 | "\n", 103 | "In the following exercise, we are asked to implement functions for generating synthetic data for our experiments. Given natural numbers $n$, $d$ and a support set $S \\subset \\{0, \\dots, d-1\\}$, a sample dataset contains a matrix $X \\in \\{-1, +1\\}^{n \\times d}$ and a vector $y \\in \\{-1, +1\\}^{n}$. Letting $x_{i}$ denote the $i$-th row of the matrix $X$, the $i$-th data point is $(x_{i}, y_{i}) = (x_{i}, f_{S}(x_{i}))$. We will only consider covariate vectors $x_{i}$ whose elements are sampled independently and uniformly from the set $\\{-1, +1\\}$. To summarize, the data generating mechanism considered in this practical session can be described as follows:\n", 104 | "\\begin{align*}\n", 105 | " X &\\sim \\text{Uniform}(\\mathcal{X}), \\\\\n", 106 | " Y \\mid X = x &\\sim \\delta_{f_{S}(x)}\\text{ for some fixed parity function }f_{S}.\n", 107 | "\\end{align*}\n", 108 | "The notation $\\delta_{x}$ denotes a probability distribution that assigns all mass to the point $x$. " 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": { 114 | "id": "BfsLwa5glaaE" 115 | }, 116 | "source": [ 117 | "### Exercise 1" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": { 123 | "id": "XOZ12WYViFEQ" 124 | }, 125 | "source": [ 126 | "Fill in the missing implementation details in the code cell below." 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "metadata": { 132 | "id": "7JOOdRAbmAfI" 133 | }, 134 | "source": [ 135 | "class ParityFunction(object):\n", 136 | "\n", 137 | " def __init__(self, S):\n", 138 | " \"\"\" :S: A subset of {0,1,...,d-1} denoting the support of the parity\n", 139 | " function represented by this object. \"\"\"\n", 140 | " self.S = np.array(S)\n", 141 | "\n", 142 | " def __call__(self, X):\n", 143 | " \"\"\" :X: Either a d-dimensional vector or a matrix in {-1, +1}^{n \\times d},\n", 144 | " whose each row contains an input x_{i} \\in {-1, +1}^{d}.\n", 145 | " :returns: A vector in {-1, +1}^{n} whose i-th entry contains the output\n", 146 | " of the parity function with support set self.S evaluated at x_{i}.\n", 147 | " \"\"\"\n", 148 | " ############################################################################\n", 149 | " # Exercise 1.1. Fill in the missing code below.\n", 150 | "\n", 151 | " ############################################################################\n", 152 | "\n", 153 | "def generate_uniform_covariates(n, d):\n", 154 | " \"\"\" :n: The number of covariate vectors.\n", 155 | " :d: The dimension of \n", 156 | " :returns: A matrix in {-1, +1}^{n \\times d} with i.i.d. entries sampled\n", 157 | " uniformly from {-1, +1}.\n", 158 | " \"\"\"\n", 159 | " ############################################################################\n", 160 | " # Exercise 1.2. Fill in the missing code below.\n", 161 | " \n", 162 | " ############################################################################\n", 163 | "\n", 164 | "\n", 165 | "# We will now generate a small dataset with S={0,1}, d=4 and n=5.\n", 166 | "np.random.seed(0)\n", 167 | "X = generate_uniform_covariates(n=5, d=4)\n", 168 | "y = ParityFunction(S=[0,1])(X)\n", 169 | "print(\"X = \\n\", X)\n", 170 | "print(\"y = \\n\", y)" 171 | ], 172 | "execution_count": null, 173 | "outputs": [] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": { 178 | "id": "H4WVLOboo-Kr" 179 | }, 180 | "source": [ 181 | "#### Solution" 182 | ] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "metadata": { 187 | "id": "OX7q94iRpA0O" 188 | }, 189 | "source": [ 190 | "Exercise 1.1.\n", 191 | "```\n", 192 | " def __call__(self, X):\n", 193 | " if len(X.shape) == 1:\n", 194 | " # In case a vector was passed, convert it to a matrix.\n", 195 | " X.reshape(1, -1)\n", 196 | " y = np.prod(X[:, self.S], axis=1)\n", 197 | " return y\n", 198 | "```\n", 199 | "\n", 200 | "Exercise 1.2.\n", 201 | "```\n", 202 | "return np.random.binomial(n=1, p=0.5, size=(n, d)) * 2.0 - 1.0\n", 203 | "````" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": { 209 | "id": "92ESYZOst_ci" 210 | }, 211 | "source": [ 212 | "## Training a Neural Network" 213 | ] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "metadata": { 218 | "id": "GZymgQXcqgCn" 219 | }, 220 | "source": [ 221 | "We will now attempt to learn a parity function by training a ReLU neural network with one hidden layer. First, let us generate training and validation data." 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "metadata": { 227 | "id": "qwulokCorC-n" 228 | }, 229 | "source": [ 230 | "d = 50 # Dimension of the input space.\n", 231 | "n_train = d*10 # The number of data points for training.\n", 232 | "n_valid = 1000 # The number of data points for validating the learned function.\n", 233 | "\n", 234 | "# Let us start with a parity function supported on one variable only.\n", 235 | "# In Exercise 2 you will be asked to experiment with different supports.\n", 236 | "f_S = ParityFunction(S = [0])\n", 237 | "\n", 238 | "# Generate the covariates for training and validation.\n", 239 | "X_train = generate_uniform_covariates(n=n_train, d=d)\n", 240 | "X_valid = generate_uniform_covariates(n=n_valid, d=d)\n", 241 | "\n", 242 | "# Now generate the labels for training and validation.\n", 243 | "y_train = f_S(X_train)\n", 244 | "y_valid = f_S(X_valid)" 245 | ], 246 | "execution_count": null, 247 | "outputs": [] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": { 252 | "id": "RnTD38FIvoO8" 253 | }, 254 | "source": [ 255 | "In the following cell, we create a fully connected ReLU neural network with one hidden layer of hidden dimension ```d' = hidden_d```. Such a network can be expressed as a mapping\n", 256 | "$$\n", 257 | " x \\mapsto W_2\\operatorname{ReLU}(W_1x + b_1) + b_2,\n", 258 | "$$\n", 259 | "where the first layer weights are $W_1 \\in \\mathbb{R}^{d' \\times d}, b_1 \\in \\mathbb{R}^{d'}$,\n", 260 | "and the second layer weights are $W_2 \\in \\mathbb{R}^{1 \\times d'}$, $b_{2} \\in \\mathbb{R}$. The function $\\operatorname{ReLU}$ applies a component-wise operation $x \\mapsto \\max(0, x)$." 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "metadata": { 266 | "id": "4FfdWlgEuBYY" 267 | }, 268 | "source": [ 269 | "hidden_d = 10*d # The hidden dimension denoted d' in the text cell above.\n", 270 | "model = tf.keras.Sequential([\n", 271 | " tf.keras.layers.InputLayer(input_shape=(X_train.shape[1],)),\n", 272 | " tf.keras.layers.Dense(hidden_d, activation='relu'),\n", 273 | " tf.keras.layers.Dense(1, activation=None),\n", 274 | "])" 275 | ], 276 | "execution_count": null, 277 | "outputs": [] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": { 282 | "id": "3RYPA2O9xFuL" 283 | }, 284 | "source": [ 285 | "Notice that a neural network output is real-valued, while the parity function outputs take values in $\\{-1, +1\\}$. We can thus measure the accuracy of a trained ReLU network by computing the fraction of data points on which the sign of the ReLU network output agrees with the output of the true parity function that generated the data.\n", 286 | "\n", 287 | "At the same time, we need to specify a loss function and an optimization procedure for training the weights of our ReLU network. This is done in the following code cell." 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "metadata": { 293 | "id": "1unVUUMyxF73" 294 | }, 295 | "source": [ 296 | "# Implement a function for tracking accuracy of the learned network.\n", 297 | "def accuracy(y_true, y_pred):\n", 298 | " \"\"\" Given a real-valued prediction vector y_pred, calculate the proportion of\n", 299 | " labels in y_pred whose signs agree with {-1, +1} valued y_true labels vector.\n", 300 | " \"\"\"\n", 301 | " y_pred_sign = tnp.sign(y_pred)\n", 302 | " return tnp.average(y_true == y_pred_sign)\n", 303 | " \n", 304 | " #y_pred_sgn = tf.math.sign(y_pred)\n", 305 | " #return tf.math.equal(y_true, y_pred_sgn)\n", 306 | "\n", 307 | "# Set the quadratic loss function and set the optimizer to gradient descent\n", 308 | "# with step size given by the learning_rate parameter.\n", 309 | "model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),\n", 310 | " loss=tf.keras.losses.MeanSquaredError(),\n", 311 | " metrics=[accuracy])" 312 | ], 313 | "execution_count": null, 314 | "outputs": [] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": { 319 | "id": "Vf4umbMOnYuC" 320 | }, 321 | "source": [ 322 | "The below cell can be executed to print some summary statistics." 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "metadata": { 328 | "id": "kF_U6Hnumz2l" 329 | }, 330 | "source": [ 331 | "# The parameters of the first layer are (W1, b1), hence the number of parameters\n", 332 | "# should be (d * d') + d'.\n", 333 | "# The parameters of the second layer are (W2, b2), hence the number of\n", 334 | "# parameters should be d' + 1.\n", 335 | "model.summary()" 336 | ], 337 | "execution_count": null, 338 | "outputs": [] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": { 343 | "id": "D8pksiFWzwey" 344 | }, 345 | "source": [ 346 | "We are now ready to train our ReLU network." 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "metadata": { 352 | "id": "legbxGCQvQ11" 353 | }, 354 | "source": [ 355 | "history = model.fit(\n", 356 | " X_train, y_train, batch_size=n_train, \n", 357 | " validation_data=(X_valid, y_valid), validation_batch_size=n_valid,\n", 358 | " epochs=200,\n", 359 | " verbose=1)" 360 | ], 361 | "execution_count": null, 362 | "outputs": [] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "metadata": { 367 | "id": "iZ0IfSGi00S5" 368 | }, 369 | "source": [ 370 | "# The below code visualizes the evolution of accuracy and loss curves throughout\n", 371 | "# the training process.\n", 372 | "plt.plot(history.history['accuracy'])\n", 373 | "plt.plot(history.history['val_accuracy'])\n", 374 | "plt.legend(['Training Accuracy', 'Validation Accuracy'])\n", 375 | "plt.xlabel('Epochs')\n", 376 | "plt.ylabel('Accuracy')\n", 377 | "plt.title('Accuracy vs Training Time')\n", 378 | "plt.show()\n", 379 | "plt.plot(history.history['loss'])\n", 380 | "plt.plot(history.history['val_loss'])\n", 381 | "plt.legend(['Training Loss', 'Validation Loss'])\n", 382 | "plt.xlabel('Epochs')\n", 383 | "plt.ylabel('Accuracy')\n", 384 | "plt.title('Loss vs Training Time')" 385 | ], 386 | "execution_count": null, 387 | "outputs": [] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": { 392 | "id": "21HEHf2xxyhF" 393 | }, 394 | "source": [ 395 | "### Exercise 2" 396 | ] 397 | }, 398 | { 399 | "cell_type": "markdown", 400 | "metadata": { 401 | "id": "hO-huEGt1g3j" 402 | }, 403 | "source": [ 404 | "In the above simulation we have observed that a two-layer ReLU network trained by gradient descent learns the correct parity function $f_{S}$ when the support set is $S=\\{0\\}$. Repeat the above experiment with parity functions supported on larger sets (e.g., $S=\\{0,1\\}; S=\\{0,1,\\dots,d-1\\}$; $S$ sampled uniformly at random). For what sizes of $S$ do you observe that the problem gets difficult? Explore the above experimental setup by:\n", 405 | "\n", 406 | "- modifying the neural network architecture (e.g., adding more layers, increasing/decreasing the hidden dimension, changing the activation function, etc);\n", 407 | "- modifying the optimization algorithm and its hyper-parameters;\n", 408 | "- modifying the dataset size and input dimension parameters.\n", 409 | "\n", 410 | "You may find it helpful to refer to Keras documentation available in the following link https://www.tensorflow.org/api_docs/python/tf/keras." 411 | ] 412 | }, 413 | { 414 | "cell_type": "markdown", 415 | "metadata": { 416 | "id": "EYxWXMd8pFOq" 417 | }, 418 | "source": [ 419 | "#### Solution" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": { 425 | "id": "MGWQ9S1TQMqH" 426 | }, 427 | "source": [ 428 | "Exploring the above exercise you should have concluded that for large enough support sets $S$ (e.g., $|S|=20$) learning the true parity function becomes difficult regardless of the neural network architecture, choice of the optimizer, its hyper-parameters and other parameters of the problem.\n" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": { 434 | "id": "I4JnSQM7QEQD" 435 | }, 436 | "source": [ 437 | "## Is Learning Parity Functions Hard?" 438 | ] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "metadata": { 443 | "id": "gI27L7697IpS" 444 | }, 445 | "source": [ 446 | "In exercise 2, we have empirically concluded that learning parity functions\n", 447 | "is a difficult learning problem for neural networks trained by gradient descent.\n", 448 | "**At this point, it is not clear if learning parities is possible for any algorithm, from either statistical or computational perspective.** An open possibility remains that the problem of learning parities is difficult for any learning algorithm, and perhaps we have not observed anything special in the failure of neural networks trained by gradient descent. This section investigates whether learning parity functions is hard from a statistical perspective (Exercise 3) and whether it is hard from a computational perspective (Exercise 4)." 449 | ] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "metadata": { 454 | "id": "vzAtuIUV42z4" 455 | }, 456 | "source": [ 457 | "### Exercise 3" 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": { 463 | "id": "hnYcLep045YU" 464 | }, 465 | "source": [ 466 | "Argue why from an information-theoretic perspective, the problem of learning parity functions is not difficult. To do so, suggest a procedure that correctly learns the true parity function, disregarding any computational constraints (i.e., your algorithm is allowed to run in exponential time)." 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": { 472 | "id": "JpBPz97Z6jpW" 473 | }, 474 | "source": [ 475 | "#### Hint" 476 | ] 477 | }, 478 | { 479 | "cell_type": "markdown", 480 | "metadata": { 481 | "id": "zSAibJwp6qkL" 482 | }, 483 | "source": [ 484 | "There is a finite number of parity functions and the generated labels contain zero noise." 485 | ] 486 | }, 487 | { 488 | "cell_type": "markdown", 489 | "metadata": { 490 | "id": "n0CCdbXP5Dib" 491 | }, 492 | "source": [ 493 | "#### Solution" 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": { 499 | "id": "pdXbZG3-5Ffz" 500 | }, 501 | "source": [ 502 | "Let $P$ denote the data generating distribution (i.e., the data points $(x_{i}, y_{i})$ are sampled i.i.d. from $P$). For a function $f : \\mathcal{X} \\to \\{-1, +1\\}$ denote its error as\n", 503 | "$$\n", 504 | " \\operatorname{err}(f) = \\mathbb{P}_{(x,y) \\sim P}\\left( f(x) \\neq y\\right)\n", 505 | "$$\n", 506 | "\n", 507 | "\n", 508 | "For any function $f$, probability that it is consistent with the training data (i.e., correctly labels each of the data points) is at most\n", 509 | "$$\n", 510 | " \\mathbb{P}\\left(f \\text{ is consistent with the training data }(X, y) \\right)\n", 511 | " = (1 - \\operatorname{err}(f))^{n}\n", 512 | " \\leq e^{-n \\operatorname{err}(f)},\n", 513 | "$$\n", 514 | "where $n$ is the number of data points.\n", 515 | "\n", 516 | "\n", 517 | "Let $\\mathcal{A} = \\{ f_{S} : S \\subseteq \\{0, \\dots, d-1\\}\\}$ denote the set of all parity functions. Fix any $\\varepsilon > 0$ and let $\\mathcal{A}_{\\mathrm{bad}} \\subseteq \\mathcal{A}$ denote the subset of parity functions whose error is at least $\\varepsilon$. It follows via the union bound that\n", 518 | "$$\n", 519 | " \\mathbb{P}\\left( \\text{exists } f \\in \\mathcal{A}_{\\mathrm{bad}} \\text{ that is consistent with the training data (X, y) }\n", 520 | " \\right)\n", 521 | " \\leq |\\mathcal{A}_{\\mathrm{bad}}|e^{-n\\varepsilon}\n", 522 | " \\leq 2^{d}e^{-n \\varepsilon}.\n", 523 | "$$\n", 524 | "For any $\\delta \\in (0,1)$, the above probability is at most $\\delta$ provided that $n \\geq \\frac{d\\log(2) + \\log(\\delta^{-1})}{\\varepsilon}$.\n", 525 | "\n", 526 | "**In particular, the above argument establishes that for large enough sample sizes N, it is enough to output any parity function that correctly labels the training data. Since there is a finite number of such functions, we can try all of them and return any parity function consistent with the training data. Such a parity function is guaranteed to exist since the training data is labeled by one such function.**" 527 | ] 528 | }, 529 | { 530 | "cell_type": "markdown", 531 | "metadata": { 532 | "id": "3bX_i6vl-ydJ" 533 | }, 534 | "source": [ 535 | "### Exercise 4" 536 | ] 537 | }, 538 | { 539 | "cell_type": "markdown", 540 | "metadata": { 541 | "id": "y8NxU0EWBYKO" 542 | }, 543 | "source": [ 544 | "In Exercise 3, we have shown that the noise-free problem of learning parity functions is easy from an information-theoretic perspective; however, to do so we have provided an algorithm that requires exponential computational resources. This leaves an open possibility that no polynomial time algorithm can identify the correct parity function.\n", 545 | "\n", 546 | "In this exercise, you are asked to find a polynomial-time algorithm that learns the correct parity function, provided that the training data set is large enough (e.g., $n \\geq d$)." 547 | ] 548 | }, 549 | { 550 | "cell_type": "code", 551 | "metadata": { 552 | "id": "14kGigS_ByQj" 553 | }, 554 | "source": [ 555 | "def learn_the_correct_parity(X, y):\n", 556 | " \"\"\" :X: A matrix in {-1, +1}^{n \\times d} of covariate vectors.\n", 557 | " :y: A vector in {-1, +1}^{n} generated by some parity function f_{S}\n", 558 | " applied to the rows of $X$.\n", 559 | " :returns: A support set $S$ of the parity function f_{S} that generated\n", 560 | " the labels.\n", 561 | " \"\"\"\n", 562 | " ##############################################################################\n", 563 | " # Exercise 4. Implement a polynomial time algorithm that learns the true\n", 564 | " # partity function.\n", 565 | " \n", 566 | " ##############################################################################\n", 567 | "\n", 568 | "\n", 569 | "# The below code tests our implementation.\n", 570 | "np.random.seed(0)\n", 571 | "d = 100\n", 572 | "S = np.random.choice(d, size = 10, replace=False)\n", 573 | "X = generate_uniform_covariates(n=d, d=d)\n", 574 | "print(X)\n", 575 | "print()\n", 576 | "y = ParityFunction(S=S)(X)\n", 577 | "print(\"True S =\\n\", np.sort(S))\n", 578 | "computed_S = learn_the_correct_parity(X, y)\n", 579 | "print(\"Computed S =\\n\", computed_S)\n", 580 | "print(\"Consistent output:\", np.allclose(y, ParityFunction(computed_S)(X)))" 581 | ], 582 | "execution_count": null, 583 | "outputs": [] 584 | }, 585 | { 586 | "cell_type": "markdown", 587 | "metadata": { 588 | "id": "uOzCt1SfCV2s" 589 | }, 590 | "source": [ 591 | "#### Hint" 592 | ] 593 | }, 594 | { 595 | "cell_type": "markdown", 596 | "metadata": { 597 | "id": "sl5FUhzwCj-T" 598 | }, 599 | "source": [ 600 | "Relabel $+1 \\mapsto 0$ and $-1 \\mapsto 1$ and think of parity functions as addition modulo 2 on $\\{0, 1\\}^{d}$." 601 | ] 602 | }, 603 | { 604 | "cell_type": "markdown", 605 | "metadata": { 606 | "id": "2jt1ycLE-z8f" 607 | }, 608 | "source": [ 609 | "#### Solution" 610 | ] 611 | }, 612 | { 613 | "cell_type": "markdown", 614 | "metadata": { 615 | "id": "0mkEXpCQUoy8" 616 | }, 617 | "source": [ 618 | "**The key observation is that the problem we are trying to solve is just a linear system of equations over the finite field $\\text{GF}(2)$.**\n", 619 | "Hence, we can recover the true parity function by simply solving the linear system (e.g., by performing Gaussian elimination). A sample implementation is provided below." 620 | ] 621 | }, 622 | { 623 | "cell_type": "markdown", 624 | "metadata": { 625 | "id": "au71IEiUEVCk" 626 | }, 627 | "source": [ 628 | "```\n", 629 | " # First convert the data so that the elements of X,y are in {0,1} and we \n", 630 | " # have y = Xw_{S} (where w_S represents the true parity function) and the\n", 631 | " # addition is performed modulo 2.\n", 632 | " X = (X - 1.0)/(-2.0) # -1 becomes 1, 1 becomes 0.\n", 633 | " y = (y - 1.0)/(-2.0) # same as above.\n", 634 | " X = X.astype(int)\n", 635 | " y = y.astype(int)\n", 636 | " # We will now implement Gaussian elimination to find the solution.\n", 637 | " d = X.shape[1]\n", 638 | " y = y.reshape(-1, 1)\n", 639 | "\n", 640 | "\n", 641 | " def switch_rows(X, y, i, j):\n", 642 | " \"\"\" Switches the i-th and j-th row in the matrix X and a vectory y (viewed\n", 643 | " as an (n \\times 1) matrix. \"\"\"\n", 644 | " X[[i,j],:] = X[[j,i],:]\n", 645 | " y[[i,j],0] = y[[j,i],0]\n", 646 | "\n", 647 | " for i in range(d):\n", 648 | " # Switch some rows to make X[i,i]=1\n", 649 | " if np.alltrue(X[i:, i] == 0):\n", 650 | " # Skip this column, as all elements are equal to 0 below the i-th row.\n", 651 | " continue\n", 652 | " non_zero_idx = np.argmin(X[i:,i] == 0)+i\n", 653 | " switch_rows(X, y, i, non_zero_idx)\n", 654 | " # Right now we have X[i,i] = 1.0\n", 655 | " # Perform all the row reduction operations for the i-th column, to make\n", 656 | " # X[j,i] = 0 for all j not equal to i.\n", 657 | " mask = X[:,i].copy().reshape(-1, 1)\n", 658 | " mask[i,0] = 0 # Do not add the i-th row to itselt.\n", 659 | " X += mask @ X[i,:].reshape(1, -1) # Perform the row operations on X\n", 660 | " X %= 2\n", 661 | " y += y[i,0] * mask # Perform the row operations on y\n", 662 | " y %= 2\n", 663 | " \n", 664 | " S = []\n", 665 | " for i in range(d):\n", 666 | " if X[i,i] == 1 and y[i,0] == 1:\n", 667 | " S.append(i)\n", 668 | " return np.array(S)\n", 669 | "```" 670 | ] 671 | }, 672 | { 673 | "cell_type": "markdown", 674 | "metadata": { 675 | "id": "1oPnWb_tQcmT" 676 | }, 677 | "source": [ 678 | "## Understanding Why Neural Networks Fail" 679 | ] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "metadata": { 684 | "id": "wMNcZhiXj-yq" 685 | }, 686 | "source": [ 687 | "One possible reason for failure to learn the correct parity function in our neural network experiments is that our chosen architecture was not expressive enough to contain the target parity functions. The below exercise asks us to rule out this possibility." 688 | ] 689 | }, 690 | { 691 | "cell_type": "markdown", 692 | "metadata": { 693 | "id": "ZqjBzsmbkTXO" 694 | }, 695 | "source": [ 696 | "### Exercise 5\n" 697 | ] 698 | }, 699 | { 700 | "cell_type": "markdown", 701 | "metadata": { 702 | "id": "Dyy8v8MUblRR" 703 | }, 704 | "source": [ 705 | "Complete the below code cell, which asks to find a configuration of ReLU network weights that realize *exactly* a given parity function $f_{S}$." 706 | ] 707 | }, 708 | { 709 | "cell_type": "code", 710 | "metadata": { 711 | "id": "dT9wt20Fksgi" 712 | }, 713 | "source": [ 714 | "# Fill in the missing code in the following function.\n", 715 | "def get_parity_relu_network(d, S):\n", 716 | " \"\"\"\n", 717 | " :d: The input dimension.\n", 718 | " :S: An array specifying the indexes in {0, 1, ..., d-1} of the parity function\n", 719 | " to be implemented by this function.\n", 720 | " \n", 721 | " :returns: A relu network with one hidden layer that implements the parity\n", 722 | " function indexed by S.\n", 723 | " \"\"\"\n", 724 | "\n", 725 | " ##############################################################################\n", 726 | " # Exercise 5.1. Set the hidden layer dimension required by your construction.\n", 727 | "\n", 728 | " ##############################################################################\n", 729 | "\n", 730 | " # Create a relu network with the specified dimensions.\n", 731 | " model = tf.keras.Sequential([\n", 732 | " tf.keras.layers.InputLayer(input_shape=(d,)),\n", 733 | " tf.keras.layers.Dense(hidden_d, activation='relu'),\n", 734 | " tf.keras.layers.Dense(1, activation=None),\n", 735 | " ])\n", 736 | "\n", 737 | " # Convert model parameters to numpy arrays.\n", 738 | " W1, b1 = model.layers[0].get_weights()\n", 739 | " W1 = np.array(W1).transpose()\n", 740 | " b1 = np.array(b1)\n", 741 | " W2, b2 = model.layers[1].get_weights()\n", 742 | " W2 = np.array(W2).transpose()\n", 743 | " b2 = np.array(b2)\n", 744 | "\n", 745 | " # Recall that the above relu network implements the function\n", 746 | " # x \\in \\R^{d} --> (W2) relu((W1) x + b1) + b2.\n", 747 | " \n", 748 | " ##############################################################################\n", 749 | " # Exercise 5.2. Your code for setting W1, b1, W2, b2 goes below.\n", 750 | " \n", 751 | " ##############################################################################\n", 752 | "\n", 753 | " # Set the model parameters with the above implemented construction.\n", 754 | " model.layers[0].set_weights([W1.transpose(), b1])\n", 755 | " model.layers[1].set_weights([W2.transpose(), b2])\n", 756 | "\n", 757 | " return model\n", 758 | "\n", 759 | "\n", 760 | "# The below code verifies if the implemented \"get_parity_relu_network\" function\n", 761 | "# agrees with the data generated via the ParitiesData class implemented earlier.\n", 762 | "d = 100\n", 763 | "# Sample a random support set S for the parity function f_S.\n", 764 | "coin_flips = np.random.binomial(n=1, p=0.5, size=d)\n", 765 | "S = np.arange(d)[coin_flips >= 1]\n", 766 | "f_S = ParityFunction(S=S)\n", 767 | "X = generate_uniform_covariates(n=2*d, d=d)\n", 768 | "y = f_S(X)\n", 769 | "\n", 770 | "parity_network = get_parity_relu_network(d, S)\n", 771 | "y_pred = np.array(parity_network(X)).flatten()\n", 772 | "\n", 773 | "# Check if the y_pred vector equals element-wise to the vector y.\n", 774 | "assert (y_pred == y).all()\n", 775 | "print(\"Assertion passed.\")" 776 | ], 777 | "execution_count": null, 778 | "outputs": [] 779 | }, 780 | { 781 | "cell_type": "markdown", 782 | "metadata": { 783 | "id": "YKpn7FcNkZTZ" 784 | }, 785 | "source": [ 786 | "#### Solution" 787 | ] 788 | }, 789 | { 790 | "cell_type": "markdown", 791 | "metadata": { 792 | "id": "JxA1usBrYSt8" 793 | }, 794 | "source": [ 795 | "It is enough to set the hidden unit dimension to |S| + 1.\n", 796 | "```\n", 797 | "# Exercise 5.1.\n", 798 | "\n", 799 | "S = np.array(S)\n", 800 | "hidden_d = len(S)+1 # The hidden layer dimension of your construction. \n", 801 | "```\n", 802 | "Without loss of generality, assume that $S = \\{0, 1, \\dots, d-1\\}$, since otherwise the coordinates outside of the support set $S$ can be ignored. Suppose that an input vector $x \\in \\{-1, +1\\}^{d}$ contains $u$ entries equal to $+1$ and $v$ entries equal to $-1$. Hence, we have $u + v = d$.\n", 803 | "The output of the parity function $f_{S}$ is then equal to $+1$ if $v$ is even and $-1$ if $v$ is odd. Letting $\\mathbf{1} \\in \\mathbb{R}^{d}$ denote the vector of ones, we have $\\langle \\mathbf{1}, x \\rangle = u - v$. Therefore, we also have $\\langle \\mathbf{1}, x \\rangle = d - 2v$.\n", 804 | "\n", 805 | "**Setting the weights of the first layer.**\n", 806 | "\n", 807 | "Notice that if we set the first $d$ hidden units to $-\\frac{1}{2} \\mathbf{1}$, and if we set the first $d$ bias terms to $d_{i} = \\frac{d}{2} - i$ (with indexing starting from $0$), then the value computed by the $i$-th hidden unit is equal to $(W_{1}x)_{i} + b_{i} = v - i$. Finally, set the last, $d+1$-th hidden unit (i.e., the last row of the matrix $W_1$) to $-\\frac{1}{4}\\mathbf{1}$ and set the last bias term to $\\frac{d}{4}$. This results in the $d+1$-th activation equal to $\\frac{v}{2}$.\n", 808 | "\n", 809 | "**Setting the weights of the second layer.**\n", 810 | "The input received by the second layer is equal to\n", 811 | "$$\n", 812 | " \\left(\\text{ReLU}(v), \\text{ReLu}(v-1), \\text{ReLu}(v-2), \\dots, \\text{ReLu}(v-d+1), \\frac{v}{2}\\right)^{\\mathsf{T}}.\n", 813 | "$$\n", 814 | "Setting the bias term $b_{2} = 1$ and the layer weights to the alternating pattern $-4, +4, -4, +4, \\dots$ realizes the desired parity function.\n", 815 | "\n", 816 | "\n", 817 | "Sample implementation of the above-described solution is provided below.\n", 818 | "```\n", 819 | " # Exercise 5.2.\n", 820 | "\n", 821 | " W1 *= 0\n", 822 | " b1 *= 0\n", 823 | " W2 *= 0\n", 824 | " b2 *= 0\n", 825 | " \n", 826 | " # Set the weights of the first layer.\n", 827 | " W1[:-1, S] = -1/2\n", 828 | " b1[:-1] = len(S)/2 - np.arange(len(S))\n", 829 | " W1[-1, S] = -1/4\n", 830 | " b1[-1] = len(S)/4\n", 831 | "\n", 832 | " # We now turn to the weights of the second layer.\n", 833 | " W2[:] = 1\n", 834 | " W2[0, ::2] = -1\n", 835 | " W2[0, -1] = 1\n", 836 | " W2 *= 4.\n", 837 | " b2[:] = 1\n", 838 | "```" 839 | ] 840 | }, 841 | { 842 | "cell_type": "markdown", 843 | "metadata": { 844 | "id": "S1qLY9HGHXqL" 845 | }, 846 | "source": [ 847 | "### Intuitive Reasons for Failure to Learn the True Parity" 848 | ] 849 | }, 850 | { 851 | "cell_type": "markdown", 852 | "metadata": { 853 | "id": "QdRQ2zwvc98O" 854 | }, 855 | "source": [ 856 | "Let us summarize what we have discussed up to this point:\n", 857 | "- we found it challenging to learn the true parity function by training various architectures of neural networks;\n", 858 | "- we understood that the problem of learning parity functions is not hard from the statistical perspective;\n", 859 | "- we also understood that the problem is not difficult from a computational perspective;\n", 860 | "- we have also found that ReLU neural networks are expressive enough to realize parity functions with reasonably small weights.\n", 861 | "\n", 862 | "So why do the gradient methods fail to find the correct configuration of model weights? **It turns out that this problem has nothing to do with neural networks; rather, it is known to be difficult (from a computational perspective) for a large class of algorithms known as statistical query algorithms.** Such algorithms are, informally, procedures that do not inspect the individual data points but are instead based on some aggregate statistics of the observed data. Gradient descent can be shown to fall into this category (see the bibliographic remarks section)." 863 | ] 864 | }, 865 | { 866 | "cell_type": "markdown", 867 | "metadata": { 868 | "id": "cDvoODkOe9GA" 869 | }, 870 | "source": [ 871 | "Let us now sketch the argument showing why gradient methods find it challenging to solve the parities problem. Let $\\mathcal{A} = \\{ a_{w} : \\{0,1\\}^{d} \\to \\mathbb{R} \\mid w \\in \\mathbb{R}^{m}\\}$ denote some class of parametric functions (such as neural networks considered above). Let $P_{S}$ denote the data generating distribution assuming that the true parity function is given by $f_{S}$ and as before, the covariates are sampled uniformly from the boolean hypercube. Let\n", 872 | "$$\n", 873 | " r_{S}(a_w) = \\mathbf{E}_{(X,Y) \\sim P_{S}}[(a_w(X) - Y)^{2}]\n", 874 | " = \\mathbf{E}_{(X,Y) \\sim P_{S}}[(a_w(X) - f_{S}(X))^{2}].\n", 875 | "$$\n", 876 | "Let $R_{S}$ denote the empirical risk function for a dataset of size $n$ sampled from the distribution $P_S$.\n", 877 | "When the dataset size $n$ is large enough, for any fixed point $w \\in \\mathbb{R}^{m}$ we have $\\nabla_{w} R_{S}(a_w) \\approx \\nabla_{w} r_{S}(a_w)$. **We will now attempt to quantify the amount of information contained in the true gradient $\\nabla_{w} r_{S}(a_w)$.**\n", 878 | "In order to do that, we will try to upper bound the variance of the gradient at an arbitrary point $w$, with the randomness coming from sampling $S$ uniformly over all possible subsets of $\\{0, 1, \\dots, d-1\\}$:\n", 879 | "$$\n", 880 | " \\mathrm{Var}(\\nabla r, w)\n", 881 | " = \\mathbf{E}_{S \\sim \\text{Uniform}} \\| \\nabla_{w} r_{S}(w) - \\mathbf{E}_{S' \\sim \\text{Uniform}}[\\nabla_{w} r_{S'}(w)]\\|_{2}^{2}.\n", 882 | "$$\n", 883 | "**If the above quantity is small, it means that the gradient\n", 884 | "$\\nabla_{w} r_{S}(w)$ does not depend too much on $S$, and hence the gradient does not depend on the signal.** We will now proceed to show that the below quantity is indeed exponentially small in the dimension $d$." 885 | ] 886 | }, 887 | { 888 | "cell_type": "markdown", 889 | "metadata": { 890 | "id": "cXAS3WXgpOpj" 891 | }, 892 | "source": [ 893 | "First, notice that\n", 894 | "$$\n", 895 | " \\nabla_{w}r_{S}(w) = \\mathbf{E}_{(X,Y)\\sim P_{S}}[\\nabla_{w} (a_w(X) - f_{S}(X))]\n", 896 | " = \\mathbf{E}_{X}[2\\nabla_{w}a_{w}(X)(a_{w}(X) - f_{S}(X))].\n", 897 | "$$\n", 898 | "We hence have\n", 899 | "\\begin{align*}\n", 900 | " \\frac{1}{4}\\mathrm{Var}(\\nabla r, w)\n", 901 | " &= \\frac{1}{4}\\mathbf{E}_{S} \\| \\nabla_{w} r_{S}(w) - \\mathbf{E}_{S'}[\n", 902 | " \\nabla_{w} r_{S'}(w)]\\|_{2}^{2}\n", 903 | " \\\\\n", 904 | " &\\leq \\frac{1}{4}\\mathbf{E}_{S} \\| \\nabla_{w} r_{S}(w) -\n", 905 | " 2\\mathbf{E}_{X}[\\nabla_{w}a_{w}(X)a_{w}(X)] \\|_{2}^{2}\n", 906 | " \\\\\n", 907 | " &= \\mathbf{E}_{S} \\|\\, \\mathbf{E}_{X}[\\nabla_{w}a_{w}(X)f_{S}(X)]\\, \\|_{2}^{2}.\n", 908 | "\\end{align*}\n", 909 | "To simplify the notation, let us now write $g(X) = \\nabla_{w}a_{w}(X)$.\n", 910 | "Further, let $\\langle f, g \\rangle_{P_{X}} = \\mathbf{E}_{X}[f(X)g(X)]$, where $X$ is distributed uniformly over the boolean hypercube. Using the fact that the $2^{d}$ parity functions form a basis over the Hilbert space $L_{2}(P_{X})$ of real-valued functions with the domain $\\{-1,1\\}^{d}$, we have\n", 911 | "\\begin{align*}\n", 912 | " \\mathbf{E}_{S} \\sum_{i=1}^{m}\\left(\n", 913 | " \\mathbf{E}_{X}[g_{i}(X)f_{S}(X)]\n", 914 | " \\right)^{2} \n", 915 | " &= \\mathbf{E}_{S} \\sum_{i=1}^{m}\\langle g_{i}, f_{S} \\rangle^{2}_{P_{X}} \\\\\n", 916 | " &= \\frac{1}{2^{d}} \\sum_{S}\\sum_{i=1}^{m}\\langle g_{i}, f_{S} \\rangle^{2}_{P_{X}} \\\\\n", 917 | " &= \\frac{1}{2^{d}} \\sum_{i=1}^{m} \\sum_{S} \\langle g_{i}, f_{S} \\rangle^{2}_{P_{X}} \\\\\n", 918 | " &= \\frac{1}{2^{d}} \\sum_{i=1}^{m} \\|g_{i}\\|_{P_{X}}^{2} \\\\\n", 919 | " &= \\frac{1}{2^{d}} \\mathbf{E}_{X}[\\|g(X)\\|_{2}^{2}].\n", 920 | "\\end{align*}\n", 921 | "To sum up the above derivations, we have shown that\n", 922 | "$$\n", 923 | " \\mathrm{Var}(\\nabla r, w) \\leq \\frac{\\mathbf{E}_{X}[\\|g(X)\\|_{2}^{2}]}\n", 924 | " {2^{d-2}}.\n", 925 | "$$\n", 926 | "**In particular, the variance of the gradient with respect to uniform draw of a target parity function is exponentially small in the dimension $d$. This establishes that gradients for the parity problem are strongly concentrated in directions independent of the true signal.**" 927 | ] 928 | }, 929 | { 930 | "cell_type": "markdown", 931 | "metadata": { 932 | "id": "8wdK4ezsQhKH" 933 | }, 934 | "source": [ 935 | "## Bibliographic Remarks" 936 | ] 937 | }, 938 | { 939 | "cell_type": "markdown", 940 | "metadata": { 941 | "id": "SHXqhJUSQpgX" 942 | }, 943 | "source": [ 944 | "The study of learning theory from the computational complexity theory point of view was pioneered by *Valiant [1984]*, where the so-called PAC learning framework was introduced. Various tweaks of the PAC learning models were subsequently studied. One such model motivated by the need to develop noise-tolerant learning algorithms, called the statistical\n", 945 | "query model, was introduced by *Kearns [1998]*, where it was also shown that the problem of learning parity functions cannot be solved using a polynomial number of queries; see also the textbook by *Kearns, Vazirani, and Vazirani [1994]*. Whether there exists an algorithm that can learn the underlying parity function in the presence of label noise is a long-standing\n", 946 | "open problem; see *Blum, Kalai, and Wasserman [2003]*. This practical session is based on Section 2 of *Shalev-Shwartz, Shamir, and Shammah [2017]*, where other limitations of gradient-based learning are also presented. Most of the learning algorithms used in practice can be implemented using statistical queries. For further background reading, see the discussions and references in the papers by *Feldman, Grigorescu, Reyzin, Vempala, and Xiao [2017]*, *Feldman [2017]*." 947 | ] 948 | }, 949 | { 950 | "cell_type": "markdown", 951 | "metadata": { 952 | "id": "H6WgMbjIEiLD" 953 | }, 954 | "source": [ 955 | "**References**\n", 956 | "\n", 957 | "A. Blum, A. Kalai, and H. Wasserman. Noise-tolerant learning, the parity problem, and the statistical query model. Journal of the ACM (JACM), 50(4):506–519, 2003.\n", 958 | "\n", 959 | "V. Feldman. A general characterization of the statistical query complexity. In Conference on Learning Theory, pages 785–830. PMLR, 2017.\n", 960 | "\n", 961 | "V. Feldman, E. Grigorescu, L. Reyzin, S. S. Vempala, and Y. Xiao. Statistical algorithms and a lower bound for detecting planted cliques. Journal of the ACM (JACM), 64(2): 1–37, 2017.\n", 962 | "\n", 963 | "M. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983–1006, 1998.\n", 964 | "\n", 965 | "M. J. Kearns, U. V. Vazirani, and U. Vazirani. An introduction to computational learning theory. MIT press, 1994.\n", 966 | "\n", 967 | "S. Shalev-Shwartz, O. Shamir, and S. Shammah. Failures of gradient-based deep learning. In International Conference on Machine Learning, pages 3067–3075. PMLR, 2017.\n", 968 | "\n", 969 | "L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984." 970 | ] 971 | } 972 | ] 973 | } -------------------------------------------------------------------------------- /practicals/restricted_eigenvalue.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Copy of lasso_and_restricted_eigenvalue.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "toc_visible": true 10 | }, 11 | "kernelspec": { 12 | "name": "python3", 13 | "display_name": "Python 3" 14 | }, 15 | "language_info": { 16 | "name": "python" 17 | } 18 | }, 19 | "cells": [ 20 | { 21 | "cell_type": "markdown", 22 | "metadata": { 23 | "id": "emglUHu3XTEg" 24 | }, 25 | "source": [ 26 | "# Restricted Eigenvalue Condition for the Lasso\n", 27 | "\n", 28 | " In this practical session, we introduce the restricted eigenvalue condition and its role in sparse linear prediction and estimation. Our main objectives are the following:\n", 29 | "\n", 30 | "- revisiting the basic inequality proof technique assuming the restricted eigenvalue condition;\n", 31 | "- implementing the proximal gradient method for the lasso objective and understanding how the restricted eigenvalue constant affects the convergence speed;\n", 32 | "- understanding why the lasso prediction performance can be sensitive to the restricted eigenvalue constant;\n", 33 | "- introducing an open problem regarding the existence of a computational-statistical gap for the problem of optimal in-sample sparse linear prediction.\n" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": { 39 | "id": "sh7NNRLRbaSj" 40 | }, 41 | "source": [ 42 | "We will consider the design matrix $X \\in \\mathbb{R}^{n \\times d}$ to be fixed and the observations generated via the following model:\n", 43 | "$$\n", 44 | " y = Xw^{\\star} + \\xi\\text{, where }w^{\\star}\\in\\mathbb{R}^{d}\\text{ is a }k\\text{-sparse target vector and }\\xi \\sim N(0, \\sigma^{2}I_{n \\times n})\\text{ is the noise random variable.}\n", 45 | "$$\n", 46 | "Let $S \\subseteq \\{1, \\dots, d\\}$ denote the support set of non-zero coordinates of the signal vector $w^{\\star}$ (thus, $|S| = k$); we also denote $S^{c} = \\{1, \\dots, d\\} \\backslash S$.\n", 47 | "\n", 48 | "Define the cone\n", 49 | "$$\n", 50 | " \\mathcal{C}(S) = \\{ \\Delta \\in \\mathbb{R}^{d} : \\|\\Delta_{S^{c}}\\|_{1} \\leq 3\\|\\Delta_{S}\\|_{1}\\}.\n", 51 | "$$\n", 52 | "Notice that the above cone is almost the same as the cone used in the definition of the restricted nullspace property (cf. the compressed sensing notebook). The only difference is that above we have a factor $3$ instead of $1$. Why this is the case will become clear shortly, when we obtain performance bounds for the lasso. Let us now define the restricted eigenvalue condition.\n", 53 | "\n", 54 | "---\n", 55 | "\n", 56 | "**Restricted Eigenvalue Condition**\n", 57 | "\n", 58 | "\n", 59 | "A matrix $X \\in \\mathbb{R}^{n \\times d}$ satisfies the $\\gamma$-restricted eigenvalue condition with respect to the support set $S \\subseteq \\{1, \\dots, d\\}$ if \n", 60 | "$$\n", 61 | " \\text{for any } \\Delta \\in \\mathcal{C}(S)\\text{ we have }\n", 62 | " \\frac{1}{n}\\|X\\Delta\\|_{2}^{2} \\geq \\gamma \\|\\Delta\\|_{2}^{2}.\n", 63 | "$$\n", 64 | "\n", 65 | "---\n", 66 | "\n", 67 | "**We remark that the restricted eigenvalue condition is significantly weaker than the restricted isometry assumption.**\n", 68 | "Obtaining the latter requires sampling the rows of $X$ from isotropic distributions. On the other hand, the restricted eigenvalue condition can be satisfied when the rows are sampled from distributions will ill-conditioned covariance structures; refer to the bibliographic remarks section for further details.\n" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": { 74 | "id": "Ebkp92mPlILV" 75 | }, 76 | "source": [ 77 | "Denote the empirical risk functional by $R(w) = \\frac{1}{n}\\|Xw-y\\|_{2}^{2}$ and let the lasso objective be denoted by $R_{\\lambda}(w) = R(w) + \\lambda\\|w\\|_{1}$. Let $w_{\\lambda}$ denote any solution to the lasso objective:\n", 78 | "$$\n", 79 | " w_{\\lambda} \\in \\text{argmin}_{w \\in \\mathbb{R}^{d}} R_{\\lambda}(w).\n", 80 | "$$\n", 81 | "Let us now see how restricted eigenvalue condition allows us to obtain *fast rate* convergence bounds on:\n", 82 | "1. the *estimation error* defined as $\\|w_{\\lambda} - w^{\\star}\\|_{2}^{2}$;\n", 83 | "2. the *in-sample prediction error* defined as $\\frac{1}{n} \\|Xw_{\\lambda} - Xw^{\\star}\\|_{2}^{2}$.\n", 84 | "\n", 85 | "**As it turns out, obtaining upper bounds on both performance measures defined above can be done in just a few lines of analysis via the basic-inequality proof technique.**\n", 86 | "\n", 87 | "---\n", 88 | "\n", 89 | "**Obtaining Performance Bounds via Basic-Inequality Proof Technique**\n", 90 | "\n", 91 | "In what follows, we work under the two following conditions:\n", 92 | "- we suppose that our fixed design matrix $X$ satisfies the $\\gamma$-restricted eigenvalue condition with respect to the support set $S$ of the true parameter $w^{\\star}$;\n", 93 | "- we suppose that the parameter $\\lambda$ definining the lasso objective satisfies $\\lambda \\geq 4\\|\\frac{1}{n}X^{\\mathsf{T}}\\xi\\|_{\\infty}$.\n", 94 | "\n", 95 | "The idea of the basic-inequality proof technique is to combine the facts that $w_{\\lambda}$ is an *optimal* point for the objective $R_{\\lambda}$, while $w^{\\star}$ is a *feasible* point. Hence, we have\n", 96 | "$$\n", 97 | " R_{\\lambda}(w_{\\lambda}) \\leq R_{\\lambda}(w^{\\star}).\n", 98 | "$$\n", 99 | "Denote $\\Delta = w^{\\star} - w_{\\lambda}$. Then, rearranging the above inequality yields\n", 100 | "\\begin{align*}\n", 101 | " \\underbrace{\\frac{1}{n}\\|X\\Delta\\|_{2}^{2}}_{\\text{in-sample prediction error}}\n", 102 | " &\\leq\n", 103 | " 2\\langle \\Delta, \\frac{1}{n}X^{\\mathsf{T}}\\xi\\rangle\n", 104 | " + \\lambda(\\|w^{\\star}\\|_{1} - \\|w_{\\lambda}\\|_{1}) \\\\\n", 105 | " &\\leq\n", 106 | " 2\\|\\Delta\\|_{1}\\|\\frac{1}{n}X^{\\mathsf{T}}\\xi\\|_{\\infty} \n", 107 | " + \\lambda(\\|w^{\\star}\\|_{1} - \\|w_{\\lambda}\\|_{1}) \\\\\n", 108 | " &=\n", 109 | " 2\\|\\Delta\\|_{1}\\|\\frac{1}{n}X^{\\mathsf{T}}\\xi\\|_{\\infty} \n", 110 | " + \\lambda(\\|\\Delta_{S} + (w_{\\lambda})_{S}\\|_{1} - \\|w_{\\lambda}\\|_{1}) \\\\\n", 111 | " &=\n", 112 | " 2\\|\\Delta\\|_{1}\\|\\frac{1}{n}X^{\\mathsf{T}}\\xi\\|_{\\infty} \n", 113 | " + \\lambda(\\|\\Delta_{S} + (w_{\\lambda})_{S}\\|_{1} - \\|(w_{\\lambda})_{S}\\|_{1} - \\|\\Delta_{S^{c}}\\|_{1}) \\\\\n", 114 | " &\\leq\n", 115 | " 2\\|\\Delta\\|_{1}\\|\\frac{1}{n}X^{\\mathsf{T}}\\xi\\|_{\\infty} \n", 116 | " + \\lambda(\\|\\Delta_{S}\\|_{1} - \\|\\Delta_{S^{c}}\\|_{1}) \\\\\n", 117 | " &\\leq\n", 118 | " \\frac{\\lambda}{2}\\|\\Delta\\|_{1}\n", 119 | " + \\lambda(\\|\\Delta_{S}\\|_{1} - \\|\\Delta_{S^{c}}\\|_{1}) \\\\\n", 120 | " &=\n", 121 | " \\frac{\\lambda}{2}(3\\|\\Delta_{S}\\|_{1} - \\|\\Delta_{S^{c}}\\|_{1}).\n", 122 | "\\end{align*}\n", 123 | "Since the left-hand side is non-negative, the above inequality implies that $\\Delta \\in \\mathcal{C}(S)$; thus, we may apply the $\\gamma$-restricted eigenvalue condition.\n", 124 | "\n", 125 | "We can thus obtain an upper bound on the estimation error as follows:\n", 126 | "\\begin{align*}\n", 127 | " &\\gamma \\|\\Delta\\|_{2}^{2}\n", 128 | " \\leq \\frac{1}{n}\\|X\\Delta\\|_{2}^{2}\n", 129 | " \\leq \\frac{\\lambda}{2}\\cdot 3\\sqrt{k}\\|\\Delta\\|_{2} \\\\\n", 130 | " \\implies&\n", 131 | " \\|\\Delta\\|_{2}^{2} \\leq \\frac{1}{\\gamma^{2}}\\frac{9}{4} k \\lambda^{2}.\n", 132 | "\\end{align*}\n", 133 | "\n", 134 | "Similarly, an upper bound on the in-sample prediction error can be obtained by noting that\n", 135 | "\\begin{align*}\n", 136 | " &\\frac{1}{n}\\|X\\Delta\\|_{2}^{2}\n", 137 | " \\leq \n", 138 | " \\frac{\\lambda}{2}\\cdot 3\\sqrt{k}\\frac{1}{\\sqrt{\\gamma}}\\cdot \\sqrt{\\gamma}\\|\\Delta\\|_{2}\n", 139 | " \\leq\n", 140 | " \\frac{\\lambda}{2}\\cdot 3\\sqrt{k}\\frac{1}{\\sqrt{\\gamma}}\\cdot \\frac{1}{\\sqrt{n}}\\|X\\Delta\\|_{2}\n", 141 | " \\\\\n", 142 | " \\implies&\n", 143 | " \\frac{1}{n}\\|X\\Delta\\|_{2}^{2} \\leq \\frac{1}{\\gamma}\\frac{9}{4}k \\lambda^{2}.\n", 144 | "\\end{align*}\n", 145 | "\n", 146 | "Finally, regarding the choice of the parameter $\\lambda$, recall that we have imposed a condition $\\lambda \\geq 4\\|\\frac{1}{n}X^{\\mathsf{T}}\\xi\\|_{\\infty}$.\n", 147 | "*Suppose that the $\\ell_{2}$ norms of the columns of $\\frac{1}{\\sqrt{n}}X$ are at most equal to $1$*. The right-hand side of the previous equation can then be shown to be at most $8\\sigma\\sqrt{\\log d}/\\sqrt{n}$ with overwhelming probability and thus we will perform our simulations with the choice:\n", 148 | "$$\n", 149 | " \\lambda = \\frac{8\\sigma\\sqrt{\\log d}}{\\sqrt{n}}.\n", 150 | "$$\n", 151 | "**Note that the above choice of $\\lambda$ gives the \"fast rates\" for our above upper bounds on the estimation and prediction errors.**\n", 152 | "\n", 153 | "---" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": { 159 | "id": "dWuoJCKJXdn5" 160 | }, 161 | "source": [ 162 | "The rest of this notebook is organized as follows:\n", 163 | "- in the next section, we import some of the code developed in the \"optimization\" notebook;\n", 164 | "- we then implement a data generating distribution (sampling from the spiked identity model) that violates the restricted isometry property but satisfies the restricted eigenvalue property;\n", 165 | "- in the \"Computational Considerations\" section, we introduce an exercise that asks us to explore the convergence properties of proximal gradient descent when applied to a lasso objective satisfying restricted eigenvalue condition.\n", 166 | "- in the \"Statistical Considerations\" section, we introduce an exercise that asks us to find a design matrix for which the lasso incurs a sub-optimal convergence rate for the in-sample prediction error rate." 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": { 172 | "id": "ZcofGQEGSz74" 173 | }, 174 | "source": [ 175 | "## Reusing Code From the \"Optimization\" Practical Session" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": { 181 | "id": "uSMFXaALezsz" 182 | }, 183 | "source": [ 184 | "We import the abstract `Optimizer` class and the implementation of `GradientDescent` subclass." 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "metadata": { 190 | "id": "C-_3MDVzZiMY" 191 | }, 192 | "source": [ 193 | "import numpy as np\n", 194 | "from matplotlib import pyplot as plt\n", 195 | "import tensorflow as tf\n", 196 | "import tensorflow.experimental.numpy as tnp\n", 197 | "tnp.experimental_enable_numpy_behavior()" 198 | ], 199 | "execution_count": null, 200 | "outputs": [] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "metadata": { 205 | "id": "VIwW-uDaSx8R" 206 | }, 207 | "source": [ 208 | "class Optimizer(object):\n", 209 | " \"\"\" A base class for optimizers. \"\"\"\n", 210 | "\n", 211 | " def __init__(self, eta):\n", 212 | " \"\"\" :eta_t: A function taking as argument the current iteration t and\n", 213 | " returning the step size eta_t to be used in the current iteration. \"\"\"\n", 214 | " super().__init__()\n", 215 | " self.eta = eta\n", 216 | " self.t = 0 # Set iterations counter.\n", 217 | "\n", 218 | " def apply_gradient(self, x_t, g_t):\n", 219 | " \"\"\" Given the current iterate x_t and gradient g_t, updates the value\n", 220 | " of x_t to x_(t+1) by performing one iterative update.\n", 221 | " :x_t: A tf.Variable which value is to be updated.\n", 222 | " :g_t: The gradient value, to be used for performing the update.\n", 223 | " \"\"\"\n", 224 | " raise NotImplementedError(\"To be implemented by subclasses.\")\n", 225 | "\n", 226 | " def step(self, f, x_t):\n", 227 | " \"\"\" Updates the variable x_t by performing one optimization iteration.\n", 228 | " :f: A function which is being minimized.\n", 229 | " :x_t: A tf.Variable with respect to which the function is being\n", 230 | " minimized and which value is to be updated\n", 231 | ".\n", 232 | " \"\"\"\n", 233 | " with tf.GradientTape() as tape:\n", 234 | " fx = f(x_t)\n", 235 | " g_t = tape.gradient(fx, x_t)\n", 236 | " self.apply_gradient(x_t, g_t)\n", 237 | " # Update the iterations counter.\n", 238 | " self.t += 1\n", 239 | "\n", 240 | " def optimize(self, f, x_t, n_iterations):\n", 241 | " \"\"\" Applies the function step n_iterations number of times, starting from\n", 242 | " the iterate x_t. Note: the number of iterations member self.t is not\n", 243 | " restarted to 0, which may affects the computed step sizes. \n", 244 | " :f: Function to optimize.\n", 245 | " :x_t: Current iterate x_t.\n", 246 | " :returns: A list of length n_iterations+1, containing the iterates\n", 247 | " [x_t, x_{t+1}, ..., x_{t+n_iterations}].\n", 248 | " \"\"\"\n", 249 | " x = tf.Variable(x_t)\n", 250 | " iterates = []\n", 251 | " iterates.append(x.numpy().reshape(-1,1))\n", 252 | " for _ in range(n_iterations):\n", 253 | " self.step(f, x)\n", 254 | " iterates.append(x.numpy().reshape(-1,1))\n", 255 | " return iterates" 256 | ], 257 | "execution_count": null, 258 | "outputs": [] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "metadata": { 263 | "id": "iOnOpK4AUBAt" 264 | }, 265 | "source": [ 266 | "class GradientDescent(Optimizer):\n", 267 | " \"\"\" An implementation of gradient descent uptades. \"\"\"\n", 268 | "\n", 269 | " def apply_gradient(self, x_t, g_t):\n", 270 | " eta_t = self.eta(self.t)\n", 271 | " x_t.assign_add(-eta_t * g_t)" 272 | ], 273 | "execution_count": null, 274 | "outputs": [] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": { 279 | "id": "P-WFT2FkGG11" 280 | }, 281 | "source": [ 282 | "## Sampling Data From the Spiked Identity Model" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": { 288 | "id": "EEBL7HzP-Y8r" 289 | }, 290 | "source": [ 291 | "In this section, we implement a class `Problem` that implements various methods concerning the fixed-design linear regression setup, such as computing the empirical risk, computing the lasso parameter value $\\lambda$ suggested in the introduction, and resampling of the labels. Notice that the design matrix $X$ is assumed to be fixed, and the only source of randomness comes from the noise random variables $\\xi$." 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "metadata": { 297 | "id": "-hZawsRTUDi0" 298 | }, 299 | "source": [ 300 | "class Problem(object):\n", 301 | " \"\"\" A class for representing a dataset (X, y) and providing methods for\n", 302 | " computing the empirical risk function and the in-sample prediction error. \"\"\"\n", 303 | " \n", 304 | " def __init__(self, X, w_star, noise_std):\n", 305 | " \"\"\" :n: An n \\times d matrix of covariates.\n", 306 | " :w_star: A d-dimensional vector used to generate noisy observations\n", 307 | " y_i = + N(0, noise_std**2). In this practical session\n", 308 | " w_star should be a sparse vector.\n", 309 | " :noise_std: Standard deviation of the zero-mean Gaussian label noise.\n", 310 | " \"\"\"\n", 311 | " self.X = X\n", 312 | " self.n = X.shape[0]\n", 313 | " self.d = X.shape[1]\n", 314 | " self.w_star = w_star.reshape(self.d, 1)\n", 315 | " self.noise_std = noise_std\n", 316 | " self.generate_labels(seed=0)\n", 317 | "\n", 318 | " def generate_labels(self, seed):\n", 319 | " \"\"\" Resamples new labels y = Xw* + noise_std*N(0, I) and stores it as a\n", 320 | " class member self.y. \"\"\"\n", 321 | " np.random.seed(seed)\n", 322 | " self.xi = np.random.normal(loc=0.0, scale=1.0, size=(self.n, 1))\n", 323 | " self.y = self.X @ self.w_star + self.xi\n", 324 | "\n", 325 | " def get_lasso_lambda(self):\n", 326 | " \"\"\" Returns the lasso lambda computed in the introduction. \"\"\"\n", 327 | " return 8.0*self.noise_std*np.sqrt(np.log(self.d))/np.sqrt(self.n)\n", 328 | "\n", 329 | " def compute_in_sample_prediction_error(self, w):\n", 330 | " \"\"\" For a d-dimensional vector w outputs 1/n ||X(w - w*)||_{2}^{2}. \"\"\"\n", 331 | " return tnp.average((self.X @ (w - self.w_star))**2)\n", 332 | "\n", 333 | " def compute_empirical_risk(self, w):\n", 334 | " \"\"\" For a d-dimensional vector w, outputs R(w) = 1/n ||Xw - y||_{2}^{2}. \"\"\"\n", 335 | " return tnp.average((self.X @ w - self.y)**2)" 336 | ], 337 | "execution_count": null, 338 | "outputs": [] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": { 343 | "id": "YmsXPgWD_cbR" 344 | }, 345 | "source": [ 346 | "Next, we implement a function for obtaining a design matrix $X$, whose rows are randomly sampled from the spiked identity model:\n", 347 | "$$\n", 348 | " X_{i} \\sim N(0, \\gamma I_{d} + (1-\\gamma)\\mathbf{1}\\mathbf{1}^{\\mathsf{T}}).\n", 349 | "$$\n", 350 | "For $\\gamma > 0$ bounded away from $1$, random matrices with rows sampled from the above distribution do not satisfy the restricted isometry property but satisfy the restricted eigenvalue condition. See Example 7.18 in the textbook by *Wainwright [2019]* for details." 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "metadata": { 356 | "id": "iBl5k0C5BM2o" 357 | }, 358 | "source": [ 359 | "def get_spiked_identity_X(n, d, gamma):\n", 360 | " \"\"\" See the discussion in the above text cell. \"\"\"\n", 361 | " X_id = np.random.normal(0, 1, size=(n,d)) # Identity covariance component.\n", 362 | " X_spiked = np.ones((n, d)) * np.random.normal(0,1,size=(n,1)) # Spiked component.\n", 363 | " return gamma*X_id + (1.0-gamma)*X_spiked" 364 | ], 365 | "execution_count": null, 366 | "outputs": [] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": { 371 | "id": "ljYyWeP_b5h6" 372 | }, 373 | "source": [ 374 | "## Computational Considerations" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": { 380 | "id": "-mrP2HWMb9Bv" 381 | }, 382 | "source": [ 383 | "Recall that the lasso objective can be written as a sum of a smooth and non-smooth terms\n", 384 | "$$\n", 385 | " R_{\\lambda}(w) = \\underbrace{\\frac{1}{n}\\|Xw - y\\|_{2}^{2}}_{\\text{smooth}} + \n", 386 | " \\underbrace{\\lambda\\|w\\|_{1}}_{\\text{non-smooth}}.\n", 387 | "$$\n", 388 | "Due to the $\\ell_{1}$ penalty, the overall objective $R_{\\lambda}$ is non-smooth. Based on the results introduced in the optimization lectures, it would appear that only the slow convergence rate of order $O(1/\\sqrt{t})$ is possible. To circumvent this issue, in Lecture 10 we have introduced **proximal algorithms that allow us to obtain the smooth convergence rate of order $O(1/t)$ for non-smooth problems whenever we can explicitly compute the proximal operator**. In the following exercise, we compare the usual gradient descent updates with the smooth gradient descent updates." 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "metadata": { 394 | "id": "8RhVJ5_KB9CG" 395 | }, 396 | "source": [ 397 | "np.random.seed(0)\n", 398 | "\n", 399 | "n=250\n", 400 | "d=1000\n", 401 | "w_star = np.zeros((d,1))\n", 402 | "k = 10\n", 403 | "w_star[:k,0] = 10.0\n", 404 | "X = get_spiked_identity_X(n, d, gamma=1.0)\n", 405 | "\n", 406 | "problem = Problem(\n", 407 | " X=X,\n", 408 | " w_star=w_star,\n", 409 | " noise_std=1.0)" 410 | ], 411 | "execution_count": null, 412 | "outputs": [] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": { 417 | "id": "tWyqjXGRsvXc" 418 | }, 419 | "source": [ 420 | "### Exercise 1" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": { 426 | "id": "FbPus8WwsxC5" 427 | }, 428 | "source": [ 429 | "- In you simulations, use the python variable `problem` defined in the preceding cell.\n", 430 | "- Use the lasso penalty lamba given by `problem.get_lasso_lambda()`.\n", 431 | "- Implement the ISTA algorithm described in Lecture 10.\n", 432 | "- Run ISTA and gradient descent with a constant step size $\\eta = 0.1$ for $T=500$ iterations each. Plot the value of $\\|w_{t} - w^{\\star}\\|_{2}$ computed by both algorithms against the number of iterations $t$. Commend on your observations.\n", 433 | "- Repeat the above simulations with different choices of $\\gamma$ parameter. What happens for values of $\\gamma$ close to $0$?" 434 | ] 435 | }, 436 | { 437 | "cell_type": "markdown", 438 | "metadata": { 439 | "id": "zm1IDPeLhGuS" 440 | }, 441 | "source": [ 442 | "## Statistical Considerations" 443 | ] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "metadata": { 448 | "id": "kGx_8rLmWoIT" 449 | }, 450 | "source": [ 451 | "Recall the upper bound on the in-sample prediction error proved at the beginning of this notebook. With the regularization parameter $\\lambda = \\frac{8\\sigma\\sqrt{\\log d}}{\\sqrt{n}}$, we obtained\n", 452 | "$$\n", 453 | " \\underbrace{\\frac{1}{n}\\|Xw_{\\lambda} - Xw^{\\star}\\|_{2}^{2}}_{\\text{in-sample prediction error}}\n", 454 | " \\leq \\frac{144}{\\gamma}\\frac{k\\sigma^{2}\\log d}{n}.\n", 455 | "$$\n", 456 | "While the above bound gives the fast rate, the issue is in the presence of the parameter $1/\\gamma$. While the dependence on the conditioning of the design matrix $X$ is perfectly natural for the estimation problem, it is less natural in the prediction setting. As an example, consider duplicating some columns in the design matrix of $X$ corresponding to indexes on the support set of the sparse signal vector $w^{\\star}$. Intuitively, duplicating columns makes the estimation problem impossible, as the model becomes unidentifiable, yet the prediction problem becomes easier, as there is no difference which of the duplicate columns the learning algorithm selects (i.e., the learning algorithm gets more freedom if some columns get duplicated).\n", 457 | "\n", 458 | "**At the same time, it is known that the computationally intractable $\\ell_{0}$ estimator satisfies the fast rate in-sample prediction error bound $O(\\frac{k\\sigma^{2}\\log d}{n})$ without any dependence on the restricted eigenvalue constant $\\gamma$.** This leaves two open possibilities: either the lasso estimator is sub-optimal or the analysis presented above is too loose.\n", 459 | "\n", 460 | "**In the following exercise, you are asked to (informally) show that the lasso estimator is indeed sub-optimal by constructing a design matrix $X \\in \\mathbb{R}^{n \\times d}$ such that the lasso incurs in-sample prediction error of order $\\Omega(1/\\sqrt{n})$ with high probability.**" 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": { 466 | "id": "XMumAeQ6yelD" 467 | }, 468 | "source": [ 469 | "### Exercise 2" 470 | ] 471 | }, 472 | { 473 | "cell_type": "markdown", 474 | "metadata": { 475 | "id": "Gs5zSa4HywA9" 476 | }, 477 | "source": [ 478 | "Let the signal vector $w^{\\star} \\in \\mathbb{R}^{d}$ be $2$-sparse with the first two coordinates equal to $1/2$. Construct a design matrix $X \\in \\mathbb{R}^{n \\times d}$ such that the lasso estimator $w_{\\lambda}$ with the usual choice of parameter\n", 479 | "$\\lambda = \\Theta(\\frac{\\sigma\\sqrt{\\log d}}{\\sqrt{n}})$ satisfies the following lower bound with high probability:\n", 480 | "$$\n", 481 | " \\frac{1}{n}\\|X w_{\\lambda} - Xw^{\\star}\\|_{2}^{2} = \\Omega\\left(1/\\sqrt{n}\\right).\n", 482 | "$$\n", 483 | "You are only asked to establish the above result informally. Verify your construction empirically." 484 | ] 485 | }, 486 | { 487 | "cell_type": "markdown", 488 | "metadata": { 489 | "id": "lwLur0LggPA7" 490 | }, 491 | "source": [ 492 | "### Hint" 493 | ] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "metadata": { 498 | "id": "57UijKVigZhF" 499 | }, 500 | "source": [ 501 | "To gain some intuition as to what a bad matrix would look like try answering the following questions:\n", 502 | "- What is the smallest value $\\lambda = \\lambda_{\\text{min}}$ that guarantees $(w_{\\lambda})_{S^{c}} = 0$ (i.e., the lasso output does not fit any coordinates outside of the true support $S$)?\n", 503 | "- What happens for values of $\\lambda$ significantly smaller than $\\lambda_{\\text{min}}$?\n", 504 | "- Consider the case $n=d$ and let $X$ be a block-diagonal matrix constructed from $2 \\times 2$ matrices $A \\in \\mathbb{R}^{2 \\times 2}$ (i.e., each block is the same matrix repeated). Choose the matrix $A$ such that $(w_{\\lambda})_{1} \\neq 0$ and $(w_{\\lambda})_{2} \\neq 2$ if an only if $\\lambda \\ll \\lambda_{\\text{min}}$.\n", 505 | "- Deduce the desired slow rate lower bound." 506 | ] 507 | }, 508 | { 509 | "cell_type": "markdown", 510 | "metadata": { 511 | "id": "XnxGsJRf1_OS" 512 | }, 513 | "source": [ 514 | "## Bibliographic Remarks" 515 | ] 516 | }, 517 | { 518 | "cell_type": "markdown", 519 | "metadata": { 520 | "id": "q6zkTcbi2Bi3" 521 | }, 522 | "source": [ 523 | "A slightly weaker version of the restricted eigenvalue condition than the one presented in this practical session is due to *Bickel, Ritov, and Tsybakov [2009]*. Our presentation of the lasso estimation error and in-sample prediction error upper bounds is based on [*Wainwright, 2019, Chapter 7*], which in turn credits *Bickel, Ritov, and Tsybakov [2009]*. Many variations of different conditions used to establish statistical performance bounds for the lasso exist; the\n", 524 | "interplay between these different conditions is explored by *van de Geer and Bühlmann [2009]*. The fact that non-isotropic random ensembles can satisfy the restricted eigenvalue condition is shown in [*Raskutti, Wainwright, and Yu, 2010, Rudelson and Zhou, 2012, Oliveira, 2016*]. *Agarwal, Negahban, and Wainwright [2012]* establish the geometric convergence of gradient methods under restricted notions of strong convexity and smoothness, as seen in Exercise 1 of this practical session. Slow rate lower bounds for the lasso in-sample prediction error were shown by *Foygel and Srebro [2011]* and *Dalalyan, Hebiri, and Lederer [2017]*. The block-diagonal construction of the ill-behaved design matrix presented in this notebook is due to *Candès and Plan [2009]*. This construction was later refined by *Zhang, Wainwright, and Jordan [2017]* to show that the sub-optimal slow rate is intrinsic for regularization paths of a large class of penalized estimators, including the lasso. The same authors have also previously shown that any polynomial-time algorithm constrained to output a sparse vector cannot achieve the fast rate without imposing strong restrictions on the design [*Zhang, Wainwright, and Jordan, 2014*]. In contrast, we remind that the computationally intractable $\\ell_{0}$ estimator achieves the fast rate without any restrictions on the design matrix other than column normalization. Whether there exists a polynomial-time algorithm achieving the fast rate guarantee under the same conditions as the $\\ell_{0}$ estimator remains an open problem." 525 | ] 526 | }, 527 | { 528 | "cell_type": "markdown", 529 | "metadata": { 530 | "id": "uLx7Knxw72W6" 531 | }, 532 | "source": [ 533 | "**References**\n", 534 | "\n", 535 | "A. Agarwal, S. Negahban, and M. J. Wainwright. Fast global convergence of gradient methods for high-dimensional statistical recovery. The Annals of Statistics, pages 2452–2482, 2012.\n", 536 | "\n", 537 | "P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of statistics, 37(4):1705–1732, 2009.\n", 538 | "\n", 539 | "E. J. Candès and Y. Plan. Near-ideal model selection by l1 minimization. The Annals of Statistics, 37(5A):2145–2177, 2009.\n", 540 | "\n", 541 | "A. S. Dalalyan, M. Hebiri, and J. Lederer. On the prediction performance of the lasso. Bernoulli, 23(1):552–581, 2017.\n", 542 | "\n", 543 | "R. Foygel and N. Srebro. Fast-rate and optimistic-rate error bounds for l1-regularized regression. arXiv preprint arXiv:1108.0373, 2011.\n", 544 | "\n", 545 | "R. Oliveira. The lower tail of random quadratic forms with applications to ordinary least squares. Probability Theory and Related Fields, 166(3-4):1175–1194, 2016.\n", 546 | "\n", 547 | "G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eigenvalue properties for correlated gaussian designs. The Journal of Machine Learning Research, 11:2241–2259, 2010.\n", 548 | "\n", 549 | "M. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements. In\n", 550 | "Conference on Learning Theory, pages 10–1. JMLR Workshop and Conference Proceedings, 2012.\n", 551 | "\n", 552 | "S. A. van de Geer and P. Bühlmann. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360–1392, 2009.\n", 553 | "\n", 554 | "M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.\n", 555 | "\n", 556 | "Y. Zhang, M. J. Wainwright, and M. I. Jordan. Lower bounds on the performance of\n", 557 | "polynomial-time algorithms for sparse linear regression. In Conference on Learning Theory, pages 921–948, 2014.\n", 558 | "\n", 559 | "Y. Zhang, M. J. Wainwright, and M. I. Jordan. Optimal prediction for sparse linear models? lower bounds for coordinate-separable m-estimators. Electronic Journal of Statistics, 11 (1):752–799, 2017." 560 | ] 561 | } 562 | ] 563 | } -------------------------------------------------------------------------------- /practicals/solved_practicals/README.md: -------------------------------------------------------------------------------- 1 | This repository contains already executed practical session notebooks with 2 | solution code snippets plugged into the correct places, and all the simulation 3 | outputs already generated. 4 | --------------------------------------------------------------------------------