├── README.md
└── bibliography.csv


/README.md:
--------------------------------------------------------------------------------
  1 | # Annotated reading list for ML theory
  2 | 
  3 | An annotated reference list of ML theory. I didn’t compile this list! All the credit goes to Aditi Raghunathan and her reading list for the course [Theoretical and Empirical Foundations of Modern Machine Learning (2022)](https://www.cs.cmu.edu/~aditirag/teaching/15-884F22.html). I simply generated summaries via ChatGPT-4 with the Link Reader plugin and assembled the summaries into this list. You can also download a CSV with the references to import into your citation manager.
  4 | 
  5 | ## Generalization
  6 | 
  7 | **[The Tradeoffs of Large Scale Learning](https://proceedings.neurips.cc/paper/2007/file/0d3180d672e08b4c5312dcdafdf6ef36-Paper.pdf)**
  8 | Authors: Léon Bottou, Olivier Bousquet (2007)
  9 | Publication: NIPS
 10 | 
 11 | - The paper investigates the trade-offs in large scale learning, specifically looking at the balance between computational cost and statistical accuracy.
 12 | - It introduces a theoretical framework that allows for the analysis of the trade-offs between computation and statistics in machine learning.
 13 | - The authors argue that in large scale learning, the computational cost becomes a crucial factor, and the traditional statistical view of learning is not sufficient.
 14 | - They propose that the optimal learning strategy in such scenarios is to perform a small amount of computation on many examples rather than performing a large amount of computation on a few examples.
 15 | - The paper concludes that understanding these trade-offs can lead to more efficient learning algorithms, particularly in the context of large scale learning.
 16 | 
 17 | **[The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks](https://arxiv.org/pdf/1803.03635.pdf)**
 18 | Authors: Jonathan Frankle, Michael Carbin (2018)
 19 | Publication: arXiv
 20 | 
 21 | - This paper investigates the "lottery ticket hypothesis," which posits that randomly-initialized, dense neural networks contain subnetworks ("winning tickets") that - when trained in isolation - can match the test accuracy of the original network.
 22 | - The authors provide empirical evidence supporting this hypothesis, demonstrating that such subnetworks indeed exist and can be identified through a process of iterative pruning.
 23 | - They further explore the properties of these "winning tickets," finding that they are initialized such that the initial, random weights are conducive to successful optimization.
 24 | - The paper suggests that these findings could have significant implications for neural network initialization and for understanding why large, over-parameterized networks are easier to train.
 25 | - The authors conclude by proposing future research directions, including the exploration of whether these principles apply to other forms of networks and tasks, and how these "winning tickets" can be found more efficiently.
 26 | 
 27 | **[Exploring Generalization in Deep Learning](https://proceedings.neurips.cc/paper/2017/file/10ce03a1ed01077e3e289f3e53c72813-Paper.pdf)**
 28 | Authors: Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, Nathan Srebro (2017)
 29 | Publication: NeurIPS 2017
 30 | 
 31 | * The paper investigates what drives generalization in deep networks, considering several recently suggested explanations, including norm-based control, sharpness, and robustness.
 32 | * The authors highlight the importance of scale normalization and make a connection between sharpness and PAC-Bayes theory.
 33 | * The paper explores the bias introduced by algorithmic choices for neural networks and what ensures generalization in neural networks. It also discusses the relevant notion of complexity or capacity control.
 34 | * The authors examine complexity measures that have recently been suggested, or could be considered, in explaining generalization in deep learning. They evaluate the measures based on their ability to theoretically guarantee generalization, and their empirical ability to explain several recently observed empirical phenomena.
 35 | * The paper concludes that studying how each measure can guarantee generalization allows for a better understanding of how it should be computed and compared in order to explain the empirical phenomena. The authors also emphasize the importance of relating the scale of the parameters and the scale of the output of the network, e.g., by relating norm and margin.
 36 | 
 37 | **[The Implicit Bias of Gradient Descent on Separable Data](https://www.jmlr.org/papers/volume19/18-188/18-188.pdf)**
 38 | Authors: Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, Nathan Srebro (2018)
 39 | Publication: Journal of Machine Learning Research
 40 | 
 41 | - This paper investigates the implicit bias of gradient descent (GD) when applied to linearly separable data.
 42 | - The authors show that GD with small initialization and sufficiently small learning rate converges to the maximum margin separator, a result that holds true even in the presence of over-parameterization.
 43 | - They further demonstrate that this implicit bias towards maximum margin solutions is not unique to GD but is shared by other optimization algorithms.
 44 | - The paper concludes that the implicit bias of GD and other optimization algorithms plays a crucial role in the generalization ability of deep learning models, providing a fresh perspective on the behavior of these models.
 45 | 
 46 | ## Double descent, bias-variance tradeoff, kernel methods
 47 | 
 48 | **[Neural Tangent Kernel: convergence and generalization in Neural Networks](https://arxiv.org/pdf/1806.07572.pdf)**
 49 | Authors: Arthur Jacot, Franck Gabriel, Clément Hongler (2018)
 50 | Publication: arXiv
 51 | 
 52 | - The paper introduces the concept of the Neural Tangent Kernel (NTK), a new tool to analyze the behavior of Neural Networks in the infinite width limit.
 53 | - The authors show that at initialization, neural networks in the NTK parameterization are equivalent to kernel regression with the NTK.
 54 | - The paper demonstrates that during training, the function implemented by the network evolves according to a linear differential equation, the NTK remaining constant for all time.
 55 | - The authors prove that gradient descent on neural networks follows a kernel gradient descent with respect to the NTK, with the kernel function remaining constant during training.
 56 | - The paper concludes that the NTK allows for a new understanding of the dynamics of gradient descent over neural networks, and provides a new set of tools to tackle the analysis of deep learning.
 57 | 
 58 | **[Benign Overfitting in Linear Regression](https://arxiv.org/pdf/1906.11300.pdf)**
 59 | Authors: Peter L. Bartlett, Philip M. Long, Gábor Lugosi, Alexander Tsigler (2020)
 60 | Publication: arXiv
 61 | 
 62 | - The paper investigates the phenomenon of benign overfitting, where deep neural networks predict well even with a perfect fit to noisy training data, in the context of linear regression.
 63 | - The authors provide a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy.
 64 | - The paper shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.
 65 | - The authors find that the accuracy of the minimum norm interpolating prediction rule approaches the best possible accuracy for a much narrower range of properties of the data distribution when the data lies in an infinite dimensional space versus when the data lies in a finite dimensional space whose dimension grows faster than the sample size.
 66 | - The paper concludes that understanding the performance of prediction rules that fit the training data perfectly is a central challenge to arrive at a scientific understanding of the success of deep learning methods.
 67 | 
 68 | ## Robustness
 69 | 
 70 | **[A Universal Law of Robustness via Isoperimetry](https://arxiv.org/pdf/2105.12806.pdf)**
 71 | Authors: Sébastien Bubeck, Mark Sellke (2023)
 72 | Publication: Journal of the ACM
 73 | 
 74 | * The paper proposes a theoretical explanation for the phenomenon in deep learning where models are trained with many more parameters than what classical theory would suggest. The authors prove that for a broad class of data distributions and model classes, overparametrization is necessary for smooth interpolation of the data.
 75 | * The authors show that smooth interpolation requires d times more parameters than mere interpolation, where d is the ambient data dimension. This law is proven for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry (or a mixture thereof).
 76 | * The paper suggests that the large size of the models used in deep learning might be a necessity rather than a weakness of the framework. It presents a tradeoff between the size of a model (as measured by the number of parameters) and its “robustness” (as measured by its Lipschitz constant).
 77 | * The authors extend a previously conjectured tradeoff for the specific case of two-layer neural networks and Gaussian data to a much more general phenomenon that applies to essentially any parametrized function class and a much broader class of data distributions.
 78 | * The paper concludes with the universal law of robustness, stating that for any function class smoothly parametrized by p parameters, and for any d-dimensional dataset satisfying a natural isoperimetry condition, any function in this class that fits the data below the noise level must have a (Euclidean) Lipschitz constant of order at least √(nd/p).
 79 | 
 80 | 
 81 | **[Adversarial Examples are not Bugs, they are Features](https://proceedings.neurips.cc/paper/2019/file/e2c420d928d4bf8ce0ff2ec19b371514-Paper.pdf)**
 82 | Authors: Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, Aleksander Madry (2019)
 83 | Publication: NeurIPS
 84 | 
 85 | - The paper explores the phenomenon of adversarial examples in machine learning, arguing that these are not bugs but rather features of the model.
 86 | - The authors propose that adversarial examples can be attributed to the presence of non-robust features in the data, which are highly predictive but also brittle and incomprehensible to humans.
 87 | - They demonstrate that these non-robust features are widespread in standard datasets and can lead to adversarial perturbations.
 88 | - The paper also discusses adversarial transferability, suggesting that since any two models are likely to learn similar non-robust features, perturbations that manipulate such features will apply to both.
 89 | - The authors conclude that adversarial vulnerability is a human-centric phenomenon, and that non-robust features can be as important as robust ones from the perspective of standard supervised learning.
 90 | 
 91 | [Understanding the Failure Modes of Out-of-Distribution Generalization](https://arxiv.org/pdf/2010.15775.pdf)
 92 | Authors: Vaishnavh Nagarajan, Anders Andreassen, Behnam Neyshabur (2021)
 93 | Publication: ICLR 2021
 94 | 
 95 | * The paper investigates the nature of failure in machine learning models when faced with out-of-distribution (OoD) generalization, particularly in situations where models rely on spurious features that are only significantly correlated with labels during training.
 96 | * The authors study the characteristic failure of the Empirical Risk Minimization (ERM) principle during OoD generalization, providing an understanding of why ERM tends to fail even in tasks that should be easy to learn based on fully predictive invariant features. The study is focused on gradient-descent-trained linear classifiers.
 97 | * Two distinct modes of ERM failure are identified, both emerging from how spurious correlations induce different types of skews in data: one skew is geometric, while the other is statistical.
 98 | * The authors propose a set of constraints for the design of tasks to ensure their "ease" of success, such as making the invariant feature fully predictive of the label. These constraints establish both a theoretical and empirical test-bed for reasoning about OoD generalization.
 99 | * Experimentally, these theoretical insights are validated using MNIST and CIFAR10-based tasks and applied to fully-connected networks (FNNs) and ResNets. The authors demonstrate that in any easy-to-learn task devoid of these geometric or statistical skews, these models do not rely on spurious features, suggesting these skews are both a sufficient and necessary factor for failure in easy-to-learn tasks.
100 | 
101 | **[Accuracy on the Line: On the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization](https://arxiv.org/pdf/2107.04649.pdf)**
102 | Authors: John Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, Ludwig Schmidt (2022)
103 | Publication: arXiv
104 | 
105 | - This paper empirically demonstrates a strong correlation between in-distribution and out-of-distribution performance for a wide range of models and distribution shifts.
106 | - The authors show that this correlation holds across model architectures, hyperparameters, training set size, and training duration, and is more precise than what is expected from existing domain adaptation theory.
107 | - The paper also investigates cases where the correlation is weaker, such as some synthetic distribution shifts from CIFAR-10-C and the tissue classification dataset Camelyon17-WILDS.
108 | - The authors provide a candidate theory based on a Gaussian data model that shows how changes in the data covariance arising from distribution shift can affect the observed correlations.
109 | - The paper concludes that improving in-distribution performance reliably improves out-of-distribution performance. However, it is currently unclear whether improving in-distribution performance is the only way, or even the best way, to improve out-of-distribution performance.
110 | 
111 | ## Causality
112 | 
113 | **[On causal and anti-causal learning](https://icml.cc/2012/papers/625.pdf)**
114 | Authors: Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij (2012)
115 | Publication: ICML 2012
116 | 
117 | - The paper investigates the difference between causal and anti-causal learning, where causal learning predicts effects from causes, and anti-causal learning predicts causes from effects.
118 | - The authors argue that causal learning is fundamentally easier than anti-causal learning. This is because the causal direction is independent of the underlying distribution, while the anti-causal direction is not.
119 | - The paper introduces a new algorithm for causal feature selection, which is based on the idea that causal features are easier to predict than anti-causal features.
120 | - The authors demonstrate the effectiveness of their algorithm through a series of experiments on synthetic and real-world data.
121 | - The paper concludes that understanding the difference between causal and anti-causal learning can lead to more effective machine learning algorithms.
122 | 
123 | **[Invariant Risk Minimization](https://arxiv.org/pdf/1907.02893.pdf)**
124 | Authors: Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, David Lopez-Paz (2020)
125 | Publication: Arxiv
126 | 
127 | - The paper addresses the fundamental problem in machine learning where machines inherit biases from the data they are trained on, leading to spurious correlations and poor generalization to new test distributions.
128 | - The authors propose Invariant Risk Minimization (IRM), a novel learning paradigm that estimates nonlinear, invariant, causal predictors from multiple training environments, enabling out-of-distribution (OOD) generalization.
129 | - The paper presents a mathematical formulation of IRM and discusses its implementation details, including how to estimate the objective using mini-batches for stochastic gradient descent.
130 | - The authors also explore the relationship between invariance, causality, and OOD generalization, arguing that invariant predictors can be written as linear data representations of different ranks.
131 | - The paper concludes with a discussion on future research directions, including the benefits of enforcing non-linear invariances and constructing invariance penalties for non-linear invariances.
132 | 
133 | ## Unsupervised learning
134 | 
135 | **[Realistic Evaluation of Deep Semi-Supervised Learning Algorithms](https://arxiv.org/pdf/1804.09170.pdf)**
136 | Authors: Oliver, A., Odena, A., Raffel, C., Cubuk, E. D., & Goodfellow, I. (2018)
137 | Publication: arXiv preprint arXiv:1804.09170
138 | 
139 | - The paper investigates the performance of deep semi-supervised learning (SSL) algorithms under realistic conditions. It argues that previous evaluations of these algorithms may have been overly optimistic due to certain experimental design choices.
140 | - The authors propose a new evaluation methodology that includes factors such as the presence of out-of-distribution examples in the unlabeled dataset, the use of data augmentation, and the variability in performance due to different model initializations and architectures.
141 | - The paper finds that under these more realistic conditions, the performance of deep SSL algorithms is significantly worse than previously reported. In particular, the authors find that the presence of out-of-distribution examples in the unlabeled dataset can severely degrade performance.
142 | - The authors also find that the choice of data augmentation strategy and the variability in performance due to different model initializations and architectures can significantly impact the performance of deep SSL algorithms.
143 | - The paper concludes by calling for more rigorous evaluation methodologies in future SSL research to ensure that the reported performance of these algorithms is representative of their performance under realistic conditions.
144 | 
145 | **[Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/pdf/2111.06377.pdf)**
146 | Authors: He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021)
147 | Publication: arXiv preprint arXiv:2111.06377
148 | 
149 | - The paper presents a novel approach to self-supervised learning for computer vision tasks, called Masked Autoencoders (MAE), which is based on masking random patches of the input image and reconstructing the missing pixels.
150 | - The authors propose an asymmetric encoder-decoder architecture, where the encoder operates only on the visible subset of patches, and a lightweight decoder reconstructs the original image from the latent representation and mask tokens.
151 | - The paper finds that masking a high proportion of the input image (e.g., 75%) yields a nontrivial and meaningful self-supervisory task, which enables efficient and effective training of large models.
152 | - The authors demonstrate that their approach can achieve high accuracy (87.8%) on the ImageNet-1K dataset, outperforming previous methods that use only ImageNet-1K data.
153 | - The paper concludes that the MAE approach allows for learning high-capacity models that generalize well, and shows promising scaling behavior in downstream tasks.
154 | 
155 | **[Emerging properties in self-supervised vision transformers](https://arxiv.org/pdf/2104.14294.pdf)**
156 | Authors: Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou (2022)
157 | Publication: arXiv
158 | 
159 | - The paper investigates the properties of self-supervised learning in Vision Transformers (ViTs) and how they compare to Convolutional Neural Networks (CNNs).
160 | - The authors focus on the question of whether the self-attention mechanism in ViTs can learn to localize objects in an image without explicit supervision, a property known as "objectness."
161 | - The paper finds that ViTs trained with self-supervised learning can indeed learn to localize objects, and that this ability improves with the scale of the model and the amount of training data.
162 | - The authors also find that the self-attention maps of ViTs can be interpreted as object saliency maps, providing a form of interpretability for these models.
163 | - The paper concludes that self-supervised learning can be a powerful tool for training ViTs, and that these models have promising properties for tasks such as object detection and image segmentation.
164 | 
165 | **[Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss](https://arxiv.org/pdf/2106.04156.pdf)**
166 | Authors: Jeff Z. HaoChen, Colin Wei, Adrien Gaidon, Tengyu Ma (2022)
167 | Publication: arXiv
168 | 
169 | - This paper presents a theoretical framework for self-supervised learning without requiring conditional independence, which is a common assumption in previous works.
170 | - The authors introduce a novel concept of the augmentation graph on data, where edges connect augmentations of the same datapoint, and ground-truth classes naturally form connected sub-graphs.
171 | - They propose a loss function, the spectral contrastive loss, that performs spectral decomposition on the population augmentation graph and can be succinctly written as a contrastive learning objective on neural net representations.
172 | - The paper proves that, under a simple and realistic data assumption, linear classification using representations learned on a polynomial number of unlabeled data samples can recover the ground-truth labels of the data with high accuracy.
173 | - Empirically, the features learned by the proposed objective can match or outperform several strong baselines on benchmark vision datasets.
174 | 
175 | ## Distribution shifts
176 | 
177 | **[Domain-adversarial training of neural networks](https://arxiv.org/pdf/1505.07818.pdf)**
178 | Authors: Caron M., Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., & Lempitsky, V. (2016)
179 | Publication: The Journal of Machine Learning Research
180 | 
181 | - The paper introduces a new approach for domain adaptation in deep networks, called Domain-Adversarial Neural Network (DANN). The approach aims to learn a feature representation that is useful for the learning task and is also invariant to the change of domains.
182 | - The authors propose a new regularization approach that encourages the learned features to be domain-invariant. This is achieved by adding a domain classifier to the network and training it adversarially against the feature learner.
183 | - The paper demonstrates that the proposed approach can significantly reduce the error rate in domain adaptation tasks. The authors show that DANN outperforms standard neural networks and other domain adaptation methods on several benchmark datasets.
184 | - The authors also provide a theoretical analysis of their method, showing that the domain-adversarial training process can be interpreted as minimizing a certain upper bound of the expected risk on the target domain.
185 | - The paper concludes that the proposed DANN approach is a promising direction for domain adaptation in deep networks. The authors suggest that future work could explore other types of domain-invariant representations and investigate the use of DANN in other types of learning tasks.
186 | 
187 | **[Test-Time Training with Self-Supervision for Generalization under Distribution Shifts](https://arxiv.org/pdf/1909.13231.pdf)**
188 | Authors: Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A. A., & Hardt, M. (2020)
189 | Publication: Proceedings of the 37th International Conference on Machine Learning
190 | 
191 | - The paper proposes Test-Time Training (TTT), a method for improving the performance of predictive models when training and test data come from different distributions. The approach turns a single unlabeled test sample into a self-supervised learning problem, updating the model parameters before making a prediction.
192 | - The authors argue that supervised learning struggles with generalization under distribution shifts. They propose to learn from these shifts at test time, allowing the model parameters to depend on the test sample but not its unknown label.
193 | - The TTT method creates a self-supervised learning problem based on a single test sample, updating the model parameters at test time before making a prediction. The authors use the task of rotating each input image by a multiple of 90 degrees and predicting its angle as an auxiliary task.
194 | - The paper demonstrates that TTT leads to improvements on diverse image classification benchmarks aimed at evaluating robustness to distribution shifts. The authors show that their algorithm makes substantial improvements under distribution shifts, while maintaining the same performance on the original distribution.
195 | - The authors conclude that TTT is a promising approach for dealing with distribution shifts in predictive models. They suggest that future work could explore other types of self-supervised tasks and investigate the use of TTT in other types of learning tasks.
196 | 
197 | ## Foundation models
198 | 
199 | **[Model-agnostic meta-learning for fast adaptation of deep networks](https://arxiv.org/pdf/1703.03400.pdf)**
200 | Authors: Chelsea Finn, Pieter Abbeel, Sergey Levine (2017)
201 | Publication: Proceedings of the 34th International Conference on Machine Learning
202 | 
203 | - The paper introduces a method called Model-Agnostic Meta-Learning (MAML) that is designed to help deep learning models adapt quickly to new tasks.
204 | - The key idea of MAML is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of gradient steps.
205 | - The authors demonstrate that MAML is applicable to any model trained with gradient descent and to any machine learning problem that can be cast as learning a function, including classification, regression, and reinforcement learning problems.
206 | - The experiments show that the approach is effective for few-shot learning in the context of image recognition and reinforcement learning tasks.
207 | - The paper concludes that MAML provides a promising approach for few-shot learning and rapid adaptation to new tasks, but also notes that there are many interesting directions for future work, including exploring different types of prior knowledge and meta-objectives.
208 | 
209 | **[The power of scale for parameter-efficient prompt tuning](https://arxiv.org/pdf/2104.08691.pdf)**
210 | Authors: Brian Lester, Rami Al-Rfou, Noah Constant (2021)
211 | Publication: arXiv preprint
212 | 
213 | - This paper explores "prompt tuning," a mechanism for learning "soft prompts" to condition frozen language models to perform specific downstream tasks.
214 | - The authors show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, their method matches the strong performance of model tuning (where all model weights are tuned).
215 | - The paper demonstrates that their end-to-end learned approach outperforms GPT-3’s few-shot learning by a large margin.
216 | - The authors also show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer and enables efficient "prompt ensembling."
217 | - The paper concludes that prompt tuning is a promising method for adapting large language models, offering a balance between performance and efficiency, and that it opens up several avenues for future research.
218 | 
219 | **[Scaling laws for neural language models](https://arxiv.org/pdf/2001.08361.pdf)**
220 | Authors: OpenAI (2020)
221 | Publication: arXiv
222 | 
223 | - The paper investigates the relationship between the performance of neural language models and their scale, in terms of model size, dataset size, and the amount of computation.
224 | - The authors find that, in many cases, increasing these factors leads to improved performance, even beyond the scales that are currently common in the field.
225 | - They also find that the benefits of scale are not constant, but rather exhibit "power-law" scaling, meaning that the benefits decrease as scale increases, but do not disappear entirely.
226 | - The authors suggest that these findings could have significant implications for the future of AI research, as they suggest that simply scaling up existing models and techniques could lead to continued improvements in performance.
227 | - However, they also note that this approach could have significant costs, both in terms of the computational resources required and the potential environmental impact.
228 | 
229 | **[What Can Transformers Learn In-Context? A Case Study of Simple Function Classes](https://arxiv.org/pdf/2208.01066.pdf)**
230 | Authors: Shivam Garg, Dimitris Tsipras, Percy Liang, Gregory Valiant (2023)
231 | Publication: arXiv
232 | 
233 | - The paper explores the concept of in-context learning, where a model learns to generate outputs based on a sequence of input-output pairs, without any parameter updates.
234 | - The authors focus on the ability of Transformer models to perform in-context learning of simple function classes, such as linear functions, sparse linear functions, two-layer neural networks, and decision trees.
235 | - They find that Transformers can be trained to perform in-context learning of these function classes, with performance comparable to or exceeding that of task-specific learning algorithms.
236 | - The authors also find that the performance of the trained models is robust to distribution shifts between the training data and inference-time prompts, as well as between the in-context examples and the query input during inference.
237 | - The study suggests that Transformers can encode complex learning algorithms in a single forward pass, and that increasing the model's capacity can significantly improve its performance.
238 | 
239 | ## Benchmarking LLMs
240 | 
241 | **[Beyond the imitation game: quantifying and extrapolating the capabilities of language models](https://arxiv.org/pdf/2206.04615.pdf)**
242 | Authors: OpenAI (2022)
243 | Publication: arXiv
244 | 
245 | - The paper investigates the capabilities of large language models, specifically focusing on GPT-3, and proposes a new methodology to quantify and extrapolate their performance.
246 | - The authors introduce a new measure, the "pseudo-perplexity", to quantify the performance of language models. This measure is based on the model's ability to predict held-out human text.
247 | - The paper explores the relationship between model size and performance, finding that performance continues to improve with increasing model size, albeit with diminishing returns.
248 | - The authors also investigate the model's ability to generalize from the training data and perform tasks that are not explicitly present in the training data.
249 | - The paper concludes that while large language models like GPT-3 have impressive capabilities, there are still many tasks where they fall short, indicating the need for further research and development in this field.
250 | 
251 | **[On the dangers of stochastic parrots: can language models be too big?](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)**
252 | Authors: Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
253 | Publication: [FAccT '21](https://dl.acm.org/doi/proceedings/10.1145/3442188)
254 | 
255 | - The paper critically examines the trend in Natural Language Processing (NLP) of developing and deploying increasingly larger language models, such as BERT, GPT-2/3, and Switch-C. These models have advanced the state-of-the-art on many tasks, largely through the methodology of pretraining on large datasets and fine-tuning for specific tasks.
256 | - The authors pose the question: "How big is too big?" They explore the potential risks associated with these large language models, including environmental, financial, and ethical considerations.
257 | - The paper highlights the environmental and financial costs of training large language models. It suggests that these costs should be carefully weighed before deciding to develop such models.
258 | - The authors recommend investing resources into curating and carefully documenting datasets, rather than indiscriminately ingesting all available web data. This approach could help to mitigate some of the risks associated with large language models.
259 | - The paper encourages pre-development exercises to evaluate how the planned approach aligns with research and development goals and supports stakeholder values. It also advocates for exploring research directions beyond simply creating larger and larger language models.
260 | 
261 | # Prompt
262 | 
263 | ```
264 | Fetch the following papers. Based on their abstracts and the content in the introduction, write 3-5 key points about these papers. Make sure to highlight the key questions investigated by the paper, and the core conclusions. Target a mathematically knowledgeable professor of neuroscience who is outside of this field. Write the results in the format:
265 | 
266 | [Paper 1 name](https://link)
267 | Authors (date)
268 | Publication name
269 | 
270 | * point 1
271 | * Point 2
272 | * Point 3, etc.
273 | 
274 | [Paper 2 name]...
275 | ```
276 | 


--------------------------------------------------------------------------------
/bibliography.csv:
--------------------------------------------------------------------------------
 1 | "Key","Item Type","Publication Year","Author","Title","Publication Title","ISBN","ISSN","DOI","Url","Abstract Note","Date","Date Added","Date Modified","Access Date","Pages","Num Pages","Issue","Volume","Number Of Volumes","Journal Abbreviation","Short Title","Series","Series Number","Series Text","Series Title","Publisher","Place","Language","Rights","Type","Archive","Archive Location","Library Catalog","Call Number","Extra","Notes","File Attachments","Link Attachments","Manual Tags","Automatic Tags","Editor","Series Editor","Translator","Contributor","Attorney Agent","Book Author","Cast Member","Commenter","Composer","Cosponsor","Counsel","Interviewer","Producer","Recipient","Reviewed Author","Scriptwriter","Words By","Guest","Number","Edition","Running Time","Scale","Medium","Artwork Size","Filing Date","Application Number","Assignee","Issuing Authority","Country","Meeting Name","Conference Name","Court","References","Reporter","Legal Status","Priority Numbers","Programming Language","Version","System","Code","Code Number","Section","Session","Committee","History","Legislative Body"
 2 | "BHA9Y28U","journalArticle","2015","Ganin, Yaroslav; Ustinova, Evgeniya; Ajakan, Hana; Germain, Pascal; Larochelle, Hugo; Laviolette, François; Marchand, Mario; Lempitsky, Victor","Domain-Adversarial Training of Neural Networks","","","","10.48550/arXiv.1505.07818","https://arxiv.org/abs/1505.07818v4","We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.","2015-05-28","2022-06-07 01:12:13","2023-05-14 16:47:31","2022-06-07 01:12:13","","","","","","","","","","","","","","en","","","","","arxiv.org","","","","/Users/patrickmineault/Zotero/storage/X5N5UKQH/Ganin et al. - 2016 - Domain-adversarial training of neural networks.pdf; /Users/patrickmineault/Zotero/storage/JS7GWRX7/1505.html","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""
 3 | "YKG67LQA","preprint","2020","Arjovsky, Martin; Bottou, Léon; Gulrajani, Ishaan; Lopez-Paz, David","Invariant Risk Minimization","","","","10.48550/arXiv.1907.02893","http://arxiv.org/abs/1907.02893","We introduce Invariant Risk Minimization (IRM), a learning paradigm to estimate invariant correlations across multiple training distributions. To achieve this goal, IRM learns a data representation such that the optimal classifier, on top of that data representation, matches for all training distributions. Through theory and experiments, we show how the invariances learned by IRM relate to the causal structures governing the data and enable out-of-distribution generalization.","2020-03-27","2023-03-12 16:45:36","2023-03-12 16:45:36","2023-03-12 16:45:36","","","","","","","","","","","","arXiv","","","","","","","arXiv.org","","arXiv:1907.02893 [cs, stat]","","/Users/patrickmineault/Zotero/storage/N8VVTUB4/Arjovsky et al. - 2020 - Invariant Risk Minimization.pdf; /Users/patrickmineault/Zotero/storage/XMUBEJ5H/1907.html","","","Computer Science - Artificial Intelligence; Computer Science - Machine Learning; Statistics - Machine Learning","","","","","","","","","","","","","","","","","","","arXiv:1907.02893","","","","","","","","","","","","","","","","","","","","","","","","","","",""
 4 | "ZQZE6GRY","conferencePaper","2007","Bottou, Léon","The tradeoffs of large scale learning","NeurIPS","","","","https://proceedings.neurips.cc/paper/2007/file/0d3180d672e08b4c5312dcdafdf6ef36-Paper.pdf","This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation–estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the underlying optimization algorithms in non-trivial ways.","2007","2023-04-18 21:47:34","2023-04-19 00:58:30","","","","","","","","","","","","","","","","","","","","","","","","/Users/patrickmineault/Zotero/storage/EMWXNCDF/Bottou - 2007 - The tradeoffs of large scale learning.pdf","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""
 5 | "5NDBTBAA","preprint","2023","Garg, Shivam; Tsipras, Dimitris; Liang, Percy; Valiant, Gregory","What Can Transformers Learn In-Context? A Case Study of Simple Function Classes","","","","10.48550/arXiv.2208.01066","http://arxiv.org/abs/2208.01066","In-context learning refers to the ability of a model to condition on a prompt sequence consisting of in-context examples (input-output pairs corresponding to some task) along with a new query input, and generate the corresponding output. Crucially, in-context learning happens only at inference time without any parameter updates to the model. While large language models such as GPT-3 exhibit some ability to perform in-context learning, it is unclear what the relationship is between tasks on which this succeeds and what is present in the training data. To make progress towards understanding in-context learning, we consider the well-defined problem of training a model to in-context learn a function class (e.g., linear functions): that is, given data derived from some functions in the class, can we train a model to in-context learn ""most"" functions from this class? We show empirically that standard Transformers can be trained from scratch to perform in-context learning of linear functions -- that is, the trained model is able to learn unseen linear functions from in-context examples with performance comparable to the optimal least squares estimator. In fact, in-context learning is possible even under two forms of distribution shift: (i) between the training data of the model and inference-time prompts, and (ii) between the in-context examples and the query input during inference. We also show that we can train Transformers to in-context learn more complex function classes -- namely sparse linear functions, two-layer neural networks, and decision trees -- with performance that matches or exceeds task-specific learning algorithms. Our code and models are available at https://github.com/dtsip/in-context-learning .","2023-01-14","2023-04-19 05:08:06","2023-05-14 16:50:07","2023-04-19 05:08:06","","","","","","","What Can Transformers Learn In-Context?","","","","","arXiv","","","","","","","arXiv.org","","arXiv:2208.01066 [cs]","","/Users/patrickmineault/Zotero/storage/B4JUV7EC/Garg et al. - 2023 - What Can Transformers Learn In-Context A Case Stu.pdf; /Users/patrickmineault/Zotero/storage/BU45KQEE/2208.html","","","Computer Science - Computation and Language; Computer Science - Machine Learning","","","","","","","","","","","","","","","","","","","arXiv:2208.01066","","","","","","","","","","","","","","","","","","","","","","","","","","",""
 6 | "7NLMRHGT","preprint","2022","HaoChen, Jeff Z.; Wei, Colin; Gaidon, Adrien; Ma, Tengyu","Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss","","","","10.48550/arXiv.2106.04156","http://arxiv.org/abs/2106.04156","Recent works in self-supervised learning have advanced the state-of-the-art by relying on the contrastive learning paradigm, which learns representations by pushing positive pairs, or similar examples from the same class, closer together while keeping negative pairs far apart. Despite the empirical successes, theoretical foundations are limited -- prior analyses assume conditional independence of the positive pairs given the same class label, but recent empirical applications use heavily correlated positive pairs (i.e., data augmentations of the same image). Our work analyzes contrastive learning without assuming conditional independence of positive pairs using a novel concept of the augmentation graph on data. Edges in this graph connect augmentations of the same data, and ground-truth classes naturally form connected sub-graphs. We propose a loss that performs spectral decomposition on the population augmentation graph and can be succinctly written as a contrastive learning objective on neural net representations. Minimizing this objective leads to features with provable accuracy guarantees under linear probe evaluation. By standard generalization bounds, these accuracy guarantees also hold when minimizing the training contrastive loss. Empirically, the features learned by our objective can match or outperform several strong baselines on benchmark vision datasets. In all, this work provides the first provable analysis for contrastive learning where guarantees for linear probe evaluation can apply to realistic empirical settings.","2022-06-23","2023-04-19 05:09:33","2023-04-19 05:09:33","2023-04-19 05:09:33","","","","","","","","","","","","arXiv","","","","","","","arXiv.org","","arXiv:2106.04156 [cs, stat]","","/Users/patrickmineault/Zotero/storage/JVFCMJ5M/HaoChen et al. - 2022 - Provable Guarantees for Self-Supervised Deep Learn.pdf; /Users/patrickmineault/Zotero/storage/P7CYBTNG/2106.html","","","Computer Science - Machine Learning; Statistics - Machine Learning","","","","","","","","","","","","","","","","","","","arXiv:2106.04156","","","","","","","","","","","","","","","","","","","","","","","","","","",""
 7 | "F5HXNRL3","preprint","2021","Lester, Brian; Al-Rfou, Rami; Constant, Noah","The Power of Scale for Parameter-Efficient Prompt Tuning","","","","10.48550/arXiv.2104.08691","http://arxiv.org/abs/2104.08691","In this work, we explore ""prompt tuning"", a simple yet effective mechanism for learning ""soft prompts"" to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signal from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3's ""few-shot"" learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method ""closes the gap"" and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant in that large models are costly to share and serve, and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed ""prefix tuning"" of Li and Liang (2021), and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning.","2021-09-02","2023-04-19 05:09:56","2023-04-19 05:09:56","2023-04-19 05:09:56","","","","","","","","","","","","arXiv","","","","","","","arXiv.org","","arXiv:2104.08691 [cs]","","/Users/patrickmineault/Zotero/storage/S37S22N3/Lester et al. - 2021 - The Power of Scale for Parameter-Efficient Prompt .pdf; /Users/patrickmineault/Zotero/storage/5WPHPW94/2104.html","","","Computer Science - Computation and Language","","","","","","","","","","","","","","","","","","","arXiv:2104.08691","","","","","","","","","","","","","","","","","","","","","","","","","","",""
 8 | "KIJY3AU6","preprint","2021","Nagarajan, Vaishnavh; Andreassen, Anders; Neyshabur, Behnam","Understanding the Failure Modes of Out-of-Distribution Generalization","","","","10.48550/arXiv.2010.15775","http://arxiv.org/abs/2010.15775","Empirical studies suggest that machine learning models often rely on features, such as the background, that may be spuriously correlated with the label only during training time, resulting in poor accuracy during test-time. In this work, we identify the fundamental factors that give rise to this behavior, by explaining why models fail this way {\em even} in easy-to-learn tasks where one would expect these models to succeed. In particular, through a theoretical study of gradient-descent-trained linear classifiers on some easy-to-learn tasks, we uncover two complementary failure modes. These modes arise from how spurious correlations induce two kinds of skews in the data: one geometric in nature, and another, statistical in nature. Finally, we construct natural modifications of image classification datasets to understand when these failure modes can arise in practice. We also design experiments to isolate the two failure modes when training modern neural networks on these datasets.","2021-04-29","2023-04-19 05:11:04","2023-04-19 05:11:04","2023-04-19 05:11:04","","","","","","","","","","","","arXiv","","","","","","","arXiv.org","","arXiv:2010.15775 [cs, stat]","","/Users/patrickmineault/Zotero/storage/2IH2HHVF/Nagarajan et al. - 2021 - Understanding the Failure Modes of Out-of-Distribu.pdf; /Users/patrickmineault/Zotero/storage/ID2SRCHK/2010.html","","","Computer Science - Computer Vision and Pattern Recognition; Computer Science - Machine Learning; Statistics - Machine Learning","","","","","","","","","","","","","","","","","","","arXiv:2010.15775","","","","","","","","","","","","","","","","","","","","","","","","","","",""
 9 | "ACK5HXZF","preprint","2022","Bubeck, Sébastien; Sellke, Mark","A Universal Law of Robustness via Isoperimetry","","","","10.48550/arXiv.2105.12806","http://arxiv.org/abs/2105.12806","Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a partial theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires $d$ times more parameters than mere interpolation, where $d$ is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry. In the case of two-layers neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj. We also give an interpretation of our result as an improved generalization bound for model classes consisting of smooth functions.","2022-12-23","2023-04-19 05:11:25","2023-04-19 05:11:25","2023-04-19 05:11:25","","","","","","","","","","","","arXiv","","","","","","","arXiv.org","","arXiv:2105.12806 [cs, stat]","","/Users/patrickmineault/Zotero/storage/V3DFGHK5/Bubeck and Sellke - 2022 - A Universal Law of Robustness via Isoperimetry.pdf; /Users/patrickmineault/Zotero/storage/4KEVS4TU/2105.html","","","Computer Science - Machine Learning; Statistics - Machine Learning","","","","","","","","","","","","","","","","","","","arXiv:2105.12806","","","","","","","","","","","","","","","","","","","","","","","","","","",""
10 | "X5YQUH8Z","preprint","2021","Caron, Mathilde; Touvron, Hugo; Misra, Ishan; Jégou, Hervé; Mairal, Julien; Bojanowski, Piotr; Joulin, Armand","Emerging Properties in Self-Supervised Vision Transformers","","","","10.48550/arXiv.2104.14294","http://arxiv.org/abs/2104.14294","In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.","2021-05-24","2023-04-19 05:11:52","2023-04-19 05:11:52","2023-04-19 05:11:52","","","","","","","","","","","","arXiv","","","","","","","arXiv.org","","arXiv:2104.14294 [cs]","","/Users/patrickmineault/Zotero/storage/DISN9VCJ/Caron et al. - 2021 - Emerging Properties in Self-Supervised Vision Tran.pdf; /Users/patrickmineault/Zotero/storage/CLQJF45P/2104.html","","","Computer Science - Computer Vision and Pattern Recognition","","","","","","","","","","","","","","","","","","","arXiv:2104.14294","","","","","","","","","","","","","","","","","","","","","","","","","","",""
11 | "CWQJ7GBH","preprint","2021","He, Kaiming; Chen, Xinlei; Xie, Saining; Li, Yanghao; Dollár, Piotr; Girshick, Ross","Masked Autoencoders Are Scalable Vision Learners","","","","10.48550/arXiv.2111.06377","http://arxiv.org/abs/2111.06377","This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.","2021-12-19","2023-04-19 05:12:10","2023-04-19 05:12:10","2023-04-19 05:12:10","","","","","","","","","","","","arXiv","","","","","","","arXiv.org","","arXiv:2111.06377 [cs]","","/Users/patrickmineault/Zotero/storage/VILLQ59S/He et al. - 2021 - Masked Autoencoders Are Scalable Vision Learners.pdf; /Users/patrickmineault/Zotero/storage/PT7R2QT4/2111.html","","","Computer Science - Computer Vision and Pattern Recognition","","","","","","","","","","","","","","","","","","","arXiv:2111.06377","","","","","","","","","","","","","","","","","","","","","","","","","","",""
12 | "VDMKIPK8","conferencePaper","2021","Bender, Emily M.; Gebru, Timnit; McMillan-Major, Angelina; Shmitchell, Shmargaret","On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜","Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency","978-1-4503-8309-7","","10.1145/3442188.3445922","https://dl.acm.org/doi/10.1145/3442188.3445922","The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.","2021-03-01","2023-04-19 05:12:31","2023-04-19 05:12:31","2023-04-18","610–623","","","","","","On the Dangers of Stochastic Parrots","FAccT '21","","","","Association for Computing Machinery","New York, NY, USA","","","","","","ACM Digital Library","","","","/Users/patrickmineault/Zotero/storage/VEFRN9ZN/Bender et al. - 2021 - On the Dangers of Stochastic Parrots Can Language.pdf","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""
13 | "GSZ6HHA2","preprint","2021","Miller, John; Taori, Rohan; Raghunathan, Aditi; Sagawa, Shiori; Koh, Pang Wei; Shankar, Vaishaal; Liang, Percy; Carmon, Yair; Schmidt, Ludwig","Accuracy on the Line: On the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization","","","","10.48550/arXiv.2107.04649","http://arxiv.org/abs/2107.04649","For machine learning systems to be reliable, we must understand their performance in unseen, out-of-distribution environments. In this paper, we empirically show that out-of-distribution performance is strongly correlated with in-distribution performance for a wide range of models and distribution shifts. Specifically, we demonstrate strong correlations between in-distribution and out-of-distribution performance on variants of CIFAR-10 & ImageNet, a synthetic pose estimation task derived from YCB objects, satellite imagery classification in FMoW-WILDS, and wildlife classification in iWildCam-WILDS. The strong correlations hold across model architectures, hyperparameters, training set size, and training duration, and are more precise than what is expected from existing domain adaptation theory. To complete the picture, we also investigate cases where the correlation is weaker, for instance some synthetic distribution shifts from CIFAR-10-C and the tissue classification dataset Camelyon17-WILDS. Finally, we provide a candidate theory based on a Gaussian data model that shows how changes in the data covariance arising from distribution shift can affect the observed correlations.","2021-10-07","2023-04-19 05:12:51","2023-04-19 05:12:51","2023-04-19 05:12:51","","","","","","","Accuracy on the Line","","","","","arXiv","","","","","","","arXiv.org","","arXiv:2107.04649 [cs, stat]","","/Users/patrickmineault/Zotero/storage/JKPI99X2/Miller et al. - 2021 - Accuracy on the Line On the Strong Correlation Be.pdf; /Users/patrickmineault/Zotero/storage/PRICILC9/2107.html","","","Computer Science - Machine Learning; Statistics - Machine Learning","","","","","","","","","","","","","","","","","","","arXiv:2107.04649","","","","","","","","","","","","","","","","","","","","","","","","","","",""
14 | "3T44X2UZ","preprint","2020","Sun, Yu; Wang, Xiaolong; Liu, Zhuang; Miller, John; Efros, Alexei A.; Hardt, Moritz","Test-Time Training with Self-Supervision for Generalization under Distribution Shifts","","","","10.48550/arXiv.1909.13231","http://arxiv.org/abs/1909.13231","In this paper, we propose Test-Time Training, a general approach for improving the performance of predictive models when training and test data come from different distributions. We turn a single unlabeled test sample into a self-supervised learning problem, on which we update the model parameters before making a prediction. This also extends naturally to data in an online stream. Our simple approach leads to improvements on diverse image classification benchmarks aimed at evaluating robustness to distribution shifts.","2020-07-01","2023-04-19 05:13:12","2023-04-19 05:13:12","2023-04-19 05:13:12","","","","","","","","","","","","arXiv","","","","","","","arXiv.org","","arXiv:1909.13231 [cs, stat]","","/Users/patrickmineault/Zotero/storage/H4T2CQM7/Sun et al. - 2020 - Test-Time Training with Self-Supervision for Gener.pdf; /Users/patrickmineault/Zotero/storage/7FZ942H7/1909.html","","","Computer Science - Computer Vision and Pattern Recognition; Computer Science - Machine Learning; Statistics - Machine Learning","","","","","","","","","","","","","","","","","","","arXiv:1909.13231","","","","","","","","","","","","","","","","","","","","","","","","","","",""
15 | "3GU34VYB","preprint","2020","Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess, Benjamin; Child, Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario","Scaling Laws for Neural Language Models","","","","10.48550/arXiv.2001.08361","http://arxiv.org/abs/2001.08361","We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.","2020-01-22","2023-04-19 05:13:42","2023-04-19 05:13:42","2023-04-19 05:13:42","","","","","","","","","","","","arXiv","","","","","","","arXiv.org","","arXiv:2001.08361 [cs, stat]","","/Users/patrickmineault/Zotero/storage/SJ6CTB5T/Kaplan et al. - 2020 - Scaling Laws for Neural Language Models.pdf; /Users/patrickmineault/Zotero/storage/2VYW26J9/2001.html","","","Computer Science - Machine Learning; Statistics - Machine Learning","","","","","","","","","","","","","","","","","","","arXiv:2001.08361","","","","","","","","","","","","","","","","","","","","","","","","","","",""
16 | "XIQMB67U","preprint","2019","Ilyas, Andrew; Santurkar, Shibani; Tsipras, Dimitris; Engstrom, Logan; Tran, Brandon; Madry, Aleksander","Adversarial Examples Are Not Bugs, They Are Features","","","","10.48550/arXiv.1905.02175","http://arxiv.org/abs/1905.02175","Adversarial examples have attracted significant attention in machine learning, but the reasons for their existence and pervasiveness remain unclear. We demonstrate that adversarial examples can be directly attributed to the presence of non-robust features: features derived from patterns in the data distribution that are highly predictive, yet brittle and incomprehensible to humans. After capturing these features within a theoretical framework, we establish their widespread existence in standard datasets. Finally, we present a simple setting where we can rigorously tie the phenomena we observe in practice to a misalignment between the (human-specified) notion of robustness and the inherent geometry of the data.","2019-08-12","2023-04-19 05:14:02","2023-04-19 05:14:02","2023-04-19 05:14:02","","","","","","","","","","","","arXiv","","","","","","","arXiv.org","","arXiv:1905.02175 [cs, stat]","","/Users/patrickmineault/Zotero/storage/MJ8DECHD/Ilyas et al. - 2019 - Adversarial Examples Are Not Bugs, They Are Featur.pdf; /Users/patrickmineault/Zotero/storage/2L8MJJFN/1905.html","","","Computer Science - Computer Vision and Pattern Recognition; Computer Science - Machine Learning; Statistics - Machine Learning; Computer Science - Cryptography and Security","","","","","","","","","","","","","","","","","","","arXiv:1905.02175","","","","","","","","","","","","","","","","","","","","","","","","","","",""
17 | "6GN6BI5S","journalArticle","2020","Bartlett, Peter L.; Long, Philip M.; Lugosi, Gábor; Tsigler, Alexander","Benign Overfitting in Linear Regression","Proceedings of the National Academy of Sciences","","0027-8424, 1091-6490","10.1073/pnas.1907378117","http://arxiv.org/abs/1906.11300","The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. By studying examples of data covariance properties that this characterization shows are required for benign overfitting, we find an important role for finite-dimensional data: the accuracy of the minimum norm interpolating prediction rule approaches the best possible accuracy for a much narrower range of properties of the data distribution when the data lies in an infinite dimensional space versus when the data lies in a finite dimensional space whose dimension grows faster than the sample size.","2020-12","2023-04-19 05:14:23","2023-04-19 05:14:23","2023-04-19 05:14:23","30063-30070","","48","117","","Proc. Natl. Acad. Sci. U.S.A.","","","","","","","","","","","","","arXiv.org","","arXiv:1906.11300 [cs, math, stat]","","/Users/patrickmineault/Zotero/storage/JX6H3HEP/Bartlett et al. - 2020 - Benign Overfitting in Linear Regression.pdf; /Users/patrickmineault/Zotero/storage/QYZ3VBUV/1906.html","","","Computer Science - Machine Learning; Statistics - Machine Learning; Mathematics - Statistics Theory","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""
18 | "4GEY573Z","preprint","2019","Oliver, Avital; Odena, Augustus; Raffel, Colin; Cubuk, Ekin D.; Goodfellow, Ian J.","Realistic Evaluation of Deep Semi-Supervised Learning Algorithms","","","","10.48550/arXiv.1804.09170","http://arxiv.org/abs/1804.09170","Semi-supervised learning (SSL) provides a powerful framework for leveraging unlabeled data when labels are limited or expensive to obtain. SSL algorithms based on deep neural networks have recently proven successful on standard benchmark tasks. However, we argue that these benchmarks fail to address many issues that these algorithms would face in real-world applications. After creating a unified reimplementation of various widely-used SSL techniques, we test them in a suite of experiments designed to address these issues. We find that the performance of simple baselines which do not use unlabeled data is often underreported, that SSL methods differ in sensitivity to the amount of labeled and unlabeled data, and that performance can degrade substantially when the unlabeled dataset contains out-of-class examples. To help guide SSL research towards real-world applicability, we make our unified reimplemention and evaluation platform publicly available.","2019-06-17","2023-04-19 05:15:26","2023-04-19 05:15:26","2023-04-19 05:15:26","","","","","","","","","","","","arXiv","","","","","","","arXiv.org","","arXiv:1804.09170 [cs, stat]","","/Users/patrickmineault/Zotero/storage/GN4JEBYC/Oliver et al. - 2019 - Realistic Evaluation of Deep Semi-Supervised Learn.pdf; /Users/patrickmineault/Zotero/storage/BE8B877V/1804.html","","","Computer Science - Machine Learning; Statistics - Machine Learning","","","","","","","","","","","","","","","","","","","arXiv:1804.09170","","","","","","","","","","","","","","","","","","","","","","","","","","",""
19 | "BB6STB8D","preprint","2020","Jacot, Arthur; Gabriel, Franck; Hongler, Clément","Neural Tangent Kernel: Convergence and Generalization in Neural Networks","","","","10.48550/arXiv.1806.07572","http://arxiv.org/abs/1806.07572","At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit, thus connecting them to kernel methods. We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function $f_\theta$ (which maps input vectors to output vectors) follows the kernel gradient of the functional cost (which is convex, in contrast to the parameter cost) w.r.t. a new kernel: the Neural Tangent Kernel (NTK). This kernel is central to describe the generalization features of ANNs. While the NTK is random at initialization and varies during training, in the infinite-width limit it converges to an explicit limiting kernel and it stays constant during training. This makes it possible to study the training of ANNs in function space instead of parameter space. Convergence of the training can then be related to the positive-definiteness of the limiting NTK. We prove the positive-definiteness of the limiting NTK when the data is supported on the sphere and the non-linearity is non-polynomial. We then focus on the setting of least-squares regression and show that in the infinite-width limit, the network function $f_\theta$ follows a linear differential equation during training. The convergence is fastest along the largest kernel principal components of the input data with respect to the NTK, hence suggesting a theoretical motivation for early stopping. Finally we study the NTK numerically, observe its behavior for wide networks, and compare it to the infinite-width limit.","2020-02-10","2023-04-19 05:15:48","2023-04-19 05:15:48","2023-04-19 05:15:48","","","","","","","Neural Tangent Kernel","","","","","arXiv","","","","","","","arXiv.org","","arXiv:1806.07572 [cs, math, stat]","","/Users/patrickmineault/Zotero/storage/FY96IFZ5/Jacot et al. - 2020 - Neural Tangent Kernel Convergence and Generalizat.pdf; /Users/patrickmineault/Zotero/storage/D8KL2D7Q/1806.html","","","Computer Science - Machine Learning; Statistics - Machine Learning; Computer Science - Neural and Evolutionary Computing; Mathematics - Probability","","","","","","","","","","","","","","","","","","","arXiv:1806.07572","","","","","","","","","","","","","","","","","","","","","","","","","","",""
20 | "FQKC3JPL","preprint","2022","Soudry, Daniel; Hoffer, Elad; Nacson, Mor Shpigel; Gunasekar, Suriya; Srebro, Nathan","The Implicit Bias of Gradient Descent on Separable Data","","","","10.48550/arXiv.1710.10345","http://arxiv.org/abs/1710.10345","We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the max-margin (hard margin SVM) solution. The result also generalizes to other monotone decreasing loss functions with an infimum at infinity, to multi-class problems, and to training a weight layer in a deep network in a certain restricted setting. Furthermore, we show this convergence is very slow, and only logarithmic in the convergence of the loss itself. This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and, as we show, even if the validation loss increases. Our methodology can also aid in understanding implicit regularization n more complex models and with other optimization methods.","2022-07-19","2023-04-19 05:16:11","2023-04-19 05:16:11","2023-04-19 05:16:11","","","","","","","","","","","","arXiv","","","","","","","arXiv.org","","arXiv:1710.10345 [cs, stat]","","/Users/patrickmineault/Zotero/storage/AG9RZPAK/Soudry et al. - 2022 - The Implicit Bias of Gradient Descent on Separable.pdf; /Users/patrickmineault/Zotero/storage/723DKFE7/1710.html","","","Computer Science - Machine Learning; Statistics - Machine Learning","","","","","","","","","","","","","","","","","","","arXiv:1710.10345","","","","","","","","","","","","","","","","","","","","","","","","","","",""
21 | "U7F779J2","preprint","2019","Frankle, Jonathan; Carbin, Michael","The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks","","","","10.48550/arXiv.1803.03635","http://arxiv.org/abs/1803.03635","Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance. We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the ""lottery ticket hypothesis:"" dense, randomly-initialized, feed-forward networks contain subnetworks (""winning tickets"") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective. We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.","2019-03-04","2023-04-19 05:17:01","2023-04-19 05:17:01","2023-04-19 05:17:01","","","","","","","The Lottery Ticket Hypothesis","","","","","arXiv","","","","","","","arXiv.org","","arXiv:1803.03635 [cs]","","/Users/patrickmineault/Zotero/storage/6TJ6KYMG/Frankle and Carbin - 2019 - The Lottery Ticket Hypothesis Finding Sparse, Tra.pdf; /Users/patrickmineault/Zotero/storage/QTET8VW5/1803.html","","","Computer Science - Artificial Intelligence; Computer Science - Machine Learning; Computer Science - Neural and Evolutionary Computing","","","","","","","","","","","","","","","","","","","arXiv:1803.03635","","","","","","","","","","","","","","","","","","","","","","","","","","",""
22 | "XAKELA2H","preprint","2017","Neyshabur, Behnam; Bhojanapalli, Srinadh; McAllester, David; Srebro, Nathan","Exploring Generalization in Deep Learning","","","","10.48550/arXiv.1706.08947","http://arxiv.org/abs/1706.08947","With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness. We study how these measures can ensure generalization, highlighting the importance of scale normalization, and making a connection between sharpness and PAC-Bayes theory. We then investigate how well the measures explain different observed phenomena.","2017-07-06","2023-04-19 05:17:20","2023-04-19 05:17:20","2023-04-19 05:17:20","","","","","","","","","","","","arXiv","","","","","","","arXiv.org","","arXiv:1706.08947 [cs]","","/Users/patrickmineault/Zotero/storage/J3EH3A63/Neyshabur et al. - 2017 - Exploring Generalization in Deep Learning.pdf; /Users/patrickmineault/Zotero/storage/RJGX2NCK/1706.html","","","Computer Science - Machine Learning","","","","","","","","","","","","","","","","","","","arXiv:1706.08947","","","","","","","","","","","","","","","","","","","","","","","","","","",""
23 | "3EH3XM6H","preprint","2017","Finn, Chelsea; Abbeel, Pieter; Levine, Sergey","Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks","","","","10.48550/arXiv.1703.03400","http://arxiv.org/abs/1703.03400","We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. In our approach, the parameters of the model are explicitly trained such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task. In effect, our method trains the model to be easy to fine-tune. We demonstrate that this approach leads to state-of-the-art performance on two few-shot image classification benchmarks, produces good results on few-shot regression, and accelerates fine-tuning for policy gradient reinforcement learning with neural network policies.","2017-07-18","2023-04-19 05:17:45","2023-04-19 05:17:45","2023-04-19 05:17:45","","","","","","","","","","","","arXiv","","","","","","","arXiv.org","","arXiv:1703.03400 [cs]","","/Users/patrickmineault/Zotero/storage/4UZEAG3T/Finn et al. - 2017 - Model-Agnostic Meta-Learning for Fast Adaptation o.pdf; /Users/patrickmineault/Zotero/storage/59WSRNHL/1703.html","","","Computer Science - Computer Vision and Pattern Recognition; Computer Science - Artificial Intelligence; Computer Science - Machine Learning; Computer Science - Neural and Evolutionary Computing","","","","","","","","","","","","","","","","","","","arXiv:1703.03400","","","","","","","","","","","","","","","","","","","","","","","","","","",""
24 | "VCQP4E2S","preprint","2012","Schoelkopf, Bernhard; Janzing, Dominik; Peters, Jonas; Sgouritsa, Eleni; Zhang, Kun; Mooij, Joris","On Causal and Anticausal Learning","","","","10.48550/arXiv.1206.6471","http://arxiv.org/abs/1206.6471","We consider the problem of function estimation in the case where an underlying causal model can be inferred. This has implications for popular scenarios such as covariate shift, concept drift, transfer learning and semi-supervised learning. We argue that causal knowledge may facilitate some approaches for a given problem, and rule out others. In particular, we formulate a hypothesis for when semi-supervised learning can help, and corroborate it with empirical results.","2012-06-27","2023-04-19 05:19:20","2023-04-19 05:19:20","2023-04-19 05:19:20","","","","","","","","","","","","arXiv","","","","","","","arXiv.org","","arXiv:1206.6471 [cs, stat]","","/Users/patrickmineault/Zotero/storage/TAUY85AC/1206.html; /Users/patrickmineault/Zotero/storage/BLCXFQVW/Schölkopf et al. - 2012 - On causal and anticausal learning.pdf","","","Computer Science - Machine Learning; Statistics - Machine Learning","","","","","","","","","","","","","","","","","","","arXiv:1206.6471","","","","","","","","","","","","","","","","","","","","","","","","","","",""


--------------------------------------------------------------------------------